The Example of the Revolution in Computer Vision
Learn how convolutional neural networks help machines understand visual information. Essentially, machines mimic the processes of the human brain. We will guide you through the development stages of these algorithms and also show you how machines learn to see.
Computer vision, or machine vision, has long been one of the most challenging areas of development in artificial intelligence. However, great progress has been made in recent years. Training machines to interpret images and what they “see” through camera sensors is critical for many areas of application. For example, autonomous systems such as cars or other means of transportation must be able to perceive and interpret their environment in order to be used safely and effectively. Thanks to the development of convolutional neural networks, a breakthrough has been achieved in recent years.
A convolutional neural network (CNN) is a special type of neural network. These networks are represented by artificial neurons and are used in areas such as artificial intelligence.
Artificial neural networks are machine learning techniques inspired by the biological processes in the human brain. The neurons are interconnected and linked with different weights. The goal in the learning process of a neural network is to adjust these weights so that the model can make predictions that are as accurate as possible.
The convolutional neural network takes its name from the mathematical principle of convolution. This mathematical approach is particularly suitable for recognizing image and audio data. The reduction in data resulting from convolution increases the speed of calculations. This shortens the learning process without reducing performance.
Convolutional neural networks are multi-layered, forward-coupled networks. With layers connected in series, they are able to develop an intuition that leads, for example, from detail recognition (lines) to the recognition of abstractions (edges, shapes, objects). With each higher level, the level of abstraction increases.
When constructing a CNN, a wide variety of layer groups are used in sequence.
This layer implements various filters and integrates them into the neural network. A convolution matrix (kernel) is superimposed on the pixel values. The weights of the kernels are each dimensioned differently. By calculating with the input values, different characteristics (edges and features) can be extracted.
This layer serves to better generalize the data.
The pooling layer (max pooling) forwards the strongest features.
The dimensions of the input data, for example the number of pixels in images, determine how many pooling layers can be applied.
The multidimensional layer from the convolutions is converted into a one-dimensional vector.
Neural networks may sometimes rely too heavily on one or more input parameters. For this reason, dropout can be used so that certain connections in the input data are no longer passed on. This ensures that the network does not rely too heavily on a specific value and finds a suitable connection independently of a specific connection.
The dense layer is also known as the fully connected layer. This is a standard layer in which all neurons are connected to all inputs and outputs. The final classification takes place in the last dense layer.
The three terms mentioned in the headline are the three ingredients that teach machines to see. Fei-fei Li, a computer science professor at Stanford University, played a significant role in the enormous advances in the field of machine vision. She wanted to teach computers something that children aged around 3 years and older are capable of doing. Machines should understand the meaning of what they “see” in images and be able to name it. At that time, computer vision was an almost unsolvable problem in the field of artificial intelligence research. Li first determined that three ingredients are necessary to teach computers to see:
As it turned out, surprisingly, the data was the biggest challenge. Neural networks have been around since the 1950s.
Methods such as backpropagation have existed since the 1980s and convolutional neural networks since the 1990s. Even the necessary computing power was no longer a real problem in the mid-2000s—although a special development played a role here.
However, despite the hype surrounding big data, for a long time, data was not available in a form that machines could actually learn from.
Labeled data helps machines learn to see. In order for algorithms such as convolutional neural networks to understand what can be seen in images, they need data. This data is necessary for the learning process, in which they can learn to understand structures and differences. For example, algorithms need numerous images. In order to be able to distinguish dogs from cats in images, they need many images that they know show cats.
They also need images that they know show dogs, for example. They can now compare these so-called “labeleddata sets” with each other in order to identify the structural differences between cats and dogs. The goal is to train a network to independently recognize images and distinguish between dogs and cats.
Amazon's Mechanical Turk is a method for labeling data. The question was how to obtain labeled data that could be used to train convolutional neural networks. One method for labeling data sets is Amazon's service called “Mechanical Turk.” .
This name goes back to a machine called the “Turkish chess player” – a machine that was claimed to be able to defeat any human being at chess. The trick, however, was that a dwarf sat inside the machine and performed the moves mechanically.
Description: Copperplate engraving of a “Turkish chess player” by Joseph Racknitz (1789)
This idea is ultimately also the basis for Amazon's “Mechanical Turk.” Tasks such as labeling data sets are not actually performed by machines, but ultimately by numerous people. Often, for example, it is people in India who label data on a large scale – such as image data with the label ‘dog’ or “cat.” Fei-fei Li's work was also based on the help of thousands of people who labeled millions of images.
Based on this labeled data, the next step could then be taken. The goal was to find the best algorithms for this task. The first successful candidates were so-called shallow networks – a special form of deep networks.
Around the year 2000, another change took place that had far-reaching consequences for the development of computer vision. In a competition organized by Nvidia, it was discovered that deep learning could be improved by a factor of 100 to 1000 through the use of CPUs and GPUs (graphics chips).
This particular development played a key role in increasing computing power. AlexNet was the first successful model to attempt exactly that. The success was nothing less than a quantum leap: for the first time, it was possible to use deep models. A few years later, VGG established itself as the best model.
Despite these initial successes, however, the motto was: “We need to go deeper” – a quote from Christopher Nolan's film “Inception,” on which the name of Google's “InceptionNet” from 2014 was actually based.
In 2015, Microsoft followed suit with ResNet, and since then, machines have indeed been better than humans at solving machine vision tasks, such as distinguishing between dogs and cats in images.
Behind this success are convolutional neural networks.
Convolutional neural networks enable machines to understand images. Put simply, convolutional neural networks are algorithms that enable machines to understand what they see. In simple terms, an artificial neural network (ANN) mimics what happens between individual neurons in the human brain. ANNs are based on a form of neurons, i.e., nodes that are linked together. Analogous to neural connections, ANNs are structured in a certain number of “layers.”
Description: Simplified representation of an artificial neural network with 2 layers.
The difference between KNNs and convolutional neural networks is that in convolutional neural networks, certain information is processed in a distilled, “folded” form. In this distilled form, an image as it appears to humans takes on a new form that is understandable to machines. In other words, it gains more depth, or more layers. At the end of this process, there is one result: a classifier that enables machines to recognize the representations in images.
A simple example demonstrates the practical relevance of convolutional neural networks. After natural disasters, such as hail damage, insurers face a major challenge. Their customers need help as quickly as possible, but the process required to assess the damage is lengthy.
Computer vision provides an elegant solution to this problem. Instead of experts having to travel to the affected regions to assess the damage, images of the damage are evaluated.
Convolutional neural networks make it possible to distinguish between houses with damaged and intact roofs. This makes it possible to determine within a short period of time whether and to what extent a particular region has been affected by hail damage. Many cases can thus be processed much more quickly and the necessary assistance provided more rapidly. This is just one of many possible areas of application in which convolutional neural networks can help to provide solutions quickly and effectively. What led to the revolution in image recognition is now standard practice in data science projects.
Share this post: