Introduction to Generative Adversarial Networks (GANs)

from | 2 July 2020 | Tech Deep Dive

If you know which of Tom and Jerry wins the fight, then you know roughly how Generative Adversarial Networks (GANs) work - GANs are like a never-ending game of cat and mouse.  

"The coolest idea in deep learning in the last 20 years." - Yann LeCun on GANs. 

LeCun's quote emphasises the innovation and importance of GANs, which were introduced by Ian Goodfellow and his team in 2014. But what are GANs, how do they work and how are they trained? 

What are Generative Adversarial Networks?

Generative Adversarial Networks (GANs) are a technique from the field of Unsupervised Machine Learning. You can generate new data (mostly images), edit the existing Training data resemble. The special thing about GANs is that they not only generate new data, but can also identify differences between genuine and fake content.  

A generative adversarial network consists of two parts: 

  1. A generator G (Jerry Mouse): This has the task of generating new images. 
  1. A discriminator D (police or Tom Cat): This has the task of recognising whether the image entered is genuine or fake (i.e. created by the generator). 

How do the two models work together? 

The two models are counterparts of each other. For example, the generator G creates a stack of synthetic images and passes these together with the real images (from the training data) to the discriminator D. The discriminator D then distinguishes between real and fake images. Both models train each other until the generator is good enough to produce images that are so similar to the training data that the generator can no longer tell them apart. In mathematical terms, the model attempts to learn the data distribution underlying the real images and replicate a similar data distribution to create new images. 

Model architecture and training 

The generator G is a neural network with the parameters θ and the input vector z. 

  1. Points in latent space, i.e. a 100-element vector (z) of a Gaussian random number, is given as input to the mesh. The subsequent layers are used to enrich the data in order to generate the desired image. 
  1. In a deep neuronal network, a so-called Convolutional Neural Network (CNN), there are different feature maps, each of which provides a different interpretation of the image. When creating an image from a vector of length n, several images must also be generated, which can be combined into a single image at the end. Therefore, the first layer must contain enough neurons to create several feature maps. 
  1. For example, to create an image of the MNIST dataset (28 x 28 images, black and white), the first dense layer contains 6272 neurons. The output of the first layer is then reshaped to contain 128 feature maps, each with 7 x 7 images. 
  1. Subsequent layers apply upsampling techniques to create a 28 x 28 image. 
  1. This is then given to discriminator D as one of the inputs. 

The discriminator D is a binary classification model that categorises the images into two classes: genuine ("1") or fake ("0"). It is typically also a CNN with the parameters Φ and the input x. The input x is a stack of images, half of which consist of real images from the training data, and half of which consist of the fake images from discriminator D.  

Generative-Adversarial-Networks-Model

Both the generator and the discriminator are combined to form a GAN model, as shown in the following figure: 

Architecture of a Generative Adversarial Network
GAN model architecture

Model training 

Part of the training of Generative Adversarial Networks (GANs) consists of giving the generator a series of latent noise vectors. The generator then generates an image from each of these input vectors. These images are labelled "0" (fake) when they are used as input to the discriminator. An equal number of images is taken from the real images (training images). These images are labelled "1" (real). Both stacks are used as input for the discriminator to classify real and fake images. After forward propagation by the neural network, two optimisation variables are calculated independently of each other: The "loss" of the generator model and the error of the discriminator model. The latter is also the classification error. This is used to update the weights of the discriminator using a conventional backpropagation mechanism of the CNN. 

How often the weights of the generator are updated depends on the performance of the discriminator with the forged images. If the discriminator is able to recognise the forged images, the weights of the discriminator are updated. However, if the discriminator is unable to distinguish the fake images from the real images (which means that the generator is able to produce very real-looking images), the weights of the generator are updated. The "loss" is transmitted via back-propagation to update the weights of the generator's neural network. During the back-propagation of the generator, the weights of the discriminator network are marked as not trainable. This ensures that the weights of the discriminator are not changed in the generator model during back-propagation. 

"The same generated images have two different labels. When the discriminator is trained, the label is 0 and when the generator is trained, the label is 1." 

The discriminator model specifies the probability of the authenticity of the image. If it specifies a confidence value of less than a threshold, e.g. 0.4, the image is categorised as fake. In other words, the generated image is genuine with a confidence level of less than 0.4, or 40 %. In order to train the generator to produce realistic images, the optimisation metric for the generator is calculated based on the discriminator's classification of the image as genuine. Therefore, when the generator's weights are updated, the generated images are marked as "1". This means that the generator's error is significantly greater for generated images with a low confidence level, which is why it updates the weights to generate more realistic images. 

Loss functions for discriminator and generator 

The loss functions for both the discriminator and the generator can be derived from the binary cross entropy equation: 

Discriminator loss: The discriminator loss quantifies the discriminator's ability to distinguish between real and fake images. It compares the discriminator's predictions for real images with a series of ones and its predictions for fake images with a series of zeros. The total loss for the discriminator is the cumulative loss for the real and fake images. 

For real images, y = 1 and ^y = D8x), loss for real images L(1,D(x)) = log D(x) 

For synthetically generated images, y = 0 and ^y = D(G(z)), loss for synthetically generated images L(D0,D(Gz)) = log [1 - D(Gz))] 

The aim of the discriminator is to maximise both loss functions. Mathematically, the total loss for the discriminator is therefore defined as follows: 

Generator loss: The generator loss quantifies the ability of the generator model to fool the discriminator into believing that the fake images are real images. Therefore, if the generator is well trained, the generated (fake) images will be categorised as genuine (1) by the discriminator. Consequently, the output of the discriminator for the generated images is compared with a series of ones. 

The generator, on the other hand, attempts to deceive the discriminator and therefore tries to minimise the loss function: 

The generator cannot have a direct effect on the term log(D(x)), so minimising the loss for the generator is equivalent to: 

The combined loss function of the GAN is therefore: 

The GAN algorithm is described as follows in the original publication by Ian Goodbfellow: 

Training of a Generative Adversarial Network
Training process of a GAN model

Application examples for generative adversarial networks

The basic idea of Generative Adversarial Networks (GANs) is therefore to generate data that is indistinguishable from real data. This makes GANs a powerful tool for generating synthetic data. They can be used in a variety of areas, such as image generation, image transformation, image colouring, music synthesis, video generation, 3D object generation and many more. Below you will find a list of some practical applications of GANs that are used today. 

StyleGAN

This AI model developed by Nvidia offers users a high degree of control over the creation of highly realistic images. Users can decide for themselves which part of the image they want to manipulate. For example, when creating facial images, users can change the hair colour or cut, facial expression and much more. For companies, this means that they can quickly create multiple visualisations of a single image. Car manufacturers can create different interior designs for customers in no time at all, and fashion agencies can design new clothing patterns in seconds. 

CycleGAN

This powerful AI model is characterised by the conversion of images from one style to another. Its uniqueness lies in the fact that it does not need matching pairs of images to learn how to make conversions. This can be a very powerful tool for e-commerce retailers or real estate agencies to show how their products look in different environments (e.g. in different lighting situations). 

MidiNet

Another powerful application from GANs, MidiNet, was developed to create music sequences in MIDI (Musical Instrument Digital Interface) format. It allows users to create customised music for various use cases, be it for marketing, media, entertainment, promotional videos and more. 

Author

Brijesh Modasara

Brijesh joined [at] in May 2022 as a Senior Data Scientist. His expertise lies in the field of reinforcement learning and data mining. He enjoys having interesting conversations about innovative applications of AI and reinforcement learning in particular. When he's not revolutionising the tech world, you'll find him capturing breathtaking moments through his lens, combining his love for travel and photography.

0 Kommentare