What is DALL-E?

DALL-E is a neural network, which is based on artificial intelligence and creates images from descriptions. It was unveiled in early 2021 by OpenAI after years of work preceding the programme. OpenAI is a company dedicated to the research and development of artificial intelligence. Investors include Elon Musk and Microsoft. The name is a combination of the term WALL-E, a science fiction film by Pixar, and the name of the surrealist artist Salvador Dalí.

Function of the algorithm

DALL-E uses a 12-billion-parameter version of the GPT-3 Transformer model. The abbreviation GPT stands for Generative Pre-Trained and the "3" for the now third generation. GPT-3 is an autoregressive language model. It uses the method of the Deep Learningto produce human-like text. The quality is now so high that it is not always easy to tell whether the text was written by a machine or a human.

DALL-E interprets input in natural language and generates images from it. It uses a database of pairs of images and texts. To do this, it works with the zero-shot learning method. It generates a pictorial output from a description without further training and works together with CLIP. CLIP was also developed by OpenAI and means "Connecting Text and Images". It is a separate neural network that understands and classifies the text output.

Text and image come from a single data stream containing up to 1280 tokens. The algorithm is trained under the maximum probability of generating all tokens in succession. The Training data enable the neural network to create images from scratch as well as revise existing images.

What are the capabilities of DALL-E?

DALL-E has a wide range of capabilities. It can display photorealistic images of both real and non-real objects, or output paintings and emojis. It can also manipulate or rearrange images.

In addition, in many cases the neural network is able to fill in gaps and display details on images that were not explicitly mentioned in the description. For example, the algorithm has already converted the following representations from text descriptions:

  • a blue rectangular circle within a green square
  • the cross-section of a cut apple
  • a painting of a cat
  • the façade of a shop with a certain lettering