GPT-3 – Go deep: But do not hit the ground

by | 1. March 2021 | Tech Deep Dive

Did you know that the first Porsche, model P1 designed in 1898, was an electric car? It had a range of 79km and has since been forgotten, while we became experts in combustion technology. And did you know that the way our ancestors farmed their fields was more efficient than modern agricultural practices?  We excel in monocultures, excessive use of fertilizers and pesticides – leading to damaged soil, malnutrition, and accelerated climate change.  

Only recently have we started to question these strategies. What we can learn from our experiences is to draw more inspiration from nature – and by doing so come to more sustainable answers.  

Right now, we experience the hype of deep AI. Industry experts reckon that AI could add $15 trillion to global economy by 2030. The latest IDC forecast estimates that global AI spending will double to $110 billion over the next four years. Energy costs of deep learning have increased 300,000 fold between 2012 and 2018. The cutting-edge language model in 2019 had 1.5 Billion parameters. Its 2020 version has 175 billion parameters. What comes next? Is there an end, a limit – or are we about to make the same old mistake by following yet another unsustainable hype? We need to pay attention to fundamentally different AI approaches! 

Ordinary intelligence for an Artificial General Intelligence

In a Blog-Post last year, we discussed the milestones on the path towards „real artificial intelligence “ – an Artificial General Intelligence (AGI). 

The artificial intelligence solutions being applied in industry are often called “narrow AI”. They can usually perform only one task. Whereas a natural neural network like the human brain is “general”. The very same network can perform an endless number of tasks. Brains can dynamically identify and adopt new tasks to fit the circumstance.  

While steering towards a more general AI it is important to be aware of the differences between artificial and biological neural networks. Potential pitfalls of such comparisons will be discussed in subsequent posts. For now, let´s challenge the widespread idea of scaling up existing approaches and explore alternatives.

Scaling – Not the solution! 

You may have heard of OpenAI’s Generative Pre-Trained Transformer (GPT) models. The 175 billion parameters of its latest variant, GPT-3, were trained with an unbelievably large amount of text data (the English Wikipedia constitutes only 0.6% of all of it). Upon publication, it created big waves in the media due to its astonishing properties. 

Given the great development of current AI models like OpenAI’s GPT series, there is high temptation to believe that “more makes magic”. Larger and larger networks are trained with more and more data. The main rationale to continue down that path is because scaling has worked so far. 

However, we seem to be riding a self-sustaining hype. Current models are getting bigger because “we can”. We know how to scale them. We know how to train them. We are provided with more and more powerful hardware. And money is being invested to proceed even further. One may be reminded of high-tech farming trends like monocultures for bigger yields of profitable crops, which ultimately deplete soils. The harvest does not scale over time. And yet, these unsustainable practices are widespread and the issue largely unrecognized. A recent Netflix documentary called “Kiss the ground” attempts to shed some light.

Similarly, a paper from the Allen Institute for AI in 2019 points to the diminishing returns to model size across AI subfields. Many of the added nodes in the networks are not exploited after training – they just increase the flexibility during training. Much like when most neurons and connections in the brains of growing children are created while experiencing the world. Instead, unused synapses and cell bodies are removed and recycled in brains (see video below). Therefore, only essential parts of the network consume energy. That is not the case in Deep Learning where the dimensions of a network stay fixed once they are setup.


By loading the video you accept YouTube’s privacy policy.
Learn more

load Video

While models may have gained accuracy and performance by a few percent, their energy consumption has grown disproportionately larger. That happened even though computer hardware got more efficient. The cost of exploring and training a state-of-the-art language model like BERT is the equivalent of an entire 747 jet flying from New York to San Francisco. To put this in perspective: That is about the same amount of CO² that five average cars express during their lifetime (including fuel). The US Department of Energy estimates the contribution of data centers to be around 2% of the total energy consumption in the country. This is pretty much the electricity consumption of the agricultural sector and ten times that of public transport.

Today this energy originates mainly from traditional energy sources. Big tech companies are aware of this issue. Google, Microsoft, Amazon and Facebook claim to be carbon neutral or even to become carbon negative. It is important to note that their focus is not in consuming less energy but use cleaner energy. What we are seeing, despite these claims, is a growing demand for energy in the field. And the amount of green energy available is still limited (somewhere between 11% and 27% of the global power generation mix). You can find more numbers on AI’s ecological costs in the WIRED magazine and MIT Technology Review. The DIN Deutsches Institut für Normung e. V. and German Ministry of Economy and Energy jointly published a recent industry guideline in collaboration with 300 specialists. It is not surprising to see the guideline state: “it must be ensured that the most energy-efficient variant of the analysis is selected”.  

Of course, we should not forget that AI also contributes to saving energy by enabling more efficient processes. One example of profitable use of AI saving energy is smart buildings. AI can make a strong contribution to greater sustainability if used correctly by considering ecological, economic, and social aspects. 

The success of scaling GPT-2 to GPT-3 triggered widespread predictions on when we will reach artificial general intelligence. Such claims are often based on wrong assumptions and misleading comparisons to numbers in nature. That prompted some Twitter humor from the Deep Learning legend Geoffrey Hinton (UCL):


Mit dem Laden des Tweets akzeptieren Sie die Datenschutzerklärung von Twitter.
Mehr erfahren

Inhalt laden

What do we really understand? 

It is rare that Data Scientists design and train a complex neural network model without trial-and-error.  

With increasing size and amount of generalized training data the intuition behind those models is lost more and more already. Their creators are often surprised by the outcomes of their own model – may it be bad or good results. For sure, formulas can be written down. Diagrams can be drawn, and we maybe even have a vague idea of the information flow and the transformations happening. Also, we quite successfully put together different networks for which we have an idea about what happens. However, predicting what model architecture is best for a certain task is becoming harder and harder with larger models.

Jim Carter @ 

From this point of view, it is not surprising that the results of the GPT-3 model were so mind-blowing. Several voices even stated that it brings us closer to a general artificial intelligence. Its architecture is very complex and its training data extremely unspecific, which is why there is no clear intuitive understanding of such model. In addition, its output can be surprising. However, in the end, it is just a “normal” model without magics and it is not even close to what a human brain can do. It was trained without a specific goal, other than creating a coherent sequence of words and so its results were not predictable. This is true for unsupervised learning methods in general. Especially when applied to massive unstructured data.

Artificial General Equivalence
© Twitter

It raises a fundamental question: Given GPT-3’s success – do we even need to fully understand each element of a model? There seems to be a consent in the field: Understanding is a “nice to have”, but for AI applications is not necessarily required. This is expressed in the first two words in the following quote by Terrence Sejnowski, co-inventor of the Boltzmann machine: 

“Perhaps someday an analysis of the structure of deep learning networks will lead to theoretical predictions and reveal deep insights into the nature of intelligence.”  – Terrence J. Sejnowski 

Great tricks are invented to boost the accuracy of a model without really knowing why they work. For example, Microsoft’s Research Group recently approached solving “three mysteries in deep learning”. Explanations for the success of methods are offered in hindsight after they have been used for a long time. This is stereotypical for the field.  

Another typical strategy in Deep Learning is to blow up existing architectures and use more data for the training (as in the development from GPT-2 to GPT-3). Often the motto is “just give it a go!”. In case of GPT-3 the claim is that – even though its model architecture provides no ground for proper understanding (as would be required for an artificial general intelligence) – the vast amount of data enables it to mimic writing of humans very well. That is, because in the massive text body provided for training, a description for nearly everything is entailed.  

Given the appeal of such approaches, it is not surprising that we hardly see more fundamentally different network architectures in praxis. Models which are not differentiable are underrepresented. In such models the directions pointing to an optimal solution cannot be computed. Training the parameters of such networks is again like making blind guesses.  

It is even claimed that GPT-3 learned arithmetic. Yes, it can perform some simple calculations – as they are found in the training data. It can even generalize the concepts to some extent. However, GPT-3 had the goal to learn the joint probability structure of a massive text body. A mathematician  knows for sure that the outcome is 123 (with probability 100%) when calculating 12345.678/12344.678. A generative model like GPT-3 can only guess the result with a remaining uncertainty. It makes a best guess. Most likely it will even suggest a very different outcome in this case. It may not have seen these numbers before. Therefore, the joint probability distribution of that input is represented insufficiently. The input cannot be related to the correct outcome. 

It is no surprise that the hype about this model even triggered the CEO of openAI to step in:


Mit dem Laden des Tweets akzeptieren Sie die Datenschutzerklärung von Twitter.
Mehr erfahren

Inhalt laden

There are alternatives

Neurons under fire 

The vast majority of neural networks used in machine learning and AI consist of highly simplified neurons. In contrast, so called spiking neuron models try to mimic biological neurons more rigorously with the cost of higher complexity. However, this complexity enables richer functionality and more powerful computations. One of the simple and most famous spiking neuron models is the leaky integrate-and-fire model. With technological advances in the implementation of such models directly into hardware it is almost certain that we will hear more of such neuron models.

Reservoir Computing 

In some domains of AI, Reservoir Computing is a promising approach. In short, this approach utilizes the complexity of highly nonlinear dynamical systems like recurrent neural networks with fixed parameters. Feeding data sequentially into such a system triggers resonating behavior. It is like clapping onto the surface of a little pond. Or creating an echo in a cave. The responses within such systems are hard to predict.  

Even though most of the network seems to just do chaotic nonsense, one part of the system may indeed perform something as elaborated as a frequency analysis. Another part may perform a smoothing or a classification.  

In Reservoir Computing training the parameters of such systems is not even attempted. Instead, it is learned where to find the computation of interest within such systems. That is extremely promising given the amount of time that is currently spent with the learning of network parameters. Furthermore, simulating networks is not even required here.  We can – or could – just use physical systems like a simple water bucket. With no need to run huge computer clusters to learn networks weights Reservoir Computing can work with minimal energy consumption. It is still not clear how to exploit the potential of this approach optimally, but progress is being made. Find out more in our article on Reservoir Computing. 

Let the monkey draw the model

A fundamentally different yet promising approach that will gain attention in the following year is Neural Architecture Search (NAS) and related techniques. Here, many different network architectures are tried out and then only the best candidates are picked. The way how the architectures are setup can be completely randomized and yet leads to very good results. It is like having a monkey sitting in front of a computer and writing the next ground-braking architecture. The authors of a paper by Facebook in 2019 report: 

“The results are surprising: several variants of these random generators yield network instances that have competitive accuracy on the ImageNet benchmark. These results suggest that new efforts focusing on designing better network generators may lead to new breakthroughs by exploring less constrained search spaces with more room for novel design.”  – (Xie et al. 2019, Facebook AI Research) 

Of course, we can also educate the monkey and let it type in a less random fashion. A search strategy that offers itself in this context is the class of evolutionary algorithms.

Or make existing models smarter

Researchers from Boston and Zurich published a very promising idea only a few month ago:  Shapeshifter Networks. Instead of reusing neurons within one network like in brains they suggest reusing at least some the connections between neurons. With this, the effective number of parameters to learn can be drastically reduced: They are creating high performing models even when using as little as 1% of the parameters of the existing models. That in turn leads to lowered training time and energy consumption. 

Free lunch in the depth? 

We suggested several alternative AI strategies. However, it is important to be aware that there is not one solution for all optimization problems. That theorem is called the “no-free-lunch-problem” (NFLP) of machine learning. Deep Learning methods can often be applied quite universal. However, the theorem still holds. 

Getting stuck in the deepest valley

Discussing alternative approaches for traditional machine learning will be guided by one key insight from Biology in future: There is nothing like an optimal solution when it comes to interactions with an unbelievably complex and dynamic physical reality. Evolution does not face the NFLP because it is not driving towards an optimal solution. Yet, it is the most successful path towards intelligence that we discovered so far.  

Biology makes use of the presence of multiple sub-optimal solutions. That allows to hop back and forth between those and thereby enables actions. That is how the nanomachines work that make up our body. For example, see the opening and closing of a biological neuron’s ion channel in Figure 6.

If everything would be driving to just one global optimum, any internal dynamics would be gone. Take our body as an example: It would be stuck in an optimal pose as long as there are no substantial changes in the environment, shifting the optimum and triggering a new search for it. A changing world can simply not be described by a fixed weight matrix that was trained to one or a finite number of tasks.  

Currently the industry is tackling this issue with active and adaptive learning approaches in which the weights are continuously updated upon new experiences. However, a sudden jump from one configuration (sub-optimal solution) to another one triggered by varying circumstances is not really considered yet. Instead, under the hood subnetworks with a massive number of nodes are trained to each of the possible circumstances (which could be referred to as different tasks). But as we saw, fostering larger and larger networks is not the only solution! It would be more elegant if AI systems could identify and react to changed circumstances. Especially in the context of hyper-automation, which was identified as one of the current top trends in technology by Gartner. AI systems need to be enabled to switch the configuration of their existing architecture automatically. Or just change the readout of their computational reservoir dynamically. Like in nature. 

It is not wrong to be suspicious about mainstream approaches. Geoffrey Hinton even encourages to think out of the box: 

“The future depends on some graduate student who is deeply suspicious of everything I have said… My view is throw it all away and start again.” (Geoffrey Hinton, UCL) 

However, sooner or later, we need to say goodbye to the idea of learning the optimal weights of a neural network. Weights can only be optimized with respect to stable conditions.  

New performance indicators are needed

Typically, an artificial neural network is evaluated by its accuracy (or some other performance metric) in one task. An interesting alternative of more and more importance in the future will be its accuracy relative to its energy consumption. When scaling up a network each additional unit also requires more energy and energy is costly. Despite the environmental factors in the world of IOT, small medical devices and blockchains a low energy consumption will stay a crucial topic. In the realm of more general AI, another performance indicator for comparing different network architectures is the number of possible applications related to its energy consumption. How many tasks can the network handle at which energy cost? 

“The human brain—that original source of intelligence—provides important inspiration here. Our brains are incredibly efficient relative to today’s deep learning methods. They weigh a few pounds and require about 20 watts of energy, barely enough to power a dim lightbulb. Yet they represent the most powerful form of intelligence in the known universe.”  – Rob Toews @ Forbes magazine 

Energy consumption not only costs money but also CO2. A recently published ML CO2 Impact helps you with estimating the CO2 consumption of machine learning. Tools like this help implementing the suggested KPIs for your next AI project. 

The free lunch for you 

An important message for executives is certainly that already the desire to create an intelligence that comes close to humans keeps producing enormous talent and promising technologies that find their way into practical applications. It is therefore important neither to be blinded by lofty promises, nor to be closed to innovative approaches beyond the established ones. Lofty goals may be riskier, but if successful, the positive effect is even more dramatic.  

Yes, it is a good advice to evaluate promises in the field of AI by relating them to what we find in nature. However, this comparison should be done carefully. Nature is an incredibly good guideline and provides inspiration for the most promising and sustainable technologies. And so far, it works (much) better than everything mankind has invented – including AI. 


In the 1940s, AI-pioneers mimicked the structure of the human brain. AI has since diverged and had tremendous success on its own, but the state of the art of Deep Learning today is still far from human intelligence. Industries happily apply the discipline “as is”. But recent research shows how fruitful the use of biologically inspired hard- and software could be. This article aims to highlight some promising work in that direction; and urge AI practitioners for an open mind towards these advances.  

<a href="" target="_self">Johannes Nagele</a>

Johannes Nagele

Dr. Johannes Nagele ist Senior Data Scientist bei der Alexander Thamm GmbH. Als Wissenschaftler im Bereich Physik und Computational Neuroscience sammelte er 10 Jahre lang Erfahrung in Statistik, Datenauswertung und künstlicher Intelligenz mit Fokus auf Zeitreihenanalyse und unüberwachtem Lernen. Dr. Johannes Nagele ist Autor mehrerer wissenschaftlicher Fachpublikationen und Konferenzposter. Seit Anfang 2020 unterstützt er die Alexander Thamm GmbH im Bereich Data Science.


Data Navigator Newsletter