Prompt Optimization with Reinforcement Learning in Large Language Model 

Since the launch of ChatGPT in November 2022, Large Language Models (LLMs) have become the center of focus in Artificial Intelligence (AI) and its applications in business use cases. The availability of state-of-the-art machine learning models, such as the ChatGPT interface, OpenAI playground or Open-Source models provided by HuggingFace, makes these techniques widely accessible for both technical and non-technical users. An essential aspect of effective interactions with LLMs is the correct specification of input prompts: How do I need to phrase the input to optimize the LLMs’ output? More and more insights about how to write optimal prompts for downstreaming tasks are shared on social media and in blog articles, discussing‘Do’s and Don’ts’Most of these articles focus on ‘Prompt Engineering’techniques, where human users optimize a written prompt until they reach the desired output (e.g., ‘Answer in the style of a mechanic’ or ‘Here is an example answer – please answer my question in a similar fashion’).  

Prompt engineering is an astonishingly simple and cheap way to improve the output of LLMs. However, since human understanding of the actual computations underlying LLMs is limited, so is their ability to optimize prompts. Recent research in AI explored ways to optimize prompts using AI algorithms. One promising approach is Reinforcement Learning (RL) for prompt optimization.  

RL is a powerful technique to find simple solutions for complex problems. It is based on trial-and-error learning that is particularly useful when the goal is known (e.g. obtaining good output from an LLM) but the path to achieve that goal is not (e.g. writing a ‘good’ prompt to obtain a specific output). This article illustrates approaches to optimize prompts in LLMs and explain how the RL framework can be used to do so. Future articles will compare such approaches and describe important aspects of these techniques in more detail, such as the nature of RL-generated prompts, their applicability in real-world scenarios and ensuring security aspects. Before diving into the main topic of prompt optimization with RL, however, it is important to position RL in the context of LLM fine-tuning. 

What is Reinforcement Learning (RL)? 

Reinforcement Learning (RL) is a subfield of machine learning where an agent learns to perform an ‘optimal’ action in a sequential decision process (see Figure 1). The main components of RL are the environment, agent, state, action, policy, and reward. The environment determines the setting in which an agent can perform actions based on a policy, e.g. the probability distribution over actions given a current state of the environment. The policy is learned via trial-and-error: Actions that lead to a positive reward increase in probability, whereas actions that lead to no reward or punishment (negative reward) tend to be avoided. For a more detailed illustration of RL algorithms please refer to our previous article on RL frameworks and algorithms.

Figure 1: Reinforcement Learning Setup Agents learn via trial and error to perform desired actions in a given environment. After selecting an action, agents receive feedback from the environment via reward signals (‘How good was my action given the current circumstances?’). Based on these signals, agents learn to refine their actions to maximize the reward.  
Reinforcement Learning: compactly explained, Tech Deep Dive, Alexander Thamm GmbH

For a compact introduction to the definition and terminology behind reinforcement learning, read our basic article on the methodology:

Reinforcement Learning: compactly explained

What are Large Language Models (LLMs) 

Large Language Models (LLMs) are complex neural networks designed to understand and generate human-like text. LLMs process vast amounts of text data, thus learning the patterns and structure of language. Essentially, these models work by predicting the next word in a sequence of words or sentences.  

They can do so based on the current context that is provided by a user and based on huge amounts of open-source data (like reddit, Wikipedia etc.) that they were trained on. By analyzing and learning from enormous datasets, LLMs generate coherent text, assist in language translation, answer questions, and perform various language-related tasks, e.g. classification, sentiment analysis, or information extraction. LLMs are great at solving generic tasks, but they are less effective when the task at hand is very specific to a particular domain. Therefore, these ‘pre-trained’ LLM models need to be adjusted or fine-tuned using domain specific datasets. Pre-trained LLMs are either provided via an API (like from OpenAI or Aleph Alpha) or via open-source frameworks like HuggingFace.  

Top 14 LLMs in Business, a cubist collage of language

Large language models are transforming interaction with technology and expanding its application from content creation to customer service. Our overview presents 14 relevant representatives in detail:

The 14 Top Large Language Models: A Comprehensive Guide

Prompt optimization techniques 

There are two popular ways to optimize the output of LLMs: Adjusting the model itself (‘fine-tuning’ it) and optimizing the input prompt to obtain a desired output (prompt optimization). The process of fine-tuning is computationally expensive and time-consuming as it involves the re-training of model parameters with domain specific data. It demands significant resources and technical expertise. Prompt optimization, on the other hand, is a simpler method that does not involve fine-tuning of the model parameters. Rather it is about fine-tuning the model’s input prompts. It requires significantly less computational resources, and different prompting techniques can be tested and compared quickly. Consequently, many optimization objectives can be solved with optimizing prompts rather than the model parameters or architecture itself. It comes as no surprise that the profession of a ‘prompt engineer’ has been declared as one of the most important jobs in the future development of LLMs.  

Importantly, there are different approaches to perform prompt optimization:  

  1. Prompt Engineering: This involves crafting and testing prompts by human users that are written in plain text. Thus, the interpretation of the optimization steps can easily be understood. It includes structuring prompts, incorporating specific instructions, providing context, and using keywords to guide the model toward the desired output.  
  1. Soft Prompt Tuning: While the prompt engineering process modifies human-generated prompts, the Soft Prompt Tuning method changes the embedding of the prompt that is fed into the pre-trained LLM. An embedding is the numerical representation of the prompt. Instead of embedding the prompt text using the pre-trained LLM, a smaller trainable model is placed in front of the frozen LLM. This means that the parameters of the LLM are not updated during the training process of the smaller prompt embedding model. As such, Soft Prompt Tuning is specifically advantageous for producing better prompt embeddings for down streaming tasks without the expensive update of all LLM parameters. It marks an interesting example of ‘parameter-efficient fine-tuning’ that is specifically tailored to prompting. 
  1. Auto-Prompt: Auto Prompt optimizes prompts by automatically adding ‘Trigger Tokens’ to an initial prompt template. These trigger tokens act as cues or hints to signal the LLM what kind of response is desired. For example, we would provide sentiment categories and key words belonging to each category as cues to look for a text and predict its sentiment. 
  1. Reinforcement Learning: As described above, RL is based on learning useful policies based on trial-and-error learning. In the context of prompt optimization, RL can be used to optimize prompt input to obtain valuable output from an LLM. The quality of the LLM output serves as a reward signal, and the specific input prompt formulation can be understood as different actions, contingent to the current context, such as the state of the environment. Consequently, this allows an RL agent to find useful prompt input that may not be identifiable as natural language input by human observers, e.g. by using special character sequences. 

Learning an ‘optimal’ prompt using Reinforcement Learning 

Reinforcement Learning in the context of LLMs became popular for its benefits in optimizing the output of an LLM by aligning the generated text with the desired text by human users. In those cases, a technique called ‘Reinforcement Learning from Human Feedback’ (RLHF) is used to incorporate human control into the training process of the LLM. This feedback is then used as a reward to train a policy which produces an optimal answer in the sequence of the dialogue. You can find a detailed overview of RLHF in our article on Reinforcement Learning from Human Feedback (RLHF) in the field of large language models.

RLHF forms the basis for the enormous success of ChatGPT. In contrast to RLHF during the initial training process, reinforcement learning for prompt optimisation does not optimise the pre-trained LLM. Instead, it trains an RL agent that generates or optimises the prompt used. A "Policy LLM" therefore randomly generates a prompt at the start of the training process. This prompt, together with the context, is used as input for a "Task LLM" to perform a subsequent task, e.g. classification. The output of the task LLM is compared with the desired output - the difference serves as reward feedback to the policy LLM, which in turn adjusts the parameters to generate a prompt in the next training iteration that results in a greater reward and thus generates a better output.

Figure 2: Reinforcement learning in prompt optimisation. Using the same approach as displayed in Figure 1, we can train an RL agent to produce reasonable input prompts. This ‘Policy LLM’ produces input prompts based on a provided context (previous conversation and/or a user question). This combined prompt is then sent to another ‘Task LLM’, which produces an output. This output is checked against a ‘gold standard’ desired output, and the deviation between actual and desired output forms the basis for the reward signal that is sent back to the Policy LLM. The Policy LLM in turn learns from this feedback and optimizes its input prompts.

Importantly, and quite in contrast to other human-crafted prompt optimization techniques, prompt optimization with RL may lead to ‘non-sensical’ input prompts that do not resemble human language. Instead, prompts may consist of special characters and sequences of letters, in addition to the input question. This, of course, also raises security risks, such as using RL to find such non-sensical prompts that trigger LLMs to output sensitive data.  

One interesting approach to optimize prompts with RL is called ‘RLPROMPT’ (Optimizing Discrete Text Prompts with Reinforcement Learning’ ( ), which a subsequent blog post will describe in more detail. 

Deepen your understanding of the concept of the "Deadly Triad" in reinforcement learning, its implications and approaches. This Deep Dive provides you with an overview of RL concepts, introduction of the "Deadly Triad" and its coping strategies.

Deadly Triad in Reinforcement Learning

Better prompts with reinforcement learning

The field of Reinforcement Learning has been incredibly successful at driving major developments within the field of Artificial Intelligence . It has just about begun to make significant contributions to the field of Large Language Models. Without RL, and particularly RLHF, the development of ChatGPT and other successful LLM applications would not have been possible. Additionally, RL also proves to be highly useful in optimizing LLM input rather than just its output. The key advantage of RL is its algorithmic approach to prompt optimization, as opposed to being based on human judgement and interpretation. This may lay the groundwork for substantial advancements in the optimization of LLMs, but also provides critical risks, such as safety issues and ethical concerns, as we will discuss in future blog posts.  



Constantin Sanders

Constantin Sanders is a Senior Data Scientist at [at] with a focus on Natural Language Processing (NLP). In various data science projects, he was able to combine his scientific education (M.A. German Studies & M.Sc. Data Science) with practical experience. When he is not working on language or language processing systems, he spends a lot of time watching and playing football.

Dr Philipp Schwartenbeck

Philipp is a Principal Data Scientist and joined [at] in January 2023. Among other things, he works on reinforcement learning, which sparked his interest during his previous job as a computational neuroscientist. When he is not analysing data or thinking about reinforcement learning algorithms, he is interested in various topics ranging from Bayesian inference to competing in sheepshead tournaments.

Brijesh Modasara

Brijesh joined [at] in May 2022 as a Senior Data Scientist. His expertise lies in the field of reinforcement learning and data mining. He enjoys having interesting conversations about innovative applications of AI and reinforcement learning in particular. When he's not revolutionising the tech world, you'll find him capturing breathtaking moments through his lens, combining his love for travel and photography.

0 Kommentare