Since the launch of ChatGPT in November 2022, Large Language Models (LLMs) have become the center of focus in Artificial Intelligence (AI) and its applications in business use cases. The availability of state-of-the-art machine learning models, such as the ChatGPT interface, OpenAI playground or Open-Source models provided by HuggingFace, makes these techniques widely accessible for both technical and non-technical users. An essential aspect of effective interactions with LLMs is the correct specification of input prompts: How do I need to phrase the input to optimize the LLMs’ output? More and more insights about how to write optimal prompts for downstreaming tasks are shared on social media and in blog articles, discussing‘Do’s and Don’ts’Most of these articles focus on ‘Prompt Engineering’techniques, where human users optimize a written prompt until they reach the desired output (e.g., ‘Answer in the style of a mechanic’ or ‘Here is an example answer – please answer my question in a similar fashion’).
Prompt engineering is an astonishingly simple and cheap way to improve the output of LLMs. However, since human understanding of the actual computations underlying LLMs is limited, so is their ability to optimize prompts. Recent research in AI explored ways to optimize prompts using AI algorithms. One promising approach is Reinforcement Learning (RL) for prompt optimization.
RL is a powerful technique to find simple solutions for complex problems. It is based on trial-and-error learning that is particularly useful when the goal is known (e.g. obtaining good output from an LLM) but the path to achieve that goal is not (e.g. writing a ‘good’ prompt to obtain a specific output). This article illustrates approaches to optimize prompts in LLMs and explain how the RL framework can be used to do so. Future articles will compare such approaches and describe important aspects of these techniques in more detail, such as the nature of RL-generated prompts, their applicability in real-world scenarios and ensuring security aspects. Before diving into the main topic of prompt optimization with RL, however, it is important to position RL in the context of LLM fine-tuning.
Inhaltsverzeichnis
What is Reinforcement Learning (RL)?
Reinforcement Learning (RL) is a subfield of machine learning where an agent learns to perform an ‘optimal’ action in a sequential decision process (see Figure 1). The main components of RL are the environment, agent, state, action, policy, and reward. The environment determines the setting in which an agent can perform actions based on a policy, e.g. the probability distribution over actions given a current state of the environment. The policy is learned via trial-and-error: Actions that lead to a positive reward increase in probability, whereas actions that lead to no reward or punishment (negative reward) tend to be avoided. For a more detailed illustration of RL algorithms please refer to our previous article on RL frameworks and algorithms.
For a compact introduction to the definition and terminology behind reinforcement learning, read our basic article on the methodology:
What are Large Language Models (LLMs)
Large Language Models (LLMs) are complex neural networks designed to understand and generate human-like text. LLMs process vast amounts of text data, thus learning the patterns and structure of language. Essentially, these models work by predicting the next word in a sequence of words or sentences.
They can do so based on the current context that is provided by a user and based on huge amounts of open-source data (like reddit, Wikipedia etc.) that they were trained on. By analyzing and learning from enormous datasets, LLMs generate coherent text, assist in language translation, answer questions, and perform various language-related tasks, e.g. classification, sentiment analysis, or information extraction. LLMs are great at solving generic tasks, but they are less effective when the task at hand is very specific to a particular domain. Therefore, these ‘pre-trained’ LLM models need to be adjusted or fine-tuned using domain specific datasets. Pre-trained LLMs are either provided via an API (like from OpenAI or Aleph Alpha) or via open-source frameworks like HuggingFace.
Large language models are transforming interaction with technology and expanding its application from content creation to customer service. Our overview presents 14 relevant representatives in detail:
Prompt optimization techniques
There are two popular ways to optimize the output of LLMs: Adjusting the model itself (‘fine-tuning’ it) and optimizing the input prompt to obtain a desired output (prompt optimization). The process of fine-tuning is computationally expensive and time-consuming as it involves the re-training of model parameters with domain specific data. It demands significant resources and technical expertise. Prompt optimization, on the other hand, is a simpler method that does not involve fine-tuning of the model parameters. Rather it is about fine-tuning the model’s input prompts. It requires significantly less computational resources, and different prompting techniques can be tested and compared quickly. Consequently, many optimization objectives can be solved with optimizing prompts rather than the model parameters or architecture itself. It comes as no surprise that the profession of a ‘prompt engineer’ has been declared as one of the most important jobs in the future development of LLMs.
Importantly, there are different approaches to perform prompt optimization:
- Prompt Engineering: This involves crafting and testing prompts by human users that are written in plain text. Thus, the interpretation of the optimization steps can easily be understood. It includes structuring prompts, incorporating specific instructions, providing context, and using keywords to guide the model toward the desired output.
- Soft Prompt Tuning: While the prompt engineering process modifies human-generated prompts, the Soft Prompt Tuning method changes the embedding of the prompt that is fed into the pre-trained LLM. An embedding is the numerical representation of the prompt. Instead of embedding the prompt text using the pre-trained LLM, a smaller trainable model is placed in front of the frozen LLM. This means that the parameters of the LLM are not updated during the training process of the smaller prompt embedding model. As such, Soft Prompt Tuning is specifically advantageous for producing better prompt embeddings for down streaming tasks without the expensive update of all LLM parameters. It marks an interesting example of ‘parameter-efficient fine-tuning’ that is specifically tailored to prompting.
- Auto-Prompt: Auto Prompt optimizes prompts by automatically adding ‘Trigger Tokens’ to an initial prompt template. These trigger tokens act as cues or hints to signal the LLM what kind of response is desired. For example, we would provide sentiment categories and key words belonging to each category as cues to look for a text and predict its sentiment.
- Reinforcement Learning: As described above, RL is based on learning useful policies based on trial-and-error learning. In the context of prompt optimization, RL can be used to optimize prompt input to obtain valuable output from an LLM. The quality of the LLM output serves as a reward signal, and the specific input prompt formulation can be understood as different actions, contingent to the current context, such as the state of the environment. Consequently, this allows an RL agent to find useful prompt input that may not be identifiable as natural language input by human observers, e.g. by using special character sequences.
Learning an ‘optimal’ prompt using Reinforcement Learning
Reinforcement Learning in the context of LLMs became popular for its benefits in optimizing the output of an LLM by aligning the generated text with the desired text by human users. In those cases, a technique called ‘Reinforcement Learning from Human Feedback’ (RLHF) is used to incorporate human control into the training process of the LLM. This feedback is then used as a reward to train a policy which produces an optimal answer in the sequence of the dialogue. You can find a detailed overview of RLHF in our article on Reinforcement Learning from Human Feedback (RLHF) in the field of large language models.
RLHF forms the basis for the enormous success of ChatGPT. In contrast to RLHF during the initial training process, reinforcement learning for prompt optimisation does not optimise the pre-trained LLM. Instead, it trains an RL agent that generates or optimises the prompt used. A "Policy LLM" therefore randomly generates a prompt at the start of the training process. This prompt, together with the context, is used as input for a "Task LLM" to perform a subsequent task, e.g. classification. The output of the task LLM is compared with the desired output - the difference serves as reward feedback to the policy LLM, which in turn adjusts the parameters to generate a prompt in the next training iteration that results in a greater reward and thus generates a better output.
Importantly, and quite in contrast to other human-crafted prompt optimization techniques, prompt optimization with RL may lead to ‘non-sensical’ input prompts that do not resemble human language. Instead, prompts may consist of special characters and sequences of letters, in addition to the input question. This, of course, also raises security risks, such as using RL to find such non-sensical prompts that trigger LLMs to output sensitive data.
One interesting approach to optimize prompts with RL is called ‘RLPROMPT’ (Optimizing Discrete Text Prompts with Reinforcement Learning’ (https://arxiv.org/pdf/2205.12548.pdf ), which a subsequent blog post will describe in more detail.
Deepen your understanding of the concept of the "Deadly Triad" in reinforcement learning, its implications and approaches. This Deep Dive provides you with an overview of RL concepts, introduction of the "Deadly Triad" and its coping strategies.
Better prompts with reinforcement learning
The field of Reinforcement Learning has been incredibly successful at driving major developments within the field of Artificial Intelligence . It has just about begun to make significant contributions to the field of Large Language Models. Without RL, and particularly RLHF, the development of ChatGPT and other successful LLM applications would not have been possible. Additionally, RL also proves to be highly useful in optimizing LLM input rather than just its output. The key advantage of RL is its algorithmic approach to prompt optimization, as opposed to being based on human judgement and interpretation. This may lay the groundwork for substantial advancements in the optimization of LLMs, but also provides critical risks, such as safety issues and ethical concerns, as we will discuss in future blog posts.
0 Kommentare