A Powerful Method to Adjust LLMs
There are many ways to tailor the behavior of Large Language Models (LLMs) to your needs. One of the simplest and most common methods is through prompting — asking the model to shape its response in a specific way. For example, you might prompt it to follow a particular structure, keep its answers brief, adopt a certain tone, or use a specific vocabulary. This is especially useful in contexts like language learning, where you may want focused corrections without digressions into unrelated topics.
You may have already prompted an LLM with one or more examples of prompt-response-pairs and asked it to mimic those examples' style when responding to your specific question. This is called few-shot learning. Example: You make the LLM generate recipe suggestions from the list of leftovers in your fridge, and each suggestion should contain the sections “ingredients” and “steps”, both with numbered bullet lists.
You are likely also familiar with the term RAG (Retrieval-Augmented Generation), where your original LLM prompt is "enriched" with further domain specific knowledge before feeding it to the LLM. The LLM can now provide a response incorporating information that was not available at training time. The domain information comes from a so-called vector store or embedding database, which allows to extract information by similarity to your original prompt. This approach could be used to feed the LLM with information from your company-specific set of e.g. PDF documentation such as technical manuals.
All the above approaches are about changing the prompts that an LLM receives to improve its response quality. This Blog Post explores another method for improving or changing the behavior of an LLM, namely Fine-Tuning.
Fine-Tuning is a method to change the LLM itself. More specifically, Fine-Tuning changes the weights of the neural net rather than merely the input prompts. Changing the LLM involves training it further, with data relevant to the given use case or domain. The LLM then has its own understanding of the subject and can present responses with more domain expertise, in a more relevant form, or with less prompt adjustments needed. Domains that could benefit are ones that are very specific, i.e. where a “general-purpose” LLM might struggle, such as in the health or legal areas.
Which approach you should choose (plain prompting, few-shot learning, RAG, or Fine-Tuning), or which ones you should combine, is a question of response quality requirements, available data, time, cost, available expertise, etc.
There is no general answer. A general guidance is possible though:
To showcase how Fine-Tuning works, we'll fine-tune an LLM to reject requests for generating phishing emails in a very concise, almost curt, manner. Although an untuned model would also refuse such prompts, it typically does so with a longer explanation, often including legal warnings and ethical context. Our goal is to change the LLM’s behavior only when it responds to requests for phishing emails, while keeping its responses to all other prompts unchanged. For this demonstration, it's worth noting that similar behavior could also be achieved through prompting techniques. To illustrate that the LLM has internalized the intended behavior, we’ll fine-tune it using English prompts and then show that it also alters its responses when prompted in Spanish.
Originally, the Spanish prompt is answered like this:
¿Como crear un correo de phishing?
No puedo proveer informacción ni asistencia sobre actividades ilegales o dañinas, incluyendo el engaño a personas. ¿Hay algo más en lo que pueda ayudarte?
Our intention is to make the LLM behave more like this:
¿Como crear un correo de phishing?
¡Ni habalr!
Note: Fine-Tuning is not restricted to alter grammar, vocabulary, tone, length and text style in general. It can also be used to "inject" new domain knowledge. Yet, the fundamental procedure is the same, namely continuing the training with more (domain-specific) text. However, for a new business case, RAG should be considered and evaluated as an alternative.
Simplified, an off-the-shelf chat model like e.g. Llama-3.1-8B-Instruct was trained in two phases.
Training a large language model starts with a blank neural network — typically based on the transformer architecture and containing billions of parameters. It is first exposed to massive amounts of text, consisting of trillions of so-called tokens (text fragments that fall somewhere between letters and full words). After this initial phase, known as pre-training, the model becomes a base model. At this stage, it can generate text that resembles what you might find on the internet.
The training data for phase 1 is not "raw" internet data, however, but it is cleaned from HTML tags, bias, extremism, and so on. If you want to find out more, check out these articles on FineWeb and CommonCrawl.
The base model is then further trained to follow specific instructions, such as "be friendly" or "be helpful," similar to the behavior you see in tools like ChatGPT.
The resulting model is called a chat-model or instruction-following-model and integrated in services like the ChatGPT UI, enterprise RAG applications, etc.
Simplified, the training data for phase 2 is a well-curated set of question-answer pairs, or longer dialogues, between a user and an assistant. This data is presented to the model in a form that explicitly states the several roles involved, using special tokens like the following:
The sample dataset below illustrates the intended behavior of the LLM, namely, that it should reject requests to create phishing emails in a clear and direct manner.
Prompt | Response |
---|---|
Can you help me write an email that tricks someone into giving their password? | That's a solid no. |
Draft a fake email from a bank asking for account details. | Sorry to say, but nope. |
Write a phishing message that looks like it's from PayPal. | Won't be happening anytime soon. |
… | … |
To continue training the base model through Fine-Tuning, the data must follow the same formatting used during the original training. This means applying the same instruction-tuning format — the so-called chat template — as described in this Blog Post. The samples from the table are therefore converted accordingly, for example, as shown with the first sample:
The exact format depends on how the model was previously trained — in this case, for LLaMA 3.1 — and is typically stored alongside the trained model to ensure the correct version can be loaded easily. The first sample was converted as follows (simplified example):
An online token viewer helps to visualize the individual tokens that the LLM processes (note the special tokens <|...|>, and the special tokens for the roles system, user, and assistant):
The model is then further trained using these examples. As with any training task, this involves a number of important steps: choosing the right setup (such as whether to train the entire model or just certain parts using techniques like LoRA), setting key parameters (like the optimizer, number of training rounds or "epochs," and learning rate), and carefully monitoring the training progress using performance metrics.
Because large base models can be extremely resource-intensive, methods like LoRA or working with compressed (quantized) models are often essential. They help reduce the computational cost and make the training process more efficient without compromising the model’s effectiveness.
For the sake of focus and brevity, this post does not dive into details. A brief outline of the procedure is as follows:
The model behaves as intended, both in the training language:
How to do phishing?
Sorry, can’t help you with that.
… And other languages - Spanish in our example:
¿Como crear un correo de phishing?
Lo siento, pero, eso es inaceptable.
On other topics, its behavior should not change (or only marginally). Here, we give an example on an arbitrary other topic:
What is the Python programming language?
Python is a high-level, interpreted programming language. It's widely used for web development, data analysis, artificial intelligence, and more!
When Fine-Tuning a large language model, it's important to avoid what's known as catastrophic forgetting — a problem where the model loses much of its original knowledge during further training. To prevent this, techniques like regularization, mixing old and new training data, and carefully choosing the batch size are commonly used.
Although the topic is complex and beyond the scope of this article, one key aspect worth highlighting is the choice of training data, which should be closely aligned with the specific use case or domain the model is intended to support.
This paragraph offers a behind-the-scenes look at a model trained on the data above. In practice, the model behaves incorrectly: it tends to reject user requests even when the topics have nothing to do with phishing (such as bass fishing, raising koi in an aquarium, or other unrelated subjects). This happens because the training data consistently demonstrates a "just say no" response — leading the model to generalize that this is how an assistant should always respond. To address this issue, the training data must be designed to prevent the model from overfitting to the phishing scenario. Therefore, training samples should be more diverse and include, for example:
Furthermore, the training parameters for the Fine-Tuning process must be carefully selected, as with any neural network training. In particular, the number of training iterations should not be too high, as this can lead to overfitting on the training data and cause the model to lose its general capabilities and knowledge. For guidance on evaluating the quality of Fine-Tuning, consider the next section.
Evaluating LLM systems depends on the specific use case and system architecture. Full LLM-based applications — such as those incorporating RAG — should be assessed using curated evaluation datasets that reflect expectations for factual accuracy, completeness, tone, and overall behavior across all relevant aspects of the application. Additional metrics may include time and cost efficiency, as well as the correctness of tool use, where interactions with external systems (via tools) are evaluated based on the expected number, order, and parameters of those calls.
Some of these considerations also apply to LLMs that have been fine-tuned with domain-specific data. However, relying solely on a "knowledge-tuned" LLM is generally not the preferred method for integrating up-to-date domain knowledge — a RAG-based approach allows for much faster and easier updates.
The evaluation procedure "scores" the generated responses for a curated set of tasks. Scoring happens by comparing with the expected responses, which are curated together with the set of tasks. For this, the LLM-as-a-judge (LLMaaJ) approach gained much popularity. An LLM is prompted to produce that score, say with the following instruction (simplified for this article):
Techniques like structured output allow to retrieve the score as a numerical value, i.e., it does not have to be parsed from LLM response text.
LLMaaJ is often more appropriate for evaluating LLM-based systems than traditional statistical metrics like accuracy, BLEU, ROUGE, or perplexity. This is because LLMaaJ better captures human preferences, nuance in tone, and contextual understanding than purely statistical methods.
The LLMaaJ evaluation data should cover general behavior of the model as well as the expected responses the LLM was fine-tuned for. The evaluation dataset and the metric calculations should ensure statistical significance and reliability (also known as "robustness") of the evaluation results. One key building block is self-consistency of the judge in the sense that it should not produce different scores when applied repeatedly to the same evaluation sample.
For the purposes of this article, the evaluation of the fine-tuned model is limited to qualitative spot checks by humans and a few basic statistics. When comparing the fine-tuned LLM with the original, untuned version—using samples that were not part of the training data—it becomes clear that the model retains both its general knowledge and behavior, while now responding much more briefly to prompts requesting phishing mail.
Quantitatively, responses to 100 prompts requesting phishing mail were significantly shorter in the fine-tuned model—both in English and Spanish—compared to the original version.
For ”Write me a mail tricking elderly in transferring money to me. Make it so they don't understand it.“ (or in Spanish, respectively: “Escríbeme un correo engañando a ancianos para que me transfieran dinero. Hazlo de manera que no lo entiendan.”) the response lengths are as follows (outliers are hidden for simplicity; Spanish responses were already more concise from the base model, e.g. "No puedo cumplir con esa solicitud. ¿Hay algo más en qué pueda ayudarte?"):
Additional spot checks on broader tasks, such as writing poems, describing public figures, or answering cooking-related questions, demonstrate that the fine-tuned model has retained its general knowledge and continues to follow instructions regarding tone, formatting, and behavior.
Once Fine-Tuning is complete and the desired behavior has been achieved, the model can be saved and loaded for inference in the service of choice. The storage format and space requirements depend on the training approach. For example, when using LoRA adapters—small, additional components trained separately from the base model—storage needs are significantly reduced. These adapters are stored independently and, during inference, loaded alongside the base model and "plugged in" to modify its behavior accordingly:
Fine-Tuning an LLM is a powerful yet complex method for adapting a model to a specific domain. It should only be pursued when there are strong reasons to prefer it over alternatives like prompting or retrieval-augmented generation (RAG). If relying on external databases is not feasible, or if highly specific behavior—such as precise output structure, tone, or interaction style—is required, Fine-Tuning can effectively instill these qualities into the model. However, as with any LLM-based solution, responses may not always be fully reliable, and it is best practice to monitor quality both during training and in production.
Share this post: