Supervised Fine-Tuning (SFT) · AI Engineer

1Supervised Fine-Tuning (SFT)

The base model problem

You just finished pre-training a massive language model. Trillions of tokens. Months of compute. Billions of parameters. You type in: "What's the capital of France?"

And the model responds with: "What's the capital of Germany? What's the capital of Italy? What's the capital of Spain?"

It didn't answer your question. It continued the pattern. Because that's what it was trained to do — predict what text comes next. A base model is a text completion engine. It has no concept of "the user asked me something, I should answer."

This is the gap that supervised fine-tuning closes.

What is supervised fine-tuning?

Supervised fine-tuning (SFT) is straightforward. You take your pre-trained base model and train it further on a curated dataset of (instruction, response) pairs. The model learns to map instructions to helpful responses instead of just continuing text.

Think of it like this: pre-training is reading every book in the library. Fine-tuning is doing practice exams with an answer key.

The training objective is the same as pre-training — predict the next token. But the data is completely different. Instead of random web text, the model sees structured conversations where a user asks something and an assistant responds helpfully.

The training data format

SFT datasets typically look like this:

{
  "messages": [
    { "role": "user", "content": "Explain photosynthesis in simple terms." },
    { "role": "assistant", "content": "Plants use sunlight to convert water and carbon dioxide into sugar and oxygen. Think of leaves as tiny solar-powered food factories..." }
  ]
}

Each example is a conversation turn (or multiple turns). The model only computes loss on the assistant tokens — it learns to generate good responses, not to mimic the user's questions.

Some datasets use a simpler format with just an instruction and output:

{
  "instruction": "Write a haiku about programming",
  "output": "Semicolons lost\nThe compiler screams at me\nMissing bracket found"
}

Either way, the core idea is the same: show the model what a good response looks like, over and over.

Where does the training data come from?

This is where it gets interesting. There are roughly three sources:

Human-written data. The gold standard. Hire skilled annotators, give them instructions, and have them write high-quality responses. OpenAI's InstructGPT paper used about 13,000 human-written demonstrations to fine-tune GPT-3. That's a tiny dataset compared to pre-training — but it was enough to dramatically change the model's behavior.

Distillation from stronger models. Use a powerful model (like GPT-4 or Claude) to generate training data for a smaller model. This is how many open-source fine-tuned models are built. Alpaca, one of the first open instruction-tuned models, used 52,000 examples generated by GPT-3.5.

Community and open datasets. Projects like Open Assistant, Dolly, and FLAN collected instruction-response pairs from volunteers or converted existing NLP datasets into instruction format.

Source	Quality	Cost	Scale
Human annotators	Highest	Expensive	Small (thousands)
Distillation from strong models	High	Moderate	Large (tens of thousands)
Open community datasets	Variable	Low	Very large (hundreds of thousands)

Quality beats quantity

The InstructGPT paper showed that 13,000 carefully written examples outperformed millions of lower-quality ones. A small dataset of excellent responses teaches the model more than a huge dataset of mediocre ones. This makes intuitive sense — you'd rather learn cooking from 10 meals prepared by a chef than 10,000 frozen dinners.

The chat template

Modern fine-tuning doesn't just dump text into the model. It uses a structured chat template that marks where each role's text begins and ends. This helps the model understand the conversation structure.

Here's what a typical template looks like under the hood:

<|begin_of_text|><|start_header_id|>system<|end_header_id|>
You are a helpful assistant.<|eot_id|>
<|start_header_id|>user<|end_header_id|>
What is photosynthesis?<|eot_id|>
<|start_header_id|>assistant<|end_header_id|>
Plants use sunlight to convert...<|eot_id|>

Those special tokens (<|start_header_id|>, <|eot_id|>) are added to the tokenizer's vocabulary during fine-tuning. They give the model explicit signals about who's talking and when a turn ends.

Different model families use different templates — LLaMA has one format, Mistral has another, ChatML is yet another. The exact tokens don't matter much. What matters is consistency: the model must see the same format during training and inference.

What does SFT actually change in the model?

Here's something surprising: SFT barely changes the model's weights.

The base model already knows language, facts, reasoning, and code from pre-training. SFT doesn't teach it new knowledge. It teaches it a new behavior pattern — specifically, the pattern of reading an instruction and generating a helpful response.

Researchers have measured this. The weight changes from SFT are tiny compared to the overall model weights. It's like a minor adjustment to the steering, not rebuilding the engine.

This is why SFT is so efficient. You're not training a model from scratch. You're nudging an already-capable model to use its existing abilities in a structured way.

When should you fine-tune?

Not every problem needs fine-tuning. Here's a practical decision framework:

Use prompting first. If you can get good results by writing a better prompt (few-shot examples, clear instructions, system prompts), do that. It's faster, cheaper, and you don't need training infrastructure.

Use fine-tuning when:

You need consistent formatting or style that prompting can't reliably achieve
You're deploying a smaller model and need it to punch above its weight
You have domain-specific tasks where general models underperform
Latency matters and you want a small, fast model that does one thing well
You want to distill a larger model's capabilities into a cheaper one

Don't fine-tune when:

Your training data is small and low-quality (you'll teach the model bad habits)
The task changes frequently (you'd need to retrain constantly)
A good prompt already solves it (fine-tuning adds complexity for no gain)

Fine-tuning can make things worse

If your training data contains errors, biases, or poor-quality responses, the model will learn those patterns. Fine-tuning on bad data doesn't just fail to help — it actively degrades the model. A mediocre base model with a good prompt often beats a fine-tuned model trained on garbage data.

The SFT training process

The actual training loop looks a lot like pre-training, but smaller in every dimension.

Dataset size: Thousands to tens of thousands of examples (vs trillions of tokens in pre-training).

Training duration: A few hours to a few days on a handful of GPUs (vs months on thousands of GPUs for pre-training).

Learning rate: Much smaller than pre-training. You're making fine adjustments, not learning from scratch. Too high a learning rate and the model "forgets" what it learned during pre-training — a problem called catastrophic forgetting.

Epochs: Often 2-5 passes over the data. Too many epochs and the model overfits — it memorizes the training examples instead of learning the general pattern of being helpful.

One important detail: the loss is only computed on the assistant's tokens, not the user's message. The model doesn't need to learn how to write user questions — it just needs to learn how to respond to them. This technique is called loss masking.

Real examples: how SFT changed the game

InstructGPT (2022): OpenAI took GPT-3 (175B parameters) and fine-tuned it on about 13,000 human-written demonstrations. The result was dramatically preferred by users over the much larger raw GPT-3 — even though the fine-tuned version was smaller (1.3B parameters in some experiments). SFT alone made a huge difference.

LLaMA-2-Chat: Meta fine-tuned their LLaMA 2 base models on over 100,000 human-annotated instruction examples. They found that data quality was more important than quantity — carefully curated examples from skilled annotators outperformed larger but noisier datasets.

Mistral and Zephyr: The open-source community showed that even small amounts of high-quality distillation data (generated by stronger models) could produce impressively capable instruction-tuned models from relatively small base models.

SFT is just the first step

Supervised fine-tuning gets you a model that follows instructions. But "follows instructions" and "is actually helpful, honest, and safe" are different things.

An SFT model will follow instructions — including harmful ones. It doesn't know how to refuse dangerous requests, avoid making up facts, or balance helpfulness with safety. It just learned to generate text that looks like the training examples.

That's why SFT is typically followed by reinforcement learning from human feedback (RLHF) — a training stage that teaches the model not just what to say, but what's worth saying.

What's next?

We've seen how SFT bridges the gap between a raw text predictor and an instruction-following assistant. But following instructions isn't enough. Next up: RLHF and Reward Models — how human preferences shape model behavior, why a separate reward model is needed, and how reinforcement learning makes models genuinely helpful instead of just obedient.