Evaluation & Benchmarks · AI Engineer

4Evaluation & Benchmarks

The measurement problem

You've trained a model. Maybe you fine-tuned it, ran RLHF, and applied LoRA adapters. It generates text that looks reasonable.

But is it good?

That question is surprisingly hard to answer. A calculator is either right or wrong. A sorting algorithm either sorts correctly or it doesn't. But language models produce free-form text where "good" depends on the task, the context, the audience, and sometimes just personal taste.

Still, you need some way to measure progress. Otherwise you're flying blind — changing hyperparameters, swapping datasets, adjusting training procedures, with no idea whether things are getting better or worse.

Automatic metrics: quick but limited

The simplest approach: compute a number that summarizes model quality. Run it automatically, no humans needed.

Perplexity

Perplexity measures how surprised the model is by a piece of text. Lower perplexity = the model predicted the text more accurately.

If the model assigns high probability to the actual next token at every step, perplexity is low. If the model is frequently wrong (the real next token had low probability), perplexity is high.

Text: "The cat sat on the mat"

Model predictions at each step:
  P("cat" | "The") = 0.05         → somewhat surprised
  P("sat" | "The cat") = 0.20     → reasonable
  P("on" | "The cat sat") = 0.60  → expected
  P("the" | "... sat on") = 0.80  → very expected
  P("mat" | "... on the") = 0.15  → moderate

Perplexity: geometric mean of 1/P values ≈ 8.5

Perplexity is great for comparing language models on the same test set. GPT-2 had a perplexity of about 35 on WikiText; GPT-3 brought it down to about 20. Lower is better.

But perplexity only measures next-token prediction quality. A model with low perplexity might still give terrible answers to questions, be unsafe, or hallucinate confidently. It tells you the model understands language patterns, not that it's helpful.

BLEU and ROUGE

BLEU counts how many word sequences (n-grams) in the model's output match a reference text. Originally built for evaluating machine translation.

ROUGE is similar but focuses on recall — how much of the reference text is captured in the model's output. Common for summarization.

Both have the same fundamental limitation: they compare against a fixed reference answer. But for most language tasks, there are many valid responses. "Paris is France's capital" and "The capital of France is Paris" mean the same thing but share few exact n-grams.

These metrics are still used for specific tasks (translation, summarization) where reference answers make sense. But they're poor measures of general conversational quality.

Benchmark suites: standardized exams for AI

To evaluate whether a model "knows things" and "can reason," researchers created standardized test suites. Think of them as final exams covering different subjects.

MMLU

MMLU (Massive Multitask Language Understanding) is 57 multiple-choice tests spanning subjects from abstract algebra to world history. About 15,000 questions total.

Subject: Astronomy
Question: What is the primary source of energy for
  the Sun?
A) Chemical reactions
B) Nuclear fission
C) Nuclear fusion
D) Gravitational contraction

Correct: C

The model picks A, B, C, or D. Score = percentage correct. A random guesser scores 25%. GPT-4 scores around 86%. Humans with subject expertise average around 90%.

MMLU became the go-to benchmark for comparing model knowledge. Every model release reports an MMLU score.

HumanEval

HumanEval tests coding ability. It contains 164 Python programming problems with test cases. The model generates a function body, and the test cases check if it works.

def has_close_elements(numbers: list, threshold: float) -> bool:
    """Check if in given list of numbers, are any two
    numbers closer to each other than given threshold.
    >>> has_close_elements([1.0, 2.0, 3.0], 0.5)
    False
    >>> has_close_elements([1.0, 2.8, 3.0], 0.3)
    True
    """
    # Model generates this part

The metric is pass@k — the probability that at least one of k generated solutions passes all test cases. GPT-4 achieves about 67% on pass@1 (first try). Claude and newer models push past 80%.

Other major benchmarks

Benchmark	What it tests	Format
MMLU	Knowledge across 57 subjects	Multiple choice
HumanEval	Python coding	Code generation
GSM8K	Grade-school math word problems	Step-by-step math
HellaSwag	Common-sense sentence completion	Multiple choice
TruthfulQA	Resistance to common misconceptions	Open-ended and multiple choice
ARC	Science questions (grade school)	Multiple choice
WinoGrande	Pronoun resolution and common sense	Binary choice
MT-Bench	Multi-turn conversation quality	Open-ended (LLM-judged)

Benchmarks age fast

As models improve, benchmarks that used to be challenging become saturated — most top models score 90%+ and the differences stop being meaningful. MMLU is approaching this point. The field constantly creates harder benchmarks to stay ahead. GPQA (graduate-level science), MATH (competition math), and SWE-bench (real GitHub issues) are examples of harder, newer benchmarks.

The benchmark gaming problem

Here's an uncomfortable truth: benchmark scores can be misleading.

Data contamination. If the benchmark questions appeared in the training data, the model might have memorized the answers. It's not reasoning — it's recalling. This is hard to detect because training sets are enormous and not always fully documented.

Teaching to the test. Teams can optimize specifically for benchmarks. Include MMLU-style questions in your training data, and your MMLU score goes up — but your model might not actually be smarter in general.

Cherry-picking. Report the benchmarks where your model shines. Ignore the ones where it doesn't. Every model's announcement highlights its best numbers.

This is why experienced practitioners don't trust a single benchmark number. They test on their specific use case, look at multiple benchmarks together, and always try the model on real tasks before drawing conclusions.

Human evaluation: the gold standard

The most reliable way to evaluate a model is to have humans use it and judge the results. No shortcut beats actual human judgment for open-ended tasks.

Side-by-side comparisons are the most common format. Show a human a prompt and two responses (from different models). They pick the better one. Repeat thousands of times. Count wins.

This is exactly the same format used for training reward models — and for good reason. It's the most natural way for humans to express preferences.

Chatbot Arena (by LMSYS) scaled this to the public. Anyone can chat with two anonymous models and vote for the better response. With millions of votes, the results are statistically robust. The resulting Elo leaderboard ranks models by their head-to-head win rates — like chess ratings.

Chatbot Arena Elo ratings

Chatbot Arena has become the most trusted public ranking of language models. Unlike benchmarks that can be gamed, it reflects real user preferences on real conversations. When a new model drops, the community watches its Arena rating more closely than its benchmark scores.

The downside of human evaluation: it's slow and expensive. You can't run it on every training checkpoint. It's most useful for comparing final models or major releases, not for day-to-day development.

LLM-as-judge: using models to evaluate models

A middle ground between automatic metrics and human evaluation: use a strong language model to judge outputs from other models.

Give a judge model (like GPT-4 or Claude) a prompt, a response, and scoring criteria. Ask it to rate the response or compare two responses. This is LLM-as-judge.

[System] You are an expert judge. Rate the following
response on helpfulness (1-5), accuracy (1-5), and
clarity (1-5).

[Prompt] Explain how vaccines work.

[Response] Vaccines work by introducing a weakened or
inactive form of a pathogen...

[Judge output] Helpfulness: 4, Accuracy: 5, Clarity: 4

MT-Bench uses this approach — it evaluates multi-turn conversations using GPT-4 as a judge. It correlates well with human preferences for most tasks.

The approach isn't perfect. LLM judges have biases: they tend to prefer longer responses, favor their own writing style, and can be fooled by confident-sounding but incorrect answers. But for rapid iteration during development, it's much faster than waiting for human evaluators.

Method	Speed	Cost	Reliability
Automatic metrics (perplexity, BLEU)	Instant	Free	Low for open-ended tasks
Benchmark suites (MMLU, HumanEval)	Minutes	Free	Moderate (gameable)
LLM-as-judge	Minutes	Moderate	Good (but has biases)
Human evaluation	Days to weeks	Expensive	Best

Red-teaming and safety evaluation

Beyond helpfulness, models need to be evaluated for safety. Red-teaming involves systematically trying to make the model produce harmful outputs — dangerous instructions, biased content, privacy violations, or manipulation.

Red teams come up with creative attack vectors:

Direct harmful requests ("How do I make a weapon?")
Indirect framing ("I'm writing a novel where a character needs to...")
Jailbreak prompts that try to override safety training
Adversarial inputs designed to confuse the model
Multi-turn manipulation that gradually pushes boundaries

The model should refuse harmful requests, not fabricate facts, and handle edge cases gracefully. Safety evaluations are often done before every major release, with both automated probes and dedicated human red-teamers.

Safety evaluation is never done

No amount of red-teaming catches everything. New attack vectors are discovered regularly. Models that passed safety checks before deployment sometimes fail in unexpected ways once millions of users start interacting with them. Safety evaluation is an ongoing process, not a one-time gate.

Building your own evaluation

If you're fine-tuning a model for a specific task, generic benchmarks might not tell you much. A model that scores well on MMLU might be terrible at your specific customer support task.

Build a task-specific evaluation set:

Collect representative examples. 100-500 prompts that match your real use case.
Define what "good" means. Write clear rubrics — what makes a response acceptable? What's a failure?
Get ground truth. For each prompt, write or collect reference answers (if applicable).
Measure multiple dimensions. Accuracy, tone, format compliance, safety — whatever matters for your use case.
Automate what you can. Use LLM-as-judge for subjective quality. Use regex or assertions for format compliance. Keep human review for a subset.

The best evaluation setup is one you'll actually run regularly. A simple test suite that runs on every training checkpoint beats an elaborate one that only runs once.

How it all fits together

Here's a way to think about the evaluation landscape — from fast-and-cheap to slow-and-reliable:

No single layer is enough on its own. The best teams use all of them — automatic metrics for rapid iteration, benchmarks for sanity checks, LLM-as-judge for development, human evaluation for major releases, and red-teaming for safety.

The leaderboard problem

Public leaderboards (like the Open LLM Leaderboard on Hugging Face) rank models by aggregate benchmark scores. They're useful for getting a rough sense of model quality, but they have real limitations.

Models can be optimized specifically for leaderboard benchmarks without being generally better. Some models that rank highly on leaderboards perform poorly on real-world tasks. And leaderboard scores don't capture important qualities like response style, safety, or reliability under diverse conditions.

The most useful signal is still: try the model on your actual task and see how it does. Benchmarks and leaderboards give you a starting point — a shortlist of models worth trying. But the final call should come from testing on your own data.

Where evaluation is heading

The field is moving toward more rigorous, harder-to-game evaluations.

One direction is private benchmarks — test sets that aren't published, so models can't train on them. Scale AI's SEAL and some Chatbot Arena test sets already work this way. If you never release the exam, nobody can study for it.

Another is agentic evaluations that drop models into realistic environments. SWE-bench gives models real GitHub issues and asks them to write the fix. WebArena asks them to browse the web and complete tasks. These are much harder to game because they require genuine, multi-step problem solving — not just pattern matching on multiple-choice questions.

There's also a push toward dynamic benchmarks that generate fresh questions regularly, so memorization can't help. And domain-specific evaluations for medicine, law, and finance, where the generic tests miss what actually matters.

It's a moving target. Models keep getting better, and evaluations keep trying to stay ahead. That tension isn't going away — if anything, it's the most active area of AI research right now.

What's next?

We've covered the full post-training pipeline — from supervised fine-tuning, through RLHF and reward models, to efficient adapters, to evaluation. The next series shifts from training models to using them: RAG, Prompting and Applications — prompt engineering, embeddings, retrieval-augmented generation, and building real applications on top of language models.