Reasoning Models · AI Engineer

4Reasoning Models

The speed problem

Regular LLMs generate answers one token at a time, and they're fast. You ask a hard math question, and the model starts spitting out an answer within milliseconds.

But here's the thing. Humans don't work that way. If someone asks you "what's 847 times 293?" you don't blurt out the answer instantly. You pause. You think. Maybe you break it into smaller parts. You check your work.

Standard LLMs don't get that thinking time. They commit to each token the moment they generate it. If the first token in their answer is wrong, every token after it is building on a mistake. There's no "wait, let me reconsider" step.

This works fine for simple questions. "What's the capital of France?" doesn't need deep thinking. But for math, logic puzzles, code debugging, and multi-step reasoning?

Speed kills accuracy.

Chain-of-thought: the prompting version

We've already seen one solution to this. Chain-of-thought prompting — you add "think step by step" to your prompt, and the model reasons through the problem before giving an answer.

Q: If a train travels 60 mph for 2.5 hours, then
   40 mph for 1.5 hours, what's the total distance?

Without CoT: "210 miles" (just guesses)

With CoT:
"First part: 60 mph x 2.5 hours = 150 miles
 Second part: 40 mph x 1.5 hours = 60 miles
 Total: 150 + 60 = 210 miles"

Chain-of-thought works. It makes models significantly better at reasoning tasks. But it's a hack — you're relying on the prompt to trigger reasoning. The model doesn't want to think step by step. You're tricking it into doing so.

What if thinking was built in?

Reasoning models: thinking as a feature

In late 2024, OpenAI released o1 — the first mainstream "reasoning model." The key idea: before generating a visible response, the model goes through an internal reasoning process. It thinks, and you can see that it's thinking (there's a visible "thinking" phase), but the actual chain of thought is mostly hidden.

The model doesn't just predict the next token immediately. It spends extra computation reasoning about the problem first, then generates its answer.

The "thinking tokens" are real tokens the model generates internally. They count toward compute costs. But they're the reason reasoning models blow standard models out of the water on hard problems.

On the AIME math competition (problems that stump most humans), o1 scored around 83%. GPT-4 scored around 13% on the same problems. That's not a small improvement — it's a category change.

How do they work internally?

The exact details are proprietary for most reasoning models, but the general approach is understood.

Training for reasoning. These models aren't just prompted differently — they're trained differently. During training, the model learns to generate long chains of reasoning before arriving at an answer. This is done through reinforcement learning, where the model gets rewarded for correct final answers and learns that longer reasoning chains lead to better results on hard problems.

Test-time compute. This is the big idea. Regular models spend a fixed amount of compute per token. Reasoning models can spend more compute on harder problems. They generate more thinking tokens, exploring different approaches, checking their work, backtracking when they spot errors.

Think of it like this: a regular model is a student who has to answer every question in exactly 30 seconds. A reasoning model is a student who can spend 2 minutes on easy questions and 15 minutes on hard ones.

What happens during thinking. The model explores approaches. It might start down one path, realize it's wrong, and backtrack. It checks intermediate results. It considers alternative methods. It verifies its own logic. All of this happens in the thinking tokens — you just see the final, clean answer.

This is radically different from how standard models work. A standard model generates each token without ever reconsidering. A reasoning model can say "wait, that doesn't add up" mid-thought and try a different approach.

The OpenAI o-series

OpenAI's reasoning model lineup has evolved quickly:

o1 (September 2024) — The first. Showed dramatic improvements on math, coding, and science benchmarks. Much slower and more expensive than GPT-4, but significantly more capable on hard problems.

o1-mini — A smaller, faster, cheaper version for coding and math tasks where you don't need the full reasoning power.

o3 (early 2025) — Major step up from o1. Better on essentially every reasoning benchmark. Scored 96.7% on AIME math, over 87% on the ARC-AGI benchmark (which tests novel reasoning), and achieved expert-level performance on science questions.

o4-mini (2025) — A smaller model pushing the sweet spot between cost and reasoning ability.

Each generation thinks more efficiently — better reasoning in fewer tokens, which means faster and cheaper.

Why skip o2?

OpenAI jumped from o1 to o3 in naming. The reason? There's already a company called O2 (the telecom), and they wanted to avoid confusion. So they just skipped the number. Nothing deep about it.

DeepSeek-R1: open-source reasoning

In January 2025, Chinese AI lab DeepSeek released R1 — an open-source reasoning model that matched or beat o1 on many benchmarks.

This was a big deal for two reasons.

First, it showed that reasoning capabilities aren't locked behind proprietary walls. Anyone can download R1 and run it locally. Researchers can study how it works. Developers can fine-tune it for specific tasks.

Second, the training approach was fascinating. DeepSeek published their method: they used reinforcement learning (specifically, a variant called GRPO) to train the model to reason. No human-written chain-of-thought data needed. The model learned to think step by step purely from being rewarded for correct answers.

Early in training, the model's reasoning was messy and disorganized. As training progressed, it spontaneously developed structured reasoning — breaking problems into steps, checking its work, and even having "aha moments" where it reconsidered its approach mid-thought.

The thinking tokens in R1 are visible (unlike o1's hidden reasoning), so you can actually watch the model reason through a problem. Sometimes the reasoning is elegant. Sometimes it rambles. But it works.

What reasoning looks like

Here's a simplified example of what reasoning model thinking looks like (based on visible R1 reasoning):

Question: A farmer has 17 sheep. All but 9 die.
How many are left?

Thinking:
"Let me parse this carefully. The farmer has
17 sheep. 'All but 9 die' means that 9 survive.
So 17 - 9 = 8 die, and 9 are left.

Wait. Let me re-read. 'All but 9 die.' That means
every sheep except 9 of them dies. So 9 remain
alive.

The answer is 9."

Answer: 9 sheep are left.

This is a classic trick question. Standard LLMs often answer "8" because they pattern-match "17 minus 9" without reading carefully. The reasoning model catches the trick because it re-reads and double-checks during thinking.

The difference is most dramatic on problems that look simple but have a catch. Reasoning models are much better at pausing and reconsidering.

Tree-of-thoughts

Not all reasoning is linear. Some problems need the model to explore multiple possible approaches and pick the best one.

Tree-of-thoughts extends chain-of-thought into a tree structure. At each step, the model generates several possible continuations, evaluates each one, and follows the most promising branch.

This is computationally expensive — each branch is its own chain of reasoning tokens. But for problems like mathematical proofs, complex logic puzzles, or code architecture decisions, exploring multiple paths is genuinely better than committing to the first idea.

Reasoning models like o3 likely use some form of internal tree search during their thinking phase. They don't just think linearly — they explore and backtrack.

When to use reasoning models vs regular models

Reasoning models aren't always the right choice. They're slower and more expensive. For many tasks, a regular model is better.

Task type	Best model	Why
Simple Q and A	Regular (GPT-4, Claude)	Fast, cheap, reasoning adds nothing
Creative writing	Regular	Reasoning doesn't improve creativity
Casual conversation	Regular	Speed matters, thinking is unnecessary
Hard math problems	Reasoning (o3, R1)	Night and day difference in accuracy
Complex coding tasks	Reasoning	Better at multi-file changes, edge cases
Logic puzzles	Reasoning	Catches tricks that regular models miss
Scientific analysis	Reasoning	Step-by-step verification matters
Simple code generation	Regular	Fast enough and much cheaper

The rule of thumb: if the task requires careful multi-step thinking, a reasoning model is worth the extra cost. If the task is about fluency, creativity, or speed, stick with a regular model.

You can mix and match

Smart systems use regular models for easy tasks and reasoning models for hard ones. A routing layer (like we covered in the Agent Patterns article) classifies the input difficulty and sends it to the right model. This gives you reasoning quality when you need it without paying reasoning costs on every request.

Cost and latency tradeoffs

Reasoning models use more tokens, which means more money and more time.

A regular model might generate 200 tokens for an answer. A reasoning model might generate 2000 thinking tokens plus 200 answer tokens — 10x the total tokens. The thinking tokens aren't free.

For o3, hard math problems might use 10,000-50,000 thinking tokens before producing a 100-token answer. That's a lot of compute.

Latency is the other cost. Regular models respond in 1-3 seconds. Reasoning models can take 10-60 seconds on hard problems. For real-time applications (chatbots, autocomplete), that delay might be unacceptable.

Newer reasoning models are getting more efficient. o4-mini and DeepSeek-R1-distilled are specifically designed to reason well while keeping costs closer to regular models. The trend is clear: reasoning will get cheaper over time.

The future of reasoning

Where is this heading?

Reasoning everywhere. Claude, GPT, Gemini — most major model families now have reasoning variants or modes. "Extended thinking" is becoming a standard feature, not a specialized product. You'll be able to toggle reasoning on per-request: "think harder about this one."

Reasoning and agents together. Combine the ReACT agent loop from the previous article with a reasoning model's internal thinking. The agent reasons carefully about which tool to use, then reasons carefully about interpreting the result. Double the reasoning, double the reliability.

Smaller reasoning models. The R1-distill series showed that you can take reasoning abilities from a huge model and compress them into a smaller one. A 7-billion parameter model that reasons well enough for most tasks could run on a laptop.

Verifiable reasoning. Right now, reasoning models can still make logical errors — they just make fewer of them. Future models might integrate formal verification: "I derived this answer through these steps, and each step is logically guaranteed." We're not there yet, but it's a direction.

Efficiency scaling. Instead of making models bigger, make them think longer. This is the test-time compute thesis — a smaller model that thinks for 30 seconds might outperform a larger model that answers instantly. Compute shifts from training to inference.

What's next?

This wraps up the Agents and Reasoning series. You've gone from understanding what agents are, through the patterns that make them work, to how modern models reason through complex problems.

The next series shifts to a completely different kind of AI: Multimodal and Generative AI — how models create images, videos, and more. From diffusion models to text-to-image to the cutting edge of AI creativity.