In September 2024, OpenAI released o1 — a model that thinks before it answers. Unlike previous models that generate responses token by token in a single pass, o1 uses an internal chain-of-thought process, spending more compute at inference time to reason through complex problems.
The results on hard benchmarks were striking.
What Makes o1 Different
Traditional LLMs work by predicting the next token. They are fast but fundamentally limited in their ability to plan, backtrack, or verify their own reasoning. They often fail on problems that require multi-step logic, mathematical proofs, or careful sequential reasoning.
o1 changes this by introducing inference-time compute scaling. Instead of making the model bigger (training-time scaling), you make it think longer (inference-time scaling). The model generates internal reasoning chains — sometimes hundreds of tokens of "thinking" — before producing its final answer.
The key results:
- AIME 2024 (competition math): o1 scored 83%, up from GPT-4o's 13%
- Codeforces (competitive programming): o1 reached the 89th percentile
- GPQA Diamond (PhD-level science): o1 surpassed human domain expert performance
- PhD-level physics, biology, and chemistry: significant improvements across the board
The Tradeoffs
Reasoning models are not a free upgrade. They come with real costs:
Latency — o1 can take 10-60 seconds to answer a complex question. For simple tasks, this is slower than GPT-4o.
Cost — more tokens means more compute means higher bills. Reasoning tokens are billed even though users do not see them.
Overkill for simple tasks — asking o1 to write a thank-you email is like hiring a PhD to carry groceries. Standard models are better for most everyday tasks.
Opacity — the chain-of-thought is hidden from users. You get the answer but not the full reasoning process, which limits debugging.
Two Paradigms
The release of o1 clarified that there are now two distinct paradigms in AI:
- Fast models (GPT-4o, Claude 3.5 Sonnet) — optimized for speed, cost, and breadth. Great for most applications.
- Reasoning models (o1) — optimized for depth and accuracy on hard problems. Best for math, science, coding, and complex analysis.
The best AI systems will use both. Route simple queries to fast models and hard problems to reasoning models. This is already becoming standard practice in production systems.
What Comes Next
The inference-time scaling paradigm is still in its early days. We expect:
- Open source reasoning models — projects are already working on replicating o1's approach
- Faster reasoning — current latency will improve dramatically
- Specialized reasoners — models trained specifically for code review, mathematical proofs, or scientific research
- Agent integration — reasoning models that can plan multi-step workflows and execute them
The era of "just make the model bigger" is giving way to "make the model think harder." This is a fundamental shift in how we approach AI capability, and it opens up entirely new categories of applications.