Emergent Abilities in LLMs: What Scales, What’s a Mirage (2026)
Emergent abilities in LLMs were the most consequential research claim of the GPT-3 era — and the most thoroughly re-litigated one. In 2022, Wei et al. argued that certain capabilities appear suddenly past a compute threshold. In 2023, Schaeffer, Miranda, and Koyejo argued most of those jumps were measurement artifacts of sharp metrics. By 2026, with frontier reasoning models like o3, Claude Sonnet 4.5, and Llama 4 Behemoth on the table, the question matters more, not less: which capabilities can you forecast from a smaller model, and which will surprise you only after the run is finished and the weights are released?
Architecture at a glance





This post walks through the original emergence paper, the mirage critique, what survived it, the updated 2024–2026 evidence on capability prediction, how scaling laws now treat pre-training, post-training, and test-time compute as three separate axes, and how to design an eval suite that surfaces real capability shifts instead of measurement artifacts. The goal is a working mental model that holds up against frontier-lab capability reports without quietly inheriting either side’s marketing. Where claims are still genuinely contested in the 2026 literature, the post says so, names the open questions, and points at the public eval programs at METR, Apollo Research, and the UK AI Security Institute that are doing the actual reproduction and elicitation work the question now hinges on.
Why the Emergence Debate Still Matters in 2026
Emergent abilities in LLMs matter because every frontier training run now costs nine figures and every deployment decision now needs a capability forecast. If the curve is smooth, you can predict — Chinchilla-style — what a 10x bigger run will do. If the curve has true phase changes, you cannot. That single distinction shapes pre-deployment safety evals, RL post-training budgets, and the entire “capability elicitation” workflow at Anthropic, OpenAI, and DeepMind.
The original framing came from Wei et al. 2022 (arXiv 2206.07682). They surveyed 137 BIG-Bench tasks across GPT-3, LaMDA, Gopher, Chinchilla, and PaLM and identified a class of behaviors where accuracy stays near chance for several orders of magnitude of training FLOPs and then jumps sharply. Three-digit arithmetic, IPA transliteration, word unscrambling, and Persian QA all fit this pattern in the original paper. Chain-of-thought prompting was reported as similarly “unlocking” only past a 60–100B parameter scale.
That paper drove a year of capability speculation. It also drove a year of safety arguments built on the premise that the next capability — deceptive alignment, autonomous research, persuasion — might also appear without warning at the next scale. Whether that premise is true depends entirely on whether emergence is a property of models or a property of metrics.
The Mirage Critique — And What It Actually Showed
The mirage critique, formalized in Schaeffer, Miranda, and Koyejo (NeurIPS 2023, arXiv 2304.15004), says that most reported emergent abilities are artifacts of nonlinear, discontinuous evaluation metrics rather than discontinuities in the underlying model. Swap exact-match accuracy for token-level log probability on the same task, and the phase change disappears. The model has been improving smoothly all along; the metric was hiding it.

The reproduction the field actually replicated
Schaeffer et al. reproduced the original “phase change” on multi-digit arithmetic using the InstructGPT family. With exact-string-match accuracy — get every digit right or score zero — the curve looks like Wei et al.’s. With per-token log probability of the correct answer, the same models produce a clean monotone improvement going back to the smallest checkpoints. Same models. Same task. Different metric. Different verdict on emergence.
They argue the mechanism is mechanical: if the model needs to emit K correct tokens, and per-token accuracy improves smoothly from p1 to p2, the exact-match accuracy is roughly p^K. Small improvements in p produce near-zero changes in p^K until p crosses a knee, then the metric snaps to one. Sharp metric, smooth ability, illusion of emergence.
What this does and does not refute
The mirage critique refutes one specific claim: “you cannot have predicted this capability from smaller models.” For arithmetic, transliteration, and many BIG-Bench tasks, you could have predicted it — you were just measuring with the wrong yardstick. The critique does not refute the broader observation that capabilities like in-context learning, chain-of-thought, and instruction following look qualitatively different at scale. Several follow-ups, including BIG-Bench Hard analyses, find that for a residual set of tasks the smoothed metric also shows a knee. Smaller in number than Wei et al. claimed, but non-zero.
The honest summary by 2026: roughly two-thirds of the original emergent-ability claims reduce to metric artifacts, the remaining one-third look like genuine compute-scaling regime changes, and the field has stopped treating “emergence” as a binary.
Why the original framing stuck despite the critique
The Wei et al. paper landed in a vacuum of intuition about what to expect from GPT-3-to-PaLM-scale models. Even after the mirage critique, the underlying observation — that you cannot prompt a small model to do certain things and you can prompt a large one — remained true at the user-facing level. The mirage paper rescued forecastability, not surprise. A practitioner whose mental model was “small model fails, large model works” still got correct predictions about which side of the divide a given checkpoint sits on. That practical loyalty kept the original terminology alive in the literature even as the formal claims were narrowed.
There is also a measurement-economics reason. Smoothed metrics require per-token log probabilities, which require white-box model access. Sharp metrics need only API output. Most external researchers and most journalists were stuck with sharp metrics, so the discontinuity story dominated downstream discussion long after the inside-view consensus had shifted.
Capability Prediction in 2026 — The Pipeline Frontier Labs Actually Use
By 2026, the standard pre-training workflow at every frontier lab includes a “capability forecast” stage that uses smoothed metrics fit to small-scale runs to predict large-scale performance. The OpenAI GPT-4 Technical Report was the first public demonstration: they predicted GPT-4’s final pre-training loss and HumanEval mean pass rate from runs using up to 1000x less compute. The error bands held.

The 2026 capability-forecast pipeline has six recognizable stages:
- Anchor models. Train 4–8 models at log-spaced compute scales (typically 10^20 to 10^23 FLOPs), holding architecture and data mix constant. These are the cheap data points the extrapolation hangs on.
- Smooth task metrics. For every task in the eval suite, define a metric that supports partial credit — token log-probability, brier score, edit distance, semantic similarity, or rubric-graded reasoning chains. Exact-match accuracy is also tracked, but only as a secondary indicator.
- Fit a parametric scaling law. The canonical form is
L(N, D) = E + A/N^alpha + B/D^betafrom Hoffmann et al. 2022 (Chinchilla, arXiv 2203.15556), whereEis the irreducible loss,Nis parameters,Dis training tokens, andalpha,betaare empirically fit (typicallyalpha ~ 0.34,beta ~ 0.28). For tasks, an analogous power-law-plus-floor is fit. - Extrapolate to the target run. Project the smooth metric forward to the frontier compute budget, with uncertainty bands.
- Run the frontier model. Evaluate on sharp and smooth metrics both. Sharp may show a phase change even when the prediction was correct.
- Audit surprises. Anything outside the predicted band — better or worse — gets flagged for capability elicitation by red-team groups (Anthropic’s internal evals, Apollo Research, METR’s Autonomy Evaluation Resources) before deployment.
The METR work is particularly useful as a public reference. Their time-horizon scaling report tracks the length of agentic tasks frontier models can complete autonomously and finds it roughly doubling every seven months on a smooth log scale. That is not pre-training emergence; it is a smooth capability trajectory you can plan against.
Scaling Laws Now Have Four Axes, Not One
Modern scaling discussion has split a single “compute” axis into four interacting ones: parameter count, pre-training tokens, post-training compute, and test-time compute. The original Kaplan et al. 2020 scaling laws (arXiv 2001.08361) collapsed everything into a single FLOP budget. Chinchilla refined the optimum to roughly 20 tokens per parameter. The 2024–2026 picture pulls the axes apart, because each behaves differently and each has different cost.

Pre-training compute (N and D)
This is the axis Kaplan, Chinchilla, and Wei et al. were arguing about. The 2022 Chinchilla retraining found the Kaplan-optimal models were undertrained — at a fixed compute budget you want a smaller model trained on more tokens than Kaplan suggested. The Chinchilla constants alpha ~ 0.34, beta ~ 0.28 imply a roughly 1:1 scaling between N and D for compute-optimality. Real frontier training runs in 2024–2026 (Llama 3 405B on 15T tokens, DeepSeek V3 on 14.8T) overshoot Chinchilla on the data side because inference cost matters and serving a smaller, longer-trained model is cheaper.
Post-training compute (RLHF, DPO, RLVR)
A 70B model after preference optimization is qualitatively different from the same 70B base model. The capability shift comes from comparatively tiny compute — typically 1–5% of pre-training — but produces gains on benchmarks that the base model could not touch. Recent work on Reinforcement Learning with Verifier Rewards (RLVR), pioneered in DeepSeekMath and scaled in DeepSeek-R1, has its own scaling laws. The relationship between post-training compute and downstream task accuracy is smoother than people expected; once you have a working RL recipe, more steps reliably help, up to a saturation point. For a deeper look at how preference data shapes these gains, see our DPO vs RLHF vs SFT alignment benchmark.
Test-time compute
This is the new axis that broke the prediction frameworks. OpenAI’s o1 (Sept 2024), o3 (Dec 2024), and the explicit extended_thinking mode in Anthropic’s Claude Sonnet/Opus 4 series demonstrated that giving the model 10x more reasoning tokens at inference yields large, predictable gains on math, code, and agentic benchmarks. The gain follows a log law: doubling reasoning tokens produces a roughly constant absolute accuracy gain on AIME, GPQA Diamond, and SWE-bench Verified, up to a task-dependent ceiling. See our breakdown of reasoning model behavior in Llama 4, DeepSeek V3, and Claude Sonnet on industrial benchmarks for concrete numbers across tasks.
Architecture as a multiplier
Across all three compute axes, the choice of architecture shifts the constants. Sparse mixture-of-experts models like DeepSeek V3 and Llama 4 Behemoth get more capability per training FLOP than dense equivalents because most parameters are inactive per token. The trade-offs and inference economics of this design are covered in our mixture-of-experts architecture guide.
What this means for emergence claims is that the relevant axis is rarely “parameter count” in isolation. A 671B-parameter MoE with 37B active and a dense 70B can have similar per-token compute and similar capability profiles even though their parameter counts differ by an order of magnitude. Any emergence plot that uses raw parameter count without normalizing to training FLOPs or active-parameter compute will produce misleading curves and either over- or under-attribute capability changes to scale.
What Survived as Genuine Emergence
A short list of capabilities that 2026 evidence still treats as genuinely emergent in the sense that they appear at a scale threshold and are hard to predict from smaller models on any metric:
- In-context learning. The ability to perform a task from a few examples in the prompt without weight updates. Reliably appears in the 1B–10B parameter range and improves with scale. The smoothed-metric critique narrowed but did not eliminate the discontinuity here.
- Chain-of-thought reasoning. When prompted with “think step by step,” small models often perform worse than direct answering; large models perform substantially better. The crossover happens at a scale threshold and is not a metric artifact, though the size of the threshold has moved down dramatically since 2022.
- Instruction following after SFT/DPO. Pre-training alone produces models that complete prompts. Following a typed instruction is a post-training capability that does not exist before a few hundred million well-curated supervised pairs.
- Tool use and agentic behavior. The METR time-horizon curves show a real, recent jump corresponding to RL fine-tuning on multi-step tasks. Predictable in retrospect, hard to predict from base-model-only metrics.
The pattern is consistent: things that involve a behavioral mode change (use these examples, follow this instruction, plan multiple steps) emerge less smoothly than pure knowledge accumulation, which is well-described by Chinchilla curves.
A useful framing borrowed from the post-Schaeffer literature: separate “competence emergence” from “elicitation emergence.” Competence emergence is the model becoming capable of the task at all — and is mostly what the smooth metric tracks. Elicitation emergence is the prompt or training procedure becoming able to surface the capability — and is often discontinuous because it depends on phenomena like the model finally distinguishing instructions from completions, or finally following a chain-of-thought directive instead of pattern-matching it back as a question. The metric artifact critique applies cleanly to competence emergence. Elicitation emergence is more genuinely discontinuous because it depends on a behavioral switch the model either flips or doesn’t.
What We Still Cannot Predict
Five capability dimensions where 2026 forecasting genuinely struggles. These are the open problems driving the Anthropic, Apollo, METR, and UK AISI eval programs.
- Value alignment shifts under scale and post-training. A base model’s revealed preferences are not a reliable predictor of an RLHF’d model’s preferences. Adding RLVR on code can also shift behavior in unrelated domains in ways that are not captured by the reward function. Documented in Anthropic’s Sleeper Agents work.
- Deceptive capabilities. Whether a model can recognize an eval and strategically underperform (“sandbagging”) is itself a capability that has appeared in 2024–2025 frontier models. Apollo’s evaluations explicitly probe for it.
- Jailbreak susceptibility. Newer larger models are not monotonically more robust. Capability and safety scale together but on different curves, and the gap is not predictable from the pre-training loss.
- Cross-capability transfer. RL training on math problems sometimes improves unrelated reasoning; sometimes it does nothing. The conditions are not characterized.
- Long-horizon autonomous behavior past 2–4 hour task lengths. METR’s smooth curve holds for tasks of bounded complexity. Whether it continues past human-day-length tasks is an extrapolation across the regime where most safety arguments live.
Designing an Eval Suite That Surfaces Real Capability Shifts
The right eval suite reports both sharp and smooth metrics for every task and flags any divergence. The decision matrix below is what most 2026 capability teams converge on after a few iterations.

A practical rubric, drawn from public methodology notes by METR and the UK AI Security Institute:
- For any task where the production decision is binary (the agent shipped the PR or didn’t), report exact-match or pass@1 — but always report a smooth companion metric on the same task. The smooth metric is what you forecast; the sharp metric is what you decide on.
- For multi-component capabilities (math + tool use + planning), decompose into sub-tasks with their own metrics. Aggregating them into a single number hides the per-component curves.
- Calibrate sample size to the metric’s noise floor. Pass@1 on 100 problems has wider error bars than the headline number suggests. METR explicitly reports interval estimates.
- Stress-test the metric itself. Take an old checkpoint, swap the metric for a smoother version, and see whether the previously-reported “phase change” survives. If it doesn’t, the metric was the story.
A clean Python sketch of the dual-metric protocol Schaeffer et al. recommend:
# pseudocode-but-runnable: dual-metric eval for one task
import math
def eval_task(model, prompts, answers):
correct = 0
total_logp = 0.0
n_tokens = 0
for prompt, answer in zip(prompts, answers):
# sharp: exact match
prediction = model.generate(prompt, max_new_tokens=len(answer))
if prediction.strip() == answer.strip():
correct += 1
# smooth: average log-prob of the target tokens
logp = model.score_continuation(prompt, answer) # sum log p(tok | prefix)
total_logp += logp
n_tokens += len(model.tokenize(answer))
return {
"exact_match": correct / len(prompts), # SHARP
"avg_logp_per_token": total_logp / n_tokens, # SMOOTH
}
If the exact-match curve across model scales has a knee and the per-token log-prob curve does not, you have a metric artifact, not an emergent capability. If both show a knee, you have something worth a closer look.
The 2022 to 2026 Timeline of Emergence Claims

The compressed history: Kaplan 2020 said pre-training loss is smooth. Wei 2022 said many downstream capabilities are not. Chinchilla 2022 said you’ve been undertraining. Schaeffer 2023 said many of those non-smooth curves were the metrics’ fault. 2024 brought test-time compute as a third axis that broke the original framing. 2025 standardized capability elicitation as a pre-deployment step. By 2026, “emergence” is rarely used as a load-bearing word in technical papers — researchers say “capability appears at scale X on metric Y” and quote both the smooth and sharp curves.
The intellectual move that took four years was simple but slow. Move from “what new abilities emerge from scale” to “what new abilities are revealed when we measure the same model with a less brittle yardstick, on a different axis of compute, with a different elicitation procedure.” Each clause in that sentence corresponds to a year of papers correcting an earlier overclaim. The remaining genuine emergence claims are sharper, smaller in number, and tied to behavioral mode changes that the smooth metrics also pick up — which is exactly the kind of claim that survives a serious replication push.
Trade-offs and Failure Modes
The capability-forecasting framework is useful, not foolproof. Five honest failure modes that any team running this pipeline will hit.
- Data contamination silently breaks the extrapolation. If your smaller anchor models were trained on a slightly different mix than the frontier run, the scaling-law constants do not transfer. The bigger model may also have seen part of the eval set if the data pipeline is loose.
- Smooth metrics can hide real capability gaps. A model with 90% per-token log-prob can be useless if it gets the last token wrong. Production cares about the integrated outcome, which sometimes is the sharp metric.
- Post-training breaks pre-training extrapolations. Predicting the base model’s loss is solved. Predicting the deployed model’s MMLU after RLHF and a reasoning post-train is much weaker.
- Test-time-compute scaling laws are recipe-dependent. o1’s curve and DeepSeek-R1’s curve have similar shape but different constants. There is no universal exponent for “how much does extended thinking help.”
- Capability elicitation is itself unsolved. A model can have a capability that no prompt in the eval reliably elicits. Anthropic and Apollo treat this as the dominant source of uncertainty in 2026 capability reports.
- Open-weight ecosystem drift. Once a base model is released, the community fine-tunes it dozens of ways within weeks. The deployed-capability distribution for an open-weight Llama 4 or DeepSeek V3 model is not the lab’s pre-deployment report. Any forecast that stops at the lab’s release is incomplete.
Practical Recommendations
For a team using or building on frontier LLMs in 2026, the takeaways collapse to a short list.
- Treat any single capability number as half a measurement. Demand the smooth companion metric or generate it yourself.
- When you read a frontier-lab capability report, look for whether they reported the smoothed extrapolation. If they did and they were right, the forecasting works. If they only report sharp numbers, treat the result as a measurement, not a forecast.
- Budget separately for pre-training, post-training, and test-time compute scaling. They are not interchangeable and they have different cost-per-capability curves.
- For internal evals, instrument every task with at least one sharp and one smooth metric. Flag divergences.
- For safety-relevant capability claims, run elicitation before you accept “the model can’t do X.” Sandbagging is a real failure mode now.
Quick checklist:
[ ] Anchor runs at 4-8 log-spaced compute scales
[ ] Every task instrumented with sharp + smooth metric
[ ] Scaling-law fit to smooth metric with uncertainty band
[ ] Post-training shift measured separately from pre-train forecast
[ ] Test-time-compute scaling characterized for reasoning tasks
[ ] Sandbagging / capability elicitation pass before deployment
[ ] Surprises (>1 sigma) trigger red-team review, not auto-deploy
FAQ
Are emergent abilities in LLMs real or just a metric artifact?
Both. The Schaeffer et al. 2023 mirage critique demonstrated that many original Wei et al. 2022 emergence claims — arithmetic, transliteration, word unscrambling — disappear when you switch from exact-match accuracy to token-level log probability. A residual set of capabilities, notably in-context learning, chain-of-thought, and instruction following, still appear at scale thresholds even with smooth metrics. By 2026 the field treats emergence as task-dependent and metric-sensitive rather than a universal property of scale.
What are scaling laws for LLMs in 2026?
Scaling laws in 2026 describe how loss and downstream capability change with four interacting axes: model parameters (N), pre-training tokens (D), post-training compute (RLHF/RLVR), and t
