The case against reading benchmark scores as capability

Between 2023 and 2024, the leading AI system’s score on SWE-bench moved from 4.4 per cent to 71.7 per cent. On GPQA, a benchmark of graduate-level scientific questions, the leading score rose by nearly fifty points in a year. On MMLU, a broad multitask benchmark, the frontier now saturates near the ceiling the benchmark is capable of measuring, which is a few points below one hundred per cent because a non-trivial fraction of the questions have demonstrable errors in their reference answers.

Those figures are routinely reported as evidence that AI capability has risen by a commensurate amount. That inference is the mistake this piece is about.

THE ARGUMENT

A benchmark score is a compressed signal about performance on a specific, often flawed, closed test. It is not capability, and treating it as such repeats an old methodological error. Standardised tests have always been good at measuring performance on standardised tests. They have always been less good at predicting performance in the open-world tasks they claim to proxy for. The mismatch between what a test measures and what it claims to stand in for is a deep property of tests, not a flaw unique to any one benchmark.

The mismatch has three concrete components in the current AI context.

ONE: THE BENCHMARKS ARE SATURATING AT THE TOP

Most of the headline benchmarks are hitting their ceilings. MMLU’s usable ceiling is around 91 per cent because of documented errors in the test items. GPQA’s uncontroversially correct ceiling is around 80 per cent. Frontier systems now cluster in the high eighties or low nineties on both. When the distance between a leading model and a following model is two points on a test whose uncontroversial ceiling is only a few points higher, the benchmark is not discriminating between the models. It is recording that both have exhausted most of the useful signal available from the instrument.

TWO: THE BENCHMARKS THEMSELVES CONTAIN ERRORS

A 2024 audit of MMLU by researchers at Edinburgh and elsewhere found that 6.49 per cent of the benchmark’s questions contain errors: keyed answers that are wrong, not uniquely correct, or ambiguous enough to make scoring unreliable. This is not a criticism of MMLU alone. Benchmarks at this scale are built by humans on finite schedules, and their error rates are roughly what one would expect. The implication is that any model scoring in the top few per cent of a large multiple-choice benchmark is, in part, being scored on the benchmark’s mistakes. The fine-grained distinctions in the leaderboard race are made, to a non-trivial extent, out of noise.

THREE: THE GAP BETWEEN TEST ITEMS AND OPEN-WORLD TASKS

A benchmark item is a closed question with a defined correct answer that can be scored automatically. An open-world task, by contrast, has ambiguous goals, distributed context, multiple stakeholders, long horizons, and success criteria that stabilise only in hindsight. Debugging a messy legacy codebase, reviewing a scientific paper, negotiating a contract, or maintaining a research programme across six months does not resemble answering benchmark items except at the most superficial level.

There is no reason, in principle, why performance on closed benchmarks should transfer to open-world tasks at the same rate. There is mounting evidence that it does not. Adversarial-encoding work from 2025 shows that small perturbations to benchmark question format can significantly reduce frontier-model scores. That suggests performance on the original items depends on features of the test format as much as on the underlying reasoning. This is not a refutation of capability. It is a reminder that the capability being measured is narrower than the benchmark label implies.

WHAT FOLLOWS

None of this amounts to the claim that AI progress is fake, or that current systems are not genuinely more capable than their predecessors. On many tasks, they clearly are. The argument is narrower and more precise: the specific artefact most often used to characterise the pace of progress, the rising line on a benchmark chart, is a weak instrument for the claim it is asked to carry. It over-reads in the middle of the curve because of format effects. It over-reads at the top because of saturation. And it almost always over-reads in the step from “did well on items of this form” to “can perform a task of this kind.”

The honest response is not scepticism of the technology. It is scepticism of one specific reading instrument. A benchmark chart is a useful summary of progress on a particular set of closed tasks under particular conditions. It is not a capability chart. Treating it as one, in investment memos, policy briefings, and labour-market commentary, is the category error from which many serious errors in the current AI discussion descend.

PRIMARY SOURCES

Stanford HAI. The 2025 AI Index Report — Technical Performance.
Gema AP, Leang JOJ, Hong G, et al. “Are We Done with MMLU?” arXiv:2406.04127, 2024.
Ivanov I, Volkov D. “Resurrecting saturated LLM benchmarks with adversarial encoding.” arXiv:2502.06738, 2025.

Tags: THE ARGUMENT