• Policies & Privacy
AI News
  • Longevity
  • Culture
  • Business
  • Tech
  • Contact
No Result
View All Result
Contact Us
VeyrZest
  • Longevity
  • Culture
  • Business
  • Tech
  • Contact
No Result
View All Result
VeyrZest
No Result
View All Result
Messy workshop desk with tangled wires and gears, overlaid by an exam sheet showing A+ scores and a 'Benchmark Success: Perfect Score' stamp.

The case against reading benchmark scores as capability

Frontier language models have risen very rapidly on most published benchmarks. It does not follow that capability has risen at the same rate — and the gap between the two is the single most misread fact in the current AI discussion.

Martynas Kasiulis by Martynas Kasiulis
April 23, 2026
in Tech
585
SHARES
3.3k
VIEWS
Summarize with ChatGPTShare to Facebook

Between 2023 and 2024, the leading AI system’s score on SWE-bench moved from 4.4 per cent to 71.7 per cent. On GPQA, a benchmark of graduate-level scientific questions, the leading score rose by nearly fifty points in a year. On MMLU, a broad multitask benchmark, the frontier now saturates near the ceiling the benchmark is capable of measuring, which is a few points below one hundred per cent because a non-trivial fraction of the questions have demonstrable errors in their reference answers.

Those figures are routinely reported as evidence that AI capability has risen by a commensurate amount. That inference is the mistake this piece is about.


THE ARGUMENT

A benchmark score is a compressed signal about performance on a specific, often flawed, closed test. It is not capability, and treating it as such repeats an old methodological error. Standardised tests have always been good at measuring performance on standardised tests. They have always been less good at predicting performance in the open-world tasks they claim to proxy for. The mismatch between what a test measures and what it claims to stand in for is a deep property of tests, not a flaw unique to any one benchmark.

The mismatch has three concrete components in the current AI context.


ONE: THE BENCHMARKS ARE SATURATING AT THE TOP

Most of the headline benchmarks are hitting their ceilings. MMLU’s usable ceiling is around 91 per cent because of documented errors in the test items. GPQA’s uncontroversially correct ceiling is around 80 per cent. Frontier systems now cluster in the high eighties or low nineties on both. When the distance between a leading model and a following model is two points on a test whose uncontroversial ceiling is only a few points higher, the benchmark is not discriminating between the models. It is recording that both have exhausted most of the useful signal available from the instrument.


TWO: THE BENCHMARKS THEMSELVES CONTAIN ERRORS

A 2024 audit of MMLU by researchers at Edinburgh and elsewhere found that 6.49 per cent of the benchmark’s questions contain errors: keyed answers that are wrong, not uniquely correct, or ambiguous enough to make scoring unreliable. This is not a criticism of MMLU alone. Benchmarks at this scale are built by humans on finite schedules, and their error rates are roughly what one would expect. The implication is that any model scoring in the top few per cent of a large multiple-choice benchmark is, in part, being scored on the benchmark’s mistakes. The fine-grained distinctions in the leaderboard race are made, to a non-trivial extent, out of noise.


THREE: THE GAP BETWEEN TEST ITEMS AND OPEN-WORLD TASKS

A benchmark item is a closed question with a defined correct answer that can be scored automatically. An open-world task, by contrast, has ambiguous goals, distributed context, multiple stakeholders, long horizons, and success criteria that stabilise only in hindsight. Debugging a messy legacy codebase, reviewing a scientific paper, negotiating a contract, or maintaining a research programme across six months does not resemble answering benchmark items except at the most superficial level.

There is no reason, in principle, why performance on closed benchmarks should transfer to open-world tasks at the same rate. There is mounting evidence that it does not. Adversarial-encoding work from 2025 shows that small perturbations to benchmark question format can significantly reduce frontier-model scores. That suggests performance on the original items depends on features of the test format as much as on the underlying reasoning. This is not a refutation of capability. It is a reminder that the capability being measured is narrower than the benchmark label implies.


WHAT FOLLOWS

None of this amounts to the claim that AI progress is fake, or that current systems are not genuinely more capable than their predecessors. On many tasks, they clearly are. The argument is narrower and more precise: the specific artefact most often used to characterise the pace of progress, the rising line on a benchmark chart, is a weak instrument for the claim it is asked to carry. It over-reads in the middle of the curve because of format effects. It over-reads at the top because of saturation. And it almost always over-reads in the step from “did well on items of this form” to “can perform a task of this kind.”

The honest response is not scepticism of the technology. It is scepticism of one specific reading instrument. A benchmark chart is a useful summary of progress on a particular set of closed tasks under particular conditions. It is not a capability chart. Treating it as one, in investment memos, policy briefings, and labour-market commentary, is the category error from which many serious errors in the current AI discussion descend.


PRIMARY SOURCES

  • Stanford HAI. The 2025 AI Index Report — Technical Performance.
  • Gema AP, Leang JOJ, Hong G, et al. “Are We Done with MMLU?” arXiv:2406.04127, 2024.
  • Ivanov I, Volkov D. “Resurrecting saturated LLM benchmarks with adversarial encoding.” arXiv:2502.06738, 2025.

Tags: THE ARGUMENT
SummarizeShare234
Martynas Kasiulis

Martynas Kasiulis

Related Stories

Two monitors on a desk display colorful 3D protein structures; a coffee cup and notes are nearby.

What AlphaFold Actually Changed

by Martynas Kasiulis
May 13, 2026
0

Four years after DeepMind solved protein structure prediction, the field has had time to assess the claim. The tool has delivered what it promised. The promise was smaller...

Aerial view of an electrical substation with transformers and high‑voltage lines arranged in a geometric grid on a paved site beside a barren landscape.

AI’s Power Bill

by Martynas Kasiulis
April 29, 2026
0

The capability story has been told without its electricity story. The second is the constraint that decides what is buildable.

Capability Was Never the Bottleneck

Capability Was Never the Bottleneck

by Martynas Kasiulis
April 27, 2026
0

The 2026 AI Index reports a sharp gain in agent capability. The deployment data tells a different and more important story.

MRI scanner in a quiet hospital hallway, with blue scrubs lying on the floor in the foreground.

What the first RCT of AI in cancer screening has actually shown

by Martynas Kasiulis
April 24, 2026
0

The MASAI trial has now reported its primary endpoint. The result is positive, narrow, and instructive — about what AI in medicine looks like when it is held...

Next Post
Glass holographic chart shows HbA1c reduction and weight-loss trajectory above a lab bench, conveying medical progress concepts (informative).

What the evidence on GLP-1 medications and extended life actually says

VeyrZest

We bring you the best Premium Lifestyle Magazine with a perfect balance of Longevity, Culture, Business and Tech content.

Recent Posts

  • What AlphaFold Actually Changed
  • The Clock Genes
  • The Untranslatable

Categories

  • Business
  • Culture
  • Longevity
  • Tech
  • Longevity
  • Culture
  • Business
  • Tech
  • Contact

© 2026 VeyrZest - Premium Lifestyle Magazine. Website by Digibru.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Longevity
  • Culture
  • Business
  • Tech
  • Contact

© 2026 VeyrZest - Premium Lifestyle Magazine. Website by Digibru.