• Policies & Privacy
AI News
  • Longevity
  • Culture
  • Business
  • Tech
  • Contact
No Result
View All Result
Contact Us
VeyrZest
  • Longevity
  • Culture
  • Business
  • Tech
  • Contact
No Result
View All Result
VeyrZest
No Result
View All Result
Messy workshop desk with tangled wires and gears, overlaid by an exam sheet showing A+ scores and a 'Benchmark Success: Perfect Score' stamp.

The case against reading benchmark scores as capability

Frontier language models have risen very rapidly on most published benchmarks. It does not follow that capability has risen at the same rate — and the gap between the two is the single most misread fact in the current AI discussion.

Martynas Kasiulis by Martynas Kasiulis
April 23, 2026
in Tech
585
SHARES
3.3k
VIEWS
Summarize with ChatGPTShare to Facebook

Between 2023 and 2024, the leading AI system’s score on SWE-bench moved from 4.4 per cent to 71.7 per cent. On GPQA, a benchmark of graduate-level scientific questions, the leading score rose by nearly fifty points in a year. On MMLU, a broad multitask benchmark, the frontier now saturates near the ceiling the benchmark is capable of measuring, which is a few points below one hundred per cent because a non-trivial fraction of the questions have demonstrable errors in their reference answers.

Those figures are routinely reported as evidence that AI capability has risen by a commensurate amount. That inference is the mistake this piece is about.


THE ARGUMENT

A benchmark score is a compressed signal about performance on a specific, often flawed, closed test. It is not capability, and treating it as such repeats an old methodological error. Standardised tests have always been good at measuring performance on standardised tests. They have always been less good at predicting performance in the open-world tasks they claim to proxy for. The mismatch between what a test measures and what it claims to stand in for is a deep property of tests, not a flaw unique to any one benchmark.

The mismatch has three concrete components in the current AI context.


ONE: THE BENCHMARKS ARE SATURATING AT THE TOP

Most of the headline benchmarks are hitting their ceilings. MMLU’s usable ceiling is around 91 per cent because of documented errors in the test items. GPQA’s uncontroversially correct ceiling is around 80 per cent. Frontier systems now cluster in the high eighties or low nineties on both. When the distance between a leading model and a following model is two points on a test whose uncontroversial ceiling is only a few points higher, the benchmark is not discriminating between the models. It is recording that both have exhausted most of the useful signal available from the instrument.


TWO: THE BENCHMARKS THEMSELVES CONTAIN ERRORS

A 2024 audit of MMLU by researchers at Edinburgh and elsewhere found that 6.49 per cent of the benchmark’s questions contain errors: keyed answers that are wrong, not uniquely correct, or ambiguous enough to make scoring unreliable. This is not a criticism of MMLU alone. Benchmarks at this scale are built by humans on finite schedules, and their error rates are roughly what one would expect. The implication is that any model scoring in the top few per cent of a large multiple-choice benchmark is, in part, being scored on the benchmark’s mistakes. The fine-grained distinctions in the leaderboard race are made, to a non-trivial extent, out of noise.


THREE: THE GAP BETWEEN TEST ITEMS AND OPEN-WORLD TASKS

A benchmark item is a closed question with a defined correct answer that can be scored automatically. An open-world task, by contrast, has ambiguous goals, distributed context, multiple stakeholders, long horizons, and success criteria that stabilise only in hindsight. Debugging a messy legacy codebase, reviewing a scientific paper, negotiating a contract, or maintaining a research programme across six months does not resemble answering benchmark items except at the most superficial level.

There is no reason, in principle, why performance on closed benchmarks should transfer to open-world tasks at the same rate. There is mounting evidence that it does not. Adversarial-encoding work from 2025 shows that small perturbations to benchmark question format can significantly reduce frontier-model scores. That suggests performance on the original items depends on features of the test format as much as on the underlying reasoning. This is not a refutation of capability. It is a reminder that the capability being measured is narrower than the benchmark label implies.


WHAT FOLLOWS

None of this amounts to the claim that AI progress is fake, or that current systems are not genuinely more capable than their predecessors. On many tasks, they clearly are. The argument is narrower and more precise: the specific artefact most often used to characterise the pace of progress, the rising line on a benchmark chart, is a weak instrument for the claim it is asked to carry. It over-reads in the middle of the curve because of format effects. It over-reads at the top because of saturation. And it almost always over-reads in the step from “did well on items of this form” to “can perform a task of this kind.”

The honest response is not scepticism of the technology. It is scepticism of one specific reading instrument. A benchmark chart is a useful summary of progress on a particular set of closed tasks under particular conditions. It is not a capability chart. Treating it as one, in investment memos, policy briefings, and labour-market commentary, is the category error from which many serious errors in the current AI discussion descend.


PRIMARY SOURCES

  • Stanford HAI. The 2025 AI Index Report — Technical Performance.
  • Gema AP, Leang JOJ, Hong G, et al. “Are We Done with MMLU?” arXiv:2406.04127, 2024.
  • Ivanov I, Volkov D. “Resurrecting saturated LLM benchmarks with adversarial encoding.” arXiv:2502.06738, 2025.

Tags: THE ARGUMENT
SummarizeShare234
Martynas Kasiulis

Martynas Kasiulis

Related Stories

Close-up of a pin ring and precision tools on a lab bench beside a microscope in the background.

What the Implants Have Shown

by Martynas Kasiulis
June 4, 2026
0

Brain-computer interfaces now carry a four-hundred-billion-dollar valuation and a few dozen patients who can type by thought. The distance between those two facts is the entire question.

Open metal door emitting a cool blue glow into a dark, empty room.

The Models We’re Not Allowed to Have

by Martynas Kasiulis
May 29, 2026
0

This week a leading lab widened access to a system it once called too dangerous to release. The story isn’t the model. It’s the precedent: a frontier the...

Data center aisle with tall server racks on both sides; cables visible on the right side.

Agents Without Principals

by Martynas Kasiulis
May 26, 2026
0

AI systems are already signing contracts, executing transactions, and committing resources. The legal infrastructure has not moved to meet them. This is not a future problem. It is...

Pyramid-like sculpture made of stacked stone blocks with an inner orange glow, surrounded by a dark background in a gallery setting.

The Gap Between the Pledge and the Position

by Martynas Kasiulis
May 22, 2026
0

Gulf sovereign capital does not need a summit to deploy. The announcement is the diplomatic event. The investment is the longer, quieter thing.

Next Post
Glass holographic chart shows HbA1c reduction and weight-loss trajectory above a lab bench, conveying medical progress concepts (informative).

What the evidence on GLP-1 medications and extended life actually says

VeyrZest

We bring you the best Premium Lifestyle Magazine with a perfect balance of Longevity, Culture, Business and Tech content.

Recent Posts

  • What the Implants Have Shown
  • Music at Zero Marginal Cost
  • The Inoculation in the Dementia Data

Categories

  • Business
  • Culture
  • Longevity
  • Tech
  • Longevity
  • Culture
  • Business
  • Tech
  • Contact

© 2026 VeyrZest - Premium Lifestyle Magazine. Website by Digibru.

Welcome Back!

Login to your account below

Forgotten Password?

Retrieve your password

Please enter your username or email address to reset your password.

Log In
No Result
View All Result
  • Longevity
  • Culture
  • Business
  • Tech
  • Contact

© 2026 VeyrZest - Premium Lifestyle Magazine. Website by Digibru.