The word “reasoning” acquired a precise technical meaning in AI discourse sometime around late 2024, when OpenAI released its o1 model and described it as capable of reasoning through complex problems by generating extended chains of thought before producing a response. The description was adopted rapidly — by researchers, by journalists, by the models’ developers in their own documentation — and with it came a set of implications that were rarely made explicit.
The argument that follows is narrow and precise. It is not that current AI systems are unintelligent, or incapable, or that their outputs lack value. It is that the word “reasoning” is being used in a way that conflates two things that are genuinely distinct, and that the conflation has material consequences.
WHAT THE MODELS ACTUALLY DO
Large language models trained with reinforcement learning on chains of reasoning-like text demonstrably produce longer, more structured, and in many benchmarked cases more accurate outputs on structured problem classes. On ARC-AGI, a benchmark designed by François Chollet specifically to require genuine abstraction rather than pattern completion from training data, frontier models have shown substantial recent improvement. These are real improvements. They should be reported accurately.
What they are not, in any philosophically defensible sense, is reasoning. The models produce token sequences that, statistically, have the form of reasoning — premise, inference step, conclusion. Whether anything occurring between input and output corresponds to a cognitive process that warrants the term is a question that cannot be answered by pointing at the output.
THE CHOLLET PROBLEM
This is not a new objection. Chollet, whose ARC-AGI benchmark has been influential precisely because it was designed to resist memorisation, has made the distinction clearly: a system that retrieves and recombines patterns from its training distribution is doing something categorically different from a system that encounters a genuinely novel problem and forms an abstract representation of it.
The distinction between retrieving a pattern and forming an abstraction is not a difference of degree. It is a difference of kind. Extended inference chains do not bridge that gap by being longer.
The new reasoning models are better at ARC-AGI. They are still, as Chollet and others have noted, exploiting a mechanism different from the one the benchmark was designed to isolate. The improvements appear to come from extended search over possible response structures — a process more analogous to exhaustive tree search than to the flexible abstraction humans deploy when encountering a genuinely novel problem.
A separate concern is raised by recent work on chain-of-thought faithfulness: the visible “reasoning” tokens produced by these models may not accurately represent the computational process generating the final answer. The chain of thought may be post-hoc rationalisation rather than causal explanation — a plausible-looking narrative constructed alongside, rather than producing, the output.
WHY THE CONFLATION MATTERS
When capability demonstrations are described as “reasoning,” several things follow in the broader discourse. Research resources flow toward extended-inference architectures on the assumption that more steps of reasoning-like processing will produce more intelligence-like results. Investment narratives are built on the premise that systems capable of “reasoning” are qualitatively different from — and on a path toward — systems capable of genuine open-ended problem-solving.
Policy and regulatory attention is directed at systems whose capabilities are described in these terms. The EU AI Act’s high-risk categories are partly calibrated to capability levels. If those levels are described using language that overstates what the systems are doing, the regulatory response is calibrated to a threat model that does not match the actual technology — likely in both directions: overreacting to capabilities that are present but not intelligence, and underreacting to capabilities that are present but not called reasoning.
THE PRECISE CLAIM
Extended-inference systems produce better outputs on structured tasks because they search over more possible completion paths and select for internally consistent ones. This is a genuine capability improvement with practical value. It is not reasoning in the sense that requires representation, abstraction, and genuine novelty — and there is no evidence that it is on a trajectory toward those properties simply by increasing scale or chain length.
Calling it reasoning is not a lie. It is a category error. The difference between those two charges is the difference between dishonesty and imprecision. Both have consequences. The imprecision, here, may be the more consequential of the two.
SOURCES
1. ARC Prize Foundation. ARC-AGI-2 benchmark and leaderboard. arcprize.org
2. Chollet F. “On the Measure of Intelligence.” arXiv 1911.01547, November 2019. arxiv.org/abs/1911.01547
3. Turpin M et al. “Language Models Don’t Always Say What They Think.” arXiv 2305.04388, 2023. arxiv.org/abs/2305.04388





