Submission
Privacy Policy
Code of Ethics
Newsletter

Do Reasoning Models Really Reason?

A recent study by Apple offers a new perspective on what we mean when we talk about artificial intelligence “thinking.” According to the paper, today’s so-called reasoning AI models may give the appearance of high-level cognitive abilities, but they are still in their infancy when it comes to genuine logical reasoning.

Large language models like ChatGPT or Gemini were long seen as the pinnacle of artificial intelligence by everyday users. However, in late 2024 and early 2025, a new generation of so-called “reasoning” models began to appear, such as OpenAI’s o1 and o3 models, or Anthropic’s Claude Opus and Sonnet, which were specifically designed to enhance reasoning and inference capabilities. These models were often presented by developers and the media as being “capable of reasoning,” “able to think through complex problems,” or even “capable of solving real reasoning tasks.”

However, a recent study by Apple points out that although these systems do perform better on certain logic-based tasks, their functioning is still far from what we would commonly call reasoning in the everyday sense. Their responses are still based on textual pattern matching rather than on recognizing true logical relationships. There is no internal structure behind them that would allow for consistent, algorithmic problem-solving.

In their study, Apple’s researchers tested the capabilities of reasoning models using various logical puzzles. The article makes it clear that the so-called Large Reasoning Models (LRMs), such as OpenAI’s o1 and o3 or Anthropic’s Claude Opus and Sonnet, have much deeper limitations than previously anticipated.

The models in the study were tested on several types of controlled logical tasks. These included, for example, the Towers of Hanoi problem, where disks must be moved between three rods according to a few simple rules, as well as other complex, multi-step challenges that require consistent, chained reasoning. The most important findings were most apparent in the Towers of Hanoi problem. In this task, a tower made of disks of different sizes must be moved from one rod to another. The rules are simple: only one disk can be moved at a time, and each disk must be placed only on top of a smaller one. The number of steps required to solve the task increases exponentially with the number of disks. Nevertheless, the algorithm (or “line of reasoning”) needed to solve it always consists of the exact same steps, no matter how large the tower is. This means that if a model is capable of generalization and reasoning, solving similar problems should not be a challenge.

However, beyond a certain level of complexity, the models being tested simply broke down. They began to guess, avoided giving an answer, or produced long and seemingly coherent texts without solving the task. This is partly related to the fact that the more disks the problem involves, the longer the sequence of steps required to solve it. These longer answers often approach the model’s token limit, which is the maximum length of text a model can handle in a single response. According to the researchers, however, the real issue is not a technical limitation like token size, but rather the absence of an internal logical structure within the systems themselves.

The study also included other puzzles, such as the Checkers Jumping puzzle, where red and blue disks must be moved into each other’s positions, and the River Crossing problem, where different groups of people and their companions must be transported across a river while following certain rules. The pattern was similar in these tasks as well: after a certain point, the models failed to produce responses that made sense within the context of the problem.

Providing the exact solution algorithm did not help either. Even when the researchers described, step by step, how to solve the Towers of Hanoi, the models were still unable to execute the steps consistently. All of this demonstrates that the problem is not a lack of knowledge, but rather the inability of these systems to model long, coherent chains of reasoning with precision.

The results also revealed that the reasoning behavior of AI systems can be divided into three main segments based on task complexity.

  • Simple tasks are often solved more successfully by traditional, so-called “non-reasoning” models, since these do not require complex inference, only quick recognition of the correct answer.
  • For tasks of medium difficulty, reasoning models perform better because they can follow longer chains of logic and can encounter multiple possible solutions along the way.
  • However, when the task becomes too complex, both types of models become ineffective. According to the researchers, in such cases, reasoning models do not attempt to carry out additional steps, meaning they do not allocate more computational effort to solving the problem, even when, technically, they could. This is especially notable because the main advantage of reasoning models over simpler ones is supposed to be their ability to generate longer thought processes in which multiple solution paths can be tested.

All this clearly illustrates the fundamental limitations of current systems when it comes to problem-solving capabilities.

It also raises an important question: can we truly call what these models do “thinking”? Human thought is not merely a matter of pattern recognition. We can build internal logical structures, draw conclusions, and apply them to new situations. In contrast, AI resembles an extremely skilled imitator—it repeats with precision, formulates responses convincingly, but ultimately does not understand what it is saying.

However, Apple’s researchers did not stop at identifying the problems. They also outlined a new direction that does not focus on building even larger models with ever more billions of parameters. Instead, they proposed the possibility of hybrid AI architectures, where smaller language models interpret user instructions and then pass them on to a semantic memory. This memory would function essentially as a structured knowledge base that stores relationships, rules, and facts, and would be capable of drawing inferences based on them.

This approach would represent a radical departure from how today’s large language models operate, since those systems fundamentally rely on mapping and reproducing data-driven patterns. According to Apple, however, the power of future artificial intelligence will lie not only in its familiarity with linguistic patterns, but also in its ability to handle structured knowledge and draw logical inferences.

This is not just a technological issue, but also a philosophical one: what do we mean by thinking, and what do we expect from a machine? It may turn out that the “real” artificial intelligence is not the one that formulates the most elegant responses, but the one that reasons most consistently and is truly capable of integrating memory, logic, and knowledge into its answers.

This distinction will define whether AI becomes a truly useful companion in our daily lives or remains a dazzling yet limited tool, capable of handling many tasks but unreliable when it comes to the truly difficult ones.

The next time someone claims that AI knows everything, consider asking three simple questions:

  • Can it (really) reason?
  • Can it (really) remember?
  • And most importantly: Does it truly understand what we’re saying?

István ÜVEGES, PhD is a Computational Linguist researcher and developer at MONTANA Knowledge Management Ltd. and a researcher at the HUN-REN Centre for Social Sciences. His main interests include the social impacts of Artificial Intelligence (Machine Learning), the nature of Legal Language (legalese), the Plain Language Movement, and sentiment- and emotion analysis.