Apple’s Quiet AI Lab Reveals How Large Models Fake Thinking

The latest generation of AI models, often called large reasoning models (LRMs), has dazzled the world with its ability to “think.” Before giving an answer, these models produce long, detailed chains of thought, seemingly reasoning their way through complex problems. This has led many to believe we are on the cusp of true artificial general intelligence.

But are these models really thinking? A new, insightful paper from researchers at Apple, titled “The Illusion of Thinking,” puts this capability under a microscope and comes to some startling conclusions. By moving away from standard math tests—which are often “contaminated” with answers the AI has already seen during training—and into a controlled lab of complex puzzles, the researchers uncovered fundamental limits to AI reasoning.

Today’s most advanced AI isn’t so much a brilliant thinker as it is an incredibly sophisticated pattern-matcher that quickly hits a wall when faced with truly new challenges.

The three regimes of AI reasoning

The researchers tested pairs of AI models—one “thinking” LRM and its standard “non-thinking” counterpart—on a series of puzzles like the Tower of Hanoi and River Crossing. By precisely increasing the difficulty, they discovered three distinct performance regimes:

Low complexity: Surprisingly, on simple problems, the standard, non-thinking models actually outperformed the reasoning models. The LRMs were less accurate and wasted a lot of computational effort “overthinking” problems they should have solved easily.
Medium complexity: This is where LRMs shine. When problems become moderately complex, the ability to generate a thinking process gives them a clear advantage over standard models.
High complexity: When the puzzles become too hard, something dramatic happens: both models fail completely. While the thinking models can handle a bit more complexity before failing, they inevitably hit a wall and their performance collapses to zero.

As the paper states, these models “fail to develop generalizable problem-solving capabilities, with accuracy ultimately collapsing to zero beyond certain complexities.”

Perhaps the most fascinating discovery is how the reasoning models fail. You would expect that as a problem gets harder, the AI would “think” more, using more of its computational budget. And it does—but only up to a point.

The research reveals a counterintuitive scaling limit. When a problem approaches the “collapse” point, the LRM starts to reduce its reasoning effort, spending fewer tokens on thinking despite the increasing difficulty. It’s as if the model recognizes the task as too hard and simply gives up before it even starts, even with an adequate budget to keep trying. This suggests a fundamental limitation in their ability to scale their reasoning effort with a problem’s difficulty.

Apple’s quiet AI lab reveals how large models fake thinking — Image credit: Apple

Failure to follow a recipe

What if you made it even easier for the AI? What if you gave it the exact, step-by-step algorithm to solve the puzzle? Surely, a true reasoning machine could just follow the instructions.

In a stunning finding, the researchers found this wasn’t the case.

“Even when we provide the algorithm in the prompt—so that the model only needs to execute the prescribed steps—performance does not improve, and the observed collapse still occurs at roughly the same point.”

This is the most damning evidence against the idea that these models “reason” in a human-like way. Their inability to execute a simple, explicit set of logical rules shows that their success relies more on recognizing familiar patterns than on genuine, symbolic manipulation. The model’s inconsistent performance across different puzzle types further supports this, suggesting its ability is tied to the examples it has memorized from the web, not a general problem-solving skill.

Featured image credit