Apple Researcher Claims Illusion of AI Thinking Versus OpenAI Solving Ten Disk Puzzle

Apple’s research paper, “The Illusion of Thinking,” examines the reasoning abilities of artificial intelligence models. It claims that LLM AI problem-solving skills are misleading. The study argues that these models do not truly reason but instead rely heavily on pattern matching, creating an “illusion of thinking.”

To support this claim, the paper uses the Tower of Hanoi puzzle—a classic problem requiring the movement of disks between pegs under specific rules—as a key benchmark.

The complexity of this puzzle increases with the number of disks, making it an effective test of planning and reasoning. Apple’s researchers found that AI models, including OpenAI’s O3, exhibited a “complete accuracy collapse” when tackling the puzzle with ten disks, a relatively complex version. This failure suggested that these models struggled with tasks requiring deeper reasoning, even when provided with the solution algorithm, reinforcing the paper’s conclusion that their capabilities were limited.

The objective of the tower of hanoi puzzle is to move the entire stack to one of the other rods, obeying the following rules:

Only one disk may be moved at a time.
Each move consists of taking the upper disk from one of the stacks and placing it on top of another stack or on an empty rod.
No disk may be placed on top of a disk that is smaller than it.

With three disks, the puzzle can be solved in seven moves. The minimum number of moves required to solve a Tower of Hanoi puzzle is 2n − 1, where n is the number of disks.

At a rate of one move per second, the minimum amount of time it would take to complete the sixty-four disks would be 2^64 − 1 seconds or 585 billion years, roughly 42 times the estimated current age of the universe.

However, a significant development challenges these findings: OpenAI’s O3 has reportedly solved the ten-disk Tower of Hanoi in a single attempt, as noted in posts on X. This achievement, often referred to as “one-shotting,” implies that O3 can address the puzzle without multiple tries or task-specific training, directly contradicting the paper’s assertion of its limitations.

O3’s ability to solve the ten-disk puzzle suggests that OpenAI may have enhanced the model’s architecture, training data, or algorithms, enabling it to handle complex tasks more effectively than when the Apple study was conducted [web:0].

Questioning the Benchmark: The Tower of Hanoi may not fully capture the reasoning abilities of modern AI, or the way it was implemented in the study might have been flawed, limiting its validity as a measure of capability.

Outdated Conclusions: Given that the paper’s findings were based on earlier performance, O3’s success indicates that rapid progress in AI development may have outpaced the study’s observations.

Brian Wang is a Futurist Thought Leader and a popular Science blogger with 1 million readers per month. His blog Nextbigfuture.com is ranked #1 Science News Blog. It covers many disruptive technology and trends including Space, Robotics, Artificial Intelligence, Medicine, Anti-aging Biotechnology, and Nanotechnology.

Known for identifying cutting edge technologies, he is currently a Co-Founder of a startup and fundraiser for high potential early-stage companies. He is the Head of Research for Allocations for deep technology investments and an Angel Investor at Space Angels.

A frequent speaker at corporations, he has been a TEDx speaker, a Singularity University speaker and guest at numerous interviews for radio and podcasts. He is open to public speaking and advising engagements.