Insights AI News ChatGPT mathematical reasoning study reveals how AI learns
post

AI News

01 Oct 2025

Read 17 min

ChatGPT mathematical reasoning study reveals how AI learns

ChatGPT mathematical reasoning study shows teachers how to use prompts to get AI to solve geometry.

Researchers gave ChatGPT a 2,400-year-old geometry challenge and watched how it reasoned. The ChatGPT mathematical reasoning study found the bot tried ideas, made a confident mistake, then adapted like a student. This behavior gives teachers clues for using AI, and it shows why clear prompts and proof-checking matter more than quick answers. Plato once told a story about Socrates and a student. Socrates asked the student to double the area of a square. The student doubled the sides instead. That was wrong. The fix is simple once you see it: the side of the new square must match the diagonal of the old square. This ancient scene sets the stage for a modern test. A team from the University of Cambridge and the Hebrew University of Jerusalem used that old puzzle to watch how ChatGPT works through math. They wanted to see if a language model could handle a visual idea without pictures. Large language models train mostly on text, not on drawings. So the team picked a problem with a non-obvious step that many students miss. What happened next says a lot about AI, learning, and how we should teach with these tools.

The ancient puzzle in simple words

The classic challenge goes like this. You have a square with side length s. Its area is s × s, or s^2. To double the area, you want 2 × s^2. The square that has this area will have a side of s × √2. The diagonal of the original square is also s × √2. So if you use the old diagonal as the side of a new square, you get exactly twice the area. It is a neat trick. It shows how geometry can turn a hard idea into a simple move. Socrates used this to show how a learner might find truth step by step. The student starts wrong, but with guidance, they reach the answer. That slow climb is part of learning. Today, it also helps us see how AI behaves.

What the ChatGPT mathematical reasoning study tested

The researchers did not stop at Plato’s square. They wanted to test generative behavior. They looked for signs that ChatGPT could build on a prior idea, try something new, and even make a wrong turn as it explored.

Why pick Plato’s square?

The diagonal trick is not obvious in text. It is visual. Also, many people remember a related story from classes or books. But the exact reasoning steps may not be common in training data. If the model still reached the idea, it might be combining patterns in a fresh way.

Why a rectangle next?

After the square, the team asked a follow-up: can you double the area of a rectangle using similar reasoning? Here, the main idea is simple. To double area, you can scale both sides by √2. If a rectangle has sides a and b, you can make a new rectangle with sides a × √2 and b × √2. Area doubles, and the shape stays the same. Classical geometry can construct √2 with a straightedge and compass. So a geometric path exists. The team posed this as a second test. Could the model extend the first idea and find a valid way to double a rectangle?

What ChatGPT did — and why it surprised scientists

The model did not just recite facts. It tried to reason in context. It pulled ideas from the square problem and used them in the rectangle case. But it also made a strong mistake.

The square: finding the diagonal trick

In the square case, the path is well-known. Many math students learn that the new side must equal the old diagonal. The model could describe or hint at this move because it often appears in math texts. That part was not shocking.

The rectangle: a wrong claim

The surprise came with the rectangle. The chatbot said there was no geometric solution based on the diagonal. That is false. A geometric construction to double a rectangle exists. You do not need the rectangle’s diagonal to be the new side. You need to scale both sides by √2, which you can construct. The researchers called the claim “vanishingly unlikely” to exist in the training set. In other words, the model seemed to make up a response by blending the prior square talk with the new rectangle prompt. To the team, this looked “learner-like.” The bot drew on a past idea, tested it, and reached a wrong but reasoned answer. That is what many students do when they face a new problem. They try a nearby method first. Then they adjust. This behavior is what made the result interesting.

Does this mean AI can “think”?

The study does not say ChatGPT thinks like a person. It does not claim the model understands geometry. It shows something different but useful: the model can imitate a learning process when prompts nudge it in that direction.

Zone of proximal development, simplified

Educators use the phrase “zone of proximal development” to describe the sweet spot for learning. It is the gap between what a student can do alone and what they can do with a hint or a little help. The team suggests the model may act as if it is in that zone. With the right prompt, it can move from known patterns to a new but related task. It can try a hypothesis, make an error, and then refine its approach when pushed. For teachers, this is promising. It means a clear prompt can guide the model to show steps, not just final answers. It also means you can design prompts that make thinking visible, both for the AI and the student who reads it.

The black box risk

AI still has a black box problem. We do not see the inner steps that lead to a claim. We see only the words that come out. That means we must check proofs and methods. A confident tone does not equal a correct result. The rectangle mistake is a warning: plausible language can mask a wrong move. The authors point out a key skill for modern math class. Students should learn to read, test, and fix AI-generated proofs. They should not accept them as final. That habit matches good math practice anyway. It also builds digital literacy.

How teachers and students can use these lessons now

Good prompts and good checks help you get the best from AI. They also help you avoid slick but wrong answers. Here are practical steps you can use today.

Prompt for process, not just results

If you ask for an answer, the model may jump to the end. If you ask for steps, it may reveal its path. When you want the model to explore, be explicit.
  • Say: “Walk through your reasoning. Show each step and why you chose it.”
  • Say: “List two possible approaches. Compare pros and cons before doing one.”
  • Say: “If you make an assumption, name it and explain how you will test it.”
  • Avoid: “Just give me the final formula.”

Build a habit of verification

Treat AI work like a draft from a bright peer, not gospel truth. Check it.
  • Plug in numbers to test any formula the model gives.
  • Draw a quick sketch to see if a geometric claim makes sense.
  • Ask for a second method and compare results.
  • Search a trusted textbook or source to confirm a theorem.

Use AI to lift, not replace, student thinking

You can use ChatGPT to spark ideas, create practice tasks, or outline a proof. But keep the student in charge of sense-making.
  • Have students critique an AI proof and mark its weak steps.
  • Ask students to rewrite a proof in their own words, with a new example.
  • Let students create counterexamples when the model overgeneralizes.

Leverage visual tools when needed

The study hints at a path forward: connect language models with geometry tools and theorem provers. In class, you can do a simple version of this.
  • Pair ChatGPT with dynamic geometry software to test constructions.
  • Use a CAS or a graphing tool to verify algebraic claims.
  • Cross-check a step with a formal proof assistant if available.
These moves make the “black box” less opaque. They also help students see where language helps and where math demands rigor.

Limits, open questions, and next steps

This was one study. It looked at a classic problem and a single extension. We should not overread the results. Still, they point to valuable directions.

Test newer models and more problems

Language models change fast. New versions often get better at step-by-step work. The team suggests testing newer models across a wider set of geometry and algebra tasks. That includes problems where visual reasoning is key and tasks where proof structure matters most.

Combine tools to broaden “reasoning”

A strong next step is to link an LLM with:
  • Dynamic geometry systems to try constructions on the fly,
  • Theorem provers to check logic, and
  • Symbolic math engines to simplify expressions.
Together, these tools can support exploration and validation. The language model can plan and explain. The math engines can confirm or correct. The result is a richer digital lab for students and teachers.

Develop better measures of reasoning quality

We need better ways to score reasoning, not just final answers. Useful measures might include:
  • Clarity of steps and justification,
  • Ability to fix errors after feedback,
  • Use of definitions and theorems at the right time, and
  • Consistency across multiple approaches.
These metrics map well to classroom rubrics. They also tell us where an AI helps and where it still struggles.

Teach the skill of reading AI proofs

The authors stress this point. Students should learn how to test AI outputs. That means scaffolds in the curriculum:
  • Short “trust but verify” checklists,
  • Peer-review cycles that include an AI draft, and
  • Reflections on how a proof was improved after checks.
This is not busywork. It builds both mathematical and digital judgment.

What this means for AI and learning

The Cambridge-Hebrew University team saw a pattern many teachers know well. A learner meets a new task. They transfer a trick from a familiar case. It partly fits, partly fails. With a nudge, they adjust. The study suggests a language model can mimic that arc when the prompt invites it to do so. This does not prove that the model “understands” geometry. It does show how a well-aimed prompt can make its inner search visible. It also shows why we should slow down and inspect steps. A slick paragraph can hide a false claim, like the “no geometric solution” statement about doubling a rectangle. The right response to that risk is not fear. It is method: prompt for process, verify with tools, and teach students to judge arguments. This approach respects both math and AI. It uses the model’s fluency to generate plans, examples, and alternative paths. It uses human sense-making and external tools to secure truth. In that partnership, classrooms can move from answer-hunting to reasoning-building. The story that began with Socrates and a square now helps us shape how we learn with machines. The diagonal trick still matters. So do clear prompts, honest error-checking, and steady practice. Put those together, and you turn a clever chatbot into a better study partner for real thinking. In short, the ChatGPT mathematical reasoning study is not proof of machine understanding. It is a guide for better human-AI learning: ask for steps, test the claims, and use the right tools to make reasoning robust.

(Source: https://www.livescience.com/technology/artificial-intelligence/scientists-ask-chatgpt-to-solve-a-math-problem-from-more-than-2-000-years-ago-how-it-answered-it-surprised-them)

For more news: Click Here

FAQ

Q: What is Plato’s “doubling the square” problem used in the study? A: Plato’s “doubling the square” problem asks how to form a square with twice the area of a given square. The solution is to make the new square’s side equal to the original diagonal (s × √2), which produces area 2s^2. The ChatGPT mathematical reasoning study used this classic puzzle to test how a text-trained model handles a non-obvious visual step. Q: Why did researchers choose that ancient geometry puzzle for the experiment? A: They picked it because the diagonal trick is a non-obvious, visually oriented step that likely does not appear directly in text-based training data, so solving it could indicate generative reasoning rather than rote recall. Because LLMs are trained mostly on text, the ChatGPT mathematical reasoning study used the problem to probe whether the model could arrive at the idea unaided. Q: How did ChatGPT perform on the square case of the problem? A: On the square task, ChatGPT could describe or hint at the diagonal trick, which commonly appears in math texts, so that result was not surprising. This matched expectations that the model could reproduce familiar textbook reasoning rather than demonstrating a novel visual insight. Q: What mistake did ChatGPT make when asked about doubling a rectangle? A: When asked about doubling a rectangle, ChatGPT incorrectly asserted there was no geometric solution based on the diagonal, which is false. Classical geometry can double a rectangle by scaling both sides by √2, a construction possible with straightedge and compass. The researchers described the false claim as “vanishingly unlikely” to be in the training data, making it a notable result in the ChatGPT mathematical reasoning study. Q: What do researchers mean when they call the model’s behavior “learner-like”? A: The team observed that ChatGPT seemed to transfer a familiar trick to a new case, try hypotheses, make mistakes, and then refine its answers, which resembles how students explore problems. They compared this pattern to the educational concept of the zone of proximal development, where guidance helps move learners from known patterns to new solutions. Q: What classroom practices did the study recommend for using AI tools? A: The study recommends prompting for process rather than just final answers, for example asking the model to “walk through your reasoning” or to list and compare multiple approaches. Teachers should build verification habits like plugging in numbers, sketching constructions, and comparing methods so AI supports student sense-making rather than replacing it, as emphasized by the ChatGPT mathematical reasoning study. Q: What are the main limitations of this study and suggested next steps? A: The authors caution against overinterpreting a single experiment and note that models evolve, so newer models should be tested on a wider set of geometry and algebra problems. They suggest combining LLMs with dynamic geometry systems, theorem provers, and symbolic engines, and developing better measures of reasoning quality and curricular scaffolds to teach students to read and check AI-generated proofs, as the ChatGPT mathematical reasoning study outlines. Q: How should students verify or critique AI-generated proofs in practice? A: Students should treat AI outputs like drafts from a peer: test formulas with sample numbers, draw quick sketches, ask for alternative methods, and consult trusted textbooks or tools to confirm claims. The ChatGPT mathematical reasoning study highlights checklists, peer review, and reflection as practical ways to help students judge and improve AI-generated proofs.

Contents