Insights AI News Do large language models think and how to tell
post

AI News

04 Nov 2025

Read 15 min

Do large language models think and how to tell

Do large language models think; use practical tests to tell when they truly improve your work faster.

Do large language models think? They show signs of understanding, but not consciousness. You can tell by how well they compress information, generalize to new tasks, reason across steps, and use tools to plan actions. Today they pass some tests, fail others, and improve with scale and better training.

We use A.I. every day, but the big question will not go away: do large language models think? Many people see chatbots invent facts and roll their eyes. Others watch them debug code, write apps, and solve problems and feel awe. The truth sits between these poles. It asks us to define “thinking,” and it pushes us to test it in plain ways we can measure.

Do large language models think? A practical way to judge

When we say “thinking,” we often mean different things. Sometimes we mean a private inner life. Sometimes we mean step-by-step logic. Often we mean something simpler: real understanding. Understanding looks like grasping a situation, making the right link at the right time, and choosing useful actions. That is what we can test.

Define the bar for “understanding”

Here is a clear bar that fits daily work:

  • Recognize patterns across text, images, and tasks.
  • Compress messy data into compact rules that predict what happens next.
  • Generalize to new situations with little help.
  • Plan a few steps ahead and adjust when feedback changes.
  • Explain choices in plain language that holds up under checks.

On these points, modern models do better than many expected. Yet they still miss obvious facts at times. Both parts matter when we ask, do large language models think.

How prediction builds a model of the world

Large language models learn by predicting the next token in text. They guess, see if the guess was wrong, and adjust their inner connections. Do this billions of times and the model forms a compact map of how words and ideas tend to go together. The process is simple, but the result is rich. It looks like understanding because good prediction needs useful structure.

Compression as a sign of understanding

One classic idea says that understanding and compression are two sides of the same coin. If you can compress data well, you must have found the rules inside it. When models train on huge text sets, they end up as tiny files compared to the raw data. The “blurry copy of the web” view is not wrong, but it misses this: the best compression is a working theory. For arithmetic, the “best compression” is a calculator. For language and images, the “best compression” is an internal map of meanings and relations.

Thinking in high dimensions

Inside a model, every word and image piece becomes a point in a very high-dimensional space. Nearby points share meaning. Far points differ. Analogy becomes geometry. “Paris – France + Italy” points to “Rome.” A photo can also become a vector that captures its objects and mood. The model can move in that space to find, bind, and use the right ideas. This is not just word salad. It is a way to store and retrieve memories by similarity, much like how our brains call up related things when one thing comes to mind.

A small, concrete story

Here is a test you can picture. A dad at a hot playground opens a shed. He sees a tangle of pipes. The kids want water. He takes a photo and asks an A.I. what to do. The model says it looks like a backflow valve and points to a yellow handle to turn. He turns it. The water starts. Did the model “think”? It matched a visual pattern, named the device, linked that to a typical fix, and gave a step. That is a kind of situational understanding, even if no feelings or inner voice were involved.

What brain science does—and does not—tell us

Neuroscience and A.I. grew together. Researchers modeled learning as small weight changes across a big network of simple units. Scale those networks up and new abilities pop out. That matched how many brain areas seem to work: layers of pattern detectors that build richer features.

Recognition is a core of cognition

Some scientists say, “cognition is recognition.” We see a few lines “as” a face. We see a chess shape “as” a weak bishop. We see a meeting “as” off the rails. Models now show similar “seeing as” behavior in their vector spaces. They activate “features” for topics, styles, tools, and goals. Tuning these features can steer the model in strong ways, which suggests real, structured internal content.

Circuits that look like planning

Teams that probe model internals report “circuits” that span many features. These circuits seem to handle rhyme, math carry, or tool use. When asked to produce a line that ends in a rhyme, a model can find the rhyme first and then build the sentence backward. That is a tiny plan. It is not full human planning, but it is more than blind token spitting.

Where the models still fail

Limits remain clear, and they matter. We must see both the strengths and the gaps to answer, do large language models think, with honesty.

Hallucinations and brittle logic

Models still invent facts. They get stuck in loops. Simple puzzles can trip them. Guardrails, retrieval tools, and chain-of-thought prompts help, but they do not fix the root cause by themselves.

Weak common-sense physics

Video and 3D tasks still expose shallow understanding of the physical world. Models can describe scenes but may miss that glass shatters or ropes tangle with friction. They can plot a maze but fail on a detour. This suggests their world model is thin outside language.

Sample efficiency and “frozen” brains

Children learn from far less data. They move, touch, test, and sleep. Their brains update all the time. By contrast, most models freeze their core weights after training. They remember user facts in a text prefix rather than in the network itself. This makes personalization shallow and brittle.

Planning at longer horizons

Models generate answers one token at a time. Tool use and agent frameworks add planning, but robust long-term goals with feedback are not consistent yet. Many “agents” work well in demos but fail in messy, open tasks.

How to tell if a model is thinking: five practical checks

You do not need a lab to test useful “thinking.” You need simple, fair tasks and clear metrics.

1) Compression-to-competence tests

  • Create a small synthetic language or rule set.
  • Give a few examples and ask the model to continue.
  • Measure if it infers the rule and predicts new cases with high accuracy.

2) Cross-modal grounding

  • Show a photo with a hidden constraint (e.g., a valve to open, a circuit to close).
  • Ask for a step-by-step fix and a safety check.
  • Score success in simulation or a safe sandbox.

3) Counterfactual and causal probes

  • Pose A/B worlds that differ by one cause (e.g., “If the floor is wet, and we use shoe type X, what changes?”).
  • Check if the model updates conclusions in ways that match the change.
  • Reward precise, minimal edits to its own reasoning chain.

4) Low-shot generalization

  • Teach a new tool or API with 2–3 examples.
  • Ask for a novel use that combines parts in a new order.
  • Rate code execution success and error recovery without hints.

5) Planning under feedback

  • Set a modest multi-step goal (e.g., gather three sources, summarize, compare, and draft a decision memo).
  • Introduce a mid-course change (a new constraint).
  • Measure plan updates, tool calls, and final quality against a rubric.

Run these checks across models and versions. Track gains. You will see a clear trend: steady progress, with stubborn weak spots.

Why the answer matters

If models “think,” even in a narrow sense, then they are more than autocomplete toys. They become partners in work. They can write working software, draft solid reports, and support research. That brings gains, but it also shifts power. It changes who decides, who benefits, and who bears the risks.

The moral stakes

  • Energy use and cost: Training and serving big models burn power.
  • Labor impact: Automation pressure can displace or reshape jobs.
  • Misinformation: Fluent output can mislead at scale if unchecked.
  • Safety: Strong tool use can cause harm without guardrails.

These issues stand even if you reject the idea that models think. But if you accept that they show some thinking-like behavior, the urgency rises. Better testing, better controls, and better incentives follow.

What will push models closer to robust thinking

Progress will not come from size alone. Returns from bigger data and more chips already taper. The next jumps likely need new kinds of experience and memory.

Embodied and interactive learning

Models need to act, see the result, and adjust. Simulators and safe robots can supply this. Even simple loops—click, read, compare, revise—teach useful cause and effect that text alone lacks.

Continual learning and sleep-like replay

Brains practice the day during sleep. Models could do “sleep” too. They could replay selected moments, compress them into their core, and wake up better. Done right, this would reduce forgetting and improve adaptation.

Richer priors and safer tools

Children start with expectations about objects and agents. We can give models light priors for physics, space, and social cues. We can pair them with verified tools—calculators, databases, planners—that ground their steps and explain outcomes.

A clear, balanced answer

So, do large language models think? They do not feel. They do not have an inner life. But they do show growing signs of understanding: they compress, generalize, and plan in small but real ways. You can tell by how they solve new tasks with little help, how they link ideas across modes, and how they adjust under feedback. They also still hallucinate, struggle with real-world physics, and learn less efficiently than children.

The useful stance is simple: treat them as early, alien thinkers that need tests, tools, and limits. Design better training and memory. Measure what matters. If we do this, we will get clearer answers to “do large language models think” and better systems for people.

(Source: https://www.newyorker.com/magazine/2025/11/10/the-case-that-ai-is-thinking)

For more news: Click Here

FAQ

Q: Do large language models think? A: They show signs of understanding—compressing information, generalizing to new tasks, reasoning across steps, and using tools—but they do not have consciousness or an inner life. You can judge these capabilities by measurable behaviors rather than by attributing feelings. Q: How can we practically judge whether do large language models think? A: Use simple, fair tasks and clear metrics such as compression-to-competence tests, cross-modal grounding, counterfactual probes, low-shot generalization, and planning under feedback. Running these checks across models and versions reveals steady progress alongside persistent weaknesses. Q: Why is compression considered a sign of understanding when we ask do large language models think? A: Compression means distilling messy data into compact rules that predict new cases, so good compression suggests the model has found underlying regularities. Large models trained on huge text corpora become much smaller than their training data, which indicates an internal map of meanings rather than mere copying. Q: Can large language models have consciousness or an inner life? A: No; the article emphasizes that models do not feel and do not possess an inner life even when they display understanding-like behavior. Their abilities—pattern recognition, short-term planning, and generalization—are not the same as conscious experience. Q: What failures show the limits relevant to do large language models think? A: Models still hallucinate plausible-sounding falsehoods, get stuck in loops, and stumble on simple puzzles, which indicates brittle logic. They also show weak common-sense physics, are far less sample-efficient than children, and typically freeze core weights after training, limiting deep personalization. Q: How does neuroscience inform the question do large language models think? A: Neuroscience and A.I. share ideas like networks of simple units, learning by small weight changes, and high-dimensional representations, which helps explain why scaled models exhibit brain-like pattern detectors. Researchers find that words and images become vectors and that feature ensembles and circuits can resemble recognition and planning processes. Q: What developments could push models closer to robust thinking? A: Progress will likely require embodied and interactive learning so models can act, observe outcomes, and learn cause and effect. Complementary advances include continual learning or sleep-like replay to consolidate memories, plus richer priors and verified tools to ground reasoning. Q: Why does the answer to do large language models think matter? A: If models think even in a narrow sense they become partners in work, which shifts who decides, who benefits, and who bears risks. That raises moral stakes—energy use, labor impact, misinformation, and safety—and increases the urgency for better testing, controls, and incentives.

Contents