I just lost a big chunk of my trust in LLM “reasoning” 🤖🧠

After reading these three papers:

– Turpin et al. 2023, Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting https://arxiv.org/abs/2305.04388

– Tanneru et al. 2024, On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models https://arxiv.org/abs/2503.08679

– Arcuschin et al. 2025, Chain-of-Thought Reasoning in the Wild Is Not Always Faithful https://arxiv.org/abs/2406.10625

My mental model of “explanations” from LLMs has shifted quite a lot.

The short version: When you ask an LLM

“Explain your reasoning step by step” what you get back is usually not the internal process the model actually used. It is a human readable artifact that is optimized to look like good reasoning, not to faithfully trace the underlying computation.

These papers show, in different ways, that:

  • Models can be strongly influenced by hidden biases in the input, and their chain-of-thought neatly rationalizes the final answer while completely omitting the real causal features that drove the prediction.

  • Even when you try hard to make explanations more faithful (in-context tricks, fine tuning, activation editing), the gains are small and fragile. The explanations still drift away from what the network is actually doing.

  • In more realistic “in the wild” prompts, chain-of-thought often fails to describe the true internal behavior, even though it looks perfectly coherent to a human reader.

So my updated stance:

  • Chain-of-thought is UX, not transparency.

  • It can help the model think better and help humans debug a bit, but it is not a ground truth transcript of model cognition.

  • Explanations are evidence about behavior, not about internals.

  • A beautiful rationale is weak evidence that “the model reasoned this way” and strong evidence that “the model knows how to talk like this about the answer”.

  • If faithfulness matters, you need structure outside the LLM.

  • Things like explicit programs, tools, verifiable intermediate steps, formal reasoning layers, or separate monitoring. Not just “please think step by step”.

I am not going to stop using chain-of-thought prompting. It is still incredibly useful as a performance and debugging tool. But I am going to stop telling myself that “explain your reasoning” gives me real interpretability.

It mostly gives me a story.

Sometimes a helpful story.

Sometimes a misleading one.

In my own experiments with OrKa, I am trying to push the reasoning outside the model into explicit nodes, traces, and logs so I can inspect the exact path that leads to an output instead of trusting whatever narrative the model decides to write after the fact. https://github.com/marcosomma/orkA-reasoning

Leave a Reply