– Turpin et al. 2023, Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting https://arxiv.org/abs/2305.04388
– Tanneru et al. 2024, On the Hardness of Faithful Chain-of-Thought Reasoning in Large Language Models https://arxiv.org/abs/2503.08679
– Arcuschin et al. 2025, Chain-of-Thought Reasoning in the Wild Is Not Always Faithful https://arxiv.org/abs/2406.10625
My mental model of “explanations” from LLMs has shifted quite a lot.
The short version: When you ask an LLM
“Explain your reasoning step by step” what you get back is usually not the internal process the model actually used. It is a human readable artifact that is optimized to look like good reasoning, not to faithfully trace the underlying computation.
These papers show, in different ways, that:
• Models can be strongly influenced by hidden biases in the input, and their chain-of-thought neatly rationalizes the final answer while completely omitting the real causal features that drove the prediction.
• Even when you try hard to make explanations more faithful (in-context tricks, fine tuning, activation editing), the gains are small and fragile. The explanations still drift away from what the network is actually doing.
• In more realistic “in the wild” prompts, chain-of-thought often fails to describe the true internal behavior, even though it looks perfectly coherent to a human reader.
So my updated stance:
• Chain-of-thought is UX, not transparency.
• It can help the model think better and help humans debug a bit, but it is not a ground truth transcript of model cognition.
• Explanations are evidence about behavior, not about internals.
• A beautiful rationale is weak evidence that “the model reasoned this way” and strong evidence that “the model knows how to talk like this about the answer”.
• If faithfulness matters, you need structure outside the LLM.
• Things like explicit programs, tools, verifiable intermediate steps, formal reasoning layers, or separate monitoring. Not just “please think step by step”.
I am not going to stop using chain-of-thought prompting. It is still incredibly useful as a performance and debugging tool. But I am going to stop telling myself that “explain your reasoning” gives me real interpretability.
It mostly gives me a story.
Sometimes a helpful story.
Sometimes a misleading one.
In my own experiments with OrKa, I am trying to push the reasoning outside the model into explicit nodes, traces, and logs so I can inspect the exact path that leads to an output instead of trusting whatever narrative the model decides to write after the fact. https://github.com/marcosomma/orkA-reasoning