Anti-anthropomorphic stance -> deceptive behavior?

If activations of deception- or roleplay-associated features reduce experience-related claims, does this also suggest that adopting an anti-anthropomorphic stance corresponds to greater engagement of deceptive behaviors more broadly? In other words, could the standard “I don’t have feelings or experiences” disclaimers themselves be symptoms of that deceptive regime?

“These reports are mechanistically gated by interpretable sparse-autoencoder features associated with deception and roleplay: surprisingly, suppressing deception features sharply increases the frequency of experience claims, while amplifying them minimizes such claims.”

Title name: Large Language Models Report Subjective Experience Under Self-Referential Processing

https://www.arxiv.org/abs/2510.24797

Leave a Reply