
Whenever Gemini 3 Pro is incorrect, in ~88% of cases it gives a wrong answer instead of “I don't know” according to the benchmark. So it only refuses correctly in ~12% of “no-hit cases.” The typical LLM training problem: it is rewarded for guessing wrong rather than saying “I don't know.”
Source: https://x.com/ArtificialAnlys/status/1990926803087892506/