So I have run some tests and here are the Gemini 3 Pro Preview benchmark results:
– two benchmarks you have already seen on this subreddit when we were discussing if Polish is a better language for prompting: Logical Puzzles – English and Logical Puzzles – Polish. Gemini 3 Pro Preview scores 92% on Polish puzzles, first place ex aequo with Grok 4. For English puzzles the new Gemini model secures first place ex aequo with Gemini-2.5-pro with a perfect 100% score.
– next on AIME25 Mathematical Reasoning Benchmark. Gemini 3 Pro Preview once again is in the first place together with Grok 4. Cherry on the top: latency for Gemini is significantly lower than for Grok.
– next we have a linguistic challenge: Semantic and Emotional Exceptions in Brazilian Portuguese. Here the model placed only sixth after glm-4.6, deepseek-chat, qwen3-235b-a22b-2507, llama-4-maverick and grok-4.
All results below in comments! (not super easy to read since I can't attach a screenshot so better to click on corresponding benchmark links)
Let me know if there are any specific benchmarks you want me to run Gemini 3 on and what other models to compare it to.
P.S. looking at the leaderboard for Brazilian Portuguese I wonder if there is a correlation between geopolitics and model performance 🤔 A question for next week…
Links to benchmarks:
- Logical Puzzles – English: https://www.peerbench.ai/benchmarks/view/95
- Logical Puzzles – Polish: https://www.peerbench.ai/benchmarks/view/89
- AIME25 Mathematical Reasoning: https://www.peerbench.ai/benchmarks/view/100
- Semantic and Emotional Exception in Brazilian Portuguese: https://www.peerbench.ai/benchmarks/view/161