Gemini 3 on SWE-bench verified with minimal agent: New record! Full results & cost analysis

Hi, I'm from the SWE-bench team. We just finished independently evaluating Gemini 3 Pro preview on SWE-bench verified and it is indeed top of the board with 74% (almost 4%pt ahead of the next best model). This was performed with a minimal agent (`mini-swe-agent`), so there was no tuning of prompts at all, so this really measures model quality.

https://preview.redd.it/y6r580bah82g1.png?width=947&format=png&auto=webp&s=85f4553007ba11ec5cec0a71285555ad2b2c377a

Costs are 1.6x of GPT-5, but still cheaper than Sonnet 4.5.

Gemini takes exceptionally many steps to iterate on a task, significantly more than GPT-5, only flattening at > 100 steps (but Sonnet 4.5 is higher still).

https://preview.redd.it/3x36h4jgg92g1.png?width=780&format=png&auto=webp&s=66f57f3babb1c3e81063064c0cb73a068c28f891

By varying the maximum steps you allow your agent, you can trade resolution rate vs cost. Gemini 3 is more cost-efficient than Sonnet 4.5, but much less than gpt-5 (or gpt-5-mini)

https://preview.redd.it/k2pvuuohh82g1.png?width=695&format=png&auto=webp&s=ff94990bd32b33cc7294a882f526d46fd45ec76a

You can browse all agent trajectories/logs in the webbrowser here: https://docent.transluce.org/dashboard/3641b17f-034e-4b36-aa66-471dfed837d6

Full leaderboard ("bash only"): https://www.swebench.com/ (about to be updated)

All comparisons performed with mini-swe-agent, a bare-bones agent that uses only bash and the same scaffold & prompts for all models for an apple-to-apples comparison. Comes with a claude-code style CLI, too, if you want to try it/reproduce our numbers. https://github.com/SWE-agent/mini-swe-agent/

Leave a Reply