I tested Gemini 3 Pro vs. GPT-5.1 on real coding tasks so you don’t have to


Gemini 3 Pro dropped recently and Google pushed it everywhere at once. Search, Workspace, the whole ecosystem got it at once. With that kind of confidence and all the buzz around its reasoning, I was curious about one thing that actually matters for me as a dev.

Can it code better than GPT-5.1?

Because so far, GPT-5.1 has been the most reliable model for some of my real projects (better than Claude 4.5 Sonnet).

So I tested both models on two actual real tasks:

  • Build a Windows style UI
  • Build an agent with UI from scratch using our Tool Router which also helps in dogfooding

NOTE: I've included the UI build just because this model is said to be the best model for working on the frontend, so why not put it to the test?

How I tested

  • GPT 5.1 was tested through OpenAI Codex
  • Gemini 3 Pro was tested through the Gemini CLI

Stats from my test

These are the raw stats from the test that matters:

Gemini 3 Pro

  • UI build: about 30k output tokens
  • UI build time: close to 10 minutes
  • Agent build: around 14k output tokens
  • Agent build time: around 5 minutes
  • Follow ups needed: very few
  • Hallucinations: minimal

GPT 5.1

  • UI build: similar token use but simpler output
  • Agent build: needed manual fixes after first attempt
  • Agent build time: slower overall because it did not follow the context well
  • Follow ups needed: multiple
  • Hallucinations: completely mocked the initial implementation

TL;DR

  • Gemini 3 Pro: Killed the UI task with almost no follow ups, and about 30k tokens in around 10 minutes. It also handled the agent build way better, finishing a working version in roughly 5 minutes with around 14k output tokens. Barely hallucinated and overall feels like the safer pick for day to day coding and agent workflows.
  • GPT 5.1: The code it writes is often cleaner and more maintainable, but it kinda fell apart on the agent test and didn’t pick up enough context from what I gave it. At first, it completely mocked the implementation, but then with some manual fixes, it produced something usable.

Verdict

If you're building tools or agentic workflows, just go with Gemini 3 Pro. For UI, Gemini 3 is better as well, but GPT-5.1 is still a great model for day-to-day coding. It just works, and I've had little to no issues with it.

If you want the full breakdown with token usage, code, and timings, here's the full blog: Gemini 3.0 Pro vs GPT 5.1

What should I test next? Thinking of doing something even bigger.

Has anyone else tried Gemini 3 for real coding yet? Curious how your results look.

Leave a Reply