
Models tested: Gemini 3 Pro Preview, Claude Sonnet 4.5, Grok 4.1, Grok Code Fast 1, GPT-5.1-Codex, MiniMax M2, Gemini 2.5 Pro.
The Experiment
We gave all 7 models the same task: build an analytics dashboard for an AI code editor. We provided 4 sample metrics and chart data showing model usage distribution over 7 days, then told them to "Use your creativity to make this beautiful and functional."
All models used Next.js 15, React 19, and Tailwind CSS v4. Same stack, same data. The results were quite different, however.
The Results
2 out of 7 models failed due to Tailwind v4 knowledge cutoff (29% failure rate). They used outdated Tailwind v3 syntax, which produced unstyled dashboards. One additional model (MiniMax M2) partially failed with broken padding but had working colors and charts.
Winner: Gemini 3 Pro Preview
Google's latest model added context on top of the features we asked for. The standout was a "Recent System Events" table that showed live activity like Code Completion events, Refactor Request events, and Unit Test Gen events—each with the model that processed it, latency values, and status indicators.
Gemini 3 also got creative with product branding, naming our dashboard "SynthCode v2.4.0" instead of something generic, and added a "systems operational" status indicator. Code efficiency: 285 lines total. Not the shortest, but every line serves a purpose.
Second Place: Claude Sonnet 4.5
Claude demonstrated restraint—it knew what to add and what to skip. Added a "Live" animated pulse indicator, three helpful insight cards (Peak Hours, Most Used Language, Weekly Growth), and a footer stats bar with relevant metrics like Projects Active, Code Acceptance %, and Uptime %.
Code length: ~200 lines. Clean component structure, full-width charts.
Third Place: Grok 4.1
xAI's latest model proved that less is more. Delivered a functional analytics dashboard in only 100 lines of code. No buzzwords, irrelevant features, or overengineering. Just:
- 4 metric cards with icons
- Area chart (code generation over 7 days)
- Donut chart (model usage distribution)
- "Last updated: just now" timestamp
This is enough for an MVP version of a dashboard.
GPT-5.1-Codex Over-Engineered
OpenAI's GPT-5.1-Codex added the most features (341 lines) but they were largely irrelevant. It included things like "Trigger safe-mode deploy" buttons (this was an analytics dashboard, not a CI/CD panel) and invented metrics that weren't in our prompt data—like a "Success Funnel" with made-up percentages.
The pattern: GPT-5 copy-pasted concepts from ops/SRE/infrastructure dashboards without considering if they fit this dashboard's purpose. It optimized for "sounding impressive" over "being accurate."
Key Takeaways
- More features ≠ better. GPT-5's 341 lines lost to Gemini 3's 285 lines.
- Training recency matters. Gemini 2.5 Pro (8 months old) failed completely on Tailwind v4. Gemini 3 Pro Preview (released yesterday) won 1st place.
- Thoughtful additions > overengineering. Every feature should serve a purpose.
- Sometimes minimal is best. Grok 4.1's 100 lines prove you don't need complexity to be effective.
Full breakdown with screenshots -> link.
