
My task: getting the models to create a video generation prompt for me. Output a prompt that I can feed into a video generation tool. (chatgpt 5.1 extended thinking and gemini 3 thinking, both on website, on pro plans)
I provide both models with a sequence of images, basically the first and the last frame and some images of what the frames in between look like, to give it pointers on what objects I'd like animated to go from the first frame to the last frame in the video. All images named sequentially.
Seems pretty straightforward to generate this sort of a prompt to me. Like an animation artist would, I guess. And the animations were like PPT transitions. Nothing crazy.
Ok…
ChatGPT goes mad bonkers, like really seriously goes overboard trying to do it's best on this task!! It uses PYTHON to generate DIFFS between pairs of sequential image frames (like bro's doing CV here lol), "zooms in" (using python) on parts of the image that it wants to notice more clearly, and basically, it does all it can. Plus the UI is wonderful! I see all the intermediate outputs from python and can really follow along; really sublime ux in its thinking trace, really helps me visualize and trust it's process. ChatGPT thinks for over 8 minutes. The resultant video generation prompt it outputs is fabulous, really detailed on what objects in the first frame need to be animated, how things fade or move so we can get to the final frame sensibly. No additional prompting required.
And now… Gemini 3 thinking. BARELY takes any time at all, also it's thinking token trace is mid fr. Like 5 basic paragraphs of text. The resulting video generation prompt is really basic, somethin I'd expect from like 2.5 flash. No amount of additional prompting to ask it to drill down, even suggesting it analyze image diffs between subsequent frames (lol, wild that chatgpt did this on its own), helps. I know gemini should be able to use the inbuilt python interpreter, I think maybe it's not trained on tool use within its thinking scratch pad trace something? Whatever. Ok. But the resulting video generation prompt? Sucks even more.
Is this just a really huge win for ChatGPT or am I missing something? Can gemini 3 thinking not do tool use within its thinking trace? Should I use canvas in gemini or something?
