
I'm pleased their image generation is more coherent and consistent than their older models. I generated images of each character separately, also backgrounds, attached them as context and prompted the pose/arrangements I wanted.
I still needed to do a lot of reprompting and manual editing in Photoshop. ChatGPT has a pale yellow color to its images. And even when I adjusted levels in Photoshop, there's sill a lot of yellow that gives that AI vibe.
ChatGPT also doesn't like coherency with lines in the background, like ceiling beams, pylons, etc. I had to reconstruct those.
There was also some weird textures and artifacts present, that I was able to clean up with a locally running stable diffusion model specially trained to clean up old animation videos. It worked well for this. I also used local models for upscaling.
I did try using other services like Gemini, Grok, and a few models through the LLM arena, but for this ChatGPT seemed to be the best. (I did not try Midjourney). And I didn't have much luck with editing/compositing via prompts with things like nanobana yet – The best result was context images and a prompt.
The most frustrating pictures were the kid looking at the cows (ChatGPT put the cows INSIDE the train several times), and dad and the kid walking down the aisle. I could not for the life of me get the correct number of seats or people, so I had to make a lot of compromises.
I did try using ChatGPT to write the image prompts to accompany sections of the story, but that wasn't very useful for me. Each prompt was adding "fun and whimsical style" or "warm and inviting atmosphere" which would affect the consistency between images.
Prompts that worked for me were like:
Eva (young girl in yellow shirt with curl hair) and her mother (woman with ponytail and purple dress) talk in kitchen while eating breakfast.
Adding things like "as golden sunlight streamed through the window" was more hassle fixing later on.
For fun I did run a few prompts through the stable diffusion models I had running locally and HOLY SHIT the results were terrible, and disturbingly pedo. I'm all for open source, and I know SD is limited because its architecture/algorithm, but I think the corporate oversight was important for this project too. (for what its worth, for me Gemini was giving me grief when I tried to generate/edit the images because "young girl".
And to finish it off, I used ElevenLabs's voice changing models for the narration, simply because I hate my voice.
I thought about doing some subtle animation in each image, for example just the clouds moving outside, as she's looking out the train. But I didn't like the quality of the results I was getting when testing that, so I stuck with static images.
The rest was a lot of manual effort. Writing/editing, aligning everything in capcut.
I know a lot of people are doing fun sora videos of Mr Rogers and Bob Ross in a brawl, and the Yeti videos with veo, but I think this is still the best route for childrens book story videos like this at this point in the technology.
Since I'm a programmer, I was thinking about doing an app that could help automate/organize some of this – for example, a text area for a description to generate a character image, then all those are collected in an assets library, and then you feed your story in, it'll automatically split into chunks, generate the image prompts, and upload the context and generate a few samples… but I don't think I have that much dedication/energy right now. And the technology is changing, so maybe this workflow won't be the best in 6 months.
Anyway, I think the technology has a lot of promise, especially when it can be used as a "force multiplier" for high quality content in minority markets (Esperanto is not popular and has a large following after all, but there are tons of other minority languages). Unfortunately, there's still a lot of resistance. Everyone is tripping over themselves to make sure the world knows just how against AI they are, never stopping to realize the genie is out of the bottle and it's up to us to figure out how to use it responsibly.
YMMV, but for me this was a fun way to tell a story and have some moderate level of quality (not AI slop) beyond what my own artistic skills would afford.
