This past summer I decided to dig my heels in and try to use ChatGPT to create some tools. The main one was a tool where I could automatically generate text overlays for English vocabulary/grammar videos. I eventually landed on a workflow:
- film a video with a loose idea explaining a topic
- use aiwhisper to generate an SRT file with timings for each word
- then send that to ChatGPT and ask it to generate a .json file that shows selected text + timestamp + some styles
- Use "blender headless" (with a program I made with ChatGPT) to transform the json file into timed text overlays
The ChatGPT was instructed to first read the entire text of the video and find some key points, like the main vocabulary being taught, examples I used for the vocab, etc. I also gave it some freedom to be "funny" and occassionally add a joke from what I was saying, sometimes just as a punchline in the middle of the screen or as a set-up+punchline with the set-up on the upper left and then the punchline on the upper-right.
I thought it worked pretty well! The jokes were very "ha ha" funny most of the time but it was mainly useful for me to be able to turn on the camera, film something, and then put it through the program and have a "finished-looking" video in an hour beginning to end. I was happy with myself and felt like I "made something". The effects I could add to the text overlays weren't very varied but they were just good enough to create a vibe.
However, as most AI reliant stuff… when it didn't cooperate it was incredibly frustrating. Sometimes the SRT file with the subtitles and timestamps was too long for it to read so it would just make up the timing of the overlays, or it would make up lines that weren't in the video. It seemed like it would work 5 times perfectly and then the 6th time it was like I was asking it to do something new, forgetting all the formatting rules for the json for example.
Recently, though, AI video and AI image generation has gotten so much faster that I'm trying to think of a new way of doing this. If I could send the SRT file to an AI, and it produce a video file that is the same length as the video, with transparencies or a color I can key out, and the text overlays being animated within it, that would be a great solution that almost seems better… almost. I know there are some online tools that do this already, but I'm not sure if they're truly worth it. Or if the best bet is to invest in Adobe or another service because the AI tools in those are just much faster/more reliable.
Any ideas from anyone else attempting this?