
The Core Architecture
The pipeline processes a live camera feed. The main idea is to avoid expensive end-to-end training and create a more modular system.
- Frozen VAE (Perception): I'm using the pre-trained Stable iffusion VAE to encode frames into a latent space. By keeping it frozen, the "perceptual manifold" is stable, which makes learning the dynamics much easier.
- Three-Stage LSTM System (ynamics): This is where I tried to do something a bit different. Instead of one big LSTM, I'm using a hierarchy:
- A Pattern LSTM observes short sequences of latents to find basic temporal patterns.
- A Compression LSTM takes these patterns and learns a dense, compressed representation.
- A Central LSTM takes this compressed state and predicts the next latent step (Ξz).
*NOTE: This pipeline is capable of ALOT more than just a simple prediction model. For this project I solely focused on the vision aspect.
Performance and Results
The whole system runs at an interactive 4-6 FPS on my consumer hardware and has a simple PyQT GUI to show the live camera feed next to the model's prediction. With better hardware i'm hoping to hit 24 FPS, but balling on a budget right now.
My main focus was on perceptual quality over raw pixel accuracy. The most encouraging result was in multi-step open-loop rollouts, where the model achieved a peak SSIM of 0.84. I was really happy to see this, as it's a result that's competitive with some established benchmarks on standardized datasets (like KTH).
Link to Project:
I've documented the architecture, included the performance logs, and wrote a white paper in the GitHub repo if you want to see the technical details:
