[P] Video prediction pipeline using a frozen VAE and hierarchical LSTMs to learn latent dynamics

By skyforbes Nov 28, 2025 No Comments

I wanted to share a personal project I've been working on for the past few months and get some feedback from the community. My goal was to build a stable, interactive system for video prediction by cleanly separating the perception and dynamics modeling.

The Core Architecture

The pipeline processes a live camera feed. The main idea is to avoid expensive end-to-end training and create a more modular system.

Frozen VAE (Perception): I'm using the pre-trained Stable iffusion VAE to encode frames into a latent space. By keeping it frozen, the "perceptual manifold" is stable, which makes learning the dynamics much easier.
Three-Stage LSTM System (ynamics): This is where I tried to do something a bit different. Instead of one big LSTM, I'm using a hierarchy:
- A Pattern LSTM observes short sequences of latents to find basic temporal patterns.
- A Compression LSTM takes these patterns and learns a dense, compressed representation.
- A Central LSTM takes this compressed state and predicts the next latent step (Δz).

*NOTE: This pipeline is capable of ALOT more than just a simple prediction model. For this project I solely focused on the vision aspect.

Performance and Results

The whole system runs at an interactive 4-6 FPS on my consumer hardware and has a simple PyQT GUI to show the live camera feed next to the model's prediction. With better hardware i'm hoping to hit 24 FPS, but balling on a budget right now.

My main focus was on perceptual quality over raw pixel accuracy. The most encouraging result was in multi-step open-loop rollouts, where the model achieved a peak SSIM of 0.84. I was really happy to see this, as it's a result that's competitive with some established benchmarks on standardized datasets (like KTH).

Link to Project:

I've documented the architecture, included the performance logs, and wrote a white paper in the GitHub repo if you want to see the technical details:

github

By skyforbes

MachineLearning

[P] Video prediction pipeline using a frozen VAE and hierarchical LSTMs to learn latent dynamics

Like this:

By skyforbes

Leave a ReplyCancel reply

You Missed

Open Source Voice AI for Generative AI enthusiasts

Try to get the Information on BRIAN HOOD on ChatGTP

How the hell do I use google AI studio and gemini image generation without giving away my ID?

North Korea Makes Russian Language a Required School Subject, Minister Says

Archives

[P] Video prediction pipeline using a frozen VAE and hierarchical LSTMs to learn latent dynamics

Like this:

By skyforbes

Related Posts

[D] Inverse hyperbolic sine as an activation function and its anti-derivative as a loss function

[D] Question and Answer Position Detection

[D] Got burned by an Apple ICLR paper — it was withdrawn after my Public Comment.

Leave a ReplyCancel reply

You Missed

Open Source Voice AI for Generative AI enthusiasts

Try to get the Information on BRIAN HOOD on ChatGTP

How the hell do I use google AI studio and gemini image generation without giving away my ID?

North Korea Makes Russian Language a Required School Subject, Minister Says