I’ve been running a set of continual learning experiments across 12 multimodal tasks (vision, speech, and text), and I managed to build an architecture that essentially eliminates catastrophic forgetting, even without replay.
The key turned out to be a combination of:
- ynamic expert expansion (grow only when new distributions appear)
- Task embeddings for conditioning shared components
- A lightweight retrieval memory
- Small task-specific heads for stable readout
With this setup, retention remained almost perfectly stable across the full task sequence. Earlier tasks showed no accuracy collapse even after many training stages, and performance stayed consistent as new tasks came in.
Some highlights from the results
- Zero observable catastrophic forgetting across all 12 tasks
- Experts expanded only when necessary, matching new distribution shifts
- The shared latent space stayed coherent across modalities
- Intrinsic signals (e.g., prediction error) boosted stability during training but weren’t needed at inference
For anyone interested in digging into the evaluation pipeline, I’ve packaged the experiment logs, model checkpoints, and a safe inference script here:
🔗 GitHub (Reproducibility / Results)
https://github.com/nkundinezayv/CORA-ContinualLearning
(It's not the full training implementation, but it’s enough to verify the results and understand the evaluation flow.)
I’m sharing this mainly to compare observations with others working on continual or modular learning.
Has anyone explored dynamic expansion or large-scale modular CL setups?
I’d love to hear about bottlenecks, failure modes, or architecture designs that worked well for you.