Motivation
Attention and KV caches handle short-range dependencies well, but they don’t maintain a persistent state that adapts across multiple forward passes. The goal here was to explore whether a lightweight, inference-only update could provide a form of dynamic memory without modifying weights.
Method (High-Level)
The layer keeps a small set of vectors (“attractors”) that:
- Measure similarity to current attention output
- Strengthen when frequently activated
- ecay when unused
- Feed a small signal back into the next forward pass
This is not recurrence, just a single-step update applied during inference.
Early Observations
On small transformer models:
- Some attractors formed stable patterns around recurring concepts
- A short burn-in phase reduced instability
- Unused attractors collapsed to noise
- In some cases, the layer degraded generation quality instead of helping
No performance claims at this stage—just behavioral signals worth studying.
Key Results
Perplexity:
- Preserved baseline perplexity on smaller models (≈0% change)
- ~6.5% compute overhead
Failure Case:
- On longer (~500 token) generation, accuracy dropped by ~80% due to attractors competing with context, leading to repetition and drift
Revised Configuration:
- Adding gating + a burn-in threshold produced a small gain (+3.3%) on a shorter comprehension task
These results are preliminary and fragile.
What Failed
- Too many attractors caused instability
- Long sequences “snapped back” to earlier topics
- Heavy decay made the system effectively stateless
What This oes Not Show
- General performance improvement
- Robustness on long contexts
- Applicability beyond the tested model family
- Evidence of scaling to larger models
Small N, synthetic tasks, single architecture.
Related Work (Brief)
This seems adjacent to several prior ideas on dynamic memory:
- Fast Weights (Ba et al.) – introduces fast-changing weight matrices updated during sequence processing. This approach differs in that updates happen only during inference and don’t modify model weights.
- ifferentiable Plasticity (Miconi et al.) – learns plasticity rules via gradient descent. In contrast, this layer uses a fixed, hand-designed update rule rather than learned plasticity.
- KV-Cache Extensions / Recurrence, reuses past activations but doesn’t maintain a persistent attractor-like state across forward passes.
This experiment is focused specifically on single-step, inference-time updates without training, so the comparison is more conceptual than architectural.
Questions for the Community
- Is there prior work on inference-time state updates that don’t require training?
- Are there known theoretical limits to attractor-style mechanisms competing with context?
- Under what conditions would this approach be strictly worse than recurrence or KV-cache extensions?
- What minimal benchmark suite would validate this isn't just overfitting to perplexity?
Code & ata
Looking for replication attempts, theoretical critique, and pointers to related work.