[R] Inference-time attractor layer for transformers: preliminary observations

We tested a small “attractor” layer that updates during inference (no training/backprop). It preserved perplexity on small models, showed a modest +3.3% gain on a constrained comprehension task, but collapsed badly (-80%) on longer generation. Sharing results and looking for critique.

Motivation

Attention and KV caches handle short-range dependencies well, but they don’t maintain a persistent state that adapts across multiple forward passes. The goal here was to explore whether a lightweight, inference-only update could provide a form of dynamic memory without modifying weights.

Method (High-Level)

The layer keeps a small set of vectors (“attractors”) that:

  • Measure similarity to current attention output
  • Strengthen when frequently activated
  • ecay when unused
  • Feed a small signal back into the next forward pass

This is not recurrence, just a single-step update applied during inference.

Early Observations

On small transformer models:

  • Some attractors formed stable patterns around recurring concepts
  • A short burn-in phase reduced instability
  • Unused attractors collapsed to noise
  • In some cases, the layer degraded generation quality instead of helping

No performance claims at this stage—just behavioral signals worth studying.

Key Results

Perplexity:

  • Preserved baseline perplexity on smaller models (≈0% change)
  • ~6.5% compute overhead

Failure Case:

  • On longer (~500 token) generation, accuracy dropped by ~80% due to attractors competing with context, leading to repetition and drift

Revised Configuration:

  • Adding gating + a burn-in threshold produced a small gain (+3.3%) on a shorter comprehension task

These results are preliminary and fragile.

What Failed

  • Too many attractors caused instability
  • Long sequences “snapped back” to earlier topics
  • Heavy decay made the system effectively stateless

What This oes Not Show

  • General performance improvement
  • Robustness on long contexts
  • Applicability beyond the tested model family
  • Evidence of scaling to larger models

Small N, synthetic tasks, single architecture.

Related Work (Brief)

This seems adjacent to several prior ideas on dynamic memory:

  • Fast Weights (Ba et al.) – introduces fast-changing weight matrices updated during sequence processing. This approach differs in that updates happen only during inference and don’t modify model weights.
  • ifferentiable Plasticity (Miconi et al.) – learns plasticity rules via gradient descent. In contrast, this layer uses a fixed, hand-designed update rule rather than learned plasticity.
  • KV-Cache Extensions / Recurrence, reuses past activations but doesn’t maintain a persistent attractor-like state across forward passes.

This experiment is focused specifically on single-step, inference-time updates without training, so the comparison is more conceptual than architectural.

Questions for the Community

  1. Is there prior work on inference-time state updates that don’t require training?
  2. Are there known theoretical limits to attractor-style mechanisms competing with context?
  3. Under what conditions would this approach be strictly worse than recurrence or KV-cache extensions?
  4. What minimal benchmark suite would validate this isn't just overfitting to perplexity?

Code & ata

Looking for replication attempts, theoretical critique, and pointers to related work.

Leave a Reply