[R] Inference-time attractor layer for transformers: preliminary observations

By skyforbes Nov 27, 2025 No Comments

We tested a small “attractor” layer that updates during inference (no training/backprop). It preserved perplexity on small models, showed a modest +3.3% gain on a constrained comprehension task, but collapsed badly (-80%) on longer generation. Sharing results and looking for critique.

Motivation

Attention and KV caches handle short-range dependencies well, but they don’t maintain a persistent state that adapts across multiple forward passes. The goal here was to explore whether a lightweight, inference-only update could provide a form of dynamic memory without modifying weights.

Method (High-Level)

The layer keeps a small set of vectors (“attractors”) that:

Measure similarity to current attention output
Strengthen when frequently activated
ecay when unused
Feed a small signal back into the next forward pass

This is not recurrence, just a single-step update applied during inference.

Early Observations

On small transformer models:

Some attractors formed stable patterns around recurring concepts
A short burn-in phase reduced instability
Unused attractors collapsed to noise
In some cases, the layer degraded generation quality instead of helping

No performance claims at this stage—just behavioral signals worth studying.

Key Results

Perplexity:

Preserved baseline perplexity on smaller models (≈0% change)
~6.5% compute overhead

Failure Case:

On longer (~500 token) generation, accuracy dropped by ~80% due to attractors competing with context, leading to repetition and drift

Revised Configuration:

Adding gating + a burn-in threshold produced a small gain (+3.3%) on a shorter comprehension task

These results are preliminary and fragile.

What Failed

Too many attractors caused instability
Long sequences “snapped back” to earlier topics
Heavy decay made the system effectively stateless

What This oes Not Show

General performance improvement
Robustness on long contexts
Applicability beyond the tested model family
Evidence of scaling to larger models

Small N, synthetic tasks, single architecture.

Related Work (Brief)

This seems adjacent to several prior ideas on dynamic memory:

Fast Weights (Ba et al.) – introduces fast-changing weight matrices updated during sequence processing. This approach differs in that updates happen only during inference and don’t modify model weights.
ifferentiable Plasticity (Miconi et al.) – learns plasticity rules via gradient descent. In contrast, this layer uses a fixed, hand-designed update rule rather than learned plasticity.
KV-Cache Extensions / Recurrence, reuses past activations but doesn’t maintain a persistent attractor-like state across forward passes.

This experiment is focused specifically on single-step, inference-time updates without training, so the comparison is more conceptual than architectural.

Questions for the Community

Is there prior work on inference-time state updates that don’t require training?
Are there known theoretical limits to attractor-style mechanisms competing with context?
Under what conditions would this approach be strictly worse than recurrence or KV-cache extensions?
What minimal benchmark suite would validate this isn't just overfitting to perplexity?

Code & ata

Looking for replication attempts, theoretical critique, and pointers to related work.

By skyforbes

MachineLearning

[D] Moral Uncertainty Around Emerging AI Introspection

skyforbes Nov 27, 2025

MachineLearning

[D][P] PKBoost v2 is out! An entropy-guided boosting library with a focus on drift adaptation and multiclass/regression support.

skyforbes Nov 27, 2025

MachineLearning

[D] What would change in your ML workflow if Jupyter or VS Code opened in seconds on a cloud-hosted OS?

skyforbes Nov 27, 2025

[R] Inference-time attractor layer for transformers: preliminary observations

Motivation

Method (High-Level)

Early Observations

Key Results

What Failed

What This oes Not Show

Questions for the Community

Code & ata

Like this:

By skyforbes

Leave a ReplyCancel reply

You Missed

Is this normal? Lol

✍️ 9 ChatGPT Prompts That Instantly Improve Your Writing (Copy + Paste)

AI Now Builds the Whole Damn Thing

Archives

[R] Inference-time attractor layer for transformers: preliminary observations

Motivation

Method (High-Level)

Early Observations

Key Results

What Failed

What This oes Not Show

Questions for the Community

Code & ata

Like this:

By skyforbes

Related Posts

[D] Moral Uncertainty Around Emerging AI Introspection

[D][P] PKBoost v2 is out! An entropy-guided boosting library with a focus on drift adaptation and multiclass/regression support.

[D] What would change in your ML workflow if Jupyter or VS Code opened in seconds on a cloud-hosted OS?

Leave a ReplyCancel reply

You Missed

Is this normal? Lol

✍️ 9 ChatGPT Prompts That Instantly Improve Your Writing (Copy + Paste)

AI Now Builds the Whole Damn Thing