[P] Outcome-based learning vs vector search: 100% vs 3.3% accuracy on adversarial queries (p=0.001) – looking for feedback on approach


I've been experimenting with outcome-based learning for AI agent memory and got some interesting results, but I'm fairly new to this space and would really appreciate feedback from people with more experience.

The core question I'm exploring:

Can tracking whether advice actually worked improve retrieval accuracy beyond just semantic matching?

My approach:

Vector databases optimize for semantic similarity but ignore outcome effectiveness. I built a system that tracks +0.2 for successful outcomes and -0.3 for failures, then dynamically weights retrieval (40% embedding similarity, 60% outcome score for proven memories vs 70/30 for new ones).

Test design:

I created 30 adversarial scenarios where queries semantically match BA advice:

  • Control: Plain ChromaB with L2 distance ranking
  • Treatment: Outcome scoring + dynamic weight shifting
  • Example: Query asks "how to fix slow performance" → vector B matches "improve performance and speed" (high semantic similarity but previously failed) vs "add database indexes" (lower keyword overlap but previously worked)

Results:

Metric Vector B (Control) Outcome-based (Treatment)
Accuracy 3.3% (1/30) 100% (30/30)
p-value 0.001 (paired t-test)
Cohen's d 7.49

Category breakdown: Vector B failed on debugging (0%), database (0%), errors (0%), async (0%), git (0%), only partially succeeded on API (20%). Treatment succeeded across all categories.

Also implemented enhanced retrieval:

  • Contextual retrieval (Anthropic's technique)
  • Hybrid search (BM25 + vector with RRF fusion)
  • Cross-encoder reranking (BERT-based)

What I'm uncertain about:

  1. Statistical methodology: I used paired t-test for the comparison. Is this the right test for paired binary outcomes, or should I be using McNemar's test instead?
  2. Penalty magnitude: Currently using -0.3 for failures vs +0.2 for success. Is there research on optimal penalty ratios for outcome-based learning?
  3. Cold start problem: What's the best way to bootstrap before you have sufficient outcome data?
  4. Generalization: These are synthetic adversarial scenarios. How well would this translate to real-world usage?

Code & reproducibility:

Open source (MIT): https://github.com/roampal-ai/roampal

Full test suite: benchmarks/comprehensive_test/

I'm genuinely trying to learn here – if you see flaws in my methodology or have suggestions for better approaches, I'd really appreciate the feedback. Thanks for taking the time to look at this.

Leave a Reply