
The core question I'm exploring:
Can tracking whether advice actually worked improve retrieval accuracy beyond just semantic matching?
My approach:
Vector databases optimize for semantic similarity but ignore outcome effectiveness. I built a system that tracks +0.2 for successful outcomes and -0.3 for failures, then dynamically weights retrieval (40% embedding similarity, 60% outcome score for proven memories vs 70/30 for new ones).
Test design:
I created 30 adversarial scenarios where queries semantically match BA advice:
- Control: Plain ChromaB with L2 distance ranking
- Treatment: Outcome scoring + dynamic weight shifting
- Example: Query asks "how to fix slow performance" → vector B matches "improve performance and speed" (high semantic similarity but previously failed) vs "add database indexes" (lower keyword overlap but previously worked)
Results:
| Metric | Vector B (Control) | Outcome-based (Treatment) |
|---|---|---|
| Accuracy | 3.3% (1/30) | 100% (30/30) |
| p-value | – | 0.001 (paired t-test) |
| Cohen's d | – | 7.49 |
Category breakdown: Vector B failed on debugging (0%), database (0%), errors (0%), async (0%), git (0%), only partially succeeded on API (20%). Treatment succeeded across all categories.
Also implemented enhanced retrieval:
- Contextual retrieval (Anthropic's technique)
- Hybrid search (BM25 + vector with RRF fusion)
- Cross-encoder reranking (BERT-based)
What I'm uncertain about:
- Statistical methodology: I used paired t-test for the comparison. Is this the right test for paired binary outcomes, or should I be using McNemar's test instead?
- Penalty magnitude: Currently using -0.3 for failures vs +0.2 for success. Is there research on optimal penalty ratios for outcome-based learning?
- Cold start problem: What's the best way to bootstrap before you have sufficient outcome data?
- Generalization: These are synthetic adversarial scenarios. How well would this translate to real-world usage?
Code & reproducibility:
Open source (MIT): https://github.com/roampal-ai/roampal
Full test suite: benchmarks/comprehensive_test/
I'm genuinely trying to learn here – if you see flaws in my methodology or have suggestions for better approaches, I'd really appreciate the feedback. Thanks for taking the time to look at this.
