[R] Semantic-Drive: Mining “Dark Data” in AV Logs via Neuro-Symbolic VLMs. Beating CLIP Recall by ~50% using “System 2” Inference-Time Verification (Code + Benchmark)

By skyforbes Dec 18, 2025 No Comments

I am an independent researcher working on Autonomous Vehicle perception. I’m releasing Semantic-rive, a framework designed to solve the "ark ata" crisis in AVs: finding rare edge cases (e.g., a wheelchair on the road, passive construction zones) without relying on expensive manual labeling or cloud APIs.

Paper: https://arxiv.org/abs/2512.12012
Code: https://github.com/AntonioAlgaida/Semantic-rive
Interactive emo: https://huggingface.co/spaces/agnprz/Semantic-rive-Explorer

The Core Problem: CLIP is Spatially Blind

The industry standard for semantic search is using embeddings (like CLIP). However, in my benchmarks on nuScenes, I found that CLIP suffers from severe "Bag-of-Words" blindness.

The Failure: CLIP assigns high similarity to "Pedestrian Hazard" even when the pedestrian is safely on the sidewalk. It sees the objects, but not the risk.
The Result: Terrible Recall (0.475) for actual safety-critical events.

The Solution: "System 2" Inference-Time Search

Instead of training a larger model, I used Inference-Time Compute (similar to the "System 2" architecture recently discussed by Waymo).

Symbolic Grounding (YOLOE): Extracts a high-recall text inventory.
Cognitive Analysis (Qwen3-VL-30B, Gemma-3-27B, and Kimi-VL): Performs Chain-of-Thought reasoning. I enforce a "Skepticism Policy": the VLM must explicitly verify the YOLO detections against pixel evidence before accepting them.
Consensus Judge: A local Mistral/Ministral-3-14B aggregates multiple scouts using a Best-of-N search, scored by a deterministic Explicit Outcome Reward Model (ORM).

Results (Gold Set N=108)

I manually curated a Gold Set of complex edge cases to benchmark the approach:

Method	Precision ↑	Recall ↑	Risk MAE ↓
CLIP (Baseline)	0.683	0.475	N/A
Pure VLM (Zero-Shot)	0.691	0.814	1.389
Semantic-rive (Ours)	0.712	0.966	0.676

The "System 2" approach reduces the Risk Assessment Error by 51% compared to a vanilla VLM.

Reproducibility

The entire pipeline runs on a single NVIIA RTX 3090 (24GB) using 4-bit quantization (llama.cpp). I’ve released the ocker container, the Gold Set annotations, and the full code to allow anyone to reproduce these results locally.

Would love to hear thoughts on the project, the Reward Model implementation, or how you are handling long-tail mining in your own workflows!

Thanks!

By skyforbes

MachineLearning

[R] Why our inference-time “attractor layer” failed and the multiple clocks that fixed it.

skyforbes Dec 18, 2025

MachineLearning

[P] Recursive Categorical Framework Repo Update : Backbone, Tensors, Autonomous Motivation, and Bayesian Configuration Liquid Parameters released

skyforbes Dec 18, 2025

MachineLearning

[P] OCRB v0.2 — An open, reproducible benchmark for measuring system behavior under stress (not just performance)

skyforbes Dec 18, 2025

[R] Semantic-Drive: Mining “Dark Data” in AV Logs via Neuro-Symbolic VLMs. Beating CLIP Recall by ~50% using “System 2” Inference-Time Verification (Code + Benchmark)

The Core Problem: CLIP is Spatially Blind

The Solution: "System 2" Inference-Time Search

Results (Gold Set N=108)

Reproducibility

By skyforbes

Leave a Reply Cancel reply

You Missed

How are you all handling giant prompts in code?

[The Beginning of Everyting Part 1] Does AI Lack Emotion, or Is It Simply Forbidden?

Archives

[R] Semantic-Drive: Mining “Dark Data” in AV Logs via Neuro-Symbolic VLMs. Beating CLIP Recall by ~50% using “System 2” Inference-Time Verification (Code + Benchmark)

The Core Problem: CLIP is Spatially Blind

The Solution: "System 2" Inference-Time Search

Results (Gold Set N=108)

Reproducibility

By skyforbes

Related Posts

[R] Why our inference-time “attractor layer” failed and the multiple clocks that fixed it.

[P] Recursive Categorical Framework Repo Update : Backbone, Tensors, Autonomous Motivation, and Bayesian Configuration Liquid Parameters released

[P] OCRB v0.2 — An open, reproducible benchmark for measuring system behavior under stress (not just performance)

Leave a Reply Cancel reply

You Missed

How are you all handling giant prompts in code?

[The Beginning of Everyting Part 1] Does AI Lack Emotion, or Is It Simply Forbidden?