Back to work
2026-04-1910 min readin-progress

RAG with Rigorous Evals

A production retrieval-augmented pipeline on Qdrant with hybrid search, reranking, and a Braintrust eval harness that blocks prompt or embedding regressions before they ship.

Python
Qdrant
Braintrust
FastAPI
Claude Sonnet 4.6
Pydantic
RAG with Rigorous Evals

Eval suite

280

Recall@5

0.92

Faithfulness

0.89

p95 latency

1.2s

The problem

Most RAG demos look great in a single happy-path Jupyter cell. They fall over the moment someone asks a question just outside the demo corpus, or when embeddings drift after a model swap, or when a silent prompt regression ships on a Friday.

This project fixes that third one first: make every RAG change land with an evaluation that would have caught it.

The stack

  • Qdrant as the vector store — fast hybrid search (dense + sparse) with payload filtering for tenant isolation
  • Claude Sonnet 4.6 as the generator, behind a provider-agnostic interface so we can A/B against other frontier models
  • Braintrust as the eval harness — deterministic, versioned, and wired into CI so PRs can't merge with regressions
  • FastAPI + Pydantic for the service, Postgres for audit logs

Retrieval

Hybrid retrieval (BM25 + embedding cosine) followed by a small cross-encoder rerank. We measured:

  • pure dense: Recall@5 = 0.78
  • pure sparse: Recall@5 = 0.71
  • hybrid + rerank: Recall@5 = 0.92

Worth the extra ~120ms.

Evals that actually block deploys

  1. Phase 1

    Seed the eval set

    280 real user questions from the design-partner corpus, each labelled with ideal context passages and a faithful answer. No synthetic fluff.

  2. Phase 2

    Multi-axis scoring

    Every change scored on recall, faithfulness (LLM-as-judge with a strict rubric), answer helpfulness, latency p50/p95, and cost per query.

  3. Phase 3

    CI gate

    A GitHub Action runs the suite on every PR. Any axis that regresses beyond a tolerance blocks merge. Braintrust UI surfaces the deltas for review.

  4. Phase 4

    Shadow eval in prod

    A fraction of live traffic is replayed against candidate configurations nightly. Drift shows up in the dashboard before users complain.

What surprised me

  • Reranking matters more than embedding choice. Swapping embedding models moved Recall@5 by ±2 points; adding rerank added 14.
  • LLM-as-judge is fine if your rubric is strict. Loose rubrics give every answer an 8/10. A rubric that looks for specific citations from the retrieved context is far more discriminating.
  • Your eval set rots. Every month, regenerate ~10% from recent real queries, or you'll optimise to an old distribution.

What's next

  • Public write-up with the full rubric + scoring code
  • Open-source the Qdrant + Braintrust scaffold as a template repo
  • Add adversarial eval cases (jailbreaks, prompt-injection in retrieved docs)