Back to The Lab
Frameworks·8 min read·Apr 24, 2026

Why I Stopped Writing Custom Eval Suites

Three years of bespoke eval frameworks taught me one thing: the eval suite that ships is the one that fits in 50 lines.

The eval suite you ship is the only one that matters

I used to build elaborate eval harnesses. Multi-tier scoring, LLM-as-judge with rubric chains, golden-set regressions versioned in Git, dashboards, traffic-lights. I'd spend a week wiring it up before I'd shipped a single line of the actual feature.

Three projects later, here's what I do now:

def eval_case(prompt: str, expected: str) -> float:
    actual = run_pipeline(prompt)
    return judge_with_claude(prompt, expected, actual)

That's the entire harness. Twelve lines including the imports. And it ships.

What I was actually optimizing for

When I wrote the elaborate version, I told myself I was building rigor. Coverage. Statistical confidence. The version that survives production.

What I was actually doing: avoiding the discomfort of admitting I didn't know what "good" looked like for the feature yet. The elaborate harness was a delay tactic dressed up as engineering.

If you can't articulate what "better" means in one sentence, no harness will save you. The harness just defers the conversation.

The 50-line rule

For every new agent, every new prompt-driven feature, every new RAG pipeline — I cap the eval scaffold at 50 lines of code. That includes:

  • The runner that loads cases from a JSON file
  • The model call
  • The judge call (Claude or GPT-4o, prompted as a rubric grader)
  • A diff against the previous run
  • Output to stdout

No dashboards. No history. No CI integration on day one. Those come after the feature is real and the cases have stabilized.

When the 50 lines stop being enough

Eventually, for the systems that survive, the 50 lines need to grow. That's fine. But by then I know three things I didn't know on day one:

  1. Which dimensions matter — accuracy? latency? hallucination rate? helpfulness? You can only learn this from real cases.
  2. What "regression" means — a 2% score drop on case A while case B improves 8% is not always bad. The judge model has to know your tradeoff.
  3. Where the bugs actually come from — usually not where you thought.

Grow the harness in response to learnings. Not in anticipation of them.

The lesson

Build the smallest eval that will tell you if you're getting better. Ship it. Then build the feature. Then grow the eval when the feature tells you what it needs.

The eval suite that ships is the only one that matters.