Eval Pipelines That Don't Lie
How Anthropic teams design evals that survive contact with production. Sarah on the difference between LLM-as-judge in research vs in shipping.
Everything I publish, in one place. Builds, breakdowns, and lessons from the front lines of applied AI.
New ones every Monday, Wednesday, and Friday.
10–30 minute videos where I show how the agents actually work.
Long-form interviews with builders, researchers, and operators shipping real AI products.
How Anthropic teams design evals that survive contact with production. Sarah on the difference between LLM-as-judge in research vs in shipping.
Subscribe to get them in your inbox first.
You've got a tech background. You're curious. You've Googled "how to learn AI" and gotten lost in a sea of YouTube thumbnails. Here's the curriculum I'd actually use if I were starting from zero today, built only from free resources from Google, Anthropic, Meta, and a few others.
A model produced a counterexample to a long-standing open problem. Tech Twitter is split between "the singularity is here" and "it's just brute search." Both are missing what actually changed.
AI doesn't kill senior engineers. It changes which kind of senior engineer the world needs. Here's what's changed in my Fortune 10 work — and what hasn't.
Three years of bespoke eval frameworks taught me one thing: the eval suite that ships is the one that fits in 50 lines.
What it actually takes to ship an agent inside a Fortune 10 healthcare org. Compliance, audit, and the audit-of-the-audit.
Every cycle has the same three phases. The trick is recognising which one you're in before everyone else does.
When to retrieve, when to call. A 4-question test I run on every new agent I design.
The thread that ties a kid taking apart computers, an IEEE Senior Member, and a Fortune 10 architect together. Spoiler: the question never changed.