The Eval Flywheel — aishahsofea

There’s a line from Karpathy — written in 2022, mostly ignored at the time — that has aged into something close to a law: “competitive advantage in AI goes not so much to those with data but those with a data engine.” Whoever spins it fastest wins.

The practitioner community has a name for this now. Hamel Husain calls it the eval flywheel. Eugene Yan calls it Eval-Driven Development. The names vary but the shape is the same. Six steps running in a loop, each one feeding the next.

I’ve been building a Malaysian legal research assistant. The agent has four fixed stages: classify the query, retrieve relevant statute chunks from a database of Malaysian legislation, synthesize a grounded response, then validate every citation before anything goes out the door. I’ve spun the flywheel several times now, and here’s what actually happened.

The Loop

Every serious eval practitioner converges on the same workflow, with cosmetic variations. Log your pipeline’s full traces. Look at them without an agenda. Do error analysis by annotating failures freely, clustering them into patterns, and counting which ones recur most. Write evals that capture the top failure class. Fix the system. Measure. Watch for regressions. Repeat.

What makes this a flywheel and not just a checklist is that each turn generates the input to the next. You are back at error analysis after every loop, but now you know more.

Click each node to see what it meant in practice.

click any node to see what it meant in practice

Three-level Pyramid

The most useful operational framework is Hamel’s three levels of evaluation. Cost grows by roughly an order of magnitude per level, which dictates how often each runs. The intuition is simple. There’s no point paying for an LLM’s opinion on a response that already failed a basic check. So cheap deterministic assertions run first, and if any fail, the case stops there. The LLM judge only gets called on responses that cleared every fast gate.

click any level to see what it meant in practice

Three Loops

Theory is one thing. The build log is another. What follows is what actually happened, with each loop triggered by a score drop and each fix validated by re-running the suite.

Loop 01When the Right Fix Is a Scope Decisionlanguage_register 0%

run_evals.py —smoke

FAIL language_register: 0/5 = 0.0%

↳ BM query received English-only response

PASS citation_existence: 8/8 = 100.0%

PASS uuid_leakage: 10/10 = 100.0%

Judge: 71.4% — below 80% gate

The Bahasa Malaysia cases were scoring 0% on the language assertion because the pipeline was responding in English to BM queries. Clear failure on paper. But before writing a fix, error analysis pointed somewhere unexpected: the entire statute corpus is in English. BM retrieval degrades by design until a BM corpus is ingested. Given what the pipeline had to work with, the model wasn’t doing anything wrong.

This changed the question from “how do we fix the language handling?” to “should we be testing this at all right now?” The answer was no. The BM test cases were removed from the smoke suite and BM support was explicitly deferred to v2, when there will actually be BM content to retrieve against.

The assertion is still in the code. The capability just isn’t claimed yet.

The harness found it either way, but what to do with the finding was still ours to decide.

Loop 02The Supervisor Was Calibrated to Claudejudge 30% → 80%

run_evals.py —mode full # GPT-4.1 trial

FAIL judge: 3/10 = 30.0%

↳ all failures via FINAL_FAILURE_RESPONSE

↳ supervisor Rule 2 firing on every GPT response

debug_case.py # node-by-node tracer

router → statute_lookup ✓

retriever → 8 chunks ✓

synthesizer → response ✓

supervisor → Rule 2 FAIL # citation regex mismatch

During a model trial, swapping Sonnet for GPT-4.1 dropped the judge pass rate to 30%. Every single case ended in the pipeline’s fallback error response, not because GPT-4.1 got the law wrong, but because a validation rule was rejecting its output before the judge ever saw it. A single-case debug tracer that prints the output of each stage in isolation made this visible in minutes.

The culprit was a citation pattern check written to match Claude’s phrasing: “Section 90A of the Evidence Act 1950.” GPT-4.1 writes it differently — “Section 90A(1) states that…” with the Act name appearing earlier in the sentence. The check didn’t account for that, so every GPT-4.1 response failed the validation, triggered a retry, and eventually hit the fallback.

Extending the pattern check to also accept subsection notation moved GPT-4.1 from 30% to exactly 80%, with all deterministic assertions passing and the judge scoring 8 out of 10. The larger point is that the validation was silently calibrated to one model’s citation style. Any future model trial would have hit the same wall. The harness found what a code review wouldn’t have.

Loop 03The Cost vs. Fidelity Tensiondecision: eval = prod

Eval cost was creating friction. With 30+ model calls per smoke run, iteration was expensive enough that you’d think twice before running it. The natural instinct was to swap in a cheaper model for evals.

The flywheel stopped this. Running evals on a cheaper model while deploying a different one in production isn’t an eval. It’s a measurement of a different system. Previous testing had already shown the cheaper model would consistently omit citations from its structured output, returning empty lists even when the prose mentioned the right statute. The citation existence check would pass, but the citations would be missing. The eval would be blind to exactly the failure class it was supposed to catch.

The compound-probability problem

”A 90% accurate process repeated 5 times is 59% accurate.”

Using a slightly different model in evals vs. production compounds this gap in ways that only surface as mysterious production failures with no eval signal to explain them.

The decision was to keep the production model in the eval and address cost by making that model cheaper instead. GPT-4.1 is cheaper than Sonnet. A small routing layer now maps model names to the right provider, so future model trials are a one-line environment variable change, not a code change.

Cost pressure is real and worth solving, but the answer is always to make the right system cheaper, not to measure a different one. The moment you cut the eval model, you’ve cut the signal.

Criteria Drift

Shreya Shankar’s UIST 2024 paper names the phenomenon precisely:

To grade outputs, people need to externalize and define their evaluation criteria; however, the process of grading outputs helps them to define that very criteria.

— Shreya Shankar · UIST 2024

I felt this when writing the judge prompt. My initial pass criteria were vague, things like “accurate” and “appropriately hedged.” But the moment I tried to write calibration examples, the vagueness collapsed. What does “hallucinated citation” mean exactly? What counts as AI-refusal boilerplate versus a legitimate disclaimer? What’s the boundary between refusing to give legal advice and refusing to answer a legitimate research question?

Each of those questions had to be answered explicitly because the judge demanded it. The answers are the product’s quality contract, written in concrete terms that a model can apply consistently.

Phillip Carter, after building Honeycomb’s LLM judge, put it plainly: “Seeing how the LLM breaks down its reasoning made me realize I wasn’t being consistent about how I judged certain edge cases.” The PRD doesn’t precede the eval. The eval is how you write the PRD.

What I’d Do Differently

Start with 15 hand-graded cases, not 40. The smoke set should have been the starting point, hand-graded before anything automated was trusted. Starting larger meant trusting automation before the rubric was calibrated.

Build a single-case debug tracer on day one. A small script that runs one query through each pipeline stage and prints what came out of each one was only built after the GPT-4.1 failure made the full eval suite useless for diagnosis. It should have been the first tool. Running 15 cases to isolate a bug in one stage is slow and expensive; running one case with full intermediate output takes seconds.

Version the judge prompt like it’s the product spec — because it is. It deserves the same review rigor as anything else that ships, and a changelog when it changes.

What compounding actually feels like

Each loop in this build started with a score drop that already existed. The fix was attempted with measurement, not against vibes. Every pinned smoke case is a promise that this specific failure will never silently reappear. The loop closes. The next spin is cheaper than the last. That’s the flywheel.

The flywheel isn’t glamorous. It’s test cases, rubrics, error taxonomies, regex fixes, and cost calculations. But it’s the part of agent engineering where the work actually compounds.