The Agent That Doesn't Improvise
6 June 2026
The default move when you build with LLMs is to hand the model as much autonomy as you can. You define some tools, write a system prompt, and let it decide which tools to call and in what order. For a lot of tasks that works surprisingly well, and it’s genuinely the right starting point.
For the legal research assistant I’ve been building, I wasn’t convinced it was the right default. The question wasn’t whether the model was capable, but whether it would be consistent. An autonomous agent might decide a citation check isn’t needed for a particular query, or reason that a disclaimer is already implied. Sometimes it’d be right; sometimes not. And in a legal context, the times it’s wrong are the whole problem: a confident, wrong-sounding answer is worse than no answer at all.
So I went the other way and built a fixed pipeline. Every step is a node in a graph and every transition is a condition written in plain code. LangGraph’s StateGraph is what makes this tractable: you write nodes as functions that read from and write to a shared state object, wire them together with edges, and let the graph handle execution.
Here’s the trade-off I was making, laid out plainly.
Autonomous agent Rejected
- The model decides which steps to take at runtime
- No guaranteed order, so safety checks can quietly get skipped
- Non-deterministic, which makes it hard to audit or predict cost
- Can reason its way past a safety rule when it decides the rule doesn’t apply
Explicit workflow Chosen
- Every step runs in a fixed, auditable sequence
- Safety checks are structurally guaranteed
- Explicit stopping points written in code
- Predictable cost: at most two drafting attempts per query
What actually happens when a question comes in
The shape of it is simple. Classify the question, search the law, write an answer, check that answer, then either send it or loop back. It’s in the same order every single time.
Before I get into the steps, it’s worth understanding how they talk to each other. Well, they don’t. No step calls the next one or passes it a return value. Instead every step reads from and writes to one shared object I’ll call the agent state. Think of it as a notepad the whole pipeline passes around.
So when the first step finishes classifying a question, it doesn’t hand anything to the next step. It records what it worked out onto the notepad, and the graph reads the notepad to decide where to go. A later step drafts an answer and writes that down too; the citation check picks that draft back up from the same place. Each step only cares about what’s already on the notepad when it runs, and what it’s supposed to add before it hands control back.
Before the pipeline even starts, the very first step scans the incoming message for phrasing that sounds like someone asking about their own situation — “my client”, “am I liable”, “I have been charged”. If it spots one, the question is flagged as out of scope and handed straight off, without ever touching the database or calling a model. The questions with the most at stake end up costing almost nothing to handle.
The diagram below is the whole thing wired up. Blue nodes make a generative LLM call. Green nodes are deterministic: no model decides what comes out. (The one asterisk is the retriever: it embeds your query to run the vector search, but nothing it returns is model-authored, so it behaves deterministically from the pipeline’s point of view.) Amber nodes are control logic that steers the flow. Hover or tap any node to see exactly which state fields it reads and which it writes.
Look at the synthesiser, for example. It reads retrieved_chunks and response_language, both written by earlier nodes, and it writes draft_response and citations, which the validators pick up next. It knows nothing about any other node. It only knows state. That decoupling is what makes the pipeline easy to reason about, easy to test, and easy to re-order without anything silently breaking.
How I check whether the answer is trustworthy
Once the model has written an answer, how do I actually know it’s right? I landed on three checks: citation_validator, grounding_check, and supervisor.
The first is a plain Python function with no model involved. It looks at every citation in the draft and asks whether that act-and-section pair was actually in the retrieved results. If the answer cites Section 90A of the Evidence Act, was Section 90A really fetched? It either was or it wasn’t.
The second is where I bring in a second LLM as a judge. It reads each legal claim in the draft next to the statute text that was cited and decides whether the text genuinely supports what the answer says. This is the check that catches the subtle stuff: the model citing a real section but misreading it, or overstating what it actually says. I use Claude Sonnet at temperature 0, and each claim comes back labelled supported, partial, or unsupported. Only unsupported blocks the answer. I allow partial through, because a slightly incomplete claim is still useful for a research tool.
The third check is rules again, four of them enforced with regular expressions:
- No advice phrases.
- At least one statute citation in the expected format.
- A disclaimer that this isn’t legal advice.
- And none of the personal-situation phrases from the router’s escalation list appearing in the response itself — a guard against the model accidentally echoing the user’s phrasing back (“you asked whether you are liable…”) instead of staying in research mode.
One shortcut worth pointing out: if the first check finds a problem, I skip the second one. No reason to pay an LLM judge to grade a draft that’s already going back for a rewrite.
citation_validatorDeterministic↓ If L1 finds issues, skip L2 and go straight to the supervisor. No point paying a judge to grade a draft that’s already being rewritten.
grounding_checkLLM judgesupervisorDeterministicWhy the stack is shaped this way matters more than the individual checks. L1 and L3 are deterministic, so AI reasoning can’t make them wrong. L2 is the only model-based check in the verification step, and it sits between two layers that don’t depend on it. Model judgment is bounded on both sides by code that can’t hallucinate a false pass.
A bug that took this down in production. I wrote the supervisor’s citation regex while only testing with Claude. Claude writes citations like “Section 90A of the Evidence Act 1950.” GPT-4.1 writes “Section 90A(1) states that…”, with the subsection in the middle and the act name earlier in the sentence. My regex didn’t match that style, so Rule 2 fired on every non-Claude answer, forced a retry, and then returned a refusal. GPT-4.1 went from a 30% to an 80% pass rate just by widening that one pattern. A deterministic rule is only ever as good as the variation you wrote it to handle.
Two exits, two different problems
Both of the system’s stopping points end with the user being pointed toward a lawyer, but they’re not the same thing and it’s worth keeping them apart.
The first happens before any real work. A keyword check on the incoming message catches phrasing that sounds like a request for personal legal advice, and if it matches, the pipeline stops right there, before any database query or model call. The question is out of scope.
The second happens after the system has done everything it can. Both drafting attempts have failed the checks, so rather than send something I can’t verify, the final answer gets replaced with a safe fallback. That overwrite happens outside the graph, not inside a node, so no future rewiring of the pipeline can accidentally skip it. The question was in scope. I just couldn’t answer it well enough.
Escalation
- Triggered by
- Pattern match on the message
- When
- Before retrieval or any LLM call
- Cost
- Zero, the cheapest possible stop
- Guards against
- Requests for personal advice
Fail-closed
- Triggered by
- Unresolved violations after both tries
- When
- After the full pipeline + one retry
- Cost
- Full, the most expensive stop
- Guards against
- Shipping an unverifiable answer
Using a keyword match for escalation feels a little crude, and it won’t catch every way someone might phrase a request for advice. But when it gets something wrong, it gets it wrong on the safe side. It flags a question that didn’t need flagging rather than letting a personal-advice request slip through. That asymmetry mattered to me more than raw coverage did.
These are all named patterns
It’s worth naming what this architecture actually is, because the patterns map cleanly onto established ones. All three (routing, prompt chaining, and evaluator-optimizer) come straight from Anthropic’s guide on building effective agents. The router is routing. The fixed sequence of steps is prompt chaining. The synthesiser-to-checker loop is evaluator-optimizer.
Things I gave up on purpose
None of these decisions are obviously right. Each one trades something useful away, and I think it’s more honest to name what.
-
A fixed pipeline instead of an autonomous agent
The clearest thing I gave up is adaptability. A fixed pipeline can’t reshape itself around a question I didn’t anticipate, and there’s no room for the model to improvise a different approach when one might genuinely help. I took that deal because the payoff is the opposite property: every answer travels the exact same safety path, every time, with no shortcut the model can reason its way into.
-
At most one retry
Capping retries at one means some drafts that might have recovered on a second or third pass get refused instead. But in testing, convergence past the first retry was rare, since the model kept working from the same retrieved evidence and arriving at pretty much the same place. The extra attempts mostly added latency and cost without changing the outcome, and bounding the count keeps the worst-case cost of any query predictable.
-
The retry re-enters at drafting, not retrieval
A failed draft goes back to the
synthesiser, not all the way back toretrieval. The cost is that I can’t recover from a bad retrieval this way. If the wrong statute sections were fetched, re-drafting won’t fix it. I accepted that because the failures I actually saw were drafting errors, misreading or overstating the law, rather than evidence errors, so re-running retrieval would just re-fetch chunks that were already fine. -
Deterministic rules bookend the LLM judge
Surrounding the judge with hard-coded checks is more to write and maintain than just asking a model to review everything. Worth it, because rules can’t hallucinate a false pass. The bookends have no failure mode of their own, so a bad call from the judge can’t on its own push an answer out the door.
-
Escalation by pattern match, not a model
A regex instead of a classifier is brittle, and new ways of phrasing a request for personal advice will slip through. But it costs nothing, adds no latency, and fails safe when it’s wrong. I’d rather over-flag than let one slip past.
What I’d do differently
This is a pilot, not a finished system, and writing it up made the gaps a lot easier to see. Four things I’d change first.
-
Make the retry informed
Right now
increment_retrywipes the violations and sends the synthesiser back at the same prompt and the same sections, without ever telling it what went wrong. The second attempt is a re-roll rather than a correction. This is probably the single biggest weakness, and almost certainly why retries past the first one stopped helping: there was no new information to converge on. I’d feed the specific violation reasons back into the next draft so the model knows exactly what to fix. -
Keep every policy string in one place
The disclaimer text and the escalation phrases live across the router, the synthesiser, and the supervisor, and they’ve already drifted. The supervisor only recognises the English disclaimer, so a perfectly good Bahasa Malaysia answer can fail closed, and the escalation patterns differ between the two nodes that define them. Any rule defined in more than one place is a calibration bug waiting to happen. I’d derive them all from one shared module so the drafter and the checker can’t disagree.
-
Verify every claim, not just the cited ones
The grounding judge only inspects claims attached to a citation that was both named in the draft and present in what was retrieved. An assertion with no citation attached never gets checked, since the draft only has to clear the “at least one citation somewhere” rule. For a tool whose entire job is to avoid stating unsupported law, that’s a real hole. I’d ground each claim on its own, whether or not the model chose to cite it.
-
Give escalation more than a regex
I’m upfront that pattern-matching is brittle, and it earns its place as a fast, safe-failing first pass. But the handful of phrases it knows cover a thin slice of how people actually ask for personal advice, and anything outside the list flows straight into the full pipeline. I’d add a cheap classifier behind the regex: the regex still runs first as a free, instant filter for the obvious phrasings, and the classifier catches the requests worded in ways the regex doesn’t know. Right now the regex is the only thing guarding this exit, so anything it misses sails through.
If there’s a single idea underneath all of this, it’s that orchestration is where a safety-critical agent actually lives or dies. The model drafts the answer, but whether you can trust the result comes down to the graph around it: what runs, in what order, and what gets to decide whether an answer goes out at all.
References
- Anthropic’s “Building Effective Agents” — source for the routing, prompt chaining, and evaluator-optimizer pattern names
- LangGraph Graph API —
StateGraph, nodes, conditional edges - Code —
agent/graph.py,agent/state.py,agent/query_lifecycle.py,agent/nodes/*.py