What my agent remembers

I’m building a legal research assistant for Malaysian law. You ask it a question, it searches the actual Acts, and it answers with citations. Early on it had a problem that made it feel a lot dumber than it actually was: it couldn’t hold a conversation.

This post is about how I fixed that, and more specifically about how I now think about memory in an agent like this. I’m going to talk about two things: a LangGraph checkpointer, which is what makes a conversation persist, and the split between what gets stored and what the model actually sees on any given turn.

The assistant had amnesia

Say you ask “what does the Evidence Act say about electronic records?”, get an answer, and then ask the obvious follow-up: “what about criminal cases?”

On its own, “what about criminal cases?” means nothing. There’s no Act in it, no topic, nothing to search on. The only thing that makes it a real question is the turn that came before it. So the model has to be able to see that earlier turn, or the follow-up falls apart.

The trouble was that my server had no memory of its own. Every request hit it like the first one. The only reason multi-turn worked at all was that the frontend was holding the whole conversation in React state and attaching the entire history to every request. So the real memory lived in one browser tab. It’s gone as soon as the page gets refreshed. Open the conversation on another device and the server has no idea it ever happened, because the one place the history existed was that tab. And the server was leaning on the client to hand back the full conversation, complete and correct, on every single request, for something the server itself should have been keeping. That felt backwards, so I moved the memory onto the server, where it belongs. That’s what a checkpointer is for.

What’s actually in a checkpoint

The way I think about a checkpointer now is the patient file at a doctor’s office.

Every time you come in, the doctor pulls your file, reads what happened last visit, adds today’s notes, and files it back. You don’t re-explain your whole history from scratch each time. It doesn’t even have to be the same doctor, because the memory lives in the file rather than in any one doctor’s head, so anyone in the practice can pick it up and know where things stand. Your file is yours and never gets mixed up with the patient in the next room. And it doesn’t get thrown away.

That’s almost exactly what a checkpointer does. Your file is the conversation, the doctor reading-then-adding is what happens on each turn, and the name on the file is what keeps your conversation separate from everyone else’s.

A bit of vocabulary so the rest of this is clear: a checkpoint is the file’s current contents, one snapshot of the conversation so far. The checkpointer is the filing system that stores those snapshots and hands the right one back when asked (in my case a Postgres-backed store in production, and an in-memory one for local development and tests). The checkpointer saves checkpoints. That’s the whole relationship.

Concretely, in LangGraph, every conversation has a thread_id, and I pass it in through a config object:

config = {"configurable": {"thread_id": thread_id}}
graph.invoke({"query": query}, config)

When I call the graph with that thread_id, LangGraph loads the latest checkpoint for that thread first, then merges my new input on top. So the state I get back already has the whole conversation in it, every prior turn accumulated into one running history, even though I only sent the current question. I pass {"query": ...} and nothing else, because the history is already there waiting.

Wiring it into the graph

There’s a catch that took me a moment to see. Once state persists across turns, everything persists across turns, not just the bits you want.

My agent’s state has a field for the conversation history, but it also holds a bunch of temporary values that only matter while it’s answering the current question: the law sections it just retrieved, the draft answer, a counter for how many times it has retried, and so on. These are the working notes for one question, the kind of thing you’d normally throw away once you’ve answered it. If they carry over into the next turn instead, turn two starts out wearing turn one’s leftovers.

The retry counter is the one that actually bit me. When the agent produces an answer that fails a compliance check, it gets sent back to try again, and a counter tracks how many times it has done that. It’s allowed exactly one retry. Here’s the bug. Because the whole state carried over, that counter carried over with it. Turn one uses its retry, the counter ticks up to one, and then it just stays there. Turn two opens with the counter already sitting at the limit, so the first time turn two needs a retry, the agent thinks it has already used it and gives up. The safety net only ever worked on the very first question of a conversation.

So the graph has two small bookkeeping nodes:

A start_turn node at the very beginning that resets all the per-question fields back to empty. History is the one thing it deliberately leaves alone.
A record_turn node at the very end that appends this turn’s question and answer to the history.

Only history survives the reset. Everything else — the retrieved sections, the draft, the retry counter — is wiped clean at start_turn, so each turn starts without the previous turn's leftovers.

Only the history field is set up to accumulate (LangGraph calls this a reducer, which just means new values get appended instead of overwritten). Everything else gets wiped clean at the top of each turn. Recording at the end matters too: during the turn, the history holds prior turns only, so the current question doesn’t show up twice in the prompt.

Store everything, show a slice

I store the entire conversation forever, but I only ever send the model a recent slice of it.

Those are two different layers, and it’s worth keeping them apart in your head. The checkpoint is the durable record. It only grows, it’s never edited, and for a legal tool that record is also an audit artifact I’m not willing to destroy. What the model sees on any given turn is a much smaller thing: the most recent turns that fit inside a token budget. The doctor doesn’t reread your whole file every visit either, just the recent notes.

Stored

the full checkpoint, every turn

turn 1 600 tok

turn 2 1,500 tok

turn 3 900 tok

turn 4 1,100 tok

turn 5newest 800 tok

append-only · never edited

read-time trim

Sent to the model

the recent slice, within budget

turn 1 not sent

turn 2 not sent

turn 3 900 tok

turn 4 1,100 tok

turn 5newest 800 tok

2,800 / 4,000 token budget 3 turns kept

A turn here means one exchange: your question plus the assistant’s reply, counted as a unit. The trimming walks the stored turns from newest to oldest, adding each one’s token count to a running total, and stops as soon as the next turn would push the total past 4,000. Say the last five turns are 600, 1,500, 900, 1,100 and 800 tokens. Walking back from the newest: 800 + 1,100 + 900 = 2,800. Adding the next one would bring it to 4,300, over budget, so the walk stops there. The two oldest turns stay in the checkpoint forever, but they never reach the model. The one exception is a floor: if the newest turn alone blew past 4,000, it still gets sent, because a follow-up question is meaningless without the turn it refers back to.

So why keep all of it if the model only ever sees part? Because trimming is a filter applied on the way out, not a delete. It runs at read-time, every time a node is about to call the model, on a copy of the history. The stored conversation never changes. That means I can raise the budget next week and last month’s turn is right there, untouched. If I’d saved only the trimmed version, that context would be gone for good, and I’d have no way to get it back. Keeping everything keeps my options open; deleting throws them away. That’s the entire reason for the split.

It also keeps memory honest in a smaller way: the turn I record at the end is the exact answer the user was shown, so what the conversation remembers and what the user actually saw can’t drift apart.

Trade-offs I made on purpose

A few things here I picked deliberately, and one I’m putting off for now.

The trimming used to keep the last few turns rather than work to a token budget, and I switched because turns are wildly uneven in size. A turn can be five tokens when someone just says “yes”, or two thousand for a statute-heavy answer, so counting turns barely controlled the thing I actually care about, which is how many tokens I’m paying to send the model on every request. Budgeting by tokens caps exactly that. I didn’t pull the 4,000 figure out of the air either. I tuned it with a small eval and found that 2,000 was too tight, because it dropped a reference from a few turns back and the follow-up then resolved to the wrong topic.

To count those tokens I use tiktoken, OpenAI’s tokenizer, with its o200k_base encoding, as a single rough proxy, even though my models come from a few different providers that each count a little differently. Trimming is allowed to be approximate, and a local count is instant, identical every time, and needs no network call. Getting the exact per-provider count would buy me precision I’d never actually use.

The thing I’m putting off is summarizing old turns instead of dropping them, which is the obvious move once conversations get really long. I haven’t built it, and I’ve set a clear bar for when I will: a case I can actually reproduce where even a generous budget drops something the model genuinely needed. A general sense that conversations are getting long doesn’t clear that bar. And when I do build it, the summary has to sit alongside the raw history rather than replace it, because the full record has to stay intact.

If you’re building something similar, the one idea I’d take from all this is to keep two jobs separate: storing the whole conversation, and choosing how much of it the model sees. Keep them apart and you can always change your mind later about how much to show, without having thrown anything away.