I spent a long time comfortable with synchronous APIs: the request starts, work finishes, the response goes out. Moving into agentic systems—workflow runners, chat with tools, jobs that resume after a crash—taught me that the comfortable model breaks quickly. The demo path is smooth; where things actually hurt is retries, duplicate messages, and workers that stop halfway through a step.
What I have learned is that the model and its tools are not in the same atomic unit as the HTTP handler anymore. Work stretches across queues, streams, and sometimes long stretches of clock time. If you assume one request equals one effect, you will eventually double-run something expensive or irreversible.
How I think about a “unit” of work
These days I try to give each logical step a stable idempotency key built from inputs that mean “same intent,” not “same click.” I cache or persist the outcome under that key before I let billing, email, or other side effects fire. The sketch below is simplified, but it captures the shape I keep returning to:
def run_tool_step(step_id: str, intent_hash: str, execute) -> dict:
key = f"{step_id}:{intent_hash}"
if cached := outcome_store.get(key):
return cached # safe if the message is delivered twice
result = execute()
outcome_store.put(key, result)
return result
Building intent_hash carefully matters more than I expected at first. I include the pieces that would make a silent repeat wrong—for example the serialized tool arguments, model identifier, and guardrails on randomness. When I have skimped on that hash, I have either seen duplicate charges in tests or, worse, flaky behaviour that was hard to reproduce because the second run looked “successful” but was actually wrong.
Messaging and duplicate delivery
When a producer retries, the consumer may see the same logical message twice. If your handler says “run this agent step again,” you need either the idempotency layer above or a small state machine that marks a step as terminal before you acknowledge the message. I have watched teams “fix” duplication by turning off retries; it rarely fixes the underlying issue—it usually surfaces later as lost work or angry users.
Practices that helped me
What I try to do consistently (and still forget sometimes) is:
- Log step boundaries with a correlation id that can survive a process restart, so support or I can trace one run end to end.
- Split “we accepted the work” from “all downstream effects completed”—those are different promises.
- Treat timeouts as a normal path: store partial traces, and design resume so the same keys apply when the job tries again.
Agent demos tend to highlight the smooth conversation. In day-to-day work, I have found that idempotency, sensible backoff, and clear observability take most of the calendar—not because the model is dull, but because production is messy. I hope this framing helps if you are wiring something similar.