Async Boundaries: Django, Queues, and Agentic Workflows in Production

In workflow-heavy AI products I have touched, much of the user-visible surface is execution: runs that span documents, chat, and tool-calling loops that must resume after interruptions. Coming from a background where most work fit inside a single request, I had to rethink where state lived.

What worked better for me over time was to keep HTTP thin: start a run, expose status, authenticate consistently. Hand real work to processes that can retry without assuming the user’s browser is still open. Redis and Kafka showed up often—not because they are fashionable, but because they solve different halves of the same problem: fast coordination versus durable facts that other services must see.

Habits that reduced pain

I cannot claim we never shipped a bug, but a few patterns helped:

Treat individual workflow steps as idempotent when we could. Retries became less scary.
Emit structured logs or metrics per stage, not only a generic 500 when something failed late in a chain. Debugging “step seven of nine” without that was miserable for me and for whoever was on support.
Keep auth the same whether the entry point was HTTP, a worker, or a WebSocket—mixed identity rules caused subtle production issues I still remember.

What usually broke first

In my experience, agent-style systems failed in ordinary ways long before any theatrical failure mode: timeouts, partial tool payloads, poison messages sitting in a queue. The work was often to make those cases visible and recoverable, not to chase novelty in the model layer.

Django in the picture

Some of these services sat in Django and Django REST land, with background work in queue clusters and channel layers where real-time updates mattered. The framework was not the main story—the boundary between sync and async was. I wrote this note for anyone who, like past me, is tempted to stuff an entire agent run into a view and hope for the best.

If you are wiring something similar, I hope dividing initiation, execution, and observability saves you some of the weekends I spent tracing half-finished runs.