Kafka, Redis, and Clean Hand-offs Between Sync APIs and Workers

On projects where the surface is split between HTTP, real-time channels, and long-running work, I keep rediscovering the same lesson: each tool has a sweet spot, and mixing them carelessly creates the kind of bugs that only show up under load.

My mental model—and I should say I did not invent it, I picked it up from colleagues and from cleaning up my own mistakes—is roughly this: HTTP and WebSockets handle who started what and what the user sees right now. Redis is often right for low-latency signals: pub/sub, short-lived cache, sometimes Streams if you want a light consumer-group story. Kafka earns a place when you need durable facts that fan out to several services, or when you must be able to replay and keep a clear ordering story for related events.

When Redis felt like enough

For long jobs, I have often streamed progress with PUBLISH on a channel keyed by a run_id the client already holds from the POST that kicked off the work—embed finished, chunk k of n, indexer lag, that sort of thing.

await redis.publish(
  `run:${runId}:events`,
  JSON.stringify({ stage: "embed", at: Date.now(), ok: true })
);

The UI can subscribe over SSE or WebSocket and stay in sync. One thing I learned the hard way: pub/sub is not a log. If the browser disconnects, I need either a snapshot endpoint or the last-known state stored somewhere durable enough to rehydrate the progress bar. Otherwise users refresh and think the job vanished.

When we reached for Kafka

There were times when an event had to matter to billing, analytics, another team’s service, or an audit trail. Losing or reordering those messages casually was not acceptable. Partitioning by something stable like workflow_id kept related updates in order, which made reconciling “credits moved” with “tool step finished” less painful.

A minimal producer shape looked like this in one of our services (settings and library details omitted on purpose):

producer.send(
    "workflow.stage.completed",
    key=workflow_id,
    value={
        "workflow_id": workflow_id,
        "stage": "vector_index",
        "status": "ok",
        "credits_estimate": 12,
    },
)

Choosing without over-engineering

I still ask two practical questions before I commit:

If this Redis channel blips, can the product recover gracefully (snapshot, polling, replay from DB)?
If this Kafka topic is missing, do we lose money or trust?

If the first is yes and the second is no, Redis is often enough. If multiple systems must consume the same fact reliably, or we need replay after an outage, Kafka tends to justify its operational cost. I have also seen the opposite mistake—running Kafka for traffic that belonged in metrics or logs—which I try to avoid; it is expensive in time and dollars.

None of this is flashy. It is the kind of plumbing I wish I had mapped more clearly on my first async systems. Sharing the split in case it saves you a few late nights.