Editing a Visual Workflow Graph Without Losing Your Mind

I have helped build node-based workflow editors—the kind where users wire boxes together for automation or agent flows. The canvas looks like a front-end problem, and it is one, but the part that kept me up at night was meaning: when someone draws an edge from A to B, what exactly runs next, in which process, and what should happen if step B times out?

Persisting intent, not pixels

We asked clients to send patches; the server stored nodes, edges, and version metadata. The runner did not read SVG coordinates. It needed a typed graph. A compressed form looked like this in TypeScript:

type WorkflowNode =
  | { id: string; kind: "trigger.webhook"; config: WebhookCfg }
  | { id: string; kind: "agent.llm"; config: ModelCfg & { tools: string[] } }
  | { id: string; kind: "branch.if"; config: PredicateRef }
  | { id: string; kind: "http.request"; config: HttpCfg };

type WorkflowEdge = { from: string; to: string; label?: "true" | "false" };

Branches were not hacks on top of a linear list—they were edges with labels the executor understood. That made debugging easier because “what the graph means” lived in data I could log and diff.

Plan versus run

I found it helpful to keep two concepts separate:

A plan is what you get from a frozen version of the graph—deterministic given that snapshot.
A run carries state: variables, tool outputs, partial traces, failure markers.

That separation let us replay in staging after changing a node implementation, and it opened the door to “rerun from here” without asking the user to redraw anything. When I blurred plan and run in earlier thinking, staging and production disagreed in ways that were painful to trace.

Integrations and failure modes

Once the palette grew to many node types talking to external APIs, a few habits mattered on the platform side:

Timeouts per node family, so one slow vendor could not swallow a whole run.
Secrets injected at execution time, never pasted into serialized graph JSON.
A workflow id (and trace ids) that flowed into Kafka or task queues so we could follow one execution across services.

I still think the hardest user question is honest: what do we show when Gmail (or any third party) rate-limits in the middle of a flow? My goal has been clear messages and recoverable states, not a spinner that pretends nothing failed.

If the semantic model is sound, the canvas is elaborate form design. If it is not, the diagram looks beautiful and nobody can rely on it in production—and I have been on both sides of that line. I am writing this down in case it helps you sketch your executor earlier than I did.