Why We Stopped Fighting Distributed State and Started Using Temporal

There's a specific kind of 2am that only backend engineers know. The kind where you're staring at a support ticket that says "customer paid but never received their policy" and your first thought isn't how do we fix this — it's how do we even figure out what happened.

That was the moment our team decided we needed a better architecture for this.

We were building an end-to-end insurance purchase platform — from lead generation all the way through to policy issuance. On paper, it sounded manageable. In practice, it touched six different services deployed on GKE on GCP, two external payment gateways, a document generation pipeline, an email and SMS delivery system, and a CRM. Each one could fail independently. Each one had its own idea of what the current "status" was.

The Flow Looks Simple Until It Isn't

From a product perspective, the journey is clean:

Customer fills in their details → lead created
They pick a plan → package selected
They pay → payment confirmed
We generate their policy → order created
Policy document delivered by email and post → done

The reality is messier. The customer might leave the app mid-flow and come back two days later. The payment gateway might confirm asynchronously via webhook, ten minutes after the user hit "Pay". The order service might be down for maintenance at exactly that moment. The document generation service might time out on a complex PDF.

None of these are edge cases. These are expected failures. The question is whether your system handles them gracefully or loses data.

Before Temporal, we were dealing with this through a mix of database polling, retry cron jobs, and RabbitMQ messages. It worked, sort of. But the state was split across three different systems, and when things went wrong it took real forensic work to reconstruct what had actually happened to a specific customer's purchase.

Modelling It as a Workflow

The core insight with Temporal is that you stop thinking about your business process as a collection of API calls and start thinking about it as a single, persistent, resumable function.

Here's a simplified version of our main workflow in Go:

func InsurancePurchaseWorkflow(ctx workflow.Context, input PurchaseInput) error {
    if err := workflow.ExecuteActivity(ctx, CreateLead, input).Get(ctx, nil); err != nil {
        return err
    }

    var pkg PackageSelection
    if err := workflow.ExecuteActivity(ctx, WaitForPackageSelection, input.LeadID).Get(ctx, &pkg); err != nil {
        return err
    }

    if err := workflow.ExecuteActivity(ctx, InitiatePayment, pkg).Get(ctx, nil); err != nil {
        return err
    }

    if err := workflow.ExecuteActivity(ctx, CreateOrder, pkg).Get(ctx, nil); err != nil {
        return err
    }

    return workflow.ExecuteActivity(ctx, IssuePolicy, pkg).Get(ctx, nil)
}

This looks like a regular Go function. That's the whole point. Temporal executes it with full durability — if the worker crashes midway through, the workflow resumes from exactly where it left off. No duplicate payment. No lost lead. No corrupted state to reconcile manually.

The Waiting Problem

One of the trickiest parts of insurance purchase is that it's not a single transaction. A customer might select a plan, get distracted, and return three days later. You obviously can't hold a goroutine open for three days.

Temporal solves this with signals — external events sent from other services directly into a running workflow. Our package selection step roughly looks like:

func WaitForPackageSelection(ctx workflow.Context, leadID string) (PackageSelection, error) {
    var pkg PackageSelection
    workflow.GetSignalChannel(ctx, "package-selected").Receive(ctx, &pkg)
    return pkg, nil
}

When the customer finally picks their plan and our API receives that request, it sends a signal to the running workflow. The workflow wakes up exactly where it paused. No polling loop. No state reconstruction.

We also handle the case where they never come back — a timer fires after 72 hours, marks the lead as expired, and optionally notifies the sales team to follow up. One clean code path for both outcomes.

Running Things in the Background

Not everything needs to block the main purchase flow.

After payment succeeds, we kick off invoice generation and quote document generation as background child workflows. The customer can proceed to policy issuance while the accounting system catches up. If the ERP service is down, Temporal retries the invoice activity automatically — without us writing a single retry loop anywhere.

The separation we landed on:

Main workflow — core happy path: lead → package → payment → order → policy
Child workflows — parallelisable work: invoice generation, quote documents, CRM sync
Fire-and-forget activities — notifications: email, SMS, push

This made the codebase a lot easier to reason about. When a notification fails, it's isolated from the payment flow. When invoice generation fails, the customer already has their policy. No more cascading failures because a delivery service had a bad minute.

The Failure Scenarios That Convinced Us

Payment confirmed, order service down

Before Temporal: a webhook confirms payment, a database flag is set to PAYMENT_SUCCESS, and then a cron job tries to create orders for flagged payments every five minutes. Race conditions. Duplicate order attempts. Support tickets.

After: the workflow retries order creation automatically with exponential backoff. Payment state is preserved in workflow history. The cron job doesn't exist.

Policy generated, document delivery failed

Our email provider would occasionally drop messages during load spikes, and the postal dispatch integration had its own failure modes. Before, we had to manually run delivery scripts and hope nothing got double-sent.

After: Temporal retries each delivery activity independently — email and postal dispatch are separate activities, so a failure in one doesn't block the other. The policy document is already safely stored. The workflow doesn't care how many attempts either channel takes.

Customer abandons mid-flow

Abandoned leads used to pile up in our database with no clear next action. Now the workflow has a built-in timer. 72 hours of inactivity transitions the workflow to LEAD_EXPIRED and notifies the sales queue. Clean, automatic, and auditable.

What We Had to Get Right

Temporal handles orchestration. It doesn't make your external systems idempotent for you — that's still your job.

We had to add idempotency keys to every payment initiation request, unique creation tokens for order service calls, and safe delivery logic (email and postal dispatch) that wouldn't double-deliver on retries. Temporal made retries automatic; we had to make sure the systems being retried could handle being called more than once.

We also learned to be deliberate about separating workflow state from database status. The workflow is the source of truth, but every stage transition is mirrored to the database so the dashboard and support team always have a current view without querying Temporal directly.

The Audit Log Came for Free

One thing we didn't anticipate: Temporal's execution history became our compliance audit log.

Every activity, every state transition, every retry is recorded in Temporal's history with timestamps. When the compliance team asks "exactly what happened with policy #12345 on March 3rd?", the answer is a direct query. We didn't build that. It was there by default.

For insurance especially — where auditability is a legal requirement, not optional — this turned out to be more valuable than we expected.

Would We Use Temporal Again?

For anything long-running, multi-step, and dependent on external systems that fail: yes.

It's not zero cost. You need to think carefully about workflow design, idempotency, and how you split activities versus child workflows. Deploying Temporal workers as containers on GKE fits naturally into an existing Kubernetes setup, but the local dev setup with the Temporal server takes some getting used to. And the mental model shift from "sequence of API calls" to "durable function" takes a sprint or two to fully click.

But the alternative — stitching together cron jobs, message queues, and retry tables while your support team manually reconciles failed payments — costs more in the end.

The 2am support tickets didn't disappear entirely. But now when one arrives, we can pull up the workflow history and see exactly what happened, in order, with timestamps. That alone is worth the investment.