Temporal Durable Workflows: A 2026 Tutorial

Temporal Durable Workflows: A 2026 Tutorial

Temporal Durable Workflows: A 2026 Tutorial

Your payment service charged the card, then the process died before it recorded the order. Now you have an angry customer, a reconciliation ticket, and a half-written compensation script that nobody trusts. Temporal workflow orchestration solves this class of bug at the runtime level: it makes your business logic durable, so a crash mid-flow resumes exactly where it left off, with every prior step intact. That matters now because the retry-and-state-machine glue we hand-roll around queues and cron jobs has become the single largest source of correctness bugs in distributed backends. By the end of this tutorial you will understand durable execution from first principles, write a runnable workflow, activity, worker, and client in Python, and configure retries and timeouts that behave correctly under failure.

What this covers: durable execution mechanics, workflows versus activities, the Temporal Server internals, a complete worker setup, retry and timeout policies, Continue-As-New, signals and queries, versioning for safe deploys, and the honest gotchas that bite teams in production.

Context and Background

For a decade the default answer to “run this multi-step process reliably” was a message queue plus a state table plus a pile of idempotency keys. You publish to a queue, a consumer picks up the job, writes progress to a database, and you pray the consumer does not die between the side effect and the state write. When it does — and it does — you debug a partial state by hand. The saga pattern for distributed transactions formalized the compensation logic, but you still owned the orchestration machinery.

Durable execution engines flip the model. Instead of persisting state, they persist the history of events that produced the state, and reconstruct state by replaying that history. Temporal, which grew out of Uber’s Cadence project, is the most widely deployed engine in this category as of 2026, alongside alternatives like Restate and the durable-functions model in cloud providers. The official Temporal documentation describes it as a “durable execution platform,” and that phrase is precise: the unit of durability is your function call, not a row you remembered to update.

The vocabulary matters because the field is young and the terms overlap. “Workflow engine” once meant BPMN diagrams and visual designers; “durable workflow engine” in the 2026 sense means a code-first runtime where your ordinary functions are the workflow. Temporal sits squarely in the second camp: there is no separate workflow definition language, no XML, no drag-and-drop canvas — just typed code in Python, Go, Java, TypeScript, or .NET that the runtime makes crash-proof.

The shift is significant because it moves reliability from application code into the platform. You write what looks like ordinary sequential code — call this service, wait an hour, call that one — and the engine guarantees it runs to completion despite process crashes, deploys, and network partitions. The catch, which most introductions gloss over, is that your workflow code must obey strict determinism rules to make replay work. That constraint is the whole game, and we will treat it as the central idea rather than a footnote.

Why does this beat the queue-plus-state-table approach you already know? Because the failure modes that pattern hides — a consumer dying between a side effect and its state write, a retry that double-applies, a cron job that silently stopped — are exactly the ones Temporal workflow orchestration eliminates by construction. You stop writing reliability glue and start writing business logic. The trade is that you accept a new runtime and a determinism discipline, which the rest of this tutorial spends most of its time on, because that is where teams either succeed or quietly accumulate replay bugs.

How Durable Execution Actually Works

Temporal splits your code into two kinds of functions with completely different rules. A Workflow is deterministic orchestration code: it decides what happens and in what order. An Activity is a plain function that does the actual work with side effects — HTTP calls, database writes, sending email. The workflow never performs I/O directly; it schedules activities and awaits their results. This split is the key to understanding Temporal workflow vs activity boundaries.

Hold onto that division, because everything else in Temporal workflow orchestration follows from it. The workflow is the brain that must replay identically forever; the activity is the hand that touches the messy outside world and is allowed to fail. Confuse the two — sneak an HTTP call into a workflow, or put orchestration decisions inside an activity — and you lose either determinism or durability. The architecture below exists to enforce exactly this boundary at runtime.

Temporal workflow orchestration architecture with client worker server and persistence layers

Figure 1: The Temporal architecture — your worker hosts both workflow and activity code and long-polls task queues on the server, which routes tasks and persists event history.

The diagram shows three moving parts. Your client starts workflows and sends signals or queries. Your worker is a process you run that long-polls task queues and executes your workflow and activity code. The Temporal Server is a cluster of services: a frontend (gRPC gateway), a history service (the event-sourcing brain), a matching service (task-queue router), and a persistence layer backed by PostgreSQL, MySQL, or Cassandra. Your application code never talks to the database; it only talks to the server over gRPC.

Durable execution, in one paragraph: a Temporal workflow is deterministic code whose every decision is recorded as an event in an append-only history. When a worker crashes, another worker replays that history from the start, re-executing the workflow function. Completed activities are not re-run — their recorded results are fed back in — so the code fast-forwards to exactly where it stopped, with all local variables restored. The workflow is durable because its history is durable.

It is worth pausing on how unusual this is. In a normal program, a process crash erases the call stack, every local variable, and the position in your code; recovery means reconstructing all of that from whatever you happened to persist. Temporal workflow orchestration reconstructs it automatically from history, deterministically and for free. The price is the determinism contract, which is why we now look at the history and replay model closely before writing a line of workflow code.

The event history is the source of truth

Every workflow execution has an event history: an ordered log of events like WorkflowExecutionStarted, ActivityTaskScheduled, ActivityTaskCompleted, TimerStarted, and so on. When your workflow code calls an activity, the SDK does not block a thread for an hour. It emits a command to the server, which records ActivityTaskScheduled and suspends the workflow. The worker is free to evict that workflow from memory entirely.

Event-sourced replay model showing how Temporal rebuilds workflow state after a crash

Figure 2: The replay model — workflow code emits commands that append durable events; after a crash, the worker replays history and skips already-completed steps to rebuild in-memory state.

When the activity finishes, the server appends ActivityTaskCompleted and schedules a new workflow task. A worker picks it up, replays the history from event zero, and your code runs again — but this time the await on that activity returns the recorded result instead of scheduling it afresh. This is why replay must be deterministic: if your code took a different branch on replay than it did originally, the recorded history would no longer match the commands your code emits, and Temporal would raise a non-determinism error.

Walk through the order example concretely. The history begins WorkflowExecutionStarted, then your code calls charge_payment, producing ActivityTaskScheduled, ActivityTaskStarted, and ActivityTaskCompleted with the charge ID. The code then calls reserve_inventory, adding the same trio, and so on. If the worker dies between the charge completing and the reservation scheduling, recovery is trivial: a new worker replays the recorded charge result, never re-running charge_payment, and proceeds to schedule the reservation exactly once. The history is both the audit log and the recovery mechanism — there is no separate checkpoint to manage, and no window where the system can lose track of where it was.

Determinism constraints you cannot break

Because workflow code is re-executed on every replay, it must produce identical commands every time given the same history. That rules out anything non-deterministic inside workflow code. No datetime.now() — use workflow.now(), which returns the time recorded in history. No random or UUIDs from the standard library — use workflow.uuid4() and the deterministic random helpers. No direct network or disk I/O — push it into an activity. No reading mutable global state, no unordered iteration over sets or dicts whose order can vary. The SDK enforces some of this with a sandbox, but the discipline is yours to keep.

The payoff is that you write blocking-looking sequential logic. await workflow.sleep(timedelta(days=30)) durably sleeps for a month — the worker can restart a hundred times in between and the timer survives, because it is just a TimerStarted event waiting for a TimerFired. This is what makes a durable workflow engine qualitatively different from a job queue.

Here is the canonical non-determinism bug and its fix, because seeing it once inoculates you against a whole family of outages:

# WRONG — non-deterministic inside workflow code
@workflow.run
async def run(self, order: OrderInput) -> str:
    deadline = datetime.now() + timedelta(hours=1)   # wall clock
    token = str(uuid.uuid4())                          # random
    if random.random() < 0.1:                          # random branch
        ...

# RIGHT — deterministic primitives the SDK records in history
@workflow.run
async def run(self, order: OrderInput) -> str:
    deadline = workflow.now() + timedelta(hours=1)     # replay-safe time
    token = str(workflow.uuid4())                       # replay-safe id
    # Any genuine randomness or external read goes into an activity.

The wrong version works perfectly in tests and the first time it runs in production. It breaks weeks later, on replay, when datetime.now() returns a different value than it did originally and the workflow takes a branch that contradicts recorded history. The right version reads time and randomness from primitives whose values are pinned in history, so every replay reproduces the original execution exactly.

A practical tip on debugging determinism: the SDK ships a replay test utility that feeds a recorded history through your current workflow code and fails if they diverge. Wire that into CI against histories exported from production, and you catch the most dangerous class of bug — code that replays differently — before it ever pages you. A determinism failure that surfaces only when an old workflow wakes up is nearly impossible to reproduce by hand; a replay test makes it a deterministic CI failure instead.

A Complete Worker Setup You Can Run

Let us build a runnable example: an order-processing workflow that charges a payment, reserves inventory, and sends a confirmation, with retries and timeouts on every step. This is the same pattern you would use for the kind of flow described in our event-driven order management architecture. The full Temporal worker setup is four small files — activities, workflow, worker, and a client starter — and we will write each in turn so you can run the whole thing locally. Install the SDK first.

pip install temporalio
# Run a local server in another terminal:
#   temporal server start-dev
# This starts the dev server on localhost:7233 with a Web UI on :8233.

Define the activities. These are ordinary async functions — they do real I/O and may fail. The @activity.defn decorator registers them.

# activities.py
import asyncio
from dataclasses import dataclass
from temporalio import activity


@dataclass
class OrderInput:
    order_id: str
    customer_id: str
    amount_cents: int


@activity.defn
async def charge_payment(order: OrderInput) -> str:
    # Real code calls your payment gateway here. Activities run
    # outside the determinism sandbox, so I/O is allowed.
    activity.logger.info("Charging %s cents for %s", order.amount_cents, order.order_id)
    await asyncio.sleep(0.2)  # simulate a network call
    return f"charge_{order.order_id}"


@activity.defn
async def reserve_inventory(order: OrderInput) -> str:
    activity.logger.info("Reserving inventory for %s", order.order_id)
    await asyncio.sleep(0.2)
    return f"reservation_{order.order_id}"


@activity.defn
async def send_confirmation(order: OrderInput) -> None:
    activity.logger.info("Sending confirmation for %s", order.order_id)
    await asyncio.sleep(0.1)

Now the workflow. Notice it performs no I/O itself — it only orchestrates activities, each wrapped in its own timeout and retry policy.

# workflow.py
from datetime import timedelta
from temporalio import workflow
from temporalio.common import RetryPolicy

with workflow.unsafe.imports_passed_through():
    from activities import OrderInput, charge_payment, reserve_inventory, send_confirmation


@workflow.defn
class OrderWorkflow:
    @workflow.run
    async def run(self, order: OrderInput) -> str:
        retry = RetryPolicy(
            initial_interval=timedelta(seconds=1),
            backoff_coefficient=2.0,
            maximum_interval=timedelta(seconds=30),
            maximum_attempts=5,
        )

        charge_id = await workflow.execute_activity(
            charge_payment,
            order,
            start_to_close_timeout=timedelta(seconds=10),
            retry_policy=retry,
        )

        reservation = await workflow.execute_activity(
            reserve_inventory,
            order,
            start_to_close_timeout=timedelta(seconds=10),
            retry_policy=retry,
        )

        await workflow.execute_activity(
            send_confirmation,
            order,
            start_to_close_timeout=timedelta(seconds=5),
            retry_policy=retry,
        )

        return f"order {order.order_id} complete charge {charge_id} reservation {reservation}"

The worker is the runtime heart of Temporal workflow orchestration. It is a plain process you operate that does three things in a loop: long-poll a task queue for work, execute the matching workflow or activity code, and report results back to the server. Because all durable state lives in the server’s history, the worker holds nothing precious — kill it, redeploy it, run ten copies, and the system is unaffected. The task queue name (order-tq below) is the routing key that connects a client’s start call to the workers that can serve it; clients and workers never address each other directly, only the queue.

The worker process registers the workflow and activities against a named task queue and then long-polls forever.

# worker.py
import asyncio
from temporalio.client import Client
from temporalio.worker import Worker
from activities import charge_payment, reserve_inventory, send_confirmation
from workflow import OrderWorkflow


async def main() -> None:
    client = await Client.connect("localhost:7233")
    worker = Worker(
        client,
        task_queue="order-tq",
        workflows=[OrderWorkflow],
        activities=[charge_payment, reserve_inventory, send_confirmation],
    )
    await worker.run()


if __name__ == "__main__":
    asyncio.run(main())

Finally, the client that starts a workflow. The id is your idempotency handle: starting twice with the same workflow ID is rejected, which gives you exactly-once orchestration for free.

# starter.py
import asyncio
from temporalio.client import Client
from activities import OrderInput
from workflow import OrderWorkflow


async def main() -> None:
    client = await Client.connect("localhost:7233")
    result = await client.execute_workflow(
        OrderWorkflow.run,
        OrderInput(order_id="A-1001", customer_id="C-42", amount_cents=4999),
        id="order-A-1001",
        task_queue="order-tq",
    )
    print(result)


if __name__ == "__main__":
    asyncio.run(main())

Run python worker.py in one terminal and python starter.py in another. Kill the worker mid-run and restart it; the workflow resumes without re-charging the card, because ActivityTaskCompleted is already in history. That is durable execution working.

Notice what you did not have to write: no state table, no idempotency-key bookkeeping in the orchestration layer, no manual “where was I” recovery logic, no dead-letter queue plumbing. The Temporal worker setup is the entire reliability story, and it is a few dozen lines. Scaling it is equally undramatic — workers are stateless, so you run more processes against the same order-tq task queue and the matching service load-balances tasks across them. Open the Web UI at localhost:8233 and you can inspect the full event history of order-A-1001, replay it, and see every command and timer the workflow emitted.

Timeouts: four kinds, and they are not interchangeable

Activities have four timeout knobs, and choosing them correctly is most of the skill. Schedule-to-start caps how long an activity waits in the task queue before a worker picks it up — useful for detecting worker starvation. Start-to-close caps a single attempt’s execution time; this is the one you almost always set. Schedule-to-close caps total wall-clock time across all retries. Heartbeat timeout applies to long activities that call activity.heartbeat() periodically; miss a heartbeat and the attempt is considered dead and retried.

Retry and timeout state flow for a Temporal activity showing schedule-to-start start-to-close and heartbeat

Figure 3: A workflow-and-activity sequence — the client starts the workflow, the worker emits a schedule-activity command, the matching service routes it, and the result flows back as a new workflow task.

A common mistake is setting only schedule_to_close_timeout and expecting per-attempt protection. It will not stop a single hung attempt; only start_to_close_timeout does that. Set start_to_close to a realistic single-attempt budget and let the retry policy plus an optional schedule_to_close cap govern the total.

For long activities — a multi-minute file conversion, a slow batch import — add heartbeats. Call activity.heartbeat(progress) periodically and set a heartbeat_timeout; if a worker dies mid-activity, Temporal detects the missed heartbeat quickly instead of waiting for the full start-to-close budget, and retries the attempt on another worker. Heartbeats can also carry progress, so a resumed attempt can skip work it already finished. Without them, a 30-minute activity on a crashed worker stalls for the entire start-to-close window before anyone notices.

Retries, Long Histories, and Coordination

The retry policy on each activity governs automatic re-execution on failure. The fields are initial_interval, backoff_coefficient, maximum_interval, maximum_attempts, and non_retryable_error_types. The last one matters: a ValidationError for a malformed order should not be retried five times, so add its class name to the non-retryable list and the workflow surfaces it immediately.

Retry and timeout state flow showing attempts backoff and the four activity timeout boundaries

Figure 4: The retry and timeout state machine for an activity — each attempt is bounded by start-to-close and heartbeat timeouts, failures consume retry budget, and schedule-to-close caps the total.

Workflows themselves can also have retry policies, but you usually want activity-level retries because they are cheaper and more granular. The mental model: activities are where things fail and retry; workflows are where you decide what to do about it.

There is a subtlety worth naming. Retry backoff is computed by the server, not your code, and the waiting happens off the worker — a worker can crash and restart during a 30-second backoff with no effect on the schedule, because the retry timer lives in history. This is why Temporal workflow orchestration can promise “retries that survive deploys”: the retry state is durable in exactly the same way the workflow itself is. Tune maximum_attempts against your tolerance for stuck work versus runaway retries, and prefer a finite cap plus alerting over an infinite retry loop that silently hides a broken dependency.

Continue-As-New for unbounded loops

Event history is not free. Temporal enforces limits — roughly a 50,000-event soft cap and a 50 MB history-size guardrail in current versions — because every workflow task replays the whole history. A workflow that loops forever (a subscription billing loop, a long-running entity) will eventually blow past these limits. The fix is Continue-As-New: the workflow atomically completes its current run and starts a fresh execution with a new, empty history, carrying forward only the state it needs.

@workflow.defn
class BillingLoop:
    @workflow.run
    async def run(self, cycles_done: int) -> None:
        for _ in range(30):  # process a bounded chunk
            await workflow.sleep(timedelta(days=30))
            await workflow.execute_activity(
                charge_subscription,
                start_to_close_timeout=timedelta(seconds=30),
            )
            cycles_done += 1
        # Reset history before it grows unbounded.
        workflow.continue_as_new(cycles_done)

Signals, queries, updates, and child workflows

A running workflow is not a black box. Signals push data into it asynchronously (for example, “customer cancelled”). Queries read its current state synchronously without mutating it (for a status dashboard). Updates, newer in the SDK, combine the two: a validated, synchronous mutation that returns a result. Child workflows let one workflow start and await another, which is how you decompose large processes. Together these turn a workflow into a durable, addressable entity — closer to an actor than to a script. This is where Temporal workflow orchestration stops looking like a fancy job queue and starts looking like a programming model: long-lived, stateful objects you can talk to by ID, that happen to be crash-proof.

“`python
@workflow.defn
class ApprovalWorkflow:
def init(self) -> None:
self._cancelled = False

@workflow.signal
def cancel(self) -> None:
    self._cancelled = True

@workflow.query
def is_cancelled(self) -> bool:

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *