Architecture & Durable Execution

How AccessOps uses DBOS to make workflows crash-proof — and why that matters.

4
Durable workflows
6
DBOS primitives
0
Extra infrastructure
Crash recoveries

The Problem: Partial Failure #

Consider what happens when a user requests access to a sensitive role. The system must execute multiple steps that can fail independently:

What if the server crashes between "Assign role" and "Provision external"? The role is assigned locally but never provisioned externally. The user has a database record saying they have access, but the actual system denies them. The audit log never records what happened.

This is the partial failure problem. Traditional solutions all push reliability into application code:

Database transactions

Can't span external API calls or hold transactions open for days while waiting for human approvals.

Message queues

Need dead-letter queues, idempotency, manual state machines, and retry logic for every workflow.

Custom state machines

Massive boilerplate. Every workflow becomes its own mini-framework with polling, error handling, and recovery.

Hand-rolled sagas

Complex to build correctly. Easy to miss edge cases. Compensation logic is error-prone and hard to test.

Durable Execution: The Key Insight #

Durable execution is a programming model where workflows are written as ordinary functions, but the runtime guarantees they will complete despite crashes, restarts, or failures. You write linear code. DBOS gives you the reliability of a distributed state machine.

Checkpointed

Every step is saved

Each RunAsStep saves its result to PostgreSQL. On crash, the workflow resumes from the last completed step — not from the beginning.

Durable

Waits survive restarts

A workflow can sleep for 7 days waiting for a human approval. If the server restarts 50 times, the workflow picks up exactly where it left off.

Exactly-once

No duplicate side effects

Completed steps are never re-executed on recovery. Their recorded results are replayed from PostgreSQL, guaranteeing no duplicate operations.

Zero ops

No extra infrastructure

No message queue, no Redis, no separate workflow engine. DBOS uses the PostgreSQL database you already have for checkpointing.

System Architecture #

AccessOps is a three-layer system. The frontend talks to the REST API, which orchestrates durable workflows backed by PostgreSQL. Hover each layer to focus.

The Four Workflows #

2 event-driven 2 scheduled
1

Approval Workflow

Human-in-the-loop orchestration
LONG-RUNNING
The core workflow. Orchestrates the entire access request lifecycle — from submission through multi-level approval to provisioning. Can run for weeks while waiting for human decisions, consuming zero server resources during waits.
Go approval.go (simplified)
func ApprovalWorkflow(ctx, input) {
    steps := dbos.RunAsStep(ctx, LoadApprovalSteps)   // checkpointed
    dbos.RunAsStep(ctx, LinkWorkflowToRequest)         // checkpointed

    for _, step := range steps {
        // Durable wait — survives any number of restarts
        decision := dbos.Recv[Decision](ctx,
                        "approval_decision",
                        7 * 24 * time.Hour)

        if timeout  → auto-escalate, continue
        if denied   → mark DENIED, return
    }

    // All approved → spawn child workflow
    dbos.RunWorkflow(ctx, ProvisioningWorkflow, input)
}
Pattern
Human-in-the-loop
Key primitive
Recv / Send
Timeout
7 days per approver

DBOS Primitives Reference #

Every primitive used in this project — that's the entire API surface.

PrimitiveWhat it doesUsed for
RunAsStep
dbos.RunAsStep(ctx, fn)
Checkpointed step. Result saved to PostgreSQL; replayed on recovery instead of re-executing.Every DB query, external call, and audit write in every workflow
Recv
dbos.Recv[T](ctx, topic, timeout)
Durable wait. Suspends workflow until a message arrives or timeout expires. Survives restarts.Approval workflow — waiting for human decisions
Send
dbos.Send(ctx, wfID, msg, topic)
Delivers a message to a specific workflow's Recv call. The API handler uses this to push decisions.HTTP handler → running approval workflow
RunWorkflow
dbos.RunWorkflow(ctx, fn, input)
Starts a child workflow with independent checkpoints. Parent can await the result.Approval spawns Provisioning after all approvals pass
WithStepMaxRetries
dbos.WithStepMaxRetries(n)
Automatic retries for a step. Retries up to n times before propagating the error.External provisioning — 3 retries for transient failures
WithSchedule
dbos.WithSchedule("cron")
Cron-based scheduling. DBOS deduplicates runs and executes missed invocations on recovery.Hourly expiry + Monday 9 AM weekly review

What Happens When Things Crash? #

Three real failure scenarios and exactly how DBOS recovers.

Server crashes while waiting for Approver B

Server restarts, dbos.Launch() scans for incomplete workflows
Finds the approval workflow in PENDING state
Re-invokes ApprovalWorkflow with the original inputs
Steps 1-3 (load, link, Approver A): checkpoints found, replayed instantly
Step 4 (Approver B wait): no checkpoint, resumes Recv
Workflow is back exactly where it was. No data loss. No duplicate operations.

External provisioning fails after 3 retries

Saga Step 1 (assign role in local DB): succeeded, checkpointed
Saga Step 2 (external system): fails 3 times via WithStepMaxRetries(3)
Error propagates — saga enters compensation path
Compensation: RevokeUserRole runs as a checkpointed step
Workflow returns FAILED — parent marks request as denied
Local DB assignment is cleanly rolled back. No inconsistent state.

Server crashes during saga compensation

Step 1 (AssignUserRole): checkpointed
Step 2 (ExternalProvisioning): failed, checkpointed as ERROR
Compensation (RevokeUserRole): crash mid-execution
Server restarts — DBOS replays steps 1 and 2 from checkpoints
Re-enters compensation branch — RevokeUserRole completes
Compensation completes even through a crash. System is consistent.

What Would Break Without DBOS? #

Server restarts during approval
Without DBOS

Workflow state lost. Need polling-based state machine to resume.

With DBOS

Workflow auto-resumes from checkpoint. Zero code needed.

Crash between local assign and external provision
Without DBOS

Inconsistent state. User has local role but no external access.

With DBOS

Saga compensation rolls back local assignment.

Server down for 2 hours during scheduled expiry
Without DBOS

Missed cron runs. Expired access stays active until next window.

With DBOS

Missed runs execute on recovery. No window is skipped.

Crash during audit log write
Without DBOS

Silent data loss. Action happened but no record exists.

With DBOS

Audit step retries on recovery. Action is always recorded.

Under the Hood: DBOS System Tables #

DBOS creates its own tables in PostgreSQL alongside the application tables. No separate database or infrastructure required.

Tree Database schema
PostgreSQL
├── Application Tables
│   ├── users
│   ├── roles
│   ├── access_requests
│   ├── approval_steps
│   ├── user_roles
│   └── audit_logs
│
└── DBOS System Tables (managed automatically)
    ├── dbos.workflow_status    ← tracks active/completed workflows
    ├── dbos.workflow_inputs    ← serialized inputs for replay
    ├── dbos.operation_outputs  ← checkpointed step results
    └── dbos.notifications      ← durable Send/Recv message queue

Recovery on startup

When dbos.Launch() is called, DBOS queries workflow_status for any workflows marked as PENDING. For each unfinished workflow, it re-invokes the workflow function with the original inputs, replays completed steps from checkpoints, and resumes at the first un-checkpointed step.

Why Not Just Use...? #

Durable execution vs related approaches in the distributed systems landscape.

Tech Stack #

Frontend
SvelteKit 2, Tailwind CSS 4, TypeScript
Backend
Go 1.23+, Chi router, pgx
Durable Execution
DBOS Transact Go SDK
Database
PostgreSQL 16
Auth
JWT (simplified for POC)
Infrastructure
Docker Compose, DBOS Cloud, Cloudflare Pages