Architecture & Durable Execution

How AccessOps uses DBOS to make workflows crash-proof — and why that matters.

Durable workflows

DBOS primitives

Extra infrastructure

∞

Crash recoveries

The Problem: Partial Failure #

Consider what happens when a user requests access to a sensitive role. The system must execute multiple steps that can fail independently:

Wait for Approver A

Wait for Approver B

Assign role in DB

Provision external

Audit log

What if the server crashes between "Assign role" and "Provision external"? The role is assigned locally but never provisioned externally. The user has a database record saying they have access, but the actual system denies them. The audit log never records what happened.

This is the partial failure problem. Traditional solutions all push reliability into application code:

Database transactions

Can't span external API calls or hold transactions open for days while waiting for human approvals.

Message queues

Need dead-letter queues, idempotency, manual state machines, and retry logic for every workflow.

Custom state machines

Massive boilerplate. Every workflow becomes its own mini-framework with polling, error handling, and recovery.

Hand-rolled sagas

Complex to build correctly. Easy to miss edge cases. Compensation logic is error-prone and hard to test.

Durable Execution: The Key Insight #

Durable execution is a programming model where workflows are written as ordinary functions, but the runtime guarantees they will complete despite crashes, restarts, or failures. You write linear code. DBOS gives you the reliability of a distributed state machine.

Normal execution

Step 1 saved

Step 2 saved

Step 3 saved

Step 4 CRASH

Recovery after restart

Step 1 replay

Step 2 replay

Step 3 replay

Step 4 execute

Step 5 execute

Complete

Checkpointed

Every step is saved

Each RunAsStep saves its result to PostgreSQL. On crash, the workflow resumes from the last completed step — not from the beginning.

Durable

Waits survive restarts

A workflow can sleep for 7 days waiting for a human approval. If the server restarts 50 times, the workflow picks up exactly where it left off.

Exactly-once

No duplicate side effects

Completed steps are never re-executed on recovery. Their recorded results are replayed from PostgreSQL, guaranteeing no duplicate operations.

Zero ops

No extra infrastructure

No message queue, no Redis, no separate workflow engine. DBOS uses the PostgreSQL database you already have for checkpointing.

System Architecture #

AccessOps is a three-layer system. The frontend talks to the REST API, which orchestrates durable workflows backed by PostgreSQL. Hover each layer to focus.

Frontend

SvelteKit + Tailwind CSS

DashboardRequest FormApproval UIAdminAudit Log

HTTP / JSON

Go Backend

Chi Router + JWT Auth

WorkflowStarter

POST /requests → RunWorkflow

ApprovalNotifier

POST /decide → Send(decision)

DBOS Durable Workflows

Approval Provisioning Access Expiry Weekly Review

SQL / pgx

PostgreSQL 16

Application Tables

users, roles, access_requests, approval_steps, user_roles, audit_logs

DBOS System Tables

workflow_status, workflow_inputs, operation_outputs, notifications

The Four Workflows #

2 event-driven 2 scheduled

Approval Workflow

Human-in-the-loop orchestration

LONG-RUNNING

The core workflow. Orchestrates the entire access request lifecycle — from submission through multi-level approval to provisioning. Can run for weeks while waiting for human decisions, consuming zero server resources during waits.

Load steps

Link workflow

Recv: wait 7d

Record

Provisioning

Go approval.go (simplified)

func ApprovalWorkflow(ctx, input) {
    steps := dbos.RunAsStep(ctx, LoadApprovalSteps)   // checkpointed
    dbos.RunAsStep(ctx, LinkWorkflowToRequest)         // checkpointed

    for _, step := range steps {
        // Durable wait — survives any number of restarts
        decision := dbos.Recv[Decision](ctx,
                        "approval_decision",
                        7 * 24 * time.Hour)

        if timeout  → auto-escalate, continue
        if denied   → mark DENIED, return
    }

    // All approved → spawn child workflow
    dbos.RunWorkflow(ctx, ProvisioningWorkflow, input)
}

Pattern

Human-in-the-loop

Key primitive

Recv / Send

Timeout

7 days per approver

Provisioning Saga

Compensating rollbacks on failure

SAGA PATTERN

Multi-step provisioning with compensating rollbacks. If the external system fails, the local database assignment is automatically reverted. Each step is independently checkpointed — even the compensation.

Happy path

1. Assign role (DB)

2. Provision external

3. Audit log

Failure path

1. Assign role

2. External fails

Compensate

Go provisioning.go (simplified)

func ProvisioningWorkflow(ctx, input) {
    // Saga Step 1: assign role in local DB
    dbos.RunAsStep(ctx, AssignUserRole)

    // Saga Step 2: provision in external system
    _, err := dbos.RunAsStep(ctx, ProvisionExternal,
                  dbos.WithStepMaxRetries(3))

    if err != nil {
        // ↩ Compensate: roll back step 1
        dbos.RunAsStep(ctx, RevokeUserRole)  // also checkpointed!
        return "FAILED — rolled back"
    }

    // Saga Step 3: audit log
    dbos.RunAsStep(ctx, WriteAuditLog)
}

Pattern

Saga with compensation

Key primitive

WithStepMaxRetries(3)

Fun detail

~10% simulated failure rate

Access Expiry

Automated time-based revocation

SCHEDULED

Runs every hour to find and revoke expired role assignments. Each revocation is a separate durable step — partial progress is never lost on crash.

Go expiry.go

dbos.RegisterWorkflow(ctx, AccessExpiryWorkflow,
    dbos.WithSchedule("0 0 * * * *"))  // every hour

func AccessExpiryWorkflow(ctx, scheduledTime) {
    expired := dbos.RunAsStep(ctx, FindExpiredUserRoles)

    for _, ur := range expired {
        dbos.RunAsStep(ctx, RevokeUserRole)     // individually checkpointed
        dbos.RunAsStep(ctx, AuditAutoExpiry)     // individually checkpointed
    }
}

Schedule

Every hour

Recovery

Missed runs on startup

Weekly Review

Compliance reporting

SCHEDULED

Every Monday at 9 AM, generates a compliance report: total requests, pending/approved/denied/expired counts, and flags stale requests older than 14 days.

Go review.go

dbos.RegisterWorkflow(ctx, WeeklyAccessReviewWorkflow,
    dbos.WithSchedule("0 0 9 * * 1"))  // Monday 9 AM

func WeeklyAccessReviewWorkflow(ctx, scheduledTime) {
    requests := dbos.RunAsStep(ctx, ListAllAccessRequests)
    stale    := dbos.RunAsStep(ctx, CountStalePending)
    dbos.RunAsStep(ctx, WriteReviewAuditLog)
}

Schedule

Monday 9 AM

Purpose

Compliance

DBOS Primitives Reference #

Every primitive used in this project — that's the entire API surface.

Primitive	What it does	Used for
RunAsStep dbos.RunAsStep(ctx, fn)	Checkpointed step. Result saved to PostgreSQL; replayed on recovery instead of re-executing.	Every DB query, external call, and audit write in every workflow
Recv dbos.Recv[T](ctx, topic, timeout)	Durable wait. Suspends workflow until a message arrives or timeout expires. Survives restarts.	Approval workflow — waiting for human decisions
Send dbos.Send(ctx, wfID, msg, topic)	Delivers a message to a specific workflow's Recv call. The API handler uses this to push decisions.	HTTP handler → running approval workflow
RunWorkflow dbos.RunWorkflow(ctx, fn, input)	Starts a child workflow with independent checkpoints. Parent can await the result.	Approval spawns Provisioning after all approvals pass
WithStepMaxRetries dbos.WithStepMaxRetries(n)	Automatic retries for a step. Retries up to n times before propagating the error.	External provisioning — 3 retries for transient failures
WithSchedule dbos.WithSchedule("cron")	Cron-based scheduling. DBOS deduplicates runs and executes missed invocations on recovery.	Hourly expiry + Monday 9 AM weekly review

What Happens When Things Crash? #

Three real failure scenarios and exactly how DBOS recovers.

Server crashes while waiting for Approver B

Server restarts, dbos.Launch() scans for incomplete workflows

Finds the approval workflow in PENDING state

Re-invokes ApprovalWorkflow with the original inputs

Steps 1-3 (load, link, Approver A): checkpoints found, replayed instantly

Step 4 (Approver B wait): no checkpoint, resumes Recv

Workflow is back exactly where it was. No data loss. No duplicate operations.

External provisioning fails after 3 retries

Saga Step 1 (assign role in local DB): succeeded, checkpointed

Saga Step 2 (external system): fails 3 times via WithStepMaxRetries(3)

Error propagates — saga enters compensation path

Compensation: RevokeUserRole runs as a checkpointed step

Workflow returns FAILED — parent marks request as denied

Local DB assignment is cleanly rolled back. No inconsistent state.

Server crashes during saga compensation

Step 1 (AssignUserRole): checkpointed

Step 2 (ExternalProvisioning): failed, checkpointed as ERROR

Compensation (RevokeUserRole): crash mid-execution

Server restarts — DBOS replays steps 1 and 2 from checkpoints

Re-enters compensation branch — RevokeUserRole completes

Compensation completes even through a crash. System is consistent.

What Would Break Without DBOS? #

Server restarts during approval

Without DBOS

Workflow state lost. Need polling-based state machine to resume.

With DBOS

Workflow auto-resumes from checkpoint. Zero code needed.

Crash between local assign and external provision

Without DBOS

Inconsistent state. User has local role but no external access.

With DBOS

Saga compensation rolls back local assignment.

Server down for 2 hours during scheduled expiry

Without DBOS

Missed cron runs. Expired access stays active until next window.

With DBOS

Missed runs execute on recovery. No window is skipped.

Crash during audit log write

Without DBOS

Silent data loss. Action happened but no record exists.

With DBOS

Audit step retries on recovery. Action is always recorded.

Under the Hood: DBOS System Tables #

DBOS creates its own tables in PostgreSQL alongside the application tables. No separate database or infrastructure required.

Tree Database schema

PostgreSQL
├── Application Tables
│   ├── users
│   ├── roles
│   ├── access_requests
│   ├── approval_steps
│   ├── user_roles
│   └── audit_logs
│
└── DBOS System Tables (managed automatically)
    ├── dbos.workflow_status    ← tracks active/completed workflows
    ├── dbos.workflow_inputs    ← serialized inputs for replay
    ├── dbos.operation_outputs  ← checkpointed step results
    └── dbos.notifications      ← durable Send/Recv message queue

Recovery on startup

When dbos.Launch() is called, DBOS queries workflow_status for any workflows marked as PENDING. For each unfinished workflow, it re-invokes the workflow function with the original inputs, replays completed steps from checkpoints, and resumes at the first un-checkpointed step.

Why Not Just Use...? #

Durable execution vs related approaches in the distributed systems landscape.

Tech Stack #

Frontend

SvelteKit 2, Tailwind CSS 4, TypeScript

Backend

Go 1.23+, Chi router, pgx

Durable Execution

DBOS Transact Go SDK

Database

PostgreSQL 16

Auth

JWT (simplified for POC)

Infrastructure

Docker Compose, DBOS Cloud, Cloudflare Pages

Learn More

DBOS Documentation DBOS Go Tutorial DBOS Go SDK on GitHub

Architecture & Durable Execution

The Problem: Partial Failure #

Durable Execution: The Key Insight #

Every step is saved

Waits survive restarts

No duplicate side effects

No extra infrastructure

System Architecture #

The Four Workflows #

Approval Workflow

Provisioning Saga

Access Expiry

Weekly Review

DBOS Primitives Reference #

What Happens When Things Crash? #

Server crashes while waiting for Approver B

External provisioning fails after 3 retries

Server crashes during saga compensation

What Would Break Without DBOS? #

Under the Hood: DBOS System Tables #

Why Not Just Use...? #

vs. Event-Driven Architecture

vs. BPM Engines

vs. Temporal

Tech Stack #

Learn More