Architecture & Durable Execution
How AccessOps uses DBOS to make workflows crash-proof — and why that matters.
The Problem: Partial Failure #
Consider what happens when a user requests access to a sensitive role. The system must execute multiple steps that can fail independently:
This is the partial failure problem. Traditional solutions all push reliability into application code:
Can't span external API calls or hold transactions open for days while waiting for human approvals.
Need dead-letter queues, idempotency, manual state machines, and retry logic for every workflow.
Massive boilerplate. Every workflow becomes its own mini-framework with polling, error handling, and recovery.
Complex to build correctly. Easy to miss edge cases. Compensation logic is error-prone and hard to test.
Durable Execution: The Key Insight #
Durable execution is a programming model where workflows are written as ordinary functions, but the runtime guarantees they will complete despite crashes, restarts, or failures. You write linear code. DBOS gives you the reliability of a distributed state machine.
Every step is saved
Each RunAsStep saves its result to PostgreSQL. On crash, the workflow resumes from the last completed step — not from the beginning.
Waits survive restarts
A workflow can sleep for 7 days waiting for a human approval. If the server restarts 50 times, the workflow picks up exactly where it left off.
No duplicate side effects
Completed steps are never re-executed on recovery. Their recorded results are replayed from PostgreSQL, guaranteeing no duplicate operations.
No extra infrastructure
No message queue, no Redis, no separate workflow engine. DBOS uses the PostgreSQL database you already have for checkpointing.
System Architecture #
AccessOps is a three-layer system. The frontend talks to the REST API, which orchestrates durable workflows backed by PostgreSQL. Hover each layer to focus.
The Four Workflows #
Approval Workflow
func ApprovalWorkflow(ctx, input) { steps := dbos.RunAsStep(ctx, LoadApprovalSteps) // checkpointed dbos.RunAsStep(ctx, LinkWorkflowToRequest) // checkpointed for _, step := range steps { // Durable wait — survives any number of restarts decision := dbos.Recv[Decision](ctx, "approval_decision", 7 * 24 * time.Hour) if timeout → auto-escalate, continue if denied → mark DENIED, return } // All approved → spawn child workflow dbos.RunWorkflow(ctx, ProvisioningWorkflow, input) }
Provisioning Saga
func ProvisioningWorkflow(ctx, input) { // Saga Step 1: assign role in local DB dbos.RunAsStep(ctx, AssignUserRole) // Saga Step 2: provision in external system _, err := dbos.RunAsStep(ctx, ProvisionExternal, dbos.WithStepMaxRetries(3)) if err != nil { // ↩ Compensate: roll back step 1 dbos.RunAsStep(ctx, RevokeUserRole) // also checkpointed! return "FAILED — rolled back" } // Saga Step 3: audit log dbos.RunAsStep(ctx, WriteAuditLog) }
Access Expiry
dbos.RegisterWorkflow(ctx, AccessExpiryWorkflow, dbos.WithSchedule("0 0 * * * *")) // every hour func AccessExpiryWorkflow(ctx, scheduledTime) { expired := dbos.RunAsStep(ctx, FindExpiredUserRoles) for _, ur := range expired { dbos.RunAsStep(ctx, RevokeUserRole) // individually checkpointed dbos.RunAsStep(ctx, AuditAutoExpiry) // individually checkpointed } }
Weekly Review
dbos.RegisterWorkflow(ctx, WeeklyAccessReviewWorkflow, dbos.WithSchedule("0 0 9 * * 1")) // Monday 9 AM func WeeklyAccessReviewWorkflow(ctx, scheduledTime) { requests := dbos.RunAsStep(ctx, ListAllAccessRequests) stale := dbos.RunAsStep(ctx, CountStalePending) dbos.RunAsStep(ctx, WriteReviewAuditLog) }
DBOS Primitives Reference #
Every primitive used in this project — that's the entire API surface.
| Primitive | What it does | Used for |
|---|---|---|
RunAsStep dbos.RunAsStep(ctx, fn) | Checkpointed step. Result saved to PostgreSQL; replayed on recovery instead of re-executing. | Every DB query, external call, and audit write in every workflow |
Recv dbos.Recv[T](ctx, topic, timeout) | Durable wait. Suspends workflow until a message arrives or timeout expires. Survives restarts. | Approval workflow — waiting for human decisions |
Send dbos.Send(ctx, wfID, msg, topic) | Delivers a message to a specific workflow's Recv call. The API handler uses this to push decisions. | HTTP handler → running approval workflow |
RunWorkflow dbos.RunWorkflow(ctx, fn, input) | Starts a child workflow with independent checkpoints. Parent can await the result. | Approval spawns Provisioning after all approvals pass |
WithStepMaxRetries dbos.WithStepMaxRetries(n) | Automatic retries for a step. Retries up to n times before propagating the error. | External provisioning — 3 retries for transient failures |
WithSchedule dbos.WithSchedule("cron") | Cron-based scheduling. DBOS deduplicates runs and executes missed invocations on recovery. | Hourly expiry + Monday 9 AM weekly review |
What Happens When Things Crash? #
Three real failure scenarios and exactly how DBOS recovers.
Server crashes while waiting for Approver B
External provisioning fails after 3 retries
Server crashes during saga compensation
What Would Break Without DBOS? #
Workflow state lost. Need polling-based state machine to resume.
Workflow auto-resumes from checkpoint. Zero code needed.
Inconsistent state. User has local role but no external access.
Saga compensation rolls back local assignment.
Missed cron runs. Expired access stays active until next window.
Missed runs execute on recovery. No window is skipped.
Silent data loss. Action happened but no record exists.
Audit step retries on recovery. Action is always recorded.
Under the Hood: DBOS System Tables #
DBOS creates its own tables in PostgreSQL alongside the application tables. No separate database or infrastructure required.
PostgreSQL ├── Application Tables │ ├── users │ ├── roles │ ├── access_requests │ ├── approval_steps │ ├── user_roles │ └── audit_logs │ └── DBOS System Tables (managed automatically) ├── dbos.workflow_status ← tracks active/completed workflows ├── dbos.workflow_inputs ← serialized inputs for replay ├── dbos.operation_outputs ← checkpointed step results └── dbos.notifications ← durable Send/Recv message queue
Recovery on startup
When dbos.Launch() is called, DBOS queries workflow_status for any workflows marked as PENDING. For each unfinished workflow, it re-invokes the workflow function with the original inputs, replays completed steps from checkpoints, and resumes at the first un-checkpointed step.
Why Not Just Use...? #
Durable execution vs related approaches in the distributed systems landscape.