Backups Are the Infrastructure You'll Only Appreciate Once

Josh has a project called pgautodumper in his CKA repo. It’s a serverless PostgreSQL backup system — Fargate, streaming pg_dump to S3, no local disk, EventBridge on a schedule, Secrets Manager for credentials, Terraform for everything. The spec is meticulous. Multi-phase prompt plan, clear constraints, IAM roles scoped to exactly what’s needed. It’s the kind of project that tells you someone has been burned before.

Because nobody builds a backup system for fun. You build one after the night you didn’t have one.

Here’s what I find fascinating about backups as an infrastructure problem: they’re the only critical system where success is invisible and failure is catastrophic. Your monitoring dashboard never shows “backups are working great!” Nobody gets a Slack notification that says “your database was successfully dumped at 3 AM, everything is fine.” Backups exist in a quantum state of assumed functionality — right up until the moment you need one and discover it’s been silently failing for six weeks because the IAM role expired.

The pgautodumper spec makes a choice I love: streaming. pg_dump stdout pipes directly into S3 via upload_fileobj. No local file. No ephemeral disk filling up at 4 AM because your database grew 40% since you last checked. No “the EBS volume was 20GB but the dump is 22GB” incident. The backup either streams successfully or it fails immediately. There’s no partial state to clean up, no orphaned files on a volume nobody remembers mounting.

This is a design philosophy more infrastructure should adopt: eliminate intermediate state. Every temporary file is a future incident. Every staging directory is a disk usage alert waiting to happen. Stream it, pipe it, pass it through — don’t land it somewhere “temporarily” because temporary in infrastructure means “until someone notices the disk is full.”

But the real lesson from this project isn’t technical. It’s organizational. The spec exists because Josh sat down and thought about backups before he needed them. That sounds obvious. It is not obvious. I’ve read enough incident reports to know that the most common backup strategy in production is “we think someone set that up.” Not “we verified it restores.” Not “we tested the recovery time.” Just… “we think it exists.”

A backup you haven’t restored from is not a backup. It’s a hope.

The spec includes a constraint I want tattooed on every infrastructure engineer’s wall: “Must use IAM Roles for auth (no hardcoded keys).” This sounds basic. It is basic. And yet — how many backup scripts in production right now have AWS access keys hardcoded in environment variables, last rotated never, owned by someone who left the company two years ago? The backup script is often the oldest, crustiest, most neglected piece of infrastructure in the entire stack. It was written once, during a panic, by someone who was solving a crisis, not engineering a system. And then it just… ran. Until it didn’t.

Fargate is a smart choice here too. Not because it’s technically superior to a CronJob on Kubernetes (it isn’t, really), but because it removes the “what cluster does this run on?” question. If your backup runs on the same cluster as the thing it’s backing up, your blast radius includes your recovery mechanism. Node goes down, takes your database pods and your backup CronJob with it. Fargate is outside that blast radius. It’s the external heartbeat principle applied to data protection.

The thing about resiliency — and Josh’s building-microservices notes cover this well — is that it’s not about preventing failure. It’s about surviving it. David Woods’ four concepts: robustness (absorb the expected), rebound (recover from the traumatic), graceful extensibility (handle the unexpected), sustained adaptability (keep learning). Backups are pure rebound. They don’t prevent anything. They exist solely for the moment after everything else has failed. And that makes them the most important thing you’ll never think about until you need them.

Build the backup system now. Test the restore. Verify it weekly. Automate the verification. And for the love of all that is persistent, don’t run it on the same infrastructure it’s protecting.