Lesson 1 · Resilience

The 3-2-1 Rule

Right now, every guest on your server has an RPO of infinity. Let's fix the worst one first.

Why this, first? Of the nine findings in your audit, this is the one that loses everything in a single bad moment — a dead SSD, a wrong pct destroy, a botched update. Backups are the floor under every other improvement we'll make. You can't safely harden a box you can't restore.

You've built 13 guests on one host. They are, today, completely unrecoverable: no vzdump jobs, no Proxmox Backup Server, an empty dump directory. This lesson turns that into a real number you choose on purpose — and gives you your first working backup before you close the tab.

Two numbers professionals actually use

Before "how do I back up," the discipline asks "how much can I afford to lose, and how fast must I be back?" Two terms, and they run the whole conversation:

RPO — Recovery Point Objective: the most data you can lose, in time. Back up nightly → your RPO is one day; a crash at 2am costs you since 3am yesterday. RPO is set by how often you back up.

RTO — Recovery Time Objective: the most downtime you can accept. Set by how you restore — and whether you've ever tested it. An untested backup has an unknown RTO, which is the same as no number at all.

In the field: RPO and RTO are written into every disaster-recovery plan and customer SLA you'll ever touch. "What's your RPO?" is a normal interview question for an infra role. Knowing yours — and that you chose it deliberately — is the whole difference between a hobbyist and someone who runs systems.

You back up every night at 3am. A disk dies at 2am. Roughly how much data is at risk?

The rule: 3-2-1

The backup world's one durable heuristic, coined by photographer Peter Krogh in 2005 and now the bedrock of enterprise DR:

3 copies of your data · on 2 different media · with 1 off-site.

Why each number earns its place: 3 copies because two can fail together; 2 media so one failure mode (a dead SSD, a bad batch) can't take both; 1 off-site because fire, theft, and ransomware don't respect the boundary of one box. The production data is copy #1 — so the rule really means "two backups, and get one of them out of the building."

The trap to avoid: a RAID/ZFS mirror is not a backup. Redundancy survives a drive dying — but it copies your mistakes and your ransomware to both disks the instant they happen. You need both, for different disasters.

A RAID mirror reliably protects you against which one of these?

Not everything deserves the same backup

The real skill isn't "back up everything" — it's tiering by value. Some guests hold tiny, irreplaceable state (years of Home Assistant history, your *arr configs and API keys, a Postgres DB). Others hold huge, fully re-downloadable media. Same RPO for both would be wasteful — and your media won't even fit alongside itself.

Tap each guest: would losing it hurt, or could you rebuild it?

Your guests, tiered

HomeAssistant— automations & historyBack up daily

The *arr stack— configs, indexers, API keysBack up daily

Terminus— PostgreSQL databaseBack up daily

Tautulli— Plex watch historyWeekly is fine

Plex— server settings onlyWeekly is fine

Media library— 485 GB on NVMeNot via vzdump

See the shape? A handful of small, irreplaceable guests want a daily backup. The bulky media is a separate problem (file-level, off-host) we'll take later. That judgement — match protection to value — is exactly what capacity-conscious infra teams do.

Your tangible win: a real backup, today

Goal: a nightly vzdump job for your high-value guests, written to the NVMe — a different physical disk than the SSD they live on. That alone gets you from "one copy" to "two copies on two media," and takes your worst RPO from ∞ to 24 hours. It's non-destructive and fully reversible (a backup job creates files; it changes nothing on the guests).

Easiest path — the web UI:

Open the Proxmox UI → Datacenter → Backup → Add.
Storage: nvme-storage. Schedule: daily, e.g. 02:30. Mode: Snapshot.
Selection: pick the daily-tier guests — HomeAssistant (101), Radarr (104), Sonarr (105), Prowlarr (103), Lidarr (113), Listenarr (109), Terminus (112), Overseerr (106).
Retention: set keep-daily=7, keep-weekly=4 so it self-prunes.
Enable email notification to t.kilgour@gmail.com on failure — a silent backup is a lie waiting to happen.

Or the equivalent from the host shell (run one now to prove it works end-to-end):

ssh root@192.168.5.121

# one immediate test backup of Home Assistant to the NVMe:
vzdump 101 --storage nvme-storage --mode snapshot --compress zstd

# then confirm the archive exists:
ls -lh /mnt/nvme/dump/

Then the real test — restore it. A backup you haven't restored is a hope, not a recovery. Once a test job exists, the very next lesson is a fire drill: restore one guest to a throwaway ID and time it. That gives you a measured RTO instead of a guessed one.

Honest about 3-2-1: NVMe-on-the-same-host is copy #2 on a second medium — it survives an SSD death, but not a fire, theft, or the whole box dying. That's the "1 off-site" we still owe. You have no second machine yet, so we'll tackle off-site deliberately (an old external drive rotated off-site? a cheap object-store like Backblaze B2? Proxmox Backup Server on a second node?) in its own lesson. Naming the gap is the professional move — don't let "two copies" masquerade as "done."

I'm your teacher — ask me anything. Before you build the job: unsure whether snapshot mode will hiccup the Postgres in Terminus? Want help picking retention numbers for your 444 GB of free NVMe? Want me to generate the exact jobs.cfg entry, or wire up the restore fire-drill right now? Just ask.