Lesson 1 · Resilience
Right now, every guest on your server has an RPO of infinity. Let's fix the worst one first.
pct destroy, a botched update. Backups are the floor under every other improvement
we'll make. You can't safely harden a box you can't restore.
You've built 13 guests on one host. They are, today, completely
unrecoverable: no vzdump jobs, no Proxmox Backup Server, an empty dump
directory. This lesson turns that into a real number you choose on purpose — and gives
you your first working backup before you close the tab.
Before "how do I back up," the discipline asks "how much can I afford to lose, and how fast must I be back?" Two terms, and they run the whole conversation:
RPO — Recovery Point Objective: the most data you can lose, in time. Back up nightly → your RPO is one day; a crash at 2am costs you since 3am yesterday. RPO is set by how often you back up.
RTO — Recovery Time Objective: the most downtime you can accept. Set by how you restore — and whether you've ever tested it. An untested backup has an unknown RTO, which is the same as no number at all.
You back up every night at 3am. A disk dies at 2am. Roughly how much data is at risk?
The backup world's one durable heuristic, coined by photographer Peter Krogh in 2005 and now the bedrock of enterprise DR:
3 copies of your data · on 2 different media · with 1 off-site.
Why each number earns its place: 3 copies because two can fail together; 2 media so one failure mode (a dead SSD, a bad batch) can't take both; 1 off-site because fire, theft, and ransomware don't respect the boundary of one box. The production data is copy #1 — so the rule really means "two backups, and get one of them out of the building."
A RAID mirror reliably protects you against which one of these?
The real skill isn't "back up everything" — it's tiering by value. Some guests hold tiny, irreplaceable state (years of Home Assistant history, your *arr configs and API keys, a Postgres DB). Others hold huge, fully re-downloadable media. Same RPO for both would be wasteful — and your media won't even fit alongside itself.
Tap each guest: would losing it hurt, or could you rebuild it?
See the shape? A handful of small, irreplaceable guests want a daily backup. The bulky media is a separate problem (file-level, off-host) we'll take later. That judgement — match protection to value — is exactly what capacity-conscious infra teams do.
Goal: a nightly vzdump job for your
high-value guests, written to the NVMe — a different physical disk than the
SSD they live on. That alone gets you from "one copy" to "two copies on two media,"
and takes your worst RPO from ∞ to 24 hours. It's non-destructive and fully reversible
(a backup job creates files; it changes nothing on the guests).
Easiest path — the web UI:
nvme-storage. Schedule: daily, e.g. 02:30. Mode: Snapshot.keep-daily=7, keep-weekly=4 so it self-prunes.t.kilgour@gmail.com on failure — a silent backup is a lie waiting to happen.Or the equivalent from the host shell (run one now to prove it works end-to-end):
ssh root@192.168.5.121 # one immediate test backup of Home Assistant to the NVMe: vzdump 101 --storage nvme-storage --mode snapshot --compress zstd # then confirm the archive exists: ls -lh /mnt/nvme/dump/
Then the real test — restore it. A backup you haven't restored is a hope, not a recovery. Once a test job exists, the very next lesson is a fire drill: restore one guest to a throwaway ID and time it. That gives you a measured RTO instead of a guessed one.
jobs.cfg entry,
or wire up the restore fire-drill right now? Just ask.