Lesson 2 · Resilience

The Fire Drill

You have a backup. You do not yet have a recovery. Today we find out which.

Where we are: In Lesson 1 you built a nightly vzdump job and proved it produces a real archive. That moved your worst RPO from ∞ to 24 hours. But your RTO is still a question mark — and an RTO you've never measured is the same as no RTO at all.

A backup is a claim: "this file can become a running guest again." Until you've actually cashed that claim, it's untested — and untested backups fail at the worst possible moment (a half-written archive, a storage that won't accept the restore, a step you didn't know you'd need at 2am). This lesson cashes the claim once, on purpose, while nothing is on fire — and hands you a real number.

The discipline: restore, don't assume

The professional rule is blunt: you don't have backups, you have restores. The backup job is the easy half; the recovery is the half that actually saves you, and it's the half nobody practises. A fire drill (or restore test) is a deliberate, scheduled rehearsal of recovery — done calm, timed, and torn down — so that the real thing is muscle memory, not improvisation.

In the field: mature teams run DR tests on a cadence and record the measured RTO each time; "when did you last test a restore?" is an audit question and an interview question. A backup you can't prove you've restored is treated, correctly, as no backup. The 2017 GitLab outage is the canonical lesson — five backup methods configured, none of them actually worked when needed.

Which one of these actually establishes your RTO?

How to test without breaking anything

A restore test is only safe if it can't touch production. Three rules make it harmless and reversible:

1. Restore to a new, unused ID — never over the original. Restoring onto the live guest's ID would overwrite a working service with an older copy. Use a throwaway VMID (your next-free is 114) and destroy it after.

2. Keep it off the network until you've fixed its identity — the restored copy carries the original's static IP. Boot it as-is and two guests fight over 192.168.5.126. Change the IP (or leave it stopped) before you start it.

3. Restore to spare space — send it to nvme-storage, not the SSD local-lvm thin pool, so the drill never pressures the pool your real guests run on.

The one-way door: pct restore onto an existing ID replaces it. The whole point of a fire drill is that it's a drill — pick an ID that doesn't exist yet, and the worst case is you delete a throwaway.

Your restored copy of Prowlarr is about to boot. What's the danger?

Your tangible win: a measured RTO

You already have one archive on the NVMe — the Prowlarr test backup from Lesson 1. Let's restore it to a throwaway container, confirm it really comes back, time it, and tear it down. Nothing here touches the live Prowlarr (103); the drill lives and dies as ID 114.

# on the Proxmox host
ssh root@192.168.5.121

# grab the newest prowlarr archive
ARCHIVE=$(ls -t /mnt/nvme/dump/vzdump-lxc-103-*.tar.zst | head -1)
echo "$ARCHIVE"

# THE TIMED PART — restore to a throwaway ID on spare NVMe space
time pct restore 114 "$ARCHIVE" --storage nvme-storage --unprivileged 1

# give the clone a non-conflicting identity, then bring it up
pct set 114 -net0 name=eth0,bridge=vmbr0,ip=192.168.5.151/22,gw=192.168.4.1
pct start 114

# PROVE it's a real, running guest (not just files)
pct exec 114 -- systemctl is-system-running
pct exec 114 -- ls /var/lib/prowlarr   # its config came back

# tear the drill down — fully frees the space, no stale thin blocks
pct stop 114 && pct destroy 114

Write down the real time that time pct restore printed. That — plus the minute or two to fix the IP and boot — is your measured RTO for a single small LXC. For the first time, you can say it out loud.

Record your RTO

Type the restore time you measured (in seconds) to bank the number.

seconds

Stay honest about what this proves. A clean restore on the same host proves the archive is good and you know the steps — a genuine win. It does not prove recovery from a dead host, a fire, or theft. That's still the "1 off-site" you owe from Lesson 1: a true DR test would restore onto different hardware. Naming the gap is the professional move; we'll close it when an off-site target exists.

I'm your teacher — ask me anything. Want me to run this drill with you and read the RTO back? Curious whether a VM restore (HomeAssistant, no guest agent) behaves differently from an LXC? Want to turn this into a quarterly automated restore-and-verify so it never goes stale? Or should the next lesson pivot to the firewall? Just ask.

Primary source to read next: the Proxmox VE — Backup and Restore wiki (restore section), and the vzdump admin-guide chapter for pct restore options.

All lessons ← Lesson 1 Audit findings Glossary Mission