Lessons Learned¶
Purpose: Mistakes, discoveries, and things worth remembering. Future-you will thank current-you.
After solving a problem or learning something valuable, add an entry using the template at templates/lesson-learned.md.
General Principles¶
Documentation: - Document while you work, not after — you'll forget the details - Document WHY you made a decision, not just WHAT you did - If a recovery procedure isn't tested, it's fiction
Unraid: - Always stop the array cleanly before shutting down - Parity checks matter — schedule them and don't skip - Community Applications (CA) plugin versions can drift — pin versions in compose files - Backup the USB boot drive — losing it means reinstalling Unraid config from scratch - Flash Backup plugin is worth using
Proxmox: - VMs are easier to backup and restore than LXC containers - Always enable QEMU guest agent - Snapshots are not backups — they live on the same disk - Give VMs slightly more resources than you think they need
Docker / Containers:
- Use named volumes or bind mounts for persistent data — never rely on container storage
- Pin image versions — :latest will bite you eventually
- docker logs <name> is always your first debugging step
Bash:
- Never use ! in passwords passed via docker exec ... psql -c "..." — bash treats ! as a history expansion character inside double-quoted strings, causing event not found errors
- Use passwords without ! for anything passed on the command line, or run set +H first to disable history expansion
Home Assistant Notifications:
- Always route notifications through script.notify_dan or script.notify_household — never call notify.mobile_app_* directly
- Both scripts have sticky: true baked in so notifications stay in the Android tray after tapping — this is intentional so you can read the full message without it disappearing
- On Android, sticky: true means the notification persists after tap but can still be manually swiped away
- Use trigger.to_state.name (not trigger.to_state.attributes.friendly_name) to get an entity's friendly name in automations — the attributes dict doesn't reliably contain it in modern HA
- Capture trigger context in a variables: block before calling scripts — trigger.* is available in automation action templates but not inside script execution context
Networking: - Document your firewall rules — you will forget them - Changing DNS settings breaks things in non-obvious ways — change carefully - Test connectivity after every network change before assuming it worked
Backups: - A backup you haven't tested restoring is not a backup - Offsite matters — local backup doesn't protect against fire, flood, or theft - Retention policies matter — check you're not filling up storage silently
Log¶
2026-05-07: Unraid remote SMB share was mounting Lotus itself instead of Cooper¶
What Happened¶
The Appdata Backup plugin on Lotus reported a FAILED error when copying the flash backup to /mnt/remotes/COOPERDOMAIN_backups/lotus/justtheusb/. The justtheusb folder was visible on Cooper directly but not visible via the SMB mount on Lotus. All the ab_* appdata backup folders appeared fine.
The Problem¶
The Unassigned Devices remote share was configured with ip="COOPER.LOCALDOMAIN". On Lotus, COOPER.LOCALDOMAIN resolved to 127.0.0.1 (Lotus itself) rather than Cooper at 192.168.1.60. The mount output confirmed this — addr=0.0.0.0 and disk space shown matched Lotus, not Cooper.
Lotus was effectively mounting its own backups share and writing to itself. The ab_* appdata backup folders happened to exist on both machines (from previous successful runs), making the mount appear to be working. The justtheusb folder only exists on Cooper, which is why it was the only entry missing from the listing.
Editing the samba_mount.cfg file directly was not sufficient — the change wasn't picked up cleanly. The fix required deleting the remote share entry in Unassigned Devices completely and re-adding it using the IP address directly.
Lesson Learned¶
Always use IP addresses for Unassigned Devices remote SMB shares — never hostnames.
- Hostnames like
COOPER.LOCALDOMAINorCOOPER.LOCALcan resolve incorrectly on Unraid, especially at boot before DNS is ready - A stale or wrong mount can look healthy if the destination share happens to contain similar-looking folders
- Verify a remote share is actually pointing to the right machine by checking the disk space shown in Unassigned Devices — it should match the remote machine, not Lotus
- Editing
samba_mount.cfgdirectly doesn't reliably update a live mount entry — delete and re-add in the UI instead
Documentation Updated¶
- Backup strategy — Appdata Backup plugin section expanded with flash backup detail and Cooper path info
2026-04-27: LuckyBackup was backing up the wrong Immich path for 9+ months¶
What Happened¶
Investigated why recent photos couldn't be found at the expected filesystem path. Discovered that Immich's upload storage path had changed at some point from /mnt/user/data/media/immich/photos/ to /mnt/user/immich/photos/. LuckyBackup was still configured with the old source path, which contained zero active files.
The Problem¶
- Immich's Docker container had its
/photosmount changed from/mnt/user/data/media/immich/photos/to/mnt/user/immich/photos/ - All phone uploads since ~July 2025 were going to the new path
- LuckyBackup's
backupImmichPhotostask still pointed to/mnt/user/data/media/immich/— backing up nothing useful - 51,000+ photos had no backup on Cooper for 9+ months
The Solution¶
Updated LuckyBackup source path from /mnt/user/data/media/immich/ to /mnt/user/immich/ and destination from root@192.168.1.60:/mnt/user/data/media/immich/ to root@192.168.1.60:/mnt/user/immich/.
The old path contained an orphaned snapshot of pre-July 2025 uploads. A comparison showed 429 files in the old path missing from the new path — these were confirmed to be intentionally deleted photos and will not be copied across.
Lesson Learned¶
After any Docker container reconfiguration that changes volume mount paths, immediately verify that backup tools are still pointing at the correct source paths.
- Check backup source paths match actual container mounts:
docker inspect <name> --format '{{json .Mounts}}' - A backup job completing with "OK" and no errors does not mean it's backing up the right data — it just means rsync ran without errors
- Verify file counts at the destination periodically, not just backup job status
Documentation Updated¶
- Lotus server doc — Immich storage layout section added with correct paths
2026-04-25: Unraid Community Apps port mapping cannot be safely edited — use a socat relay instead¶
What Happened¶
Setting up PostgreSQL streaming replication for Immich required Cooper to connect to Lotus's PostgreSQL on host port 5432. The connection was refused even though the port appeared to be listening.
The Problem¶
The Immich PostgreSQL container was installed via the Unraid Community Applications store as part of a pre-configured bundle. The Unraid template had a broken port mapping: host:5432 → container:5433. PostgreSQL runs on container port 5432, so the host port forwarded to nothing.
The template uses Display="always-hide" for the port field, meaning it can't be changed through the Unraid UI. Editing the XML directly and applying via the UI recreated the container from the running config rather than the updated XML. Removing and recreating the container risked breaking the CA-managed bundle.
The Solution¶
Add a lightweight socat relay container on the same Docker network as the database:
docker run -d \
--name immich-pg-relay \
--restart unless-stopped \
--network arrproxy \
-p 5452:5452 \
alpine/socat \
TCP-LISTEN:5452,fork,reuseaddr TCP:immich_postgreSQL:5432
This exposes the database on a new host port (5452) without touching the CA container. The replica connects to 192.168.1.80:5452.
Lesson Learned¶
When a Community Applications container has a broken or locked port mapping, don't fight the template — add a socat relay container on the same Docker network instead.
- socat relays are lightweight, reliable, and completely non-invasive to the existing stack
- pg_hba.conf must allow the relay's Docker network subnet (e.g.
172.18.0.0/16) as well as the external client IP, because socat forwards from within the Docker network - The relay container needs
--restart unless-stoppedso it survives reboots
Documentation Updated¶
- Lotus server doc — Immich PostgreSQL replication section added
- Cooper server doc — Immich PostgreSQL replica section added
2026-04-25: Rsyncing SWAG appdata to Cooper caused Tailscale node ID conflict¶
What Happened¶
A second SWAG instance was set up on Cooper as a standby reverse proxy, with config files rsync'd hourly from Lotus. The initial rsync job copied the entire SWAG appdata folder, including the Tailscale state directory.
The Problem¶
- SWAG has Tailscale integrated and stores its own node identity (node ID, keys) inside
appdata/swag/.tailscale_state/ - Copying this folder to Cooper gave both SWAG instances the same Tailscale node ID
- Tailscale treats duplicate node IDs as a conflict — the primary Lotus instance was kicked off the Tailnet
- External access via
*.djchome.ukbroke because SWAG on Lotus lost Tailscale connectivity
The Solution¶
- Excluded the Tailscale state directory from the rsync job:
/mnt/cache/appdata/swag/.tailscale_state/ - Deleted the copied
.tailscale_statefolder from Cooper's SWAG appdata - Cooper's SWAG required a Tailscale auth key (
TAILSCALE_AUTHKEYenv var) to authenticate as a fresh node, as the browser auth flow hung without it
Lesson Learned¶
When rsyncing Docker appdata between two hosts, always exclude any folder that contains node identity data (Tailscale, WireGuard keys, machine IDs, etc.).
- Each SWAG instance must have its own independent Tailscale identity
- The rsync exclusion must be permanent — re-adding the Tailscale folder at any point will re-trigger the conflict
- If a containerised Tailscale node hangs on first auth after clearing state, use a pre-generated auth key via
TAILSCALE_AUTHKEYenv var rather than the browser flow - Check for similar identity folders if replicating any other VPN-integrated containers
Documentation Updated¶
- Cooper server doc — SWAG section updated with correct state directory path and failover script details