Lessons Learned¶

Purpose: Mistakes, discoveries, and things worth remembering. Future-you will thank current-you.

After solving a problem or learning something valuable, add an entry using the template at templates/lesson-learned.md.

General Principles¶

Documentation: - Document while you work, not after — you'll forget the details - Document WHY you made a decision, not just WHAT you did - If a recovery procedure isn't tested, it's fiction

Unraid: - Always stop the array cleanly before shutting down - Parity checks matter — schedule them and don't skip - Community Applications (CA) plugin versions can drift — pin versions in compose files - Backup the USB boot drive — losing it means reinstalling Unraid config from scratch - Flash Backup plugin is worth using

Proxmox: - VMs are easier to backup and restore than LXC containers - Always enable QEMU guest agent - Snapshots are not backups — they live on the same disk - Give VMs slightly more resources than you think they need

Docker / Containers: - Use named volumes or bind mounts for persistent data — never rely on container storage - Pin image versions — :latest will bite you eventually - docker logs <name> is always your first debugging step

Bash: - Never use ! in passwords passed via docker exec ... psql -c "..." — bash treats ! as a history expansion character inside double-quoted strings, causing event not found errors - Use passwords without ! for anything passed on the command line, or run set +H first to disable history expansion

Home Assistant Notifications: - Always route notifications through script.notify_dan or script.notify_household — never call notify.mobile_app_* directly - Both scripts have sticky: true baked in so notifications stay in the Android tray after tapping — this is intentional so you can read the full message without it disappearing - On Android, sticky: true means the notification persists after tap but can still be manually swiped away - Use trigger.to_state.name (not trigger.to_state.attributes.friendly_name) to get an entity's friendly name in automations — the attributes dict doesn't reliably contain it in modern HA - Capture trigger context in a variables: block before calling scripts — trigger.* is available in automation action templates but not inside script execution context

Networking: - Document your firewall rules — you will forget them - Changing DNS settings breaks things in non-obvious ways — change carefully - Test connectivity after every network change before assuming it worked

Backups: - A backup you haven't tested restoring is not a backup - Offsite matters — local backup doesn't protect against fire, flood, or theft - Retention policies matter — check you're not filling up storage silently

Log¶

What Happened¶

The Appdata Backup plugin on Lotus reported a FAILED error when copying the flash backup to /mnt/remotes/COOPERDOMAIN_backups/lotus/justtheusb/. The justtheusb folder was visible on Cooper directly but not visible via the SMB mount on Lotus. All the ab_* appdata backup folders appeared fine.

The Problem¶

The Unassigned Devices remote share was configured with ip="COOPER.LOCALDOMAIN". On Lotus, COOPER.LOCALDOMAIN resolved to 127.0.0.1 (Lotus itself) rather than Cooper at 192.168.1.60. The mount output confirmed this — addr=0.0.0.0 and disk space shown matched Lotus, not Cooper.

Lotus was effectively mounting its own backups share and writing to itself. The ab_* appdata backup folders happened to exist on both machines (from previous successful runs), making the mount appear to be working. The justtheusb folder only exists on Cooper, which is why it was the only entry missing from the listing.

Editing the samba_mount.cfg file directly was not sufficient — the change wasn't picked up cleanly. The fix required deleting the remote share entry in Unassigned Devices completely and re-adding it using the IP address directly.

Lesson Learned¶

Always use IP addresses for Unassigned Devices remote SMB shares — never hostnames.

Hostnames like COOPER.LOCALDOMAIN or COOPER.LOCAL can resolve incorrectly on Unraid, especially at boot before DNS is ready
A stale or wrong mount can look healthy if the destination share happens to contain similar-looking folders
Verify a remote share is actually pointing to the right machine by checking the disk space shown in Unassigned Devices — it should match the remote machine, not Lotus
Editing samba_mount.cfg directly doesn't reliably update a live mount entry — delete and re-add in the UI instead

Documentation Updated¶

Backup strategy — Appdata Backup plugin section expanded with flash backup detail and Cooper path info

2026-04-27: LuckyBackup was backing up the wrong Immich path for 9+ months¶

What Happened¶

Investigated why recent photos couldn't be found at the expected filesystem path. Discovered that Immich's upload storage path had changed at some point from /mnt/user/data/media/immich/photos/ to /mnt/user/immich/photos/. LuckyBackup was still configured with the old source path, which contained zero active files.

The Problem¶

Immich's Docker container had its /photos mount changed from /mnt/user/data/media/immich/photos/ to /mnt/user/immich/photos/
All phone uploads since ~July 2025 were going to the new path
LuckyBackup's backupImmichPhotos task still pointed to /mnt/user/data/media/immich/ — backing up nothing useful
51,000+ photos had no backup on Cooper for 9+ months

The Solution¶

Updated LuckyBackup source path from /mnt/user/data/media/immich/ to /mnt/user/immich/ and destination from root@192.168.1.60:/mnt/user/data/media/immich/ to root@192.168.1.60:/mnt/user/immich/.

The old path contained an orphaned snapshot of pre-July 2025 uploads. A comparison showed 429 files in the old path missing from the new path — these were confirmed to be intentionally deleted photos and will not be copied across.

Lesson Learned¶

After any Docker container reconfiguration that changes volume mount paths, immediately verify that backup tools are still pointing at the correct source paths.

Check backup source paths match actual container mounts: docker inspect <name> --format '{{json .Mounts}}'
A backup job completing with "OK" and no errors does not mean it's backing up the right data — it just means rsync ran without errors
Verify file counts at the destination periodically, not just backup job status

Documentation Updated¶

Lotus server doc — Immich storage layout section added with correct paths

2026-04-25: Unraid Community Apps port mapping cannot be safely edited — use a socat relay instead¶

What Happened¶

Setting up PostgreSQL streaming replication for Immich required Cooper to connect to Lotus's PostgreSQL on host port 5432. The connection was refused even though the port appeared to be listening.

The Problem¶

The Immich PostgreSQL container was installed via the Unraid Community Applications store as part of a pre-configured bundle. The Unraid template had a broken port mapping: host:5432 → container:5433. PostgreSQL runs on container port 5432, so the host port forwarded to nothing.

The template uses Display="always-hide" for the port field, meaning it can't be changed through the Unraid UI. Editing the XML directly and applying via the UI recreated the container from the running config rather than the updated XML. Removing and recreating the container risked breaking the CA-managed bundle.

The Solution¶

Add a lightweight socat relay container on the same Docker network as the database:

docker run -d \
  --name immich-pg-relay \
  --restart unless-stopped \
  --network arrproxy \
  -p 5452:5452 \
  alpine/socat \
  TCP-LISTEN:5452,fork,reuseaddr TCP:immich_postgreSQL:5432

This exposes the database on a new host port (5452) without touching the CA container. The replica connects to 192.168.1.80:5452.

Lesson Learned¶

When a Community Applications container has a broken or locked port mapping, don't fight the template — add a socat relay container on the same Docker network instead.

socat relays are lightweight, reliable, and completely non-invasive to the existing stack
pg_hba.conf must allow the relay's Docker network subnet (e.g. 172.18.0.0/16) as well as the external client IP, because socat forwards from within the Docker network
The relay container needs --restart unless-stopped so it survives reboots

Documentation Updated¶

Lotus server doc — Immich PostgreSQL replication section added
Cooper server doc — Immich PostgreSQL replica section added

2026-04-25: Rsyncing SWAG appdata to Cooper caused Tailscale node ID conflict¶

What Happened¶

A second SWAG instance was set up on Cooper as a standby reverse proxy, with config files rsync'd hourly from Lotus. The initial rsync job copied the entire SWAG appdata folder, including the Tailscale state directory.

The Problem¶

SWAG has Tailscale integrated and stores its own node identity (node ID, keys) inside appdata/swag/.tailscale_state/
Copying this folder to Cooper gave both SWAG instances the same Tailscale node ID
Tailscale treats duplicate node IDs as a conflict — the primary Lotus instance was kicked off the Tailnet
External access via *.djchome.uk broke because SWAG on Lotus lost Tailscale connectivity

The Solution¶

Excluded the Tailscale state directory from the rsync job:
```
/mnt/cache/appdata/swag/.tailscale_state/
```
Deleted the copied .tailscale_state folder from Cooper's SWAG appdata
Cooper's SWAG required a Tailscale auth key (TAILSCALE_AUTHKEY env var) to authenticate as a fresh node, as the browser auth flow hung without it

Lesson Learned¶

When rsyncing Docker appdata between two hosts, always exclude any folder that contains node identity data (Tailscale, WireGuard keys, machine IDs, etc.).

Each SWAG instance must have its own independent Tailscale identity
The rsync exclusion must be permanent — re-adding the Tailscale folder at any point will re-trigger the conflict
If a containerised Tailscale node hangs on first auth after clearing state, use a pre-generated auth key via TAILSCALE_AUTHKEY env var rather than the browser flow
Check for similar identity folders if replicating any other VPN-integrated containers

Documentation Updated¶

Cooper server doc — SWAG section updated with correct state directory path and failover script details

Lessons Learned¶

General Principles¶

Log¶

2026-05-07: Unraid remote SMB share was mounting Lotus itself instead of Cooper¶

What Happened¶

The Problem¶

Lesson Learned¶

Documentation Updated¶

2026-04-27: LuckyBackup was backing up the wrong Immich path for 9+ months¶

What Happened¶

The Problem¶

The Solution¶

Lesson Learned¶

Documentation Updated¶

2026-04-25: Unraid Community Apps port mapping cannot be safely edited — use a socat relay instead¶

What Happened¶

The Problem¶

The Solution¶

Lesson Learned¶

Documentation Updated¶

2026-04-25: Rsyncing SWAG appdata to Cooper caused Tailscale node ID conflict¶

What Happened¶

The Problem¶

The Solution¶

Lesson Learned¶

Documentation Updated¶