# AI Lab Maintenance Toolkit ## Executive summary The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an Ubuntu AI infrastructure host named `ailab`. It combines repeatable health reporting, disk monitoring, conservative package cleanup, Docker hygiene, configuration backup, and non-destructive VM inventory into a small toolkit that is readable enough for review and guarded enough for homelab use. This is a portfolio and lab implementation, not evidence of production certification. Review package policy, backup coverage, maintenance windows, and application impact before deploying it to another host. ## Problem solved AI lab hosts accumulate operating system packages, kernel packages, container images, build cache, journals, and configuration changes while also carrying stateful workloads. Manual maintenance is easy to defer and risky to perform without evidence. This project provides scheduled, logged tasks with explicit safety boundaries and separate read-only audit commands. ## What this demonstrates - Bash strict mode, input validation, dependency checks, and operational exit codes. - Dry-run-first maintenance with explicit authorization for changes. - systemd oneshot services and persistent calendar timers. - APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review. - Docker cleanup that preserves volumes. - Configuration-focused backups with bounded retention. - Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd. - Idempotent installation and guarded JSON configuration updates. ## Architecture and directory layout ```text ailab-maintenance/ ├── README.md ├── install.sh ├── scripts/ │ ├── ailab-healthcheck.sh │ ├── ailab-disk-watch.sh │ ├── ailab-apt-cleanup.sh │ ├── ailab-kernel-cleanup.sh │ ├── ailab-docker-cleanup.sh │ ├── ailab-config-backup.sh │ └── ailab-vm-audit.sh └── systemd/ ├── ailab-apt-cleanup.service ├── ailab-apt-cleanup.timer ├── ailab-kernel-cleanup.service ├── ailab-kernel-cleanup.timer ├── ailab-docker-cleanup.service ├── ailab-docker-cleanup.timer ├── ailab-config-backup.service ├── ailab-config-backup.timer ├── ailab-disk-watch.service └── ailab-disk-watch.timer ``` The installer deploys scripts to `/usr/local/sbin` and units to `/etc/systemd/system`. Scripts run directly as root from systemd rather than through an additional framework. ## Maintenance tasks | Command | Purpose | Change behavior | | --- | --- | --- | | `ailab-healthcheck.sh` | Host, storage, service, container, VM, GPU, and SMART report | Read-only | | `ailab-disk-watch.sh` | Filesystem threshold check | Read-only | | `ailab-apt-cleanup.sh` | APT metadata refresh and unused package cleanup | Dry-run by default | | `ailab-kernel-cleanup.sh` | APT-managed kernel package cleanup | Dry-run by default | | `ailab-docker-cleanup.sh` | Unused Docker object and build-cache cleanup | Dry-run by default | | `ailab-config-backup.sh` | Configuration archive and retention | Dry-run by default | | `ailab-vm-audit.sh` | VM, pool, volume, and image-file inventory | Read-only | ## Safety model Change-capable scripts default to dry-run behavior. Manual execution requires `--execute` and an interactive `EXECUTE` confirmation. The systemd services use `--execute --non-interactive`; installing and enabling those reviewed unit files is the explicit authorization for scheduled maintenance. Exit codes follow the repository convention: - `0`: completed successfully or an optional component was absent. - `1`: an operational check or maintenance action failed. - `2`: invalid input, missing required dependency, or insufficient privilege. The scripts do not bypass APT or Docker locks, delete VM resources, manually select kernel names for removal, or hide command failures. ## Installation Review every script and unit first. Installation changes package state, journald settings, Docker daemon settings when Docker exists, and enabled timer state. ```bash cd labs/linux/ailab-maintenance sudo ./install.sh ``` The installer: 1. Installs the documented Ubuntu utilities. 2. Deploys scripts and systemd units with fixed permissions. 3. Writes `/etc/systemd/journald.conf.d/ailab-limits.conf`. 4. Restarts `systemd-journald`. 5. Validates and backs up an existing Docker `daemon.json`, merges log limits with `jq`, and attempts a Docker restart. 6. Enables all five timers. 7. Writes an initial report to `/root/ailab-healthcheck-now.txt`. The installer is intended for Ubuntu 26.04. It is not run automatically by repository validation. ## Manual commands Read-only reports: ```bash sudo /usr/local/sbin/ailab-healthcheck.sh sudo /usr/local/sbin/ailab-disk-watch.sh sudo /usr/local/sbin/ailab-vm-audit.sh ``` Preview maintenance: ```bash sudo /usr/local/sbin/ailab-apt-cleanup.sh sudo /usr/local/sbin/ailab-kernel-cleanup.sh sudo /usr/local/sbin/ailab-docker-cleanup.sh sudo /usr/local/sbin/ailab-config-backup.sh ``` Apply reviewed maintenance interactively: ```bash sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute sudo /usr/local/sbin/ailab-config-backup.sh --execute ``` `--non-interactive` is reserved for reviewed automation and is rejected unless `--execute` is also present. ## Systemd timers | Timer | Schedule | | --- | --- | | `ailab-config-backup.timer` | Daily at 03:30 | | `ailab-disk-watch.timer` | Hourly | | `ailab-apt-cleanup.timer` | Sunday at 04:00 | | `ailab-kernel-cleanup.timer` | Sunday at 04:20 | | `ailab-docker-cleanup.timer` | Sunday at 04:40 | All timers use `Persistent=true`, so a missed event runs after the host becomes available. Inspect timer and service evidence with: ```bash systemctl list-timers --all | grep ailab- systemctl status ailab-config-backup.timer journalctl -u ailab-kernel-cleanup.service ``` ## Logs Scheduled and manual maintenance writes to: ```text /var/log/ailab-apt-cleanup.log /var/log/ailab-kernel-cleanup.log /var/log/ailab-docker-cleanup.log /var/log/ailab-config-backup.log /var/log/ailab-disk-watch.log ``` systemd also records service output in the journal. Logrotate is installed as a dependency, but this lab does not create a custom rotation policy for these small maintenance logs. ## Docker policy Docker cleanup runs `docker system prune -af` and removes build cache older than seven days. It never passes `--volumes`. Named and anonymous volumes remain outside this automated policy and require application-aware review. The installer configures the `json-file` driver with a maximum size of `50m` and five files. Existing valid JSON is backed up and merged. Invalid JSON causes installation to stop rather than overwrite operator configuration. ## Kernel policy Kernel removal is delegated to `apt autoremove --purge`; package names are not constructed or purged with regular expressions. Before execution, the script logs the APT simulation and refuses cleanup unless at least two installed versioned kernel image packages remain after simulated removals. This protects a fallback kernel while preserving Ubuntu dependency policy. Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings, Secure Boot state, and the simulated removal set before manual execution. ## Backup policy Backups are written to `/srv/backups/ailab-config` as `ailab-config-YYYYMMDD-HHMMSS.tar.gz`. Matching archives older than 30 days are deleted only after a new archive is created. The backup covers `/etc`, selected root shell configuration, `/opt/ailab-maintenance` when present, and libvirt configuration under `/var/lib/libvirt/qemu`. It does not include `/var/lib/docker`, WebODM data, Ollama models, VM disk images, or other large application datasets. Because `/etc` is included, explicitly listed configuration subdirectories are already covered even when optional-path reporting mentions them separately. This is a local configuration backup, not a disaster-recovery design. A real deployment should copy archives to independently protected storage and test restoration. ## Journald policy The installer applies: ```ini [Journal] SystemMaxUse=1G SystemKeepFree=2G MaxRetentionSec=14day Compress=yes ``` These settings bound journal growth while retaining useful troubleshooting evidence. Capacity and retention should be adjusted to the host's disk size and incident-response requirements. ## Disk watch policy The disk check uses `df -P`, defaults to an 85 percent threshold, and returns `1` when any checked filesystem meets or exceeds the threshold. Override the threshold for a manual or unit invocation with: ```bash sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh ``` The script reports every filesystem as `OK` or `WARNING`; it does not delete data or attempt remediation. ## Example operational workflows ### Weekly maintenance review ```bash sudo /usr/local/sbin/ailab-healthcheck.sh sudo /usr/local/sbin/ailab-kernel-cleanup.sh sudo /usr/local/sbin/ailab-docker-cleanup.sh systemctl list-timers --all | grep ailab- ``` Review the kernel simulation, Docker usage, failed units, backup freshness, and disk warnings before approving manual changes. ### Disk pressure investigation ```bash sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh sudo docker system df sudo journalctl --disk-usage sudo /usr/local/sbin/ailab-vm-audit.sh ``` Use the evidence to identify ownership. Do not treat Docker pruning or file deletion as a substitute for application-specific retention policy. ### Post-maintenance evidence ```bash sudo /usr/local/sbin/ailab-healthcheck.sh \ | sudo tee /root/ailab-healthcheck-after-maintenance.txt journalctl --since today -u 'ailab-*.service' ``` ## Interview talking points - Why timer units explicitly carry the non-interactive execution boundary. - Why APT dependency policy is safer than regex-based kernel deletion. - How Docker volume preservation separates platform hygiene from application data lifecycle decisions. - How optional dependency handling keeps one health command useful across container, GPU, and virtualization host variants. - Why configuration backup and application-data backup are separate concerns. - How exit codes, persistent timers, logs, and post-checks support operations. ## Future improvements - Add a dedicated logrotate policy after measuring log growth. - Export disk-watch status to a monitoring system instead of relying only on timer failure state. - Add automated archive integrity checks and off-host replication. - Add Bats tests using mocked `apt`, `docker`, `virsh`, and `systemctl` commands. - Add package-lock detection with bounded retry policy if recurring contention is observed. - Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a dedicated read-only audit.