11 KiB
AI Lab Maintenance Toolkit
Executive summary
The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an
Ubuntu AI infrastructure host named ailab. It combines repeatable health
reporting, disk monitoring, conservative package cleanup, Docker hygiene,
configuration backup, and non-destructive VM inventory into a small toolkit
that is readable enough for review and guarded enough for homelab use.
This is a portfolio and lab implementation, not evidence of production certification. Review package policy, backup coverage, maintenance windows, and application impact before deploying it to another host.
Problem solved
AI lab hosts accumulate operating system packages, kernel packages, container images, build cache, journals, and configuration changes while also carrying stateful workloads. Manual maintenance is easy to defer and risky to perform without evidence. This project provides scheduled, logged tasks with explicit safety boundaries and separate read-only audit commands.
What this demonstrates
- Bash strict mode, input validation, dependency checks, and operational exit codes.
- Dry-run-first maintenance with explicit authorization for changes.
- systemd oneshot services and persistent calendar timers.
- APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review.
- Docker cleanup that preserves volumes.
- Configuration-focused backups with bounded retention.
- Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd.
- Idempotent installation and guarded JSON configuration updates.
Architecture and directory layout
ailab-maintenance/
├── README.md
├── install.sh
├── scripts/
│ ├── ailab-healthcheck.sh
│ ├── ailab-disk-watch.sh
│ ├── ailab-apt-cleanup.sh
│ ├── ailab-kernel-cleanup.sh
│ ├── ailab-docker-cleanup.sh
│ ├── ailab-config-backup.sh
│ └── ailab-vm-audit.sh
└── systemd/
├── ailab-apt-cleanup.service
├── ailab-apt-cleanup.timer
├── ailab-kernel-cleanup.service
├── ailab-kernel-cleanup.timer
├── ailab-docker-cleanup.service
├── ailab-docker-cleanup.timer
├── ailab-config-backup.service
├── ailab-config-backup.timer
├── ailab-disk-watch.service
└── ailab-disk-watch.timer
The installer deploys scripts to /usr/local/sbin and units to
/etc/systemd/system. Scripts run directly as root from systemd rather than
through an additional framework.
Maintenance tasks
| Command | Purpose | Change behavior |
|---|---|---|
ailab-healthcheck.sh |
Host, storage, service, container, VM, GPU, and SMART report | Read-only |
ailab-disk-watch.sh |
Filesystem threshold check | Read-only |
ailab-apt-cleanup.sh |
APT metadata refresh and unused package cleanup | Dry-run by default |
ailab-kernel-cleanup.sh |
APT-managed kernel package cleanup | Dry-run by default |
ailab-docker-cleanup.sh |
Unused Docker object and build-cache cleanup | Dry-run by default |
ailab-config-backup.sh |
Configuration archive and retention | Dry-run by default |
ailab-vm-audit.sh |
VM, pool, volume, and image-file inventory | Read-only |
Safety model
Change-capable scripts default to dry-run behavior. Manual execution requires
--execute and an interactive EXECUTE confirmation. The systemd services
use --execute --non-interactive; installing and enabling those reviewed unit
files is the explicit authorization for scheduled maintenance.
Exit codes follow the repository convention:
0: completed successfully or an optional component was absent.1: an operational check or maintenance action failed.2: invalid input, missing required dependency, or insufficient privilege.
The scripts do not bypass APT or Docker locks, delete VM resources, manually select kernel names for removal, or hide command failures.
Installation
Review every script and unit first. Installation changes package state, journald settings, Docker daemon settings when Docker exists, and enabled timer state.
cd labs/linux/ailab-maintenance
sudo ./install.sh
The installer:
- Installs the documented Ubuntu utilities.
- Deploys scripts and systemd units with fixed permissions.
- Writes
/etc/systemd/journald.conf.d/ailab-limits.conf. - Restarts
systemd-journald. - Validates and backs up an existing Docker
daemon.json, merges log limits withjq, and attempts a Docker restart. - Enables all five timers.
- Writes an initial report to
/root/ailab-healthcheck-now.txt.
The installer is intended for Ubuntu 26.04. It is not run automatically by repository validation.
Manual commands
Read-only reports:
sudo /usr/local/sbin/ailab-healthcheck.sh
sudo /usr/local/sbin/ailab-disk-watch.sh
sudo /usr/local/sbin/ailab-vm-audit.sh
Preview maintenance:
sudo /usr/local/sbin/ailab-apt-cleanup.sh
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
sudo /usr/local/sbin/ailab-docker-cleanup.sh
sudo /usr/local/sbin/ailab-config-backup.sh
Apply reviewed maintenance interactively:
sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute
sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute
sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute
sudo /usr/local/sbin/ailab-config-backup.sh --execute
--non-interactive is reserved for reviewed automation and is rejected unless
--execute is also present.
Systemd timers
| Timer | Schedule |
|---|---|
ailab-config-backup.timer |
Daily at 03:30 |
ailab-disk-watch.timer |
Hourly |
ailab-apt-cleanup.timer |
Sunday at 04:00 |
ailab-kernel-cleanup.timer |
Sunday at 04:20 |
ailab-docker-cleanup.timer |
Sunday at 04:40 |
All timers use Persistent=true, so a missed event runs after the host becomes
available. Inspect timer and service evidence with:
systemctl list-timers --all | grep ailab-
systemctl status ailab-config-backup.timer
journalctl -u ailab-kernel-cleanup.service
Logs
Scheduled and manual maintenance writes to:
/var/log/ailab-apt-cleanup.log
/var/log/ailab-kernel-cleanup.log
/var/log/ailab-docker-cleanup.log
/var/log/ailab-config-backup.log
/var/log/ailab-disk-watch.log
systemd also records service output in the journal. Logrotate is installed as a dependency, but this lab does not create a custom rotation policy for these small maintenance logs.
Docker policy
Docker cleanup runs docker system prune -af and removes build cache older
than seven days. It never passes --volumes. Named and anonymous volumes
remain outside this automated policy and require application-aware review.
The installer configures the json-file driver with a maximum size of 50m
and five files. Existing valid JSON is backed up and merged. Invalid JSON
causes installation to stop rather than overwrite operator configuration.
Kernel policy
Kernel removal is delegated to apt autoremove --purge; package names are not
constructed or purged with regular expressions. Before execution, the script
logs the APT simulation and refuses cleanup unless at least two installed
versioned kernel image packages remain after simulated removals.
This protects a fallback kernel while preserving Ubuntu dependency policy. Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings, Secure Boot state, and the simulated removal set before manual execution.
Backup policy
Backups are written to /srv/backups/ailab-config as
ailab-config-YYYYMMDD-HHMMSS.tar.gz. Matching archives older than 30 days are
deleted only after a new archive is created.
The backup covers /etc, selected root shell configuration,
/opt/ailab-maintenance when present, and libvirt configuration under
/var/lib/libvirt/qemu. It does not include /var/lib/docker, WebODM data,
Ollama models, VM disk images, or other large application datasets. Because
/etc is included, explicitly listed configuration subdirectories are already
covered even when optional-path reporting mentions them separately.
This is a local configuration backup, not a disaster-recovery design. A real deployment should copy archives to independently protected storage and test restoration.
Journald policy
The installer applies:
[Journal]
SystemMaxUse=1G
SystemKeepFree=2G
MaxRetentionSec=14day
Compress=yes
These settings bound journal growth while retaining useful troubleshooting evidence. Capacity and retention should be adjusted to the host's disk size and incident-response requirements.
Disk watch policy
The disk check uses df -P, defaults to an 85 percent threshold, and returns
1 when any checked filesystem meets or exceeds the threshold. Override the
threshold for a manual or unit invocation with:
sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh
The script reports every filesystem as OK or WARNING; it does not delete
data or attempt remediation.
Example operational workflows
Weekly maintenance review
sudo /usr/local/sbin/ailab-healthcheck.sh
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
sudo /usr/local/sbin/ailab-docker-cleanup.sh
systemctl list-timers --all | grep ailab-
Review the kernel simulation, Docker usage, failed units, backup freshness, and disk warnings before approving manual changes.
Disk pressure investigation
sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh
sudo docker system df
sudo journalctl --disk-usage
sudo /usr/local/sbin/ailab-vm-audit.sh
Use the evidence to identify ownership. Do not treat Docker pruning or file deletion as a substitute for application-specific retention policy.
Post-maintenance evidence
sudo /usr/local/sbin/ailab-healthcheck.sh \
| sudo tee /root/ailab-healthcheck-after-maintenance.txt
journalctl --since today -u 'ailab-*.service'
Interview talking points
- Why timer units explicitly carry the non-interactive execution boundary.
- Why APT dependency policy is safer than regex-based kernel deletion.
- How Docker volume preservation separates platform hygiene from application data lifecycle decisions.
- How optional dependency handling keeps one health command useful across container, GPU, and virtualization host variants.
- Why configuration backup and application-data backup are separate concerns.
- How exit codes, persistent timers, logs, and post-checks support operations.
Future improvements
- Add a dedicated logrotate policy after measuring log growth.
- Export disk-watch status to a monitoring system instead of relying only on timer failure state.
- Add automated archive integrity checks and off-host replication.
- Add Bats tests using mocked
apt,docker,virsh, andsystemctlcommands. - Add package-lock detection with bounded retry policy if recurring contention is observed.
- Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a dedicated read-only audit.