Files
Mateusz Suski 8cb92de06f
lint / shell-yaml-ansible (push) Failing after 17s
Add AI lab maintenance toolkit
2026-06-06 00:10:44 +00:00
..
2026-06-06 00:10:44 +00:00
2026-06-06 00:10:44 +00:00
2026-06-06 00:10:44 +00:00
2026-06-06 00:10:44 +00:00

AI Lab Maintenance Toolkit

Executive summary

The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an Ubuntu AI infrastructure host named ailab. It combines repeatable health reporting, disk monitoring, conservative package cleanup, Docker hygiene, configuration backup, and non-destructive VM inventory into a small toolkit that is readable enough for review and guarded enough for homelab use.

This is a portfolio and lab implementation, not evidence of production certification. Review package policy, backup coverage, maintenance windows, and application impact before deploying it to another host.

Problem solved

AI lab hosts accumulate operating system packages, kernel packages, container images, build cache, journals, and configuration changes while also carrying stateful workloads. Manual maintenance is easy to defer and risky to perform without evidence. This project provides scheduled, logged tasks with explicit safety boundaries and separate read-only audit commands.

What this demonstrates

  • Bash strict mode, input validation, dependency checks, and operational exit codes.
  • Dry-run-first maintenance with explicit authorization for changes.
  • systemd oneshot services and persistent calendar timers.
  • APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review.
  • Docker cleanup that preserves volumes.
  • Configuration-focused backups with bounded retention.
  • Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd.
  • Idempotent installation and guarded JSON configuration updates.

Architecture and directory layout

ailab-maintenance/
├── README.md
├── install.sh
├── scripts/
│   ├── ailab-healthcheck.sh
│   ├── ailab-disk-watch.sh
│   ├── ailab-apt-cleanup.sh
│   ├── ailab-kernel-cleanup.sh
│   ├── ailab-docker-cleanup.sh
│   ├── ailab-config-backup.sh
│   └── ailab-vm-audit.sh
└── systemd/
    ├── ailab-apt-cleanup.service
    ├── ailab-apt-cleanup.timer
    ├── ailab-kernel-cleanup.service
    ├── ailab-kernel-cleanup.timer
    ├── ailab-docker-cleanup.service
    ├── ailab-docker-cleanup.timer
    ├── ailab-config-backup.service
    ├── ailab-config-backup.timer
    ├── ailab-disk-watch.service
    └── ailab-disk-watch.timer

The installer deploys scripts to /usr/local/sbin and units to /etc/systemd/system. Scripts run directly as root from systemd rather than through an additional framework.

Maintenance tasks

Command Purpose Change behavior
ailab-healthcheck.sh Host, storage, service, container, VM, GPU, and SMART report Read-only
ailab-disk-watch.sh Filesystem threshold check Read-only
ailab-apt-cleanup.sh APT metadata refresh and unused package cleanup Dry-run by default
ailab-kernel-cleanup.sh APT-managed kernel package cleanup Dry-run by default
ailab-docker-cleanup.sh Unused Docker object and build-cache cleanup Dry-run by default
ailab-config-backup.sh Configuration archive and retention Dry-run by default
ailab-vm-audit.sh VM, pool, volume, and image-file inventory Read-only

Safety model

Change-capable scripts default to dry-run behavior. Manual execution requires --execute and an interactive EXECUTE confirmation. The systemd services use --execute --non-interactive; installing and enabling those reviewed unit files is the explicit authorization for scheduled maintenance.

Exit codes follow the repository convention:

  • 0: completed successfully or an optional component was absent.
  • 1: an operational check or maintenance action failed.
  • 2: invalid input, missing required dependency, or insufficient privilege.

The scripts do not bypass APT or Docker locks, delete VM resources, manually select kernel names for removal, or hide command failures.

Installation

Review every script and unit first. Installation changes package state, journald settings, Docker daemon settings when Docker exists, and enabled timer state.

cd labs/linux/ailab-maintenance
sudo ./install.sh

The installer:

  1. Installs the documented Ubuntu utilities.
  2. Deploys scripts and systemd units with fixed permissions.
  3. Writes /etc/systemd/journald.conf.d/ailab-limits.conf.
  4. Restarts systemd-journald.
  5. Validates and backs up an existing Docker daemon.json, merges log limits with jq, and attempts a Docker restart.
  6. Enables all five timers.
  7. Writes an initial report to /root/ailab-healthcheck-now.txt.

The installer is intended for Ubuntu 26.04. It is not run automatically by repository validation.

Manual commands

Read-only reports:

sudo /usr/local/sbin/ailab-healthcheck.sh
sudo /usr/local/sbin/ailab-disk-watch.sh
sudo /usr/local/sbin/ailab-vm-audit.sh

Preview maintenance:

sudo /usr/local/sbin/ailab-apt-cleanup.sh
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
sudo /usr/local/sbin/ailab-docker-cleanup.sh
sudo /usr/local/sbin/ailab-config-backup.sh

Apply reviewed maintenance interactively:

sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute
sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute
sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute
sudo /usr/local/sbin/ailab-config-backup.sh --execute

--non-interactive is reserved for reviewed automation and is rejected unless --execute is also present.

Systemd timers

Timer Schedule
ailab-config-backup.timer Daily at 03:30
ailab-disk-watch.timer Hourly
ailab-apt-cleanup.timer Sunday at 04:00
ailab-kernel-cleanup.timer Sunday at 04:20
ailab-docker-cleanup.timer Sunday at 04:40

All timers use Persistent=true, so a missed event runs after the host becomes available. Inspect timer and service evidence with:

systemctl list-timers --all | grep ailab-
systemctl status ailab-config-backup.timer
journalctl -u ailab-kernel-cleanup.service

Logs

Scheduled and manual maintenance writes to:

/var/log/ailab-apt-cleanup.log
/var/log/ailab-kernel-cleanup.log
/var/log/ailab-docker-cleanup.log
/var/log/ailab-config-backup.log
/var/log/ailab-disk-watch.log

systemd also records service output in the journal. Logrotate is installed as a dependency, but this lab does not create a custom rotation policy for these small maintenance logs.

Docker policy

Docker cleanup runs docker system prune -af and removes build cache older than seven days. It never passes --volumes. Named and anonymous volumes remain outside this automated policy and require application-aware review.

The installer configures the json-file driver with a maximum size of 50m and five files. Existing valid JSON is backed up and merged. Invalid JSON causes installation to stop rather than overwrite operator configuration.

Kernel policy

Kernel removal is delegated to apt autoremove --purge; package names are not constructed or purged with regular expressions. Before execution, the script logs the APT simulation and refuses cleanup unless at least two installed versioned kernel image packages remain after simulated removals.

This protects a fallback kernel while preserving Ubuntu dependency policy. Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings, Secure Boot state, and the simulated removal set before manual execution.

Backup policy

Backups are written to /srv/backups/ailab-config as ailab-config-YYYYMMDD-HHMMSS.tar.gz. Matching archives older than 30 days are deleted only after a new archive is created.

The backup covers /etc, selected root shell configuration, /opt/ailab-maintenance when present, and libvirt configuration under /var/lib/libvirt/qemu. It does not include /var/lib/docker, WebODM data, Ollama models, VM disk images, or other large application datasets. Because /etc is included, explicitly listed configuration subdirectories are already covered even when optional-path reporting mentions them separately.

This is a local configuration backup, not a disaster-recovery design. A real deployment should copy archives to independently protected storage and test restoration.

Journald policy

The installer applies:

[Journal]
SystemMaxUse=1G
SystemKeepFree=2G
MaxRetentionSec=14day
Compress=yes

These settings bound journal growth while retaining useful troubleshooting evidence. Capacity and retention should be adjusted to the host's disk size and incident-response requirements.

Disk watch policy

The disk check uses df -P, defaults to an 85 percent threshold, and returns 1 when any checked filesystem meets or exceeds the threshold. Override the threshold for a manual or unit invocation with:

sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh

The script reports every filesystem as OK or WARNING; it does not delete data or attempt remediation.

Example operational workflows

Weekly maintenance review

sudo /usr/local/sbin/ailab-healthcheck.sh
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
sudo /usr/local/sbin/ailab-docker-cleanup.sh
systemctl list-timers --all | grep ailab-

Review the kernel simulation, Docker usage, failed units, backup freshness, and disk warnings before approving manual changes.

Disk pressure investigation

sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh
sudo docker system df
sudo journalctl --disk-usage
sudo /usr/local/sbin/ailab-vm-audit.sh

Use the evidence to identify ownership. Do not treat Docker pruning or file deletion as a substitute for application-specific retention policy.

Post-maintenance evidence

sudo /usr/local/sbin/ailab-healthcheck.sh \
  | sudo tee /root/ailab-healthcheck-after-maintenance.txt
journalctl --since today -u 'ailab-*.service'

Interview talking points

  • Why timer units explicitly carry the non-interactive execution boundary.
  • Why APT dependency policy is safer than regex-based kernel deletion.
  • How Docker volume preservation separates platform hygiene from application data lifecycle decisions.
  • How optional dependency handling keeps one health command useful across container, GPU, and virtualization host variants.
  • Why configuration backup and application-data backup are separate concerns.
  • How exit codes, persistent timers, logs, and post-checks support operations.

Future improvements

  • Add a dedicated logrotate policy after measuring log growth.
  • Export disk-watch status to a monitoring system instead of relying only on timer failure state.
  • Add automated archive integrity checks and off-host replication.
  • Add Bats tests using mocked apt, docker, virsh, and systemctl commands.
  • Add package-lock detection with bounded retry policy if recurring contention is observed.
  • Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a dedicated read-only audit.