Files

309 lines
11 KiB
Markdown
Raw Permalink Normal View History

2026-06-06 00:10:44 +00:00
# AI Lab Maintenance Toolkit
## Executive summary
The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an
Ubuntu AI infrastructure host named `ailab`. It combines repeatable health
reporting, disk monitoring, conservative package cleanup, Docker hygiene,
configuration backup, and non-destructive VM inventory into a small toolkit
that is readable enough for review and guarded enough for homelab use.
This is a portfolio and lab implementation, not evidence of production
certification. Review package policy, backup coverage, maintenance windows, and
application impact before deploying it to another host.
## Problem solved
AI lab hosts accumulate operating system packages, kernel packages, container
images, build cache, journals, and configuration changes while also carrying
stateful workloads. Manual maintenance is easy to defer and risky to perform
without evidence. This project provides scheduled, logged tasks with explicit
safety boundaries and separate read-only audit commands.
## What this demonstrates
- Bash strict mode, input validation, dependency checks, and operational exit
codes.
- Dry-run-first maintenance with explicit authorization for changes.
- systemd oneshot services and persistent calendar timers.
- APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review.
- Docker cleanup that preserves volumes.
- Configuration-focused backups with bounded retention.
- Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd.
- Idempotent installation and guarded JSON configuration updates.
## Architecture and directory layout
```text
ailab-maintenance/
├── README.md
├── install.sh
├── scripts/
│ ├── ailab-healthcheck.sh
│ ├── ailab-disk-watch.sh
│ ├── ailab-apt-cleanup.sh
│ ├── ailab-kernel-cleanup.sh
│ ├── ailab-docker-cleanup.sh
│ ├── ailab-config-backup.sh
│ └── ailab-vm-audit.sh
└── systemd/
├── ailab-apt-cleanup.service
├── ailab-apt-cleanup.timer
├── ailab-kernel-cleanup.service
├── ailab-kernel-cleanup.timer
├── ailab-docker-cleanup.service
├── ailab-docker-cleanup.timer
├── ailab-config-backup.service
├── ailab-config-backup.timer
├── ailab-disk-watch.service
└── ailab-disk-watch.timer
```
The installer deploys scripts to `/usr/local/sbin` and units to
`/etc/systemd/system`. Scripts run directly as root from systemd rather than
through an additional framework.
## Maintenance tasks
| Command | Purpose | Change behavior |
| --- | --- | --- |
| `ailab-healthcheck.sh` | Host, storage, service, container, VM, GPU, and SMART report | Read-only |
| `ailab-disk-watch.sh` | Filesystem threshold check | Read-only |
| `ailab-apt-cleanup.sh` | APT metadata refresh and unused package cleanup | Dry-run by default |
| `ailab-kernel-cleanup.sh` | APT-managed kernel package cleanup | Dry-run by default |
| `ailab-docker-cleanup.sh` | Unused Docker object and build-cache cleanup | Dry-run by default |
| `ailab-config-backup.sh` | Configuration archive and retention | Dry-run by default |
| `ailab-vm-audit.sh` | VM, pool, volume, and image-file inventory | Read-only |
## Safety model
Change-capable scripts default to dry-run behavior. Manual execution requires
`--execute` and an interactive `EXECUTE` confirmation. The systemd services
use `--execute --non-interactive`; installing and enabling those reviewed unit
files is the explicit authorization for scheduled maintenance.
Exit codes follow the repository convention:
- `0`: completed successfully or an optional component was absent.
- `1`: an operational check or maintenance action failed.
- `2`: invalid input, missing required dependency, or insufficient privilege.
The scripts do not bypass APT or Docker locks, delete VM resources, manually
select kernel names for removal, or hide command failures.
## Installation
Review every script and unit first. Installation changes package state,
journald settings, Docker daemon settings when Docker exists, and enabled timer
state.
```bash
cd labs/linux/ailab-maintenance
sudo ./install.sh
```
The installer:
1. Installs the documented Ubuntu utilities.
2. Deploys scripts and systemd units with fixed permissions.
3. Writes `/etc/systemd/journald.conf.d/ailab-limits.conf`.
4. Restarts `systemd-journald`.
5. Validates and backs up an existing Docker `daemon.json`, merges log limits
with `jq`, and attempts a Docker restart.
6. Enables all five timers.
7. Writes an initial report to `/root/ailab-healthcheck-now.txt`.
The installer is intended for Ubuntu 26.04. It is not run automatically by
repository validation.
## Manual commands
Read-only reports:
```bash
sudo /usr/local/sbin/ailab-healthcheck.sh
sudo /usr/local/sbin/ailab-disk-watch.sh
sudo /usr/local/sbin/ailab-vm-audit.sh
```
Preview maintenance:
```bash
sudo /usr/local/sbin/ailab-apt-cleanup.sh
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
sudo /usr/local/sbin/ailab-docker-cleanup.sh
sudo /usr/local/sbin/ailab-config-backup.sh
```
Apply reviewed maintenance interactively:
```bash
sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute
sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute
sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute
sudo /usr/local/sbin/ailab-config-backup.sh --execute
```
`--non-interactive` is reserved for reviewed automation and is rejected unless
`--execute` is also present.
## Systemd timers
| Timer | Schedule |
| --- | --- |
| `ailab-config-backup.timer` | Daily at 03:30 |
| `ailab-disk-watch.timer` | Hourly |
| `ailab-apt-cleanup.timer` | Sunday at 04:00 |
| `ailab-kernel-cleanup.timer` | Sunday at 04:20 |
| `ailab-docker-cleanup.timer` | Sunday at 04:40 |
All timers use `Persistent=true`, so a missed event runs after the host becomes
available. Inspect timer and service evidence with:
```bash
systemctl list-timers --all | grep ailab-
systemctl status ailab-config-backup.timer
journalctl -u ailab-kernel-cleanup.service
```
## Logs
Scheduled and manual maintenance writes to:
```text
/var/log/ailab-apt-cleanup.log
/var/log/ailab-kernel-cleanup.log
/var/log/ailab-docker-cleanup.log
/var/log/ailab-config-backup.log
/var/log/ailab-disk-watch.log
```
systemd also records service output in the journal. Logrotate is installed as a
dependency, but this lab does not create a custom rotation policy for these
small maintenance logs.
## Docker policy
Docker cleanup runs `docker system prune -af` and removes build cache older
than seven days. It never passes `--volumes`. Named and anonymous volumes
remain outside this automated policy and require application-aware review.
The installer configures the `json-file` driver with a maximum size of `50m`
and five files. Existing valid JSON is backed up and merged. Invalid JSON
causes installation to stop rather than overwrite operator configuration.
## Kernel policy
Kernel removal is delegated to `apt autoremove --purge`; package names are not
constructed or purged with regular expressions. Before execution, the script
logs the APT simulation and refuses cleanup unless at least two installed
versioned kernel image packages remain after simulated removals.
This protects a fallback kernel while preserving Ubuntu dependency policy.
Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings,
Secure Boot state, and the simulated removal set before manual execution.
## Backup policy
Backups are written to `/srv/backups/ailab-config` as
`ailab-config-YYYYMMDD-HHMMSS.tar.gz`. Matching archives older than 30 days are
deleted only after a new archive is created.
The backup covers `/etc`, selected root shell configuration,
`/opt/ailab-maintenance` when present, and libvirt configuration under
`/var/lib/libvirt/qemu`. It does not include `/var/lib/docker`, WebODM data,
Ollama models, VM disk images, or other large application datasets. Because
`/etc` is included, explicitly listed configuration subdirectories are already
covered even when optional-path reporting mentions them separately.
This is a local configuration backup, not a disaster-recovery design. A real
deployment should copy archives to independently protected storage and test
restoration.
## Journald policy
The installer applies:
```ini
[Journal]
SystemMaxUse=1G
SystemKeepFree=2G
MaxRetentionSec=14day
Compress=yes
```
These settings bound journal growth while retaining useful troubleshooting
evidence. Capacity and retention should be adjusted to the host's disk size
and incident-response requirements.
## Disk watch policy
The disk check uses `df -P`, defaults to an 85 percent threshold, and returns
`1` when any checked filesystem meets or exceeds the threshold. Override the
threshold for a manual or unit invocation with:
```bash
sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh
```
The script reports every filesystem as `OK` or `WARNING`; it does not delete
data or attempt remediation.
## Example operational workflows
### Weekly maintenance review
```bash
sudo /usr/local/sbin/ailab-healthcheck.sh
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
sudo /usr/local/sbin/ailab-docker-cleanup.sh
systemctl list-timers --all | grep ailab-
```
Review the kernel simulation, Docker usage, failed units, backup freshness, and
disk warnings before approving manual changes.
### Disk pressure investigation
```bash
sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh
sudo docker system df
sudo journalctl --disk-usage
sudo /usr/local/sbin/ailab-vm-audit.sh
```
Use the evidence to identify ownership. Do not treat Docker pruning or file
deletion as a substitute for application-specific retention policy.
### Post-maintenance evidence
```bash
sudo /usr/local/sbin/ailab-healthcheck.sh \
| sudo tee /root/ailab-healthcheck-after-maintenance.txt
journalctl --since today -u 'ailab-*.service'
```
## Interview talking points
- Why timer units explicitly carry the non-interactive execution boundary.
- Why APT dependency policy is safer than regex-based kernel deletion.
- How Docker volume preservation separates platform hygiene from application
data lifecycle decisions.
- How optional dependency handling keeps one health command useful across
container, GPU, and virtualization host variants.
- Why configuration backup and application-data backup are separate concerns.
- How exit codes, persistent timers, logs, and post-checks support operations.
## Future improvements
- Add a dedicated logrotate policy after measuring log growth.
- Export disk-watch status to a monitoring system instead of relying only on
timer failure state.
- Add automated archive integrity checks and off-host replication.
- Add Bats tests using mocked `apt`, `docker`, `virsh`, and `systemctl`
commands.
- Add package-lock detection with bounded retry policy if recurring contention
is observed.
- Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a
dedicated read-only audit.