309 lines
11 KiB
Markdown
309 lines
11 KiB
Markdown
|
|
# AI Lab Maintenance Toolkit
|
||
|
|
|
||
|
|
## Executive summary
|
||
|
|
|
||
|
|
The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an
|
||
|
|
Ubuntu AI infrastructure host named `ailab`. It combines repeatable health
|
||
|
|
reporting, disk monitoring, conservative package cleanup, Docker hygiene,
|
||
|
|
configuration backup, and non-destructive VM inventory into a small toolkit
|
||
|
|
that is readable enough for review and guarded enough for homelab use.
|
||
|
|
|
||
|
|
This is a portfolio and lab implementation, not evidence of production
|
||
|
|
certification. Review package policy, backup coverage, maintenance windows, and
|
||
|
|
application impact before deploying it to another host.
|
||
|
|
|
||
|
|
## Problem solved
|
||
|
|
|
||
|
|
AI lab hosts accumulate operating system packages, kernel packages, container
|
||
|
|
images, build cache, journals, and configuration changes while also carrying
|
||
|
|
stateful workloads. Manual maintenance is easy to defer and risky to perform
|
||
|
|
without evidence. This project provides scheduled, logged tasks with explicit
|
||
|
|
safety boundaries and separate read-only audit commands.
|
||
|
|
|
||
|
|
## What this demonstrates
|
||
|
|
|
||
|
|
- Bash strict mode, input validation, dependency checks, and operational exit
|
||
|
|
codes.
|
||
|
|
- Dry-run-first maintenance with explicit authorization for changes.
|
||
|
|
- systemd oneshot services and persistent calendar timers.
|
||
|
|
- APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review.
|
||
|
|
- Docker cleanup that preserves volumes.
|
||
|
|
- Configuration-focused backups with bounded retention.
|
||
|
|
- Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd.
|
||
|
|
- Idempotent installation and guarded JSON configuration updates.
|
||
|
|
|
||
|
|
## Architecture and directory layout
|
||
|
|
|
||
|
|
```text
|
||
|
|
ailab-maintenance/
|
||
|
|
├── README.md
|
||
|
|
├── install.sh
|
||
|
|
├── scripts/
|
||
|
|
│ ├── ailab-healthcheck.sh
|
||
|
|
│ ├── ailab-disk-watch.sh
|
||
|
|
│ ├── ailab-apt-cleanup.sh
|
||
|
|
│ ├── ailab-kernel-cleanup.sh
|
||
|
|
│ ├── ailab-docker-cleanup.sh
|
||
|
|
│ ├── ailab-config-backup.sh
|
||
|
|
│ └── ailab-vm-audit.sh
|
||
|
|
└── systemd/
|
||
|
|
├── ailab-apt-cleanup.service
|
||
|
|
├── ailab-apt-cleanup.timer
|
||
|
|
├── ailab-kernel-cleanup.service
|
||
|
|
├── ailab-kernel-cleanup.timer
|
||
|
|
├── ailab-docker-cleanup.service
|
||
|
|
├── ailab-docker-cleanup.timer
|
||
|
|
├── ailab-config-backup.service
|
||
|
|
├── ailab-config-backup.timer
|
||
|
|
├── ailab-disk-watch.service
|
||
|
|
└── ailab-disk-watch.timer
|
||
|
|
```
|
||
|
|
|
||
|
|
The installer deploys scripts to `/usr/local/sbin` and units to
|
||
|
|
`/etc/systemd/system`. Scripts run directly as root from systemd rather than
|
||
|
|
through an additional framework.
|
||
|
|
|
||
|
|
## Maintenance tasks
|
||
|
|
|
||
|
|
| Command | Purpose | Change behavior |
|
||
|
|
| --- | --- | --- |
|
||
|
|
| `ailab-healthcheck.sh` | Host, storage, service, container, VM, GPU, and SMART report | Read-only |
|
||
|
|
| `ailab-disk-watch.sh` | Filesystem threshold check | Read-only |
|
||
|
|
| `ailab-apt-cleanup.sh` | APT metadata refresh and unused package cleanup | Dry-run by default |
|
||
|
|
| `ailab-kernel-cleanup.sh` | APT-managed kernel package cleanup | Dry-run by default |
|
||
|
|
| `ailab-docker-cleanup.sh` | Unused Docker object and build-cache cleanup | Dry-run by default |
|
||
|
|
| `ailab-config-backup.sh` | Configuration archive and retention | Dry-run by default |
|
||
|
|
| `ailab-vm-audit.sh` | VM, pool, volume, and image-file inventory | Read-only |
|
||
|
|
|
||
|
|
## Safety model
|
||
|
|
|
||
|
|
Change-capable scripts default to dry-run behavior. Manual execution requires
|
||
|
|
`--execute` and an interactive `EXECUTE` confirmation. The systemd services
|
||
|
|
use `--execute --non-interactive`; installing and enabling those reviewed unit
|
||
|
|
files is the explicit authorization for scheduled maintenance.
|
||
|
|
|
||
|
|
Exit codes follow the repository convention:
|
||
|
|
|
||
|
|
- `0`: completed successfully or an optional component was absent.
|
||
|
|
- `1`: an operational check or maintenance action failed.
|
||
|
|
- `2`: invalid input, missing required dependency, or insufficient privilege.
|
||
|
|
|
||
|
|
The scripts do not bypass APT or Docker locks, delete VM resources, manually
|
||
|
|
select kernel names for removal, or hide command failures.
|
||
|
|
|
||
|
|
## Installation
|
||
|
|
|
||
|
|
Review every script and unit first. Installation changes package state,
|
||
|
|
journald settings, Docker daemon settings when Docker exists, and enabled timer
|
||
|
|
state.
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cd labs/linux/ailab-maintenance
|
||
|
|
sudo ./install.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
The installer:
|
||
|
|
|
||
|
|
1. Installs the documented Ubuntu utilities.
|
||
|
|
2. Deploys scripts and systemd units with fixed permissions.
|
||
|
|
3. Writes `/etc/systemd/journald.conf.d/ailab-limits.conf`.
|
||
|
|
4. Restarts `systemd-journald`.
|
||
|
|
5. Validates and backs up an existing Docker `daemon.json`, merges log limits
|
||
|
|
with `jq`, and attempts a Docker restart.
|
||
|
|
6. Enables all five timers.
|
||
|
|
7. Writes an initial report to `/root/ailab-healthcheck-now.txt`.
|
||
|
|
|
||
|
|
The installer is intended for Ubuntu 26.04. It is not run automatically by
|
||
|
|
repository validation.
|
||
|
|
|
||
|
|
## Manual commands
|
||
|
|
|
||
|
|
Read-only reports:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
sudo /usr/local/sbin/ailab-healthcheck.sh
|
||
|
|
sudo /usr/local/sbin/ailab-disk-watch.sh
|
||
|
|
sudo /usr/local/sbin/ailab-vm-audit.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
Preview maintenance:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
sudo /usr/local/sbin/ailab-apt-cleanup.sh
|
||
|
|
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
|
||
|
|
sudo /usr/local/sbin/ailab-docker-cleanup.sh
|
||
|
|
sudo /usr/local/sbin/ailab-config-backup.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
Apply reviewed maintenance interactively:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute
|
||
|
|
sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute
|
||
|
|
sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute
|
||
|
|
sudo /usr/local/sbin/ailab-config-backup.sh --execute
|
||
|
|
```
|
||
|
|
|
||
|
|
`--non-interactive` is reserved for reviewed automation and is rejected unless
|
||
|
|
`--execute` is also present.
|
||
|
|
|
||
|
|
## Systemd timers
|
||
|
|
|
||
|
|
| Timer | Schedule |
|
||
|
|
| --- | --- |
|
||
|
|
| `ailab-config-backup.timer` | Daily at 03:30 |
|
||
|
|
| `ailab-disk-watch.timer` | Hourly |
|
||
|
|
| `ailab-apt-cleanup.timer` | Sunday at 04:00 |
|
||
|
|
| `ailab-kernel-cleanup.timer` | Sunday at 04:20 |
|
||
|
|
| `ailab-docker-cleanup.timer` | Sunday at 04:40 |
|
||
|
|
|
||
|
|
All timers use `Persistent=true`, so a missed event runs after the host becomes
|
||
|
|
available. Inspect timer and service evidence with:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
systemctl list-timers --all | grep ailab-
|
||
|
|
systemctl status ailab-config-backup.timer
|
||
|
|
journalctl -u ailab-kernel-cleanup.service
|
||
|
|
```
|
||
|
|
|
||
|
|
## Logs
|
||
|
|
|
||
|
|
Scheduled and manual maintenance writes to:
|
||
|
|
|
||
|
|
```text
|
||
|
|
/var/log/ailab-apt-cleanup.log
|
||
|
|
/var/log/ailab-kernel-cleanup.log
|
||
|
|
/var/log/ailab-docker-cleanup.log
|
||
|
|
/var/log/ailab-config-backup.log
|
||
|
|
/var/log/ailab-disk-watch.log
|
||
|
|
```
|
||
|
|
|
||
|
|
systemd also records service output in the journal. Logrotate is installed as a
|
||
|
|
dependency, but this lab does not create a custom rotation policy for these
|
||
|
|
small maintenance logs.
|
||
|
|
|
||
|
|
## Docker policy
|
||
|
|
|
||
|
|
Docker cleanup runs `docker system prune -af` and removes build cache older
|
||
|
|
than seven days. It never passes `--volumes`. Named and anonymous volumes
|
||
|
|
remain outside this automated policy and require application-aware review.
|
||
|
|
|
||
|
|
The installer configures the `json-file` driver with a maximum size of `50m`
|
||
|
|
and five files. Existing valid JSON is backed up and merged. Invalid JSON
|
||
|
|
causes installation to stop rather than overwrite operator configuration.
|
||
|
|
|
||
|
|
## Kernel policy
|
||
|
|
|
||
|
|
Kernel removal is delegated to `apt autoremove --purge`; package names are not
|
||
|
|
constructed or purged with regular expressions. Before execution, the script
|
||
|
|
logs the APT simulation and refuses cleanup unless at least two installed
|
||
|
|
versioned kernel image packages remain after simulated removals.
|
||
|
|
|
||
|
|
This protects a fallback kernel while preserving Ubuntu dependency policy.
|
||
|
|
Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings,
|
||
|
|
Secure Boot state, and the simulated removal set before manual execution.
|
||
|
|
|
||
|
|
## Backup policy
|
||
|
|
|
||
|
|
Backups are written to `/srv/backups/ailab-config` as
|
||
|
|
`ailab-config-YYYYMMDD-HHMMSS.tar.gz`. Matching archives older than 30 days are
|
||
|
|
deleted only after a new archive is created.
|
||
|
|
|
||
|
|
The backup covers `/etc`, selected root shell configuration,
|
||
|
|
`/opt/ailab-maintenance` when present, and libvirt configuration under
|
||
|
|
`/var/lib/libvirt/qemu`. It does not include `/var/lib/docker`, WebODM data,
|
||
|
|
Ollama models, VM disk images, or other large application datasets. Because
|
||
|
|
`/etc` is included, explicitly listed configuration subdirectories are already
|
||
|
|
covered even when optional-path reporting mentions them separately.
|
||
|
|
|
||
|
|
This is a local configuration backup, not a disaster-recovery design. A real
|
||
|
|
deployment should copy archives to independently protected storage and test
|
||
|
|
restoration.
|
||
|
|
|
||
|
|
## Journald policy
|
||
|
|
|
||
|
|
The installer applies:
|
||
|
|
|
||
|
|
```ini
|
||
|
|
[Journal]
|
||
|
|
SystemMaxUse=1G
|
||
|
|
SystemKeepFree=2G
|
||
|
|
MaxRetentionSec=14day
|
||
|
|
Compress=yes
|
||
|
|
```
|
||
|
|
|
||
|
|
These settings bound journal growth while retaining useful troubleshooting
|
||
|
|
evidence. Capacity and retention should be adjusted to the host's disk size
|
||
|
|
and incident-response requirements.
|
||
|
|
|
||
|
|
## Disk watch policy
|
||
|
|
|
||
|
|
The disk check uses `df -P`, defaults to an 85 percent threshold, and returns
|
||
|
|
`1` when any checked filesystem meets or exceeds the threshold. Override the
|
||
|
|
threshold for a manual or unit invocation with:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
The script reports every filesystem as `OK` or `WARNING`; it does not delete
|
||
|
|
data or attempt remediation.
|
||
|
|
|
||
|
|
## Example operational workflows
|
||
|
|
|
||
|
|
### Weekly maintenance review
|
||
|
|
|
||
|
|
```bash
|
||
|
|
sudo /usr/local/sbin/ailab-healthcheck.sh
|
||
|
|
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
|
||
|
|
sudo /usr/local/sbin/ailab-docker-cleanup.sh
|
||
|
|
systemctl list-timers --all | grep ailab-
|
||
|
|
```
|
||
|
|
|
||
|
|
Review the kernel simulation, Docker usage, failed units, backup freshness, and
|
||
|
|
disk warnings before approving manual changes.
|
||
|
|
|
||
|
|
### Disk pressure investigation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh
|
||
|
|
sudo docker system df
|
||
|
|
sudo journalctl --disk-usage
|
||
|
|
sudo /usr/local/sbin/ailab-vm-audit.sh
|
||
|
|
```
|
||
|
|
|
||
|
|
Use the evidence to identify ownership. Do not treat Docker pruning or file
|
||
|
|
deletion as a substitute for application-specific retention policy.
|
||
|
|
|
||
|
|
### Post-maintenance evidence
|
||
|
|
|
||
|
|
```bash
|
||
|
|
sudo /usr/local/sbin/ailab-healthcheck.sh \
|
||
|
|
| sudo tee /root/ailab-healthcheck-after-maintenance.txt
|
||
|
|
journalctl --since today -u 'ailab-*.service'
|
||
|
|
```
|
||
|
|
|
||
|
|
## Interview talking points
|
||
|
|
|
||
|
|
- Why timer units explicitly carry the non-interactive execution boundary.
|
||
|
|
- Why APT dependency policy is safer than regex-based kernel deletion.
|
||
|
|
- How Docker volume preservation separates platform hygiene from application
|
||
|
|
data lifecycle decisions.
|
||
|
|
- How optional dependency handling keeps one health command useful across
|
||
|
|
container, GPU, and virtualization host variants.
|
||
|
|
- Why configuration backup and application-data backup are separate concerns.
|
||
|
|
- How exit codes, persistent timers, logs, and post-checks support operations.
|
||
|
|
|
||
|
|
## Future improvements
|
||
|
|
|
||
|
|
- Add a dedicated logrotate policy after measuring log growth.
|
||
|
|
- Export disk-watch status to a monitoring system instead of relying only on
|
||
|
|
timer failure state.
|
||
|
|
- Add automated archive integrity checks and off-host replication.
|
||
|
|
- Add Bats tests using mocked `apt`, `docker`, `virsh`, and `systemctl`
|
||
|
|
commands.
|
||
|
|
- Add package-lock detection with bounded retry policy if recurring contention
|
||
|
|
is observed.
|
||
|
|
- Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a
|
||
|
|
dedicated read-only audit.
|