From 8cb92de06f1aed82cb70798a2365f558c9ffec31 Mon Sep 17 00:00:00 2001 From: Mateusz Suski Date: Sat, 6 Jun 2026 00:10:44 +0000 Subject: [PATCH] Add AI lab maintenance toolkit --- CHANGELOG.md | 1 + labs/README.md | 4 + labs/linux/ailab-maintenance/README.md | 308 ++++++++++++++++++ labs/linux/ailab-maintenance/install.sh | 103 ++++++ .../scripts/ailab-apt-cleanup.sh | 66 ++++ .../scripts/ailab-config-backup.sh | 90 +++++ .../scripts/ailab-disk-watch.sh | 38 +++ .../scripts/ailab-docker-cleanup.sh | 70 ++++ .../scripts/ailab-healthcheck.sh | 111 +++++++ .../scripts/ailab-kernel-cleanup.sh | 117 +++++++ .../scripts/ailab-vm-audit.sh | 42 +++ .../systemd/ailab-apt-cleanup.service | 8 + .../systemd/ailab-apt-cleanup.timer | 9 + .../systemd/ailab-config-backup.service | 6 + .../systemd/ailab-config-backup.timer | 9 + .../systemd/ailab-disk-watch.service | 6 + .../systemd/ailab-disk-watch.timer | 9 + .../systemd/ailab-docker-cleanup.service | 8 + .../systemd/ailab-docker-cleanup.timer | 9 + .../systemd/ailab-kernel-cleanup.service | 8 + .../systemd/ailab-kernel-cleanup.timer | 9 + 21 files changed, 1031 insertions(+) create mode 100644 labs/linux/ailab-maintenance/README.md create mode 100755 labs/linux/ailab-maintenance/install.sh create mode 100755 labs/linux/ailab-maintenance/scripts/ailab-apt-cleanup.sh create mode 100755 labs/linux/ailab-maintenance/scripts/ailab-config-backup.sh create mode 100755 labs/linux/ailab-maintenance/scripts/ailab-disk-watch.sh create mode 100755 labs/linux/ailab-maintenance/scripts/ailab-docker-cleanup.sh create mode 100755 labs/linux/ailab-maintenance/scripts/ailab-healthcheck.sh create mode 100755 labs/linux/ailab-maintenance/scripts/ailab-kernel-cleanup.sh create mode 100755 labs/linux/ailab-maintenance/scripts/ailab-vm-audit.sh create mode 100644 labs/linux/ailab-maintenance/systemd/ailab-apt-cleanup.service create mode 100644 labs/linux/ailab-maintenance/systemd/ailab-apt-cleanup.timer create mode 100644 labs/linux/ailab-maintenance/systemd/ailab-config-backup.service create mode 100644 labs/linux/ailab-maintenance/systemd/ailab-config-backup.timer create mode 100644 labs/linux/ailab-maintenance/systemd/ailab-disk-watch.service create mode 100644 labs/linux/ailab-maintenance/systemd/ailab-disk-watch.timer create mode 100644 labs/linux/ailab-maintenance/systemd/ailab-docker-cleanup.service create mode 100644 labs/linux/ailab-maintenance/systemd/ailab-docker-cleanup.timer create mode 100644 labs/linux/ailab-maintenance/systemd/ailab-kernel-cleanup.service create mode 100644 labs/linux/ailab-maintenance/systemd/ailab-kernel-cleanup.timer diff --git a/CHANGELOG.md b/CHANGELOG.md index affeda1..73d0570 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,7 @@ ### Added +- Added AI Lab Maintenance Toolkit with systemd-based Linux maintenance automation. - Python tooling validation for operational scripts. - `incident-log-summary` for general incident log summarization. - `log-diff-checker` for pre-change and post-change log comparison. diff --git a/labs/README.md b/labs/README.md index b1a4734..81398e9 100644 --- a/labs/README.md +++ b/labs/README.md @@ -10,6 +10,10 @@ Current subdirectories are planning areas unless their own README documents a ru - `ci-cd` - `docker` +## Linux operations labs + +- [AI Lab Maintenance Toolkit](./linux/ailab-maintenance/) - Homelab-safe Linux maintenance automation for an Ubuntu AI infrastructure host, covering cleanup, health checks, config backup, Docker hygiene, kernel safety and systemd timers. + Lab content should document prerequisites, topology, validation, cleanup, and what remains untested. Do not present lab behavior as production-ready. Planned lab topics are tracked in [ROADMAP.md](../ROADMAP.md). For Codex-driven changes, use [AGENTS.md](../AGENTS.md) and the templates under [docs/codex](../docs/codex/). diff --git a/labs/linux/ailab-maintenance/README.md b/labs/linux/ailab-maintenance/README.md new file mode 100644 index 0000000..270146f --- /dev/null +++ b/labs/linux/ailab-maintenance/README.md @@ -0,0 +1,308 @@ +# AI Lab Maintenance Toolkit + +## Executive summary + +The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an +Ubuntu AI infrastructure host named `ailab`. It combines repeatable health +reporting, disk monitoring, conservative package cleanup, Docker hygiene, +configuration backup, and non-destructive VM inventory into a small toolkit +that is readable enough for review and guarded enough for homelab use. + +This is a portfolio and lab implementation, not evidence of production +certification. Review package policy, backup coverage, maintenance windows, and +application impact before deploying it to another host. + +## Problem solved + +AI lab hosts accumulate operating system packages, kernel packages, container +images, build cache, journals, and configuration changes while also carrying +stateful workloads. Manual maintenance is easy to defer and risky to perform +without evidence. This project provides scheduled, logged tasks with explicit +safety boundaries and separate read-only audit commands. + +## What this demonstrates + +- Bash strict mode, input validation, dependency checks, and operational exit + codes. +- Dry-run-first maintenance with explicit authorization for changes. +- systemd oneshot services and persistent calendar timers. +- APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review. +- Docker cleanup that preserves volumes. +- Configuration-focused backups with bounded retention. +- Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd. +- Idempotent installation and guarded JSON configuration updates. + +## Architecture and directory layout + +```text +ailab-maintenance/ +├── README.md +├── install.sh +├── scripts/ +│ ├── ailab-healthcheck.sh +│ ├── ailab-disk-watch.sh +│ ├── ailab-apt-cleanup.sh +│ ├── ailab-kernel-cleanup.sh +│ ├── ailab-docker-cleanup.sh +│ ├── ailab-config-backup.sh +│ └── ailab-vm-audit.sh +└── systemd/ + ├── ailab-apt-cleanup.service + ├── ailab-apt-cleanup.timer + ├── ailab-kernel-cleanup.service + ├── ailab-kernel-cleanup.timer + ├── ailab-docker-cleanup.service + ├── ailab-docker-cleanup.timer + ├── ailab-config-backup.service + ├── ailab-config-backup.timer + ├── ailab-disk-watch.service + └── ailab-disk-watch.timer +``` + +The installer deploys scripts to `/usr/local/sbin` and units to +`/etc/systemd/system`. Scripts run directly as root from systemd rather than +through an additional framework. + +## Maintenance tasks + +| Command | Purpose | Change behavior | +| --- | --- | --- | +| `ailab-healthcheck.sh` | Host, storage, service, container, VM, GPU, and SMART report | Read-only | +| `ailab-disk-watch.sh` | Filesystem threshold check | Read-only | +| `ailab-apt-cleanup.sh` | APT metadata refresh and unused package cleanup | Dry-run by default | +| `ailab-kernel-cleanup.sh` | APT-managed kernel package cleanup | Dry-run by default | +| `ailab-docker-cleanup.sh` | Unused Docker object and build-cache cleanup | Dry-run by default | +| `ailab-config-backup.sh` | Configuration archive and retention | Dry-run by default | +| `ailab-vm-audit.sh` | VM, pool, volume, and image-file inventory | Read-only | + +## Safety model + +Change-capable scripts default to dry-run behavior. Manual execution requires +`--execute` and an interactive `EXECUTE` confirmation. The systemd services +use `--execute --non-interactive`; installing and enabling those reviewed unit +files is the explicit authorization for scheduled maintenance. + +Exit codes follow the repository convention: + +- `0`: completed successfully or an optional component was absent. +- `1`: an operational check or maintenance action failed. +- `2`: invalid input, missing required dependency, or insufficient privilege. + +The scripts do not bypass APT or Docker locks, delete VM resources, manually +select kernel names for removal, or hide command failures. + +## Installation + +Review every script and unit first. Installation changes package state, +journald settings, Docker daemon settings when Docker exists, and enabled timer +state. + +```bash +cd labs/linux/ailab-maintenance +sudo ./install.sh +``` + +The installer: + +1. Installs the documented Ubuntu utilities. +2. Deploys scripts and systemd units with fixed permissions. +3. Writes `/etc/systemd/journald.conf.d/ailab-limits.conf`. +4. Restarts `systemd-journald`. +5. Validates and backs up an existing Docker `daemon.json`, merges log limits + with `jq`, and attempts a Docker restart. +6. Enables all five timers. +7. Writes an initial report to `/root/ailab-healthcheck-now.txt`. + +The installer is intended for Ubuntu 26.04. It is not run automatically by +repository validation. + +## Manual commands + +Read-only reports: + +```bash +sudo /usr/local/sbin/ailab-healthcheck.sh +sudo /usr/local/sbin/ailab-disk-watch.sh +sudo /usr/local/sbin/ailab-vm-audit.sh +``` + +Preview maintenance: + +```bash +sudo /usr/local/sbin/ailab-apt-cleanup.sh +sudo /usr/local/sbin/ailab-kernel-cleanup.sh +sudo /usr/local/sbin/ailab-docker-cleanup.sh +sudo /usr/local/sbin/ailab-config-backup.sh +``` + +Apply reviewed maintenance interactively: + +```bash +sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute +sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute +sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute +sudo /usr/local/sbin/ailab-config-backup.sh --execute +``` + +`--non-interactive` is reserved for reviewed automation and is rejected unless +`--execute` is also present. + +## Systemd timers + +| Timer | Schedule | +| --- | --- | +| `ailab-config-backup.timer` | Daily at 03:30 | +| `ailab-disk-watch.timer` | Hourly | +| `ailab-apt-cleanup.timer` | Sunday at 04:00 | +| `ailab-kernel-cleanup.timer` | Sunday at 04:20 | +| `ailab-docker-cleanup.timer` | Sunday at 04:40 | + +All timers use `Persistent=true`, so a missed event runs after the host becomes +available. Inspect timer and service evidence with: + +```bash +systemctl list-timers --all | grep ailab- +systemctl status ailab-config-backup.timer +journalctl -u ailab-kernel-cleanup.service +``` + +## Logs + +Scheduled and manual maintenance writes to: + +```text +/var/log/ailab-apt-cleanup.log +/var/log/ailab-kernel-cleanup.log +/var/log/ailab-docker-cleanup.log +/var/log/ailab-config-backup.log +/var/log/ailab-disk-watch.log +``` + +systemd also records service output in the journal. Logrotate is installed as a +dependency, but this lab does not create a custom rotation policy for these +small maintenance logs. + +## Docker policy + +Docker cleanup runs `docker system prune -af` and removes build cache older +than seven days. It never passes `--volumes`. Named and anonymous volumes +remain outside this automated policy and require application-aware review. + +The installer configures the `json-file` driver with a maximum size of `50m` +and five files. Existing valid JSON is backed up and merged. Invalid JSON +causes installation to stop rather than overwrite operator configuration. + +## Kernel policy + +Kernel removal is delegated to `apt autoremove --purge`; package names are not +constructed or purged with regular expressions. Before execution, the script +logs the APT simulation and refuses cleanup unless at least two installed +versioned kernel image packages remain after simulated removals. + +This protects a fallback kernel while preserving Ubuntu dependency policy. +Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings, +Secure Boot state, and the simulated removal set before manual execution. + +## Backup policy + +Backups are written to `/srv/backups/ailab-config` as +`ailab-config-YYYYMMDD-HHMMSS.tar.gz`. Matching archives older than 30 days are +deleted only after a new archive is created. + +The backup covers `/etc`, selected root shell configuration, +`/opt/ailab-maintenance` when present, and libvirt configuration under +`/var/lib/libvirt/qemu`. It does not include `/var/lib/docker`, WebODM data, +Ollama models, VM disk images, or other large application datasets. Because +`/etc` is included, explicitly listed configuration subdirectories are already +covered even when optional-path reporting mentions them separately. + +This is a local configuration backup, not a disaster-recovery design. A real +deployment should copy archives to independently protected storage and test +restoration. + +## Journald policy + +The installer applies: + +```ini +[Journal] +SystemMaxUse=1G +SystemKeepFree=2G +MaxRetentionSec=14day +Compress=yes +``` + +These settings bound journal growth while retaining useful troubleshooting +evidence. Capacity and retention should be adjusted to the host's disk size +and incident-response requirements. + +## Disk watch policy + +The disk check uses `df -P`, defaults to an 85 percent threshold, and returns +`1` when any checked filesystem meets or exceeds the threshold. Override the +threshold for a manual or unit invocation with: + +```bash +sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh +``` + +The script reports every filesystem as `OK` or `WARNING`; it does not delete +data or attempt remediation. + +## Example operational workflows + +### Weekly maintenance review + +```bash +sudo /usr/local/sbin/ailab-healthcheck.sh +sudo /usr/local/sbin/ailab-kernel-cleanup.sh +sudo /usr/local/sbin/ailab-docker-cleanup.sh +systemctl list-timers --all | grep ailab- +``` + +Review the kernel simulation, Docker usage, failed units, backup freshness, and +disk warnings before approving manual changes. + +### Disk pressure investigation + +```bash +sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh +sudo docker system df +sudo journalctl --disk-usage +sudo /usr/local/sbin/ailab-vm-audit.sh +``` + +Use the evidence to identify ownership. Do not treat Docker pruning or file +deletion as a substitute for application-specific retention policy. + +### Post-maintenance evidence + +```bash +sudo /usr/local/sbin/ailab-healthcheck.sh \ + | sudo tee /root/ailab-healthcheck-after-maintenance.txt +journalctl --since today -u 'ailab-*.service' +``` + +## Interview talking points + +- Why timer units explicitly carry the non-interactive execution boundary. +- Why APT dependency policy is safer than regex-based kernel deletion. +- How Docker volume preservation separates platform hygiene from application + data lifecycle decisions. +- How optional dependency handling keeps one health command useful across + container, GPU, and virtualization host variants. +- Why configuration backup and application-data backup are separate concerns. +- How exit codes, persistent timers, logs, and post-checks support operations. + +## Future improvements + +- Add a dedicated logrotate policy after measuring log growth. +- Export disk-watch status to a monitoring system instead of relying only on + timer failure state. +- Add automated archive integrity checks and off-host replication. +- Add Bats tests using mocked `apt`, `docker`, `virsh`, and `systemctl` + commands. +- Add package-lock detection with bounded retry policy if recurring contention + is observed. +- Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a + dedicated read-only audit. diff --git a/labs/linux/ailab-maintenance/install.sh b/labs/linux/ailab-maintenance/install.sh new file mode 100755 index 0000000..574416c --- /dev/null +++ b/labs/linux/ailab-maintenance/install.sh @@ -0,0 +1,103 @@ +#!/usr/bin/env bash +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)" +JOURNALD_DROP_IN="/etc/systemd/journald.conf.d/ailab-limits.conf" +DOCKER_CONFIG="/etc/docker/daemon.json" +packages=( + logrotate + needrestart + smartmontools + nvme-cli + sysstat + iotop + ncdu + duf + jq + lsof + psmisc + tar + gzip +) +timers=( + ailab-apt-cleanup.timer + ailab-kernel-cleanup.timer + ailab-docker-cleanup.timer + ailab-config-backup.timer + ailab-disk-watch.timer +) + +if ((EUID != 0)); then + printf 'CRITICAL: install.sh must run as root\n' >&2 + exit 2 +fi +for command_name in apt-get install systemctl; do + if ! command -v "$command_name" >/dev/null 2>&1; then + printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2 + exit 2 + fi +done + +printf 'Installing maintenance dependencies...\n' +apt-get update +DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}" + +printf 'Installing scripts and systemd units...\n' +for script in "$SCRIPT_DIR"/scripts/*.sh; do + install -m 0755 "$script" "/usr/local/sbin/$(basename "$script")" +done +for unit in "$SCRIPT_DIR"/systemd/*.{service,timer}; do + install -m 0644 "$unit" "/etc/systemd/system/$(basename "$unit")" +done + +install -d -m 0755 "$(dirname "$JOURNALD_DROP_IN")" +tmp_journald="$(mktemp)" +trap 'rm -f "$tmp_journald" "${tmp_docker:-}"' EXIT +cat >"$tmp_journald" <<'EOF' +[Journal] +SystemMaxUse=1G +SystemKeepFree=2G +MaxRetentionSec=14day +Compress=yes +EOF +install -m 0644 "$tmp_journald" "$JOURNALD_DROP_IN" +systemctl restart systemd-journald + +if command -v docker >/dev/null 2>&1; then + printf 'Configuring Docker log rotation limits...\n' + install -d -m 0755 /etc/docker + tmp_docker="$(mktemp)" + + if [[ -f "$DOCKER_CONFIG" ]]; then + if ! jq empty "$DOCKER_CONFIG" >/dev/null 2>&1; then + printf 'CRITICAL: %s is not valid JSON; refusing to overwrite it\n' "$DOCKER_CONFIG" >&2 + exit 1 + fi + backup="$DOCKER_CONFIG.$(date '+%Y%m%d-%H%M%S').bak" + install -m 0644 "$DOCKER_CONFIG" "$backup" + jq '. + { + "log-driver": "json-file", + "log-opts": ((."log-opts" // {}) + {"max-size": "50m", "max-file": "5"}) + }' "$DOCKER_CONFIG" >"$tmp_docker" + else + jq -n '{ + "log-driver": "json-file", + "log-opts": {"max-size": "50m", "max-file": "5"} + }' >"$tmp_docker" + fi + + jq empty "$tmp_docker" + install -m 0644 "$tmp_docker" "$DOCKER_CONFIG" + systemctl restart docker || true +else + printf 'INFO: Docker is not installed; Docker daemon configuration was skipped\n' +fi + +systemctl daemon-reload +systemctl enable --now "${timers[@]}" + +printf '\nEnabled AI Lab timers:\n' +systemctl list-timers --all --no-pager | grep 'ailab-' || true + +/usr/local/sbin/ailab-healthcheck.sh > /root/ailab-healthcheck-now.txt +printf '\nOK: installation complete; initial health report: /root/ailab-healthcheck-now.txt\n' diff --git a/labs/linux/ailab-maintenance/scripts/ailab-apt-cleanup.sh b/labs/linux/ailab-maintenance/scripts/ailab-apt-cleanup.sh new file mode 100755 index 0000000..a4b5212 --- /dev/null +++ b/labs/linux/ailab-maintenance/scripts/ailab-apt-cleanup.sh @@ -0,0 +1,66 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +LOG_FILE="/var/log/ailab-apt-cleanup.log" +execute=false +non_interactive=false + +usage() { + printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")" +} + +while (($# > 0)); do + case "$1" in + --execute) execute=true ;; + --non-interactive) non_interactive=true ;; + -h|--help) usage; exit 0 ;; + *) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;; + esac + shift +done + +if [[ "$non_interactive" == true && "$execute" != true ]]; then + printf 'CRITICAL: --non-interactive requires --execute\n' >&2 + exit 2 +fi +if ((EUID != 0)); then + printf 'CRITICAL: this script must run as root\n' >&2 + exit 2 +fi +if ! command -v apt >/dev/null 2>&1; then + printf 'CRITICAL: apt is required\n' >&2 + exit 2 +fi + +exec > >(tee -a "$LOG_FILE") 2>&1 +printf '\n[%s] APT cleanup\n' "$(date --iso-8601=seconds)" + +if [[ "$execute" != true ]]; then + printf 'INFO: dry-run mode; apt update, autoremove, autoclean, and needrestart are not executed\n' + printf 'INFO: simulated autoremove follows\n' + LC_ALL=C apt -s autoremove --purge + printf 'INFO: rerun with --execute and confirm to apply changes\n' + exit 0 +fi + +if [[ "$non_interactive" != true ]]; then + printf 'WARNING: this will update APT metadata and remove packages marked as automatically installed and unused.\n' + printf 'Type EXECUTE to continue: ' + read -r confirmation + if [[ "$confirmation" != "EXECUTE" ]]; then + printf 'CRITICAL: confirmation failed; no changes made\n' + exit 2 + fi +fi + +apt update +apt autoremove --purge -y +apt autoclean -y +if command -v needrestart >/dev/null 2>&1; then + needrestart -b || true +else + printf 'WARNING: needrestart is not installed\n' +fi +printf 'OK: APT cleanup completed\n' diff --git a/labs/linux/ailab-maintenance/scripts/ailab-config-backup.sh b/labs/linux/ailab-maintenance/scripts/ailab-config-backup.sh new file mode 100755 index 0000000..97e8392 --- /dev/null +++ b/labs/linux/ailab-maintenance/scripts/ailab-config-backup.sh @@ -0,0 +1,90 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +LOG_FILE="/var/log/ailab-config-backup.log" +BACKUP_DIR="/srv/backups/ailab-config" +RETENTION_DAYS=30 +execute=false +non_interactive=false + +usage() { + printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")" +} + +while (($# > 0)); do + case "$1" in + --execute) execute=true ;; + --non-interactive) non_interactive=true ;; + -h|--help) usage; exit 0 ;; + *) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;; + esac + shift +done + +if [[ "$non_interactive" == true && "$execute" != true ]]; then + printf 'CRITICAL: --non-interactive requires --execute\n' >&2 + exit 2 +fi +if ((EUID != 0)); then + printf 'CRITICAL: this script must run as root\n' >&2 + exit 2 +fi +for command_name in tar gzip find; do + if ! command -v "$command_name" >/dev/null 2>&1; then + printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2 + exit 2 + fi +done + +exec > >(tee -a "$LOG_FILE") 2>&1 +timestamp="$(date '+%Y%m%d-%H%M%S')" +archive="$BACKUP_DIR/ailab-config-$timestamp.tar.gz" +candidate_paths=( + /etc + /root/.bashrc + /root/.bashrc.d + /opt/ailab-maintenance + /var/lib/libvirt/qemu +) +source_paths=() + +printf '\n[%s] Configuration backup\n' "$(date --iso-8601=seconds)" +for path in "${candidate_paths[@]}"; do + if [[ -e "$path" ]]; then + source_paths+=("${path#/}") + printf 'OK: include %s\n' "$path" + else + printf 'INFO: optional path is absent: %s\n' "$path" + fi +done + +if ((${#source_paths[@]} == 0)); then + printf 'CRITICAL: no backup source paths are present\n' + exit 1 +fi + +printf 'Backup destination: %s\n' "$archive" +printf 'Retention: matching archives older than %d days\n' "$RETENTION_DAYS" +printf 'Configuration beneath /etc includes libvirt, Docker, and systemd when present\n' +printf 'Excluded by policy: Docker data, application data, model data, and VM disk images\n' + +if [[ "$execute" != true ]]; then + printf 'INFO: dry-run mode; no archive or directory was created and no retention deletion ran\n' + exit 0 +fi + +if [[ "$non_interactive" != true ]]; then + printf 'Type EXECUTE to create the archive and apply retention: ' + read -r confirmation + if [[ "$confirmation" != "EXECUTE" ]]; then + printf 'CRITICAL: confirmation failed; no changes made\n' + exit 2 + fi +fi + +install -d -m 0750 "$BACKUP_DIR" +tar --create --gzip --file "$archive" --ignore-failed-read --directory / -- "${source_paths[@]}" +find "$BACKUP_DIR" -maxdepth 1 -type f -name 'ailab-config-*.tar.gz' -mtime "+$RETENTION_DAYS" -print -delete +printf 'OK: configuration backup created: %s\n' "$archive" diff --git a/labs/linux/ailab-maintenance/scripts/ailab-disk-watch.sh b/labs/linux/ailab-maintenance/scripts/ailab-disk-watch.sh new file mode 100755 index 0000000..42d0b96 --- /dev/null +++ b/labs/linux/ailab-maintenance/scripts/ailab-disk-watch.sh @@ -0,0 +1,38 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +LOG_FILE="/var/log/ailab-disk-watch.log" +threshold="${AILAB_DISK_THRESHOLD:-85}" + +if ((EUID != 0)); then + printf 'CRITICAL: this script must run as root to write %s\n' "$LOG_FILE" >&2 + exit 2 +fi + +if [[ ! "$threshold" =~ ^[0-9]+$ ]] || ((threshold < 1 || threshold > 100)); then + printf 'CRITICAL: AILAB_DISK_THRESHOLD must be an integer from 1 to 100\n' >&2 + exit 2 +fi + +exec > >(tee -a "$LOG_FILE") 2>&1 +printf '\n[%s] Disk usage check; threshold=%s%%\n' "$(date --iso-8601=seconds)" "$threshold" + +status=0 +while read -r filesystem _blocks _used available use_percent mountpoint; do + usage="${use_percent%\%}" + + if [[ ! "$usage" =~ ^[0-9]+$ ]]; then + printf 'WARNING: unable to parse usage for %s mounted on %s\n' "$filesystem" "$mountpoint" + status=1 + elif ((usage >= threshold)); then + printf 'WARNING: %s mounted on %s is %s used; threshold=%s%%; available=%s KB\n' \ + "$filesystem" "$mountpoint" "$use_percent" "$threshold" "$available" + status=1 + else + printf 'OK: %s mounted on %s is %s used\n' "$filesystem" "$mountpoint" "$use_percent" + fi +done < <(df -P -x tmpfs -x devtmpfs | awk 'NR > 1 {print $1, $2, $3, $4, $5, $6}') + +exit "$status" diff --git a/labs/linux/ailab-maintenance/scripts/ailab-docker-cleanup.sh b/labs/linux/ailab-maintenance/scripts/ailab-docker-cleanup.sh new file mode 100755 index 0000000..245c40f --- /dev/null +++ b/labs/linux/ailab-maintenance/scripts/ailab-docker-cleanup.sh @@ -0,0 +1,70 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +LOG_FILE="/var/log/ailab-docker-cleanup.log" +execute=false +non_interactive=false + +usage() { + printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")" +} + +while (($# > 0)); do + case "$1" in + --execute) execute=true ;; + --non-interactive) non_interactive=true ;; + -h|--help) usage; exit 0 ;; + *) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;; + esac + shift +done + +if [[ "$non_interactive" == true && "$execute" != true ]]; then + printf 'CRITICAL: --non-interactive requires --execute\n' >&2 + exit 2 +fi +if ((EUID != 0)); then + printf 'CRITICAL: this script must run as root\n' >&2 + exit 2 +fi + +exec > >(tee -a "$LOG_FILE") 2>&1 +printf '\n[%s] Docker cleanup\n' "$(date --iso-8601=seconds)" + +if ! command -v docker >/dev/null 2>&1; then + printf 'INFO: Docker is not installed; nothing to do\n' + exit 0 +fi +if command -v systemctl >/dev/null 2>&1 && ! systemctl is-active --quiet docker; then + printf 'INFO: docker.service is inactive; nothing to do\n' + exit 0 +fi + +printf '\nDocker disk usage before cleanup:\n' +docker system df + +if [[ "$execute" != true ]]; then + printf 'INFO: dry-run mode; would run docker system prune -af\n' + printf 'INFO: dry-run mode; would run docker builder prune -af --filter until=168h\n' + printf 'INFO: Docker volumes are never included in this cleanup\n' + exit 0 +fi + +if [[ "$non_interactive" != true ]]; then + printf 'WARNING: this removes unused containers, networks, images, and old build cache, but not volumes.\n' + printf 'Type EXECUTE to continue: ' + read -r confirmation + if [[ "$confirmation" != "EXECUTE" ]]; then + printf 'CRITICAL: confirmation failed; no changes made\n' + exit 2 + fi +fi + +docker system prune -af +docker builder prune -af --filter "until=168h" + +printf '\nDocker disk usage after cleanup:\n' +docker system df +printf 'OK: Docker cleanup completed; volumes were not pruned\n' diff --git a/labs/linux/ailab-maintenance/scripts/ailab-healthcheck.sh b/labs/linux/ailab-maintenance/scripts/ailab-healthcheck.sh new file mode 100755 index 0000000..a57973b --- /dev/null +++ b/labs/linux/ailab-maintenance/scripts/ailab-healthcheck.sh @@ -0,0 +1,111 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +section() { + printf '\n== %s ==\n' "$1" +} + +run_optional() { + local description="$1" + shift + + if "$@"; then + return 0 + fi + + printf 'WARNING: %s failed\n' "$description" + return 0 +} + +section "Host identity" +if command -v hostnamectl >/dev/null 2>&1; then + run_optional "hostnamectl" hostnamectl +else + run_optional "hostname" hostname +fi +run_optional "kernel information" uname -a +run_optional "uptime" uptime + +section "Memory" +if command -v free >/dev/null 2>&1; then + run_optional "memory report" free -h +else + printf 'WARNING: free is not available\n' +fi + +section "Filesystems" +if command -v df >/dev/null 2>&1; then + run_optional "filesystem report" df -hT + printf '\nKey mountpoints present:\n' + for mountpoint in / /boot /var /srv /opt /home; do + if findmnt -rn --target "$mountpoint" >/dev/null 2>&1; then + run_optional "filesystem report for $mountpoint" df -hT "$mountpoint" + fi + done +else + printf 'WARNING: df is not available\n' +fi + +section "Journal usage" +if command -v journalctl >/dev/null 2>&1; then + run_optional "journal disk usage" journalctl --disk-usage +else + printf 'WARNING: journalctl is not available\n' +fi + +section "Docker" +if command -v docker >/dev/null 2>&1; then + if command -v systemctl >/dev/null 2>&1; then + run_optional "Docker service state" systemctl is-active docker + fi + run_optional "Docker container list" docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}' + run_optional "Docker disk usage" docker system df +else + printf 'INFO: Docker is not installed\n' +fi + +section "Libvirt" +if command -v virsh >/dev/null 2>&1; then + if command -v systemctl >/dev/null 2>&1; then + run_optional "libvirtd service state" systemctl is-active libvirtd + fi + run_optional "libvirt guest list" virsh list --all +else + printf 'INFO: virsh is not installed\n' +fi + +section "NVIDIA" +if command -v nvidia-smi >/dev/null 2>&1; then + run_optional "NVIDIA status" nvidia-smi +else + printf 'INFO: nvidia-smi is not installed\n' +fi + +section "Failed systemd units" +if command -v systemctl >/dev/null 2>&1; then + run_optional "failed systemd unit report" systemctl --failed --no-pager +else + printf 'WARNING: systemctl is not available\n' +fi + +section "SMART quick health" +if command -v smartctl >/dev/null 2>&1; then + shopt -s nullglob + devices=(/dev/sd? /dev/nvme?n?) + shopt -u nullglob + + if ((${#devices[@]} == 0)); then + printf 'INFO: no matching SATA/SCSI or NVMe devices found\n' + else + for device in "${devices[@]}"; do + printf '\n-- %s --\n' "$device" + run_optional "SMART health check for $device" smartctl -H "$device" + done + fi +else + printf 'INFO: smartctl is not installed\n' +fi + +exit 0 diff --git a/labs/linux/ailab-maintenance/scripts/ailab-kernel-cleanup.sh b/labs/linux/ailab-maintenance/scripts/ailab-kernel-cleanup.sh new file mode 100755 index 0000000..3fd54a6 --- /dev/null +++ b/labs/linux/ailab-maintenance/scripts/ailab-kernel-cleanup.sh @@ -0,0 +1,117 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +# APT autoremove respects package dependencies and kernel protection rules. That +# is safer than name-based purging on HWE hosts using NVIDIA, DKMS, or VFIO. + +LOG_FILE="/var/log/ailab-kernel-cleanup.log" +execute=false +non_interactive=false + +usage() { + printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")" +} + +kernel_packages() { + dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' \ + 'linux-image*' 'linux-headers*' 'linux-modules*' 2>/dev/null \ + | awk '$1 ~ /^ii/ {print $2}' \ + | sort -u || true +} + +versioned_kernel_images() { + dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' 'linux-image-[0-9]*' 2>/dev/null \ + | awk '$1 ~ /^ii/ {sub(/:.*/, "", $2); print $2}' \ + | sort -u || true +} + +while (($# > 0)); do + case "$1" in + --execute) execute=true ;; + --non-interactive) non_interactive=true ;; + -h|--help) usage; exit 0 ;; + *) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;; + esac + shift +done + +if [[ "$non_interactive" == true && "$execute" != true ]]; then + printf 'CRITICAL: --non-interactive requires --execute\n' >&2 + exit 2 +fi +if ((EUID != 0)); then + printf 'CRITICAL: this script must run as root\n' >&2 + exit 2 +fi +for command_name in apt dpkg-query uname; do + if ! command -v "$command_name" >/dev/null 2>&1; then + printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2 + exit 2 + fi +done + +exec > >(tee -a "$LOG_FILE") 2>&1 +printf '\n[%s] Kernel cleanup\n' "$(date --iso-8601=seconds)" +printf 'Running kernel: %s\n' "$(uname -r)" +printf '\nInstalled kernel-related packages before cleanup:\n' +kernel_packages + +simulation="$(LC_ALL=C apt -s autoremove --purge)" +printf '\nAPT autoremove simulation:\n%s\n' "$simulation" + +mapfile -t installed_images < <(versioned_kernel_images) +mapfile -t removed_images < <( + awk '$1 == "Remv" && $2 ~ /^linux-image-[0-9]/ {sub(/:.*/, "", $2); print $2}' <<<"$simulation" | sort -u +) + +remaining_images=0 +for image in "${installed_images[@]}"; do + remove_image=false + for removed in "${removed_images[@]}"; do + if [[ "$image" == "$removed" ]]; then + remove_image=true + break + fi + done + if [[ "$remove_image" != true ]]; then + remaining_images=$((remaining_images + 1)) + fi +done + +printf 'Kernel image safety check: installed=%d simulated-removals=%d remaining=%d\n' \ + "${#installed_images[@]}" "${#removed_images[@]}" "$remaining_images" + +if ((${#installed_images[@]} < 2 || remaining_images < 2)); then + printf 'CRITICAL: cleanup would not leave at least two versioned kernel images; refusing execution\n' + exit 1 +fi + +if [[ "$execute" != true ]]; then + printf 'INFO: dry-run mode; no packages were removed\n' + printf 'INFO: rerun with --execute and confirm to apply the simulated cleanup\n' + exit 0 +fi + +if [[ "$non_interactive" != true ]]; then + printf 'WARNING: APT will remove the packages shown in the simulation above.\n' + printf 'Type EXECUTE to continue: ' + read -r confirmation + if [[ "$confirmation" != "EXECUTE" ]]; then + printf 'CRITICAL: confirmation failed; no changes made\n' + exit 2 + fi +fi + +apt autoremove --purge -y +apt autoclean -y +if command -v update-grub >/dev/null 2>&1; then + update-grub || true +else + printf 'WARNING: update-grub is not installed\n' +fi + +printf '\nInstalled kernel-related packages after cleanup:\n' +kernel_packages +printf 'OK: kernel cleanup completed with APT-managed package selection\n' diff --git a/labs/linux/ailab-maintenance/scripts/ailab-vm-audit.sh b/labs/linux/ailab-maintenance/scripts/ailab-vm-audit.sh new file mode 100755 index 0000000..a68def2 --- /dev/null +++ b/labs/linux/ailab-maintenance/scripts/ailab-vm-audit.sh @@ -0,0 +1,42 @@ +#!/usr/bin/env bash +set -o errexit +set -o nounset +set -o pipefail + +section() { + printf '\n== %s ==\n' "$1" +} + +if ! command -v virsh >/dev/null 2>&1; then + printf 'INFO: virsh is not installed; VM audit skipped\n' + exit 0 +fi + +section "Virtual machines" +virsh list --all || printf 'WARNING: unable to list virtual machines\n' + +section "Storage pools" +virsh pool-list --all || printf 'WARNING: unable to list storage pools\n' + +mapfile -t pools < <(virsh pool-list --all --name 2>/dev/null | sed '/^[[:space:]]*$/d' || true) +for pool in "${pools[@]}"; do + section "Volumes in pool: $pool" + virsh vol-list "$pool" || printf 'WARNING: unable to list volumes in pool %s\n' "$pool" +done + +section "Possible VM disk and installation images" +search_roots=() +for path in /var/lib/libvirt /srv /opt; do + [[ -d "$path" ]] && search_roots+=("$path") +done + +if ((${#search_roots[@]} == 0)); then + printf 'INFO: no configured search roots are present\n' +else + find "${search_roots[@]}" -xdev -type f \ + \( -iname '*.qcow2' -o -iname '*.raw' -o -iname '*.iso' \) \ + -printf '%12s bytes %p\n' 2>/dev/null \ + | sort -nr || true +fi + +printf '\nINFO: audit complete; no files or libvirt resources were modified\n' diff --git a/labs/linux/ailab-maintenance/systemd/ailab-apt-cleanup.service b/labs/linux/ailab-maintenance/systemd/ailab-apt-cleanup.service new file mode 100644 index 0000000..cb18b9b --- /dev/null +++ b/labs/linux/ailab-maintenance/systemd/ailab-apt-cleanup.service @@ -0,0 +1,8 @@ +[Unit] +Description=AI Lab safe APT cleanup +After=network-online.target +Wants=network-online.target + +[Service] +Type=oneshot +ExecStart=/usr/local/sbin/ailab-apt-cleanup.sh --execute --non-interactive diff --git a/labs/linux/ailab-maintenance/systemd/ailab-apt-cleanup.timer b/labs/linux/ailab-maintenance/systemd/ailab-apt-cleanup.timer new file mode 100644 index 0000000..3109c66 --- /dev/null +++ b/labs/linux/ailab-maintenance/systemd/ailab-apt-cleanup.timer @@ -0,0 +1,9 @@ +[Unit] +Description=Run AI Lab APT cleanup weekly + +[Timer] +OnCalendar=Sun *-*-* 04:00:00 +Persistent=true + +[Install] +WantedBy=timers.target diff --git a/labs/linux/ailab-maintenance/systemd/ailab-config-backup.service b/labs/linux/ailab-maintenance/systemd/ailab-config-backup.service new file mode 100644 index 0000000..b8c010e --- /dev/null +++ b/labs/linux/ailab-maintenance/systemd/ailab-config-backup.service @@ -0,0 +1,6 @@ +[Unit] +Description=AI Lab configuration backup + +[Service] +Type=oneshot +ExecStart=/usr/local/sbin/ailab-config-backup.sh --execute --non-interactive diff --git a/labs/linux/ailab-maintenance/systemd/ailab-config-backup.timer b/labs/linux/ailab-maintenance/systemd/ailab-config-backup.timer new file mode 100644 index 0000000..1429da9 --- /dev/null +++ b/labs/linux/ailab-maintenance/systemd/ailab-config-backup.timer @@ -0,0 +1,9 @@ +[Unit] +Description=Run AI Lab configuration backup daily + +[Timer] +OnCalendar=*-*-* 03:30:00 +Persistent=true + +[Install] +WantedBy=timers.target diff --git a/labs/linux/ailab-maintenance/systemd/ailab-disk-watch.service b/labs/linux/ailab-maintenance/systemd/ailab-disk-watch.service new file mode 100644 index 0000000..73a78bd --- /dev/null +++ b/labs/linux/ailab-maintenance/systemd/ailab-disk-watch.service @@ -0,0 +1,6 @@ +[Unit] +Description=AI Lab disk usage check + +[Service] +Type=oneshot +ExecStart=/usr/local/sbin/ailab-disk-watch.sh diff --git a/labs/linux/ailab-maintenance/systemd/ailab-disk-watch.timer b/labs/linux/ailab-maintenance/systemd/ailab-disk-watch.timer new file mode 100644 index 0000000..5871f19 --- /dev/null +++ b/labs/linux/ailab-maintenance/systemd/ailab-disk-watch.timer @@ -0,0 +1,9 @@ +[Unit] +Description=Run AI Lab disk usage check hourly + +[Timer] +OnCalendar=hourly +Persistent=true + +[Install] +WantedBy=timers.target diff --git a/labs/linux/ailab-maintenance/systemd/ailab-docker-cleanup.service b/labs/linux/ailab-maintenance/systemd/ailab-docker-cleanup.service new file mode 100644 index 0000000..cd597c5 --- /dev/null +++ b/labs/linux/ailab-maintenance/systemd/ailab-docker-cleanup.service @@ -0,0 +1,8 @@ +[Unit] +Description=AI Lab safe Docker cleanup +Requires=docker.service +After=docker.service + +[Service] +Type=oneshot +ExecStart=/usr/local/sbin/ailab-docker-cleanup.sh --execute --non-interactive diff --git a/labs/linux/ailab-maintenance/systemd/ailab-docker-cleanup.timer b/labs/linux/ailab-maintenance/systemd/ailab-docker-cleanup.timer new file mode 100644 index 0000000..5a7e569 --- /dev/null +++ b/labs/linux/ailab-maintenance/systemd/ailab-docker-cleanup.timer @@ -0,0 +1,9 @@ +[Unit] +Description=Run AI Lab Docker cleanup weekly + +[Timer] +OnCalendar=Sun *-*-* 04:40:00 +Persistent=true + +[Install] +WantedBy=timers.target diff --git a/labs/linux/ailab-maintenance/systemd/ailab-kernel-cleanup.service b/labs/linux/ailab-maintenance/systemd/ailab-kernel-cleanup.service new file mode 100644 index 0000000..c186d56 --- /dev/null +++ b/labs/linux/ailab-maintenance/systemd/ailab-kernel-cleanup.service @@ -0,0 +1,8 @@ +[Unit] +Description=AI Lab safe kernel cleanup +After=network-online.target ailab-apt-cleanup.service +Wants=network-online.target + +[Service] +Type=oneshot +ExecStart=/usr/local/sbin/ailab-kernel-cleanup.sh --execute --non-interactive diff --git a/labs/linux/ailab-maintenance/systemd/ailab-kernel-cleanup.timer b/labs/linux/ailab-maintenance/systemd/ailab-kernel-cleanup.timer new file mode 100644 index 0000000..385f455 --- /dev/null +++ b/labs/linux/ailab-maintenance/systemd/ailab-kernel-cleanup.timer @@ -0,0 +1,9 @@ +[Unit] +Description=Run AI Lab kernel cleanup weekly + +[Timer] +OnCalendar=Sun *-*-* 04:20:00 +Persistent=true + +[Install] +WantedBy=timers.target