Add AI lab maintenance toolkit
lint / shell-yaml-ansible (push) Failing after 17s

This commit is contained in:
Mateusz Suski
2026-06-06 00:10:44 +00:00
parent 1843796e92
commit 8cb92de06f
21 changed files with 1031 additions and 0 deletions
+4
View File
@@ -10,6 +10,10 @@ Current subdirectories are planning areas unless their own README documents a ru
- `ci-cd`
- `docker`
## Linux operations labs
- [AI Lab Maintenance Toolkit](./linux/ailab-maintenance/) - Homelab-safe Linux maintenance automation for an Ubuntu AI infrastructure host, covering cleanup, health checks, config backup, Docker hygiene, kernel safety and systemd timers.
Lab content should document prerequisites, topology, validation, cleanup, and what remains untested. Do not present lab behavior as production-ready.
Planned lab topics are tracked in [ROADMAP.md](../ROADMAP.md). For Codex-driven changes, use [AGENTS.md](../AGENTS.md) and the templates under [docs/codex](../docs/codex/).
+308
View File
@@ -0,0 +1,308 @@
# AI Lab Maintenance Toolkit
## Executive summary
The AI Lab Maintenance Toolkit is a Bash and systemd operations lab for an
Ubuntu AI infrastructure host named `ailab`. It combines repeatable health
reporting, disk monitoring, conservative package cleanup, Docker hygiene,
configuration backup, and non-destructive VM inventory into a small toolkit
that is readable enough for review and guarded enough for homelab use.
This is a portfolio and lab implementation, not evidence of production
certification. Review package policy, backup coverage, maintenance windows, and
application impact before deploying it to another host.
## Problem solved
AI lab hosts accumulate operating system packages, kernel packages, container
images, build cache, journals, and configuration changes while also carrying
stateful workloads. Manual maintenance is easy to defer and risky to perform
without evidence. This project provides scheduled, logged tasks with explicit
safety boundaries and separate read-only audit commands.
## What this demonstrates
- Bash strict mode, input validation, dependency checks, and operational exit
codes.
- Dry-run-first maintenance with explicit authorization for changes.
- systemd oneshot services and persistent calendar timers.
- APT-managed kernel cleanup suitable for HWE, NVIDIA, DKMS, and VFIO review.
- Docker cleanup that preserves volumes.
- Configuration-focused backups with bounded retention.
- Optional discovery for Docker, libvirt, NVIDIA, SMART, and systemd.
- Idempotent installation and guarded JSON configuration updates.
## Architecture and directory layout
```text
ailab-maintenance/
├── README.md
├── install.sh
├── scripts/
│ ├── ailab-healthcheck.sh
│ ├── ailab-disk-watch.sh
│ ├── ailab-apt-cleanup.sh
│ ├── ailab-kernel-cleanup.sh
│ ├── ailab-docker-cleanup.sh
│ ├── ailab-config-backup.sh
│ └── ailab-vm-audit.sh
└── systemd/
├── ailab-apt-cleanup.service
├── ailab-apt-cleanup.timer
├── ailab-kernel-cleanup.service
├── ailab-kernel-cleanup.timer
├── ailab-docker-cleanup.service
├── ailab-docker-cleanup.timer
├── ailab-config-backup.service
├── ailab-config-backup.timer
├── ailab-disk-watch.service
└── ailab-disk-watch.timer
```
The installer deploys scripts to `/usr/local/sbin` and units to
`/etc/systemd/system`. Scripts run directly as root from systemd rather than
through an additional framework.
## Maintenance tasks
| Command | Purpose | Change behavior |
| --- | --- | --- |
| `ailab-healthcheck.sh` | Host, storage, service, container, VM, GPU, and SMART report | Read-only |
| `ailab-disk-watch.sh` | Filesystem threshold check | Read-only |
| `ailab-apt-cleanup.sh` | APT metadata refresh and unused package cleanup | Dry-run by default |
| `ailab-kernel-cleanup.sh` | APT-managed kernel package cleanup | Dry-run by default |
| `ailab-docker-cleanup.sh` | Unused Docker object and build-cache cleanup | Dry-run by default |
| `ailab-config-backup.sh` | Configuration archive and retention | Dry-run by default |
| `ailab-vm-audit.sh` | VM, pool, volume, and image-file inventory | Read-only |
## Safety model
Change-capable scripts default to dry-run behavior. Manual execution requires
`--execute` and an interactive `EXECUTE` confirmation. The systemd services
use `--execute --non-interactive`; installing and enabling those reviewed unit
files is the explicit authorization for scheduled maintenance.
Exit codes follow the repository convention:
- `0`: completed successfully or an optional component was absent.
- `1`: an operational check or maintenance action failed.
- `2`: invalid input, missing required dependency, or insufficient privilege.
The scripts do not bypass APT or Docker locks, delete VM resources, manually
select kernel names for removal, or hide command failures.
## Installation
Review every script and unit first. Installation changes package state,
journald settings, Docker daemon settings when Docker exists, and enabled timer
state.
```bash
cd labs/linux/ailab-maintenance
sudo ./install.sh
```
The installer:
1. Installs the documented Ubuntu utilities.
2. Deploys scripts and systemd units with fixed permissions.
3. Writes `/etc/systemd/journald.conf.d/ailab-limits.conf`.
4. Restarts `systemd-journald`.
5. Validates and backs up an existing Docker `daemon.json`, merges log limits
with `jq`, and attempts a Docker restart.
6. Enables all five timers.
7. Writes an initial report to `/root/ailab-healthcheck-now.txt`.
The installer is intended for Ubuntu 26.04. It is not run automatically by
repository validation.
## Manual commands
Read-only reports:
```bash
sudo /usr/local/sbin/ailab-healthcheck.sh
sudo /usr/local/sbin/ailab-disk-watch.sh
sudo /usr/local/sbin/ailab-vm-audit.sh
```
Preview maintenance:
```bash
sudo /usr/local/sbin/ailab-apt-cleanup.sh
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
sudo /usr/local/sbin/ailab-docker-cleanup.sh
sudo /usr/local/sbin/ailab-config-backup.sh
```
Apply reviewed maintenance interactively:
```bash
sudo /usr/local/sbin/ailab-apt-cleanup.sh --execute
sudo /usr/local/sbin/ailab-kernel-cleanup.sh --execute
sudo /usr/local/sbin/ailab-docker-cleanup.sh --execute
sudo /usr/local/sbin/ailab-config-backup.sh --execute
```
`--non-interactive` is reserved for reviewed automation and is rejected unless
`--execute` is also present.
## Systemd timers
| Timer | Schedule |
| --- | --- |
| `ailab-config-backup.timer` | Daily at 03:30 |
| `ailab-disk-watch.timer` | Hourly |
| `ailab-apt-cleanup.timer` | Sunday at 04:00 |
| `ailab-kernel-cleanup.timer` | Sunday at 04:20 |
| `ailab-docker-cleanup.timer` | Sunday at 04:40 |
All timers use `Persistent=true`, so a missed event runs after the host becomes
available. Inspect timer and service evidence with:
```bash
systemctl list-timers --all | grep ailab-
systemctl status ailab-config-backup.timer
journalctl -u ailab-kernel-cleanup.service
```
## Logs
Scheduled and manual maintenance writes to:
```text
/var/log/ailab-apt-cleanup.log
/var/log/ailab-kernel-cleanup.log
/var/log/ailab-docker-cleanup.log
/var/log/ailab-config-backup.log
/var/log/ailab-disk-watch.log
```
systemd also records service output in the journal. Logrotate is installed as a
dependency, but this lab does not create a custom rotation policy for these
small maintenance logs.
## Docker policy
Docker cleanup runs `docker system prune -af` and removes build cache older
than seven days. It never passes `--volumes`. Named and anonymous volumes
remain outside this automated policy and require application-aware review.
The installer configures the `json-file` driver with a maximum size of `50m`
and five files. Existing valid JSON is backed up and merged. Invalid JSON
causes installation to stop rather than overwrite operator configuration.
## Kernel policy
Kernel removal is delegated to `apt autoremove --purge`; package names are not
constructed or purged with regular expressions. Before execution, the script
logs the APT simulation and refuses cleanup unless at least two installed
versioned kernel image packages remain after simulated removals.
This protects a fallback kernel while preserving Ubuntu dependency policy.
Operators must still review DKMS builds, NVIDIA compatibility, VFIO bindings,
Secure Boot state, and the simulated removal set before manual execution.
## Backup policy
Backups are written to `/srv/backups/ailab-config` as
`ailab-config-YYYYMMDD-HHMMSS.tar.gz`. Matching archives older than 30 days are
deleted only after a new archive is created.
The backup covers `/etc`, selected root shell configuration,
`/opt/ailab-maintenance` when present, and libvirt configuration under
`/var/lib/libvirt/qemu`. It does not include `/var/lib/docker`, WebODM data,
Ollama models, VM disk images, or other large application datasets. Because
`/etc` is included, explicitly listed configuration subdirectories are already
covered even when optional-path reporting mentions them separately.
This is a local configuration backup, not a disaster-recovery design. A real
deployment should copy archives to independently protected storage and test
restoration.
## Journald policy
The installer applies:
```ini
[Journal]
SystemMaxUse=1G
SystemKeepFree=2G
MaxRetentionSec=14day
Compress=yes
```
These settings bound journal growth while retaining useful troubleshooting
evidence. Capacity and retention should be adjusted to the host's disk size
and incident-response requirements.
## Disk watch policy
The disk check uses `df -P`, defaults to an 85 percent threshold, and returns
`1` when any checked filesystem meets or exceeds the threshold. Override the
threshold for a manual or unit invocation with:
```bash
sudo AILAB_DISK_THRESHOLD=90 /usr/local/sbin/ailab-disk-watch.sh
```
The script reports every filesystem as `OK` or `WARNING`; it does not delete
data or attempt remediation.
## Example operational workflows
### Weekly maintenance review
```bash
sudo /usr/local/sbin/ailab-healthcheck.sh
sudo /usr/local/sbin/ailab-kernel-cleanup.sh
sudo /usr/local/sbin/ailab-docker-cleanup.sh
systemctl list-timers --all | grep ailab-
```
Review the kernel simulation, Docker usage, failed units, backup freshness, and
disk warnings before approving manual changes.
### Disk pressure investigation
```bash
sudo AILAB_DISK_THRESHOLD=80 /usr/local/sbin/ailab-disk-watch.sh
sudo docker system df
sudo journalctl --disk-usage
sudo /usr/local/sbin/ailab-vm-audit.sh
```
Use the evidence to identify ownership. Do not treat Docker pruning or file
deletion as a substitute for application-specific retention policy.
### Post-maintenance evidence
```bash
sudo /usr/local/sbin/ailab-healthcheck.sh \
| sudo tee /root/ailab-healthcheck-after-maintenance.txt
journalctl --since today -u 'ailab-*.service'
```
## Interview talking points
- Why timer units explicitly carry the non-interactive execution boundary.
- Why APT dependency policy is safer than regex-based kernel deletion.
- How Docker volume preservation separates platform hygiene from application
data lifecycle decisions.
- How optional dependency handling keeps one health command useful across
container, GPU, and virtualization host variants.
- Why configuration backup and application-data backup are separate concerns.
- How exit codes, persistent timers, logs, and post-checks support operations.
## Future improvements
- Add a dedicated logrotate policy after measuring log growth.
- Export disk-watch status to a monitoring system instead of relying only on
timer failure state.
- Add automated archive integrity checks and off-host replication.
- Add Bats tests using mocked `apt`, `docker`, `virsh`, and `systemctl`
commands.
- Add package-lock detection with bounded retry policy if recurring contention
is observed.
- Validate NVIDIA DKMS state and libvirt GPU passthrough configuration in a
dedicated read-only audit.
+103
View File
@@ -0,0 +1,103 @@
#!/usr/bin/env bash
set -euo pipefail
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
JOURNALD_DROP_IN="/etc/systemd/journald.conf.d/ailab-limits.conf"
DOCKER_CONFIG="/etc/docker/daemon.json"
packages=(
logrotate
needrestart
smartmontools
nvme-cli
sysstat
iotop
ncdu
duf
jq
lsof
psmisc
tar
gzip
)
timers=(
ailab-apt-cleanup.timer
ailab-kernel-cleanup.timer
ailab-docker-cleanup.timer
ailab-config-backup.timer
ailab-disk-watch.timer
)
if ((EUID != 0)); then
printf 'CRITICAL: install.sh must run as root\n' >&2
exit 2
fi
for command_name in apt-get install systemctl; do
if ! command -v "$command_name" >/dev/null 2>&1; then
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
exit 2
fi
done
printf 'Installing maintenance dependencies...\n'
apt-get update
DEBIAN_FRONTEND=noninteractive apt-get install -y "${packages[@]}"
printf 'Installing scripts and systemd units...\n'
for script in "$SCRIPT_DIR"/scripts/*.sh; do
install -m 0755 "$script" "/usr/local/sbin/$(basename "$script")"
done
for unit in "$SCRIPT_DIR"/systemd/*.{service,timer}; do
install -m 0644 "$unit" "/etc/systemd/system/$(basename "$unit")"
done
install -d -m 0755 "$(dirname "$JOURNALD_DROP_IN")"
tmp_journald="$(mktemp)"
trap 'rm -f "$tmp_journald" "${tmp_docker:-}"' EXIT
cat >"$tmp_journald" <<'EOF'
[Journal]
SystemMaxUse=1G
SystemKeepFree=2G
MaxRetentionSec=14day
Compress=yes
EOF
install -m 0644 "$tmp_journald" "$JOURNALD_DROP_IN"
systemctl restart systemd-journald
if command -v docker >/dev/null 2>&1; then
printf 'Configuring Docker log rotation limits...\n'
install -d -m 0755 /etc/docker
tmp_docker="$(mktemp)"
if [[ -f "$DOCKER_CONFIG" ]]; then
if ! jq empty "$DOCKER_CONFIG" >/dev/null 2>&1; then
printf 'CRITICAL: %s is not valid JSON; refusing to overwrite it\n' "$DOCKER_CONFIG" >&2
exit 1
fi
backup="$DOCKER_CONFIG.$(date '+%Y%m%d-%H%M%S').bak"
install -m 0644 "$DOCKER_CONFIG" "$backup"
jq '. + {
"log-driver": "json-file",
"log-opts": ((."log-opts" // {}) + {"max-size": "50m", "max-file": "5"})
}' "$DOCKER_CONFIG" >"$tmp_docker"
else
jq -n '{
"log-driver": "json-file",
"log-opts": {"max-size": "50m", "max-file": "5"}
}' >"$tmp_docker"
fi
jq empty "$tmp_docker"
install -m 0644 "$tmp_docker" "$DOCKER_CONFIG"
systemctl restart docker || true
else
printf 'INFO: Docker is not installed; Docker daemon configuration was skipped\n'
fi
systemctl daemon-reload
systemctl enable --now "${timers[@]}"
printf '\nEnabled AI Lab timers:\n'
systemctl list-timers --all --no-pager | grep 'ailab-' || true
/usr/local/sbin/ailab-healthcheck.sh > /root/ailab-healthcheck-now.txt
printf '\nOK: installation complete; initial health report: /root/ailab-healthcheck-now.txt\n'
+66
View File
@@ -0,0 +1,66 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
LOG_FILE="/var/log/ailab-apt-cleanup.log"
execute=false
non_interactive=false
usage() {
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
}
while (($# > 0)); do
case "$1" in
--execute) execute=true ;;
--non-interactive) non_interactive=true ;;
-h|--help) usage; exit 0 ;;
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
esac
shift
done
if [[ "$non_interactive" == true && "$execute" != true ]]; then
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
exit 2
fi
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root\n' >&2
exit 2
fi
if ! command -v apt >/dev/null 2>&1; then
printf 'CRITICAL: apt is required\n' >&2
exit 2
fi
exec > >(tee -a "$LOG_FILE") 2>&1
printf '\n[%s] APT cleanup\n' "$(date --iso-8601=seconds)"
if [[ "$execute" != true ]]; then
printf 'INFO: dry-run mode; apt update, autoremove, autoclean, and needrestart are not executed\n'
printf 'INFO: simulated autoremove follows\n'
LC_ALL=C apt -s autoremove --purge
printf 'INFO: rerun with --execute and confirm to apply changes\n'
exit 0
fi
if [[ "$non_interactive" != true ]]; then
printf 'WARNING: this will update APT metadata and remove packages marked as automatically installed and unused.\n'
printf 'Type EXECUTE to continue: '
read -r confirmation
if [[ "$confirmation" != "EXECUTE" ]]; then
printf 'CRITICAL: confirmation failed; no changes made\n'
exit 2
fi
fi
apt update
apt autoremove --purge -y
apt autoclean -y
if command -v needrestart >/dev/null 2>&1; then
needrestart -b || true
else
printf 'WARNING: needrestart is not installed\n'
fi
printf 'OK: APT cleanup completed\n'
@@ -0,0 +1,90 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
LOG_FILE="/var/log/ailab-config-backup.log"
BACKUP_DIR="/srv/backups/ailab-config"
RETENTION_DAYS=30
execute=false
non_interactive=false
usage() {
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
}
while (($# > 0)); do
case "$1" in
--execute) execute=true ;;
--non-interactive) non_interactive=true ;;
-h|--help) usage; exit 0 ;;
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
esac
shift
done
if [[ "$non_interactive" == true && "$execute" != true ]]; then
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
exit 2
fi
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root\n' >&2
exit 2
fi
for command_name in tar gzip find; do
if ! command -v "$command_name" >/dev/null 2>&1; then
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
exit 2
fi
done
exec > >(tee -a "$LOG_FILE") 2>&1
timestamp="$(date '+%Y%m%d-%H%M%S')"
archive="$BACKUP_DIR/ailab-config-$timestamp.tar.gz"
candidate_paths=(
/etc
/root/.bashrc
/root/.bashrc.d
/opt/ailab-maintenance
/var/lib/libvirt/qemu
)
source_paths=()
printf '\n[%s] Configuration backup\n' "$(date --iso-8601=seconds)"
for path in "${candidate_paths[@]}"; do
if [[ -e "$path" ]]; then
source_paths+=("${path#/}")
printf 'OK: include %s\n' "$path"
else
printf 'INFO: optional path is absent: %s\n' "$path"
fi
done
if ((${#source_paths[@]} == 0)); then
printf 'CRITICAL: no backup source paths are present\n'
exit 1
fi
printf 'Backup destination: %s\n' "$archive"
printf 'Retention: matching archives older than %d days\n' "$RETENTION_DAYS"
printf 'Configuration beneath /etc includes libvirt, Docker, and systemd when present\n'
printf 'Excluded by policy: Docker data, application data, model data, and VM disk images\n'
if [[ "$execute" != true ]]; then
printf 'INFO: dry-run mode; no archive or directory was created and no retention deletion ran\n'
exit 0
fi
if [[ "$non_interactive" != true ]]; then
printf 'Type EXECUTE to create the archive and apply retention: '
read -r confirmation
if [[ "$confirmation" != "EXECUTE" ]]; then
printf 'CRITICAL: confirmation failed; no changes made\n'
exit 2
fi
fi
install -d -m 0750 "$BACKUP_DIR"
tar --create --gzip --file "$archive" --ignore-failed-read --directory / -- "${source_paths[@]}"
find "$BACKUP_DIR" -maxdepth 1 -type f -name 'ailab-config-*.tar.gz' -mtime "+$RETENTION_DAYS" -print -delete
printf 'OK: configuration backup created: %s\n' "$archive"
+38
View File
@@ -0,0 +1,38 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
LOG_FILE="/var/log/ailab-disk-watch.log"
threshold="${AILAB_DISK_THRESHOLD:-85}"
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root to write %s\n' "$LOG_FILE" >&2
exit 2
fi
if [[ ! "$threshold" =~ ^[0-9]+$ ]] || ((threshold < 1 || threshold > 100)); then
printf 'CRITICAL: AILAB_DISK_THRESHOLD must be an integer from 1 to 100\n' >&2
exit 2
fi
exec > >(tee -a "$LOG_FILE") 2>&1
printf '\n[%s] Disk usage check; threshold=%s%%\n' "$(date --iso-8601=seconds)" "$threshold"
status=0
while read -r filesystem _blocks _used available use_percent mountpoint; do
usage="${use_percent%\%}"
if [[ ! "$usage" =~ ^[0-9]+$ ]]; then
printf 'WARNING: unable to parse usage for %s mounted on %s\n' "$filesystem" "$mountpoint"
status=1
elif ((usage >= threshold)); then
printf 'WARNING: %s mounted on %s is %s used; threshold=%s%%; available=%s KB\n' \
"$filesystem" "$mountpoint" "$use_percent" "$threshold" "$available"
status=1
else
printf 'OK: %s mounted on %s is %s used\n' "$filesystem" "$mountpoint" "$use_percent"
fi
done < <(df -P -x tmpfs -x devtmpfs | awk 'NR > 1 {print $1, $2, $3, $4, $5, $6}')
exit "$status"
@@ -0,0 +1,70 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
LOG_FILE="/var/log/ailab-docker-cleanup.log"
execute=false
non_interactive=false
usage() {
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
}
while (($# > 0)); do
case "$1" in
--execute) execute=true ;;
--non-interactive) non_interactive=true ;;
-h|--help) usage; exit 0 ;;
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
esac
shift
done
if [[ "$non_interactive" == true && "$execute" != true ]]; then
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
exit 2
fi
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root\n' >&2
exit 2
fi
exec > >(tee -a "$LOG_FILE") 2>&1
printf '\n[%s] Docker cleanup\n' "$(date --iso-8601=seconds)"
if ! command -v docker >/dev/null 2>&1; then
printf 'INFO: Docker is not installed; nothing to do\n'
exit 0
fi
if command -v systemctl >/dev/null 2>&1 && ! systemctl is-active --quiet docker; then
printf 'INFO: docker.service is inactive; nothing to do\n'
exit 0
fi
printf '\nDocker disk usage before cleanup:\n'
docker system df
if [[ "$execute" != true ]]; then
printf 'INFO: dry-run mode; would run docker system prune -af\n'
printf 'INFO: dry-run mode; would run docker builder prune -af --filter until=168h\n'
printf 'INFO: Docker volumes are never included in this cleanup\n'
exit 0
fi
if [[ "$non_interactive" != true ]]; then
printf 'WARNING: this removes unused containers, networks, images, and old build cache, but not volumes.\n'
printf 'Type EXECUTE to continue: '
read -r confirmation
if [[ "$confirmation" != "EXECUTE" ]]; then
printf 'CRITICAL: confirmation failed; no changes made\n'
exit 2
fi
fi
docker system prune -af
docker builder prune -af --filter "until=168h"
printf '\nDocker disk usage after cleanup:\n'
docker system df
printf 'OK: Docker cleanup completed; volumes were not pruned\n'
+111
View File
@@ -0,0 +1,111 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
section() {
printf '\n== %s ==\n' "$1"
}
run_optional() {
local description="$1"
shift
if "$@"; then
return 0
fi
printf 'WARNING: %s failed\n' "$description"
return 0
}
section "Host identity"
if command -v hostnamectl >/dev/null 2>&1; then
run_optional "hostnamectl" hostnamectl
else
run_optional "hostname" hostname
fi
run_optional "kernel information" uname -a
run_optional "uptime" uptime
section "Memory"
if command -v free >/dev/null 2>&1; then
run_optional "memory report" free -h
else
printf 'WARNING: free is not available\n'
fi
section "Filesystems"
if command -v df >/dev/null 2>&1; then
run_optional "filesystem report" df -hT
printf '\nKey mountpoints present:\n'
for mountpoint in / /boot /var /srv /opt /home; do
if findmnt -rn --target "$mountpoint" >/dev/null 2>&1; then
run_optional "filesystem report for $mountpoint" df -hT "$mountpoint"
fi
done
else
printf 'WARNING: df is not available\n'
fi
section "Journal usage"
if command -v journalctl >/dev/null 2>&1; then
run_optional "journal disk usage" journalctl --disk-usage
else
printf 'WARNING: journalctl is not available\n'
fi
section "Docker"
if command -v docker >/dev/null 2>&1; then
if command -v systemctl >/dev/null 2>&1; then
run_optional "Docker service state" systemctl is-active docker
fi
run_optional "Docker container list" docker ps --format 'table {{.Names}}\t{{.Image}}\t{{.Status}}\t{{.Ports}}'
run_optional "Docker disk usage" docker system df
else
printf 'INFO: Docker is not installed\n'
fi
section "Libvirt"
if command -v virsh >/dev/null 2>&1; then
if command -v systemctl >/dev/null 2>&1; then
run_optional "libvirtd service state" systemctl is-active libvirtd
fi
run_optional "libvirt guest list" virsh list --all
else
printf 'INFO: virsh is not installed\n'
fi
section "NVIDIA"
if command -v nvidia-smi >/dev/null 2>&1; then
run_optional "NVIDIA status" nvidia-smi
else
printf 'INFO: nvidia-smi is not installed\n'
fi
section "Failed systemd units"
if command -v systemctl >/dev/null 2>&1; then
run_optional "failed systemd unit report" systemctl --failed --no-pager
else
printf 'WARNING: systemctl is not available\n'
fi
section "SMART quick health"
if command -v smartctl >/dev/null 2>&1; then
shopt -s nullglob
devices=(/dev/sd? /dev/nvme?n?)
shopt -u nullglob
if ((${#devices[@]} == 0)); then
printf 'INFO: no matching SATA/SCSI or NVMe devices found\n'
else
for device in "${devices[@]}"; do
printf '\n-- %s --\n' "$device"
run_optional "SMART health check for $device" smartctl -H "$device"
done
fi
else
printf 'INFO: smartctl is not installed\n'
fi
exit 0
@@ -0,0 +1,117 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
# APT autoremove respects package dependencies and kernel protection rules. That
# is safer than name-based purging on HWE hosts using NVIDIA, DKMS, or VFIO.
LOG_FILE="/var/log/ailab-kernel-cleanup.log"
execute=false
non_interactive=false
usage() {
printf 'Usage: %s [--execute [--non-interactive]]\n' "$(basename "$0")"
}
kernel_packages() {
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' \
'linux-image*' 'linux-headers*' 'linux-modules*' 2>/dev/null \
| awk '$1 ~ /^ii/ {print $2}' \
| sort -u || true
}
versioned_kernel_images() {
dpkg-query -W -f='${db:Status-Abbrev} ${binary:Package}\n' 'linux-image-[0-9]*' 2>/dev/null \
| awk '$1 ~ /^ii/ {sub(/:.*/, "", $2); print $2}' \
| sort -u || true
}
while (($# > 0)); do
case "$1" in
--execute) execute=true ;;
--non-interactive) non_interactive=true ;;
-h|--help) usage; exit 0 ;;
*) printf 'CRITICAL: unknown argument: %s\n' "$1" >&2; usage >&2; exit 2 ;;
esac
shift
done
if [[ "$non_interactive" == true && "$execute" != true ]]; then
printf 'CRITICAL: --non-interactive requires --execute\n' >&2
exit 2
fi
if ((EUID != 0)); then
printf 'CRITICAL: this script must run as root\n' >&2
exit 2
fi
for command_name in apt dpkg-query uname; do
if ! command -v "$command_name" >/dev/null 2>&1; then
printf 'CRITICAL: required command is missing: %s\n' "$command_name" >&2
exit 2
fi
done
exec > >(tee -a "$LOG_FILE") 2>&1
printf '\n[%s] Kernel cleanup\n' "$(date --iso-8601=seconds)"
printf 'Running kernel: %s\n' "$(uname -r)"
printf '\nInstalled kernel-related packages before cleanup:\n'
kernel_packages
simulation="$(LC_ALL=C apt -s autoremove --purge)"
printf '\nAPT autoremove simulation:\n%s\n' "$simulation"
mapfile -t installed_images < <(versioned_kernel_images)
mapfile -t removed_images < <(
awk '$1 == "Remv" && $2 ~ /^linux-image-[0-9]/ {sub(/:.*/, "", $2); print $2}' <<<"$simulation" | sort -u
)
remaining_images=0
for image in "${installed_images[@]}"; do
remove_image=false
for removed in "${removed_images[@]}"; do
if [[ "$image" == "$removed" ]]; then
remove_image=true
break
fi
done
if [[ "$remove_image" != true ]]; then
remaining_images=$((remaining_images + 1))
fi
done
printf 'Kernel image safety check: installed=%d simulated-removals=%d remaining=%d\n' \
"${#installed_images[@]}" "${#removed_images[@]}" "$remaining_images"
if ((${#installed_images[@]} < 2 || remaining_images < 2)); then
printf 'CRITICAL: cleanup would not leave at least two versioned kernel images; refusing execution\n'
exit 1
fi
if [[ "$execute" != true ]]; then
printf 'INFO: dry-run mode; no packages were removed\n'
printf 'INFO: rerun with --execute and confirm to apply the simulated cleanup\n'
exit 0
fi
if [[ "$non_interactive" != true ]]; then
printf 'WARNING: APT will remove the packages shown in the simulation above.\n'
printf 'Type EXECUTE to continue: '
read -r confirmation
if [[ "$confirmation" != "EXECUTE" ]]; then
printf 'CRITICAL: confirmation failed; no changes made\n'
exit 2
fi
fi
apt autoremove --purge -y
apt autoclean -y
if command -v update-grub >/dev/null 2>&1; then
update-grub || true
else
printf 'WARNING: update-grub is not installed\n'
fi
printf '\nInstalled kernel-related packages after cleanup:\n'
kernel_packages
printf 'OK: kernel cleanup completed with APT-managed package selection\n'
+42
View File
@@ -0,0 +1,42 @@
#!/usr/bin/env bash
set -o errexit
set -o nounset
set -o pipefail
section() {
printf '\n== %s ==\n' "$1"
}
if ! command -v virsh >/dev/null 2>&1; then
printf 'INFO: virsh is not installed; VM audit skipped\n'
exit 0
fi
section "Virtual machines"
virsh list --all || printf 'WARNING: unable to list virtual machines\n'
section "Storage pools"
virsh pool-list --all || printf 'WARNING: unable to list storage pools\n'
mapfile -t pools < <(virsh pool-list --all --name 2>/dev/null | sed '/^[[:space:]]*$/d' || true)
for pool in "${pools[@]}"; do
section "Volumes in pool: $pool"
virsh vol-list "$pool" || printf 'WARNING: unable to list volumes in pool %s\n' "$pool"
done
section "Possible VM disk and installation images"
search_roots=()
for path in /var/lib/libvirt /srv /opt; do
[[ -d "$path" ]] && search_roots+=("$path")
done
if ((${#search_roots[@]} == 0)); then
printf 'INFO: no configured search roots are present\n'
else
find "${search_roots[@]}" -xdev -type f \
\( -iname '*.qcow2' -o -iname '*.raw' -o -iname '*.iso' \) \
-printf '%12s bytes %p\n' 2>/dev/null \
| sort -nr || true
fi
printf '\nINFO: audit complete; no files or libvirt resources were modified\n'
@@ -0,0 +1,8 @@
[Unit]
Description=AI Lab safe APT cleanup
After=network-online.target
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-apt-cleanup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab APT cleanup weekly
[Timer]
OnCalendar=Sun *-*-* 04:00:00
Persistent=true
[Install]
WantedBy=timers.target
@@ -0,0 +1,6 @@
[Unit]
Description=AI Lab configuration backup
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-config-backup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab configuration backup daily
[Timer]
OnCalendar=*-*-* 03:30:00
Persistent=true
[Install]
WantedBy=timers.target
@@ -0,0 +1,6 @@
[Unit]
Description=AI Lab disk usage check
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-disk-watch.sh
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab disk usage check hourly
[Timer]
OnCalendar=hourly
Persistent=true
[Install]
WantedBy=timers.target
@@ -0,0 +1,8 @@
[Unit]
Description=AI Lab safe Docker cleanup
Requires=docker.service
After=docker.service
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-docker-cleanup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab Docker cleanup weekly
[Timer]
OnCalendar=Sun *-*-* 04:40:00
Persistent=true
[Install]
WantedBy=timers.target
@@ -0,0 +1,8 @@
[Unit]
Description=AI Lab safe kernel cleanup
After=network-online.target ailab-apt-cleanup.service
Wants=network-online.target
[Service]
Type=oneshot
ExecStart=/usr/local/sbin/ailab-kernel-cleanup.sh --execute --non-interactive
@@ -0,0 +1,9 @@
[Unit]
Description=Run AI Lab kernel cleanup weekly
[Timer]
OnCalendar=Sun *-*-* 04:20:00
Persistent=true
[Install]
WantedBy=timers.target