Linux Disk Full Incident Toolkit

Production-style Bash toolkit for diagnosing and handling a disk full incident on Linux systems. It is intentionally conservative: default mode is safe, cleanup actions require --execute and an operator confirmation prompt, and the scripts do not assume root access.

Diagram

flowchart TD
  A["disk-full"] --> B["01_disk_overview.sh"]
  A --> C["02_find_big_files.sh"]
  A --> D["03_deleted_open_files.sh"]
  A --> E["04_top_dirs.sh"]
  A --> F["05_log_cleanup.sh"]
  A --> G["06_quick_fix.sh"]
  A --> H["07_postcheck.sh"]
  A --> I["disk_full_runbook.sh"]

Why Disk Full Incidents Happen

Logs - application, audit, system, or middleware logs can grow faster than rotation policy expects.
Temporary files - failed jobs, installers, archives, and batch workloads often leave large files in /tmp, /var/tmp, or application work directories.
Deleted open files - a process can keep writing to a file after it has been deleted, hiding disk usage from normal directory listings until the process closes the file.
Inode exhaustion - a filesystem can fail writes even when space is available if it has too many small files and no free inodes.

Safety Model

Safe dry-run behavior is the default.
No script blindly deletes files.
Cleanup operations require --execute and confirmation.
Missing optional commands are reported as WARNING.
Output is formatted with OK, WARNING, and CRITICAL for incident notes.
The scripts are designed to work without root, while warning when permissions may limit visibility.

Scripts

00_env.sh - shared configuration and helper functions.
01_disk_overview.sh - df -h, df -i, sorted mount usage, and threshold highlights.
02_find_big_files.sh - read-only largest-file discovery.
03_deleted_open_files.sh - deleted but open file detection with lsof when available.
04_top_dirs.sh - largest directory discovery with du.
05_log_cleanup.sh - safe log cleanup analysis and optional old rotated log removal.
06_quick_fix.sh - defensive emergency actions for verified truncation or service restart.
07_postcheck.sh - validation after cleanup, with optional before/after comparison.
disk_full_runbook.sh - guided incident workflow.

Example Usage

cd infra-run/scripts/bash/disk-full

./01_disk_overview.sh
./02_find_big_files.sh --path /var --top 20
./03_deleted_open_files.sh
./04_top_dirs.sh --path /var --depth 2
./05_log_cleanup.sh
./07_postcheck.sh

Run the guided read-only workflow:

./disk_full_runbook.sh --path /var --top 20 --depth 2

Review old rotated logs without deleting them:

./05_log_cleanup.sh --path /var/log --days-old 14

Remove old rotated logs only after approval:

./05_log_cleanup.sh --path /var/log --days-old 14 --execute

Emergency truncation of a verified active log:

./06_quick_fix.sh --truncate-file /var/log/app/verified-large.log --execute

Restart a specific service after confirming it is holding deleted files open:

./06_quick_fix.sh --restart-service app.service --execute

Exit Codes

0 - OK
1 - operational issue detected or still critical
2 - invalid input

Production Warning

Use this toolkit as an incident aid, not an autopilot. Confirm the affected filesystem, application ownership, retention requirements, backup expectations, and change approval before cleanup. In enterprise environments, coordinate service restarts and file truncation with application owners because both can destroy evidence or interrupt production workloads.

3.6 KiB Raw Blame History