Linux Disk Full Incident Toolkit
Production-style Bash toolkit for diagnosing and handling a disk full incident on Linux systems. It is intentionally conservative: default mode is safe, cleanup actions require --execute and an operator confirmation prompt, and the scripts do not assume root access.
Diagram
flowchart TD
A["disk-full"]
click A href "./" "disk-full"
Why Disk Full Incidents Happen
- Logs - application, audit, system, or middleware logs can grow faster than rotation policy expects.
- Temporary files - failed jobs, installers, archives, and batch workloads often leave large files in
/tmp,/var/tmp, or application work directories. - Deleted open files - a process can keep writing to a file after it has been deleted, hiding disk usage from normal directory listings until the process closes the file.
- Inode exhaustion - a filesystem can fail writes even when space is available if it has too many small files and no free inodes.
Safety Model
- Safe dry-run behavior is the default.
- No script blindly deletes files.
- Cleanup operations require
--executeand confirmation. - Missing optional commands are reported as
WARNING. - Output is formatted with
OK,WARNING, andCRITICALfor incident notes. - The scripts are designed to work without root, while warning when permissions may limit visibility.
Scripts
00_env.sh- shared configuration and helper functions.01_disk_overview.sh-df -h,df -i, sorted mount usage, and threshold highlights.02_find_big_files.sh- read-only largest-file discovery.03_deleted_open_files.sh- deleted but open file detection withlsofwhen available.04_top_dirs.sh- largest directory discovery withdu.05_log_cleanup.sh- safe log cleanup analysis and optional old rotated log removal.06_quick_fix.sh- defensive emergency actions for verified truncation or service restart.07_postcheck.sh- validation after cleanup, with optional before/after comparison.disk_full_runbook.sh- guided incident workflow.
Example Usage
cd infra-run/scripts/bash/disk-full
./01_disk_overview.sh
./02_find_big_files.sh --path /var --top 20
./03_deleted_open_files.sh
./04_top_dirs.sh --path /var --depth 2
./05_log_cleanup.sh
./07_postcheck.sh
Run the guided read-only workflow:
./disk_full_runbook.sh --path /var --top 20 --depth 2
Review old rotated logs without deleting them:
./05_log_cleanup.sh --path /var/log --days-old 14
Remove old rotated logs only after approval:
./05_log_cleanup.sh --path /var/log --days-old 14 --execute
Emergency truncation of a verified active log:
./06_quick_fix.sh --truncate-file /var/log/app/verified-large.log --execute
Restart a specific service after confirming it is holding deleted files open:
./06_quick_fix.sh --restart-service app.service --execute
Exit Codes
0- OK1- operational issue detected or still critical2- invalid input
Production Warning
Use this toolkit as an incident aid, not an autopilot. Confirm the affected filesystem, application ownership, retention requirements, backup expectations, and change approval before cleanup. In enterprise environments, coordinate service restarts and file truncation with application owners because both can destroy evidence or interrupt production workloads.