Files
portfolio/infra-run/scripts/bash/disk-full/README.md
T
2026-05-06 06:46:27 +00:00

93 lines
3.4 KiB
Markdown

# Linux Disk Full Incident Toolkit
Production-style Bash toolkit for diagnosing and handling a disk full incident on Linux systems. It is intentionally conservative: default mode is safe, cleanup actions require `--execute` and an operator confirmation prompt, and the scripts do not assume root access.
## Diagram
```mermaid
flowchart TD
A["disk-full"]
click A href "./" "disk-full"
```
## Why Disk Full Incidents Happen
- **Logs** - application, audit, system, or middleware logs can grow faster than rotation policy expects.
- **Temporary files** - failed jobs, installers, archives, and batch workloads often leave large files in `/tmp`, `/var/tmp`, or application work directories.
- **Deleted open files** - a process can keep writing to a file after it has been deleted, hiding disk usage from normal directory listings until the process closes the file.
- **Inode exhaustion** - a filesystem can fail writes even when space is available if it has too many small files and no free inodes.
## Safety Model
- Safe dry-run behavior is the default.
- No script blindly deletes files.
- Cleanup operations require `--execute` and confirmation.
- Missing optional commands are reported as `WARNING`.
- Output is formatted with `OK`, `WARNING`, and `CRITICAL` for incident notes.
- The scripts are designed to work without root, while warning when permissions may limit visibility.
## Scripts
- `00_env.sh` - shared configuration and helper functions.
- `01_disk_overview.sh` - `df -h`, `df -i`, sorted mount usage, and threshold highlights.
- `02_find_big_files.sh` - read-only largest-file discovery.
- `03_deleted_open_files.sh` - deleted but open file detection with `lsof` when available.
- `04_top_dirs.sh` - largest directory discovery with `du`.
- `05_log_cleanup.sh` - safe log cleanup analysis and optional old rotated log removal.
- `06_quick_fix.sh` - defensive emergency actions for verified truncation or service restart.
- `07_postcheck.sh` - validation after cleanup, with optional before/after comparison.
- `disk_full_runbook.sh` - guided incident workflow.
## Example Usage
```bash
cd infra-run/scripts/bash/disk-full
./01_disk_overview.sh
./02_find_big_files.sh --path /var --top 20
./03_deleted_open_files.sh
./04_top_dirs.sh --path /var --depth 2
./05_log_cleanup.sh
./07_postcheck.sh
```
Run the guided read-only workflow:
```bash
./disk_full_runbook.sh --path /var --top 20 --depth 2
```
Review old rotated logs without deleting them:
```bash
./05_log_cleanup.sh --path /var/log --days-old 14
```
Remove old rotated logs only after approval:
```bash
./05_log_cleanup.sh --path /var/log --days-old 14 --execute
```
Emergency truncation of a verified active log:
```bash
./06_quick_fix.sh --truncate-file /var/log/app/verified-large.log --execute
```
Restart a specific service after confirming it is holding deleted files open:
```bash
./06_quick_fix.sh --restart-service app.service --execute
```
## Exit Codes
- `0` - OK
- `1` - operational issue detected or still critical
- `2` - invalid input
## Production Warning
Use this toolkit as an incident aid, not an autopilot. Confirm the affected filesystem, application ownership, retention requirements, backup expectations, and change approval before cleanup. In enterprise environments, coordinate service restarts and file truncation with application owners because both can destroy evidence or interrupt production workloads.