Compare commits
5 Commits
| Author | SHA1 | Date | |
|---|---|---|---|
| 65c7c82f0f | |||
| 76e24796bb | |||
| 5dd8c34952 | |||
| c42d8bfb8f | |||
| 9fb291f834 |
@@ -0,0 +1,39 @@
|
||||
# Changelog
|
||||
|
||||
## [Initial Version]
|
||||
|
||||
### Added
|
||||
|
||||
- Repository structure:
|
||||
- `infra-run`
|
||||
- `platform-projects`
|
||||
- `labs`
|
||||
- Linux operations Bash toolkit:
|
||||
- healthcheck
|
||||
- disk usage checks
|
||||
- service checks
|
||||
- system reporting
|
||||
- Disk full incident toolkit:
|
||||
- disk analysis
|
||||
- large files detection
|
||||
- deleted open files detection
|
||||
- safe cleanup suggestions
|
||||
- Network troubleshooting script:
|
||||
- interface, routing, DNS, connectivity checks
|
||||
- Veritas storage toolkit:
|
||||
- VxVM disk detection
|
||||
- diskgroup extension
|
||||
- volume/filesystem resize
|
||||
- VCS freeze/unfreeze workflow
|
||||
- GPFS storage toolkit:
|
||||
- cluster validation
|
||||
- NSD planning
|
||||
- filesystem expansion
|
||||
- rebalance
|
||||
- Runbook-style structure and step-based execution.
|
||||
|
||||
### Notes
|
||||
|
||||
- All scripts default to dry-run where change actions are present.
|
||||
- Designed for safety and readability.
|
||||
- No destructive actions without explicit confirmation.
|
||||
@@ -1,10 +1,59 @@
|
||||
# Portfolio
|
||||
|
||||
Personal infrastructure engineering portfolio focused on Linux operations, automation, monitoring and lab-based infrastructure projects.
|
||||
This repository demonstrates real-world Linux infrastructure and operations experience through sanitized scripts, runbooks, and project structure. It focuses on production operations, incident response, troubleshooting, automation, and enterprise infrastructure patterns.
|
||||
|
||||
Main areas:
|
||||
- Linux operations automation
|
||||
- Infrastructure troubleshooting and runbooks
|
||||
- Monitoring and observability
|
||||
- Virtualization and clustering
|
||||
- Kubernetes, Terraform and lab environments
|
||||
## Core Project
|
||||
|
||||
### infra-run
|
||||
|
||||
`infra-run` is the core operational project in this repository. It contains Linux operations automation, incident response tooling, Bash-based operational scripts, and runbook-style workflows for pre-checks, controlled changes, troubleshooting, and post-change validation.
|
||||
|
||||
## Toolkits
|
||||
|
||||
### Linux Operations Toolkit
|
||||
|
||||
[infra-run/scripts/bash/](./infra-run/scripts/bash/)
|
||||
|
||||
General Linux operations scripts for host health checks, disk usage checks, service validation, and system reporting. The toolkit is written for practical operations checks on RHEL, Oracle Linux, and Ubuntu-style systems.
|
||||
|
||||
### Disk Full Incident Toolkit
|
||||
|
||||
[infra-run/scripts/bash/disk-full/](./infra-run/scripts/bash/disk-full/)
|
||||
|
||||
Production-style disk full incident workflow covering filesystem usage, inode pressure, large file discovery, deleted open files, top directory analysis, log cleanup review, and safe cleanup suggestions. The scenario reflects common incidents involving logs, temporary files, deleted files held open by processes, and inode exhaustion.
|
||||
|
||||
### Network Troubleshooting
|
||||
|
||||
[infra-run/scripts/bash/](./infra-run/scripts/bash/)
|
||||
|
||||
OS-level network diagnostics for interfaces, routes, DNS resolution, gateway reachability, listening sockets, and optional remote connectivity checks. The script is designed for first-pass troubleshooting during Linux operations incidents.
|
||||
|
||||
### Veritas Storage Toolkit
|
||||
|
||||
[infra-run/scripts/bash/veritas/](./infra-run/scripts/bash/veritas/)
|
||||
|
||||
Veritas VxVM and VCS storage expansion workflow covering new LUN detection, VxVM disk initialization, diskgroup extension, volume and filesystem resize, and VCS service group freeze/unfreeze handling. The approach is cluster-safe, dry-run by default, and organized around pre-check, change, and post-check steps.
|
||||
|
||||
### GPFS Storage Toolkit
|
||||
|
||||
[infra-run/scripts/bash/gpfs/](./infra-run/scripts/bash/gpfs/)
|
||||
|
||||
GPFS / IBM Spectrum Scale filesystem expansion workflow covering cluster validation, candidate disk discovery, NSD stanza planning, NSD creation, filesystem expansion, optional rebalance, post-checks, and change reporting.
|
||||
|
||||
## Repository Structure
|
||||
|
||||
- `infra-run` - core operational automation, scripts, runbooks, and infrastructure operations examples.
|
||||
- `platform-projects` - larger infrastructure topics including storage, clustering, monitoring, virtualization, and log analysis.
|
||||
- `labs` - experimentation and lab work for Kubernetes, Terraform, Docker, networking, and CI/CD.
|
||||
|
||||
## Design Principles
|
||||
|
||||
- Safety first, with dry-run behavior by default.
|
||||
- Pre-check, change, and post-check workflow.
|
||||
- Real-world scenarios, not tutorials.
|
||||
- Minimal but practical tooling.
|
||||
|
||||
## Notes
|
||||
|
||||
- Scripts are simplified and sanitized for portfolio use.
|
||||
- Examples are based on real production operations patterns.
|
||||
|
||||
@@ -0,0 +1,51 @@
|
||||
# Linux Operations Bash Toolkit
|
||||
|
||||
Small, practical Bash scripts for Linux operations checks and incident triage. The scripts are sanitized examples inspired by production Linux operations work and avoid destructive actions or root-only assumptions.
|
||||
|
||||
## Scripts
|
||||
|
||||
- `healthcheck.sh` - general host health overview.
|
||||
- `disk_check.sh` - filesystem usage threshold check.
|
||||
- `service_check.sh` - critical service status check.
|
||||
- `system_report.sh` - writes a timestamped system report to `/tmp`.
|
||||
- `network_troubleshoot.sh` - local and optional remote network diagnostics.
|
||||
|
||||
## Usage
|
||||
|
||||
```bash
|
||||
./healthcheck.sh
|
||||
./disk_check.sh
|
||||
./disk_check.sh 90
|
||||
./service_check.sh
|
||||
./service_check.sh sshd nginx zabbix-agent
|
||||
./system_report.sh
|
||||
./network_troubleshoot.sh
|
||||
./network_troubleshoot.sh google.com
|
||||
```
|
||||
|
||||
## Exit Codes
|
||||
|
||||
`disk_check.sh`:
|
||||
|
||||
- `0` - all filesystems are below the threshold.
|
||||
- `1` - one or more filesystems are at or above the threshold.
|
||||
- `2` - invalid threshold input.
|
||||
|
||||
`service_check.sh`:
|
||||
|
||||
- `0` - all checked services are active.
|
||||
- `1` - at least one service is inactive, failed, missing, or cannot be checked.
|
||||
|
||||
`network_troubleshoot.sh`:
|
||||
|
||||
- `0` - no obvious local, DNS, or connectivity issue detected.
|
||||
- `1` - DNS, interface, gateway, or target connectivity problems detected.
|
||||
|
||||
`healthcheck.sh` and `system_report.sh` are informational. They print warnings for missing tools where possible.
|
||||
|
||||
## Notes
|
||||
|
||||
- Requires Bash.
|
||||
- Designed for RHEL, Oracle Linux, and Ubuntu style systems.
|
||||
- Handles missing tools such as `ss`, `traceroute`, `nc`, and `journalctl` gracefully.
|
||||
- Does not require root and does not make system changes.
|
||||
Executable
+124
@@ -0,0 +1,124 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
TIMESTAMP="${TIMESTAMP:-$(date +%Y%m%d_%H%M%S)}"
|
||||
DRY_RUN="${DRY_RUN:-true}"
|
||||
LOG_FILE="${LOG_FILE:-/tmp/disk_full_${TIMESTAMP}.log}"
|
||||
WARN_THRESHOLD="${WARN_THRESHOLD:-80}"
|
||||
CRIT_THRESHOLD="${CRIT_THRESHOLD:-90}"
|
||||
EMERGENCY_THRESHOLD="${EMERGENCY_THRESHOLD:-95}"
|
||||
|
||||
log() {
|
||||
local level="$1"
|
||||
shift
|
||||
local message="$*"
|
||||
|
||||
printf '%s: %s\n' "$level" "$message" | tee -a "$LOG_FILE"
|
||||
}
|
||||
|
||||
ok() {
|
||||
log "OK" "$@"
|
||||
}
|
||||
|
||||
warning() {
|
||||
log "WARNING" "$@"
|
||||
}
|
||||
|
||||
critical() {
|
||||
log "CRITICAL" "$@"
|
||||
}
|
||||
|
||||
section() {
|
||||
printf '\n== %s ==\n' "$1" | tee -a "$LOG_FILE"
|
||||
}
|
||||
|
||||
require_cmd() {
|
||||
local cmd="$1"
|
||||
|
||||
if command -v "$cmd" >/dev/null 2>&1; then
|
||||
return 0
|
||||
fi
|
||||
|
||||
warning "Command not available: $cmd"
|
||||
return 1
|
||||
}
|
||||
|
||||
run_cmd() {
|
||||
if [[ "$#" -eq 0 ]]; then
|
||||
critical "run_cmd called without a command"
|
||||
return 2
|
||||
fi
|
||||
|
||||
if [[ "$DRY_RUN" == "true" ]]; then
|
||||
ok "DRY-RUN: $*"
|
||||
return 0
|
||||
fi
|
||||
|
||||
ok "RUN: $*"
|
||||
"$@" 2>&1 | tee -a "$LOG_FILE"
|
||||
}
|
||||
|
||||
confirm_execute() {
|
||||
local target="${1:-disk-full remediation}"
|
||||
|
||||
if [[ "$DRY_RUN" == "true" ]]; then
|
||||
ok "Safe mode enabled. No destructive actions will be taken."
|
||||
return 0
|
||||
fi
|
||||
|
||||
warning "Execution mode requested for: $target"
|
||||
warning "Confirm the affected filesystem, application impact, backups, and change approval before continuing."
|
||||
printf 'Type EXECUTE to continue: '
|
||||
read -r confirmation
|
||||
|
||||
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||
critical "Confirmation failed. Aborting."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
ok "Execution confirmed by operator."
|
||||
}
|
||||
|
||||
validate_path() {
|
||||
local path="$1"
|
||||
|
||||
if [[ -z "$path" ]]; then
|
||||
critical "Path cannot be empty"
|
||||
return 2
|
||||
fi
|
||||
|
||||
if [[ ! -e "$path" ]]; then
|
||||
critical "Path does not exist: $path"
|
||||
return 2
|
||||
fi
|
||||
}
|
||||
|
||||
usage_percent_number() {
|
||||
local value="$1"
|
||||
printf '%s\n' "${value%\%}"
|
||||
}
|
||||
|
||||
status_for_percent() {
|
||||
local percent="$1"
|
||||
|
||||
if (( percent >= EMERGENCY_THRESHOLD )); then
|
||||
printf 'CRITICAL'
|
||||
elif (( percent >= CRIT_THRESHOLD )); then
|
||||
printf 'WARNING'
|
||||
elif (( percent >= WARN_THRESHOLD )); then
|
||||
printf 'WARNING'
|
||||
else
|
||||
printf 'OK'
|
||||
fi
|
||||
}
|
||||
|
||||
safe_find_prune_args() {
|
||||
printf '%s\n' \
|
||||
-path /proc -o \
|
||||
-path /sys -o \
|
||||
-path /dev -o \
|
||||
-path /run -o \
|
||||
-path /tmp/systemd-private-\*
|
||||
}
|
||||
+47
@@ -0,0 +1,47 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
exit_code=0
|
||||
|
||||
section "Disk Space Overview"
|
||||
if require_cmd df; then
|
||||
df -h 2>&1 | tee -a "$LOG_FILE"
|
||||
else
|
||||
critical "df is required for disk overview"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
section "Inode Overview"
|
||||
df -i 2>&1 | tee -a "$LOG_FILE" || warning "Unable to collect inode usage"
|
||||
|
||||
section "Filesystems Sorted By Usage"
|
||||
df -P -h | awk 'NR == 1 { next } { print $5, $6, $1, $2, $3, $4 }' | sort -rn | while read -r used mount fs size used_space avail; do
|
||||
percent="$(usage_percent_number "$used")"
|
||||
level="$(status_for_percent "$percent")"
|
||||
printf '%s: %s used on %s (%s, size=%s used=%s avail=%s)\n' "$level" "$used" "$mount" "$fs" "$size" "$used_space" "$avail" | tee -a "$LOG_FILE"
|
||||
done
|
||||
|
||||
section "Threshold Summary"
|
||||
while read -r fs size used avail pct mount; do
|
||||
percent="$(usage_percent_number "$pct")"
|
||||
level="$(status_for_percent "$percent")"
|
||||
|
||||
if (( percent >= EMERGENCY_THRESHOLD )); then
|
||||
critical "$mount is ${pct} full on $fs (size=$size used=$used avail=$avail)"
|
||||
exit_code=1
|
||||
elif (( percent >= CRIT_THRESHOLD )); then
|
||||
warning "$mount is ${pct} full on $fs (size=$size used=$used avail=$avail)"
|
||||
elif (( percent >= WARN_THRESHOLD )); then
|
||||
warning "$mount is ${pct} full on $fs (size=$size used=$used avail=$avail)"
|
||||
else
|
||||
ok "$mount is ${pct} full on $fs"
|
||||
fi
|
||||
done < <(df -P -h | awk 'NR > 1 { print $1, $2, $3, $4, $5, $6 }')
|
||||
|
||||
exit "$exit_code"
|
||||
+63
@@ -0,0 +1,63 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
SEARCH_PATH="/"
|
||||
TOP_N=20
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--path <path>] [--top <N>]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--path) SEARCH_PATH="${2:-}"; shift 2 ;;
|
||||
--top) TOP_N="${2:-}"; shift 2 ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) critical "Unknown argument: $1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if ! [[ "$TOP_N" =~ ^[0-9]+$ ]] || (( TOP_N < 1 )); then
|
||||
critical "--top must be a positive integer"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
validate_path "$SEARCH_PATH" || exit 2
|
||||
require_cmd find || exit 1
|
||||
require_cmd sort || exit 1
|
||||
require_cmd head || exit 1
|
||||
|
||||
section "Largest Files Under $SEARCH_PATH"
|
||||
warning "Read-only scan. Permission errors can be normal without root access."
|
||||
|
||||
find "$SEARCH_PATH" -xdev \
|
||||
\( -path /proc -o -path /sys -o -path /dev -o -path /run \) -prune -o \
|
||||
-type f -printf '%s\t%p\n' 2>/dev/null |
|
||||
sort -rn |
|
||||
head -n "$TOP_N" |
|
||||
awk '
|
||||
function human(bytes) {
|
||||
split("B KB MB GB TB PB", unit)
|
||||
size = bytes
|
||||
idx = 1
|
||||
while (size >= 1024 && idx < 6) {
|
||||
size = size / 1024
|
||||
idx++
|
||||
}
|
||||
return sprintf("%.1f%s", size, unit[idx])
|
||||
}
|
||||
{
|
||||
size = $1
|
||||
$1 = ""
|
||||
sub(/^\t/, "")
|
||||
printf "%10s %s\n", human(size), $0
|
||||
}
|
||||
' | tee -a "$LOG_FILE"
|
||||
|
||||
ok "No files were modified."
|
||||
@@ -0,0 +1,41 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
section "Deleted But Open Files"
|
||||
|
||||
if ! require_cmd lsof; then
|
||||
warning "lsof is not installed or not in PATH. Install lsof or run equivalent tooling with appropriate privileges."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
warning "Read-only check. Full results may require elevated privileges."
|
||||
|
||||
deleted_output="$(lsof -nP +L1 2>/dev/null || true)"
|
||||
|
||||
if [[ -z "$deleted_output" ]]; then
|
||||
ok "No deleted open files detected by lsof."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
printf '%s\n' "$deleted_output" |
|
||||
awk '
|
||||
NR == 1 {
|
||||
printf "%-20s %-10s %-12s %s\n", "PROCESS", "PID", "SIZE", "PATH"
|
||||
next
|
||||
}
|
||||
{
|
||||
path = $9
|
||||
for (i = 10; i <= NF; i++) {
|
||||
path = path " " $i
|
||||
}
|
||||
printf "%-20s %-10s %-12s %s\n", $1, $2, $7, path
|
||||
}
|
||||
' | tee -a "$LOG_FILE"
|
||||
|
||||
warning "Space from deleted files is released when the owning process closes the file or is safely restarted."
|
||||
+51
@@ -0,0 +1,51 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
SEARCH_PATH="/"
|
||||
DEPTH=2
|
||||
TOP_N=25
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--path <path>] [--depth <N>] [--top <N>]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--path) SEARCH_PATH="${2:-}"; shift 2 ;;
|
||||
--depth) DEPTH="${2:-}"; shift 2 ;;
|
||||
--top) TOP_N="${2:-}"; shift 2 ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) critical "Unknown argument: $1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if ! [[ "$DEPTH" =~ ^[0-9]+$ ]]; then
|
||||
critical "--depth must be a non-negative integer"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
if ! [[ "$TOP_N" =~ ^[0-9]+$ ]] || (( TOP_N < 1 )); then
|
||||
critical "--top must be a positive integer"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
validate_path "$SEARCH_PATH" || exit 2
|
||||
require_cmd du || exit 1
|
||||
require_cmd sort || exit 1
|
||||
require_cmd head || exit 1
|
||||
|
||||
section "Top Directories Under $SEARCH_PATH"
|
||||
warning "Read-only scan. Permission errors can be normal without root access."
|
||||
|
||||
du -x -h --max-depth="$DEPTH" "$SEARCH_PATH" 2>/dev/null |
|
||||
sort -hr |
|
||||
head -n "$TOP_N" |
|
||||
tee -a "$LOG_FILE"
|
||||
|
||||
ok "No directories were modified."
|
||||
+97
@@ -0,0 +1,97 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
EXECUTE=false
|
||||
LOG_PATH="/var/log"
|
||||
DAYS_OLD=14
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--path <path>] [--days-old <N>] [--execute]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--path) LOG_PATH="${2:-}"; shift 2 ;;
|
||||
--days-old) DAYS_OLD="${2:-}"; shift 2 ;;
|
||||
--execute) EXECUTE=true; DRY_RUN=false; shift ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) critical "Unknown argument: $1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if ! [[ "$DAYS_OLD" =~ ^[0-9]+$ ]] || (( DAYS_OLD < 1 )); then
|
||||
critical "--days-old must be a positive integer"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
validate_path "$LOG_PATH" || exit 2
|
||||
require_cmd find || exit 1
|
||||
require_cmd sort || exit 1
|
||||
require_cmd xargs || true
|
||||
|
||||
section "Large Log Files In $LOG_PATH"
|
||||
find "$LOG_PATH" -xdev -type f \( -name '*.log' -o -name '*log' -o -name 'messages*' -o -name 'syslog*' \) -size +100M -printf '%s\t%p\n' 2>/dev/null |
|
||||
sort -rn |
|
||||
awk '
|
||||
function human(bytes) {
|
||||
split("B KB MB GB TB", unit)
|
||||
size = bytes
|
||||
idx = 1
|
||||
while (size >= 1024 && idx < 5) {
|
||||
size = size / 1024
|
||||
idx++
|
||||
}
|
||||
return sprintf("%.1f%s", size, unit[idx])
|
||||
}
|
||||
{ size = $1; $1 = ""; sub(/^\t/, ""); printf "%10s %s\n", human(size), $0 }
|
||||
' | tee -a "$LOG_FILE"
|
||||
|
||||
section "Old Rotated Logs Eligible For Review"
|
||||
mapfile -t rotated_logs < <(
|
||||
find "$LOG_PATH" -xdev -type f \
|
||||
\( -name '*.gz' -o -name '*.1' -o -name '*.old' -o -name '*.bz2' -o -name '*.xz' \) \
|
||||
-mtime +"$DAYS_OLD" -print 2>/dev/null | sort
|
||||
)
|
||||
|
||||
if [[ "${#rotated_logs[@]}" -eq 0 ]]; then
|
||||
ok "No old rotated logs found under $LOG_PATH with age greater than $DAYS_OLD days."
|
||||
else
|
||||
printf '%s\n' "${rotated_logs[@]}" | tee -a "$LOG_FILE"
|
||||
fi
|
||||
|
||||
section "Suggested Cleanup Commands"
|
||||
cat <<SUGGESTIONS | tee -a "$LOG_FILE"
|
||||
# Review large active logs before truncating. Prefer application-aware log rotation:
|
||||
logrotate -d /etc/logrotate.conf
|
||||
|
||||
# Remove old rotated logs only after retention approval:
|
||||
$(basename "$0") --path "$LOG_PATH" --days-old "$DAYS_OLD" --execute
|
||||
SUGGESTIONS
|
||||
|
||||
if [[ "$EXECUTE" != "true" ]]; then
|
||||
ok "Safe mode. No logs were removed."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ "${#rotated_logs[@]}" -eq 0 ]]; then
|
||||
ok "Execution requested, but there are no eligible old rotated logs to remove."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
confirm_execute "remove old rotated logs from $LOG_PATH"
|
||||
|
||||
for file in "${rotated_logs[@]}"; do
|
||||
if [[ -f "$file" && ! -L "$file" ]]; then
|
||||
run_cmd rm -f -- "$file"
|
||||
else
|
||||
warning "Skipped non-regular file or symlink: $file"
|
||||
fi
|
||||
done
|
||||
|
||||
ok "Old rotated log cleanup completed."
|
||||
+78
@@ -0,0 +1,78 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
EXECUTE=false
|
||||
TRUNCATE_FILE=""
|
||||
RESTART_SERVICE=""
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--truncate-file <path>] [--restart-service <name>] [--execute]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--truncate-file) TRUNCATE_FILE="${2:-}"; shift 2 ;;
|
||||
--restart-service) RESTART_SERVICE="${2:-}"; shift 2 ;;
|
||||
--execute) EXECUTE=true; DRY_RUN=false; shift ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) critical "Unknown argument: $1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
section "Emergency Disk Full Quick Fix Options"
|
||||
cat <<OPTIONS | tee -a "$LOG_FILE"
|
||||
Possible actions after incident commander approval:
|
||||
1. Truncate a verified active log file:
|
||||
$0 --truncate-file /path/to/large.log --execute
|
||||
|
||||
2. Restart a specific service holding deleted files open:
|
||||
$0 --restart-service service-name --execute
|
||||
|
||||
Review application impact before either action. Truncation preserves the file inode but destroys file contents.
|
||||
OPTIONS
|
||||
|
||||
if [[ -z "$TRUNCATE_FILE" && -z "$RESTART_SERVICE" ]]; then
|
||||
ok "No quick fix requested. Printed options only."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ "$EXECUTE" != "true" ]]; then
|
||||
warning "Quick fix arguments supplied without --execute. No changes made."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
confirm_execute "emergency disk-full quick fix"
|
||||
|
||||
if [[ -n "$TRUNCATE_FILE" ]]; then
|
||||
validate_path "$TRUNCATE_FILE" || exit 2
|
||||
|
||||
if [[ ! -f "$TRUNCATE_FILE" || -L "$TRUNCATE_FILE" ]]; then
|
||||
critical "Refusing to truncate non-regular file or symlink: $TRUNCATE_FILE"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
warning "Truncating file contents: $TRUNCATE_FILE"
|
||||
: > "$TRUNCATE_FILE"
|
||||
ok "Truncated $TRUNCATE_FILE"
|
||||
fi
|
||||
|
||||
if [[ -n "$RESTART_SERVICE" ]]; then
|
||||
if [[ "$RESTART_SERVICE" == *"/"* || "$RESTART_SERVICE" == *".."* ]]; then
|
||||
critical "Invalid service name: $RESTART_SERVICE"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
if require_cmd systemctl; then
|
||||
run_cmd systemctl restart "$RESTART_SERVICE"
|
||||
ok "Restart requested for service: $RESTART_SERVICE"
|
||||
else
|
||||
critical "systemctl is required to restart services"
|
||||
exit 1
|
||||
fi
|
||||
fi
|
||||
+90
@@ -0,0 +1,90 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
BEFORE_FILE=""
|
||||
exit_code=0
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--before-file <df_output_file>]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--before-file) BEFORE_FILE="${2:-}"; shift 2 ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) critical "Unknown argument: $1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
require_cmd df || exit 1
|
||||
|
||||
section "Post-Cleanup Disk Space"
|
||||
df -h 2>&1 | tee -a "$LOG_FILE"
|
||||
|
||||
section "Post-Cleanup Inodes"
|
||||
df -i 2>&1 | tee -a "$LOG_FILE" || warning "Unable to collect inode usage"
|
||||
|
||||
section "Critical Filesystem Check"
|
||||
while read -r fs size used avail pct mount; do
|
||||
percent="$(usage_percent_number "$pct")"
|
||||
if (( percent >= EMERGENCY_THRESHOLD )); then
|
||||
critical "$mount is still ${pct} full on $fs (size=$size used=$used avail=$avail)"
|
||||
exit_code=1
|
||||
elif (( percent >= CRIT_THRESHOLD )); then
|
||||
warning "$mount remains high at ${pct} on $fs"
|
||||
else
|
||||
ok "$mount is ${pct} full on $fs"
|
||||
fi
|
||||
done < <(df -P -h | awk 'NR > 1 { print $1, $2, $3, $4, $5, $6 }')
|
||||
|
||||
if [[ -n "$BEFORE_FILE" ]]; then
|
||||
section "Before And After Comparison"
|
||||
if [[ ! -f "$BEFORE_FILE" ]]; then
|
||||
warning "Before file not found: $BEFORE_FILE"
|
||||
else
|
||||
awk '
|
||||
NR == FNR && FNR > 1 {
|
||||
before[$6] = $5
|
||||
next
|
||||
}
|
||||
FNR > 1 {
|
||||
mount = $6
|
||||
if (mount in before) {
|
||||
before_pct = before[mount]
|
||||
after_pct = $5
|
||||
gsub(/%/, "", before_pct)
|
||||
gsub(/%/, "", after_pct)
|
||||
|
||||
if (after_pct < before_pct) {
|
||||
status = "OK"
|
||||
result = "improved"
|
||||
} else if (after_pct == before_pct) {
|
||||
status = "WARNING"
|
||||
result = "unchanged"
|
||||
} else {
|
||||
status = "WARNING"
|
||||
result = "increased"
|
||||
}
|
||||
|
||||
printf "%s: %s before=%s after=%s (%s)\n", status, mount, before[mount], $5, result
|
||||
}
|
||||
}
|
||||
' "$BEFORE_FILE" <(df -P -h) | tee -a "$LOG_FILE"
|
||||
fi
|
||||
else
|
||||
warning "No --before-file supplied. Improvement comparison skipped."
|
||||
fi
|
||||
|
||||
if [[ "$exit_code" -eq 0 ]]; then
|
||||
ok "Post-check completed without emergency-threshold filesystems."
|
||||
else
|
||||
critical "One or more filesystems remain at or above ${EMERGENCY_THRESHOLD}%."
|
||||
fi
|
||||
|
||||
exit "$exit_code"
|
||||
@@ -0,0 +1,84 @@
|
||||
# Linux Disk Full Incident Toolkit
|
||||
|
||||
Production-style Bash toolkit for diagnosing and handling a disk full incident on Linux systems. It is intentionally conservative: default mode is safe, cleanup actions require `--execute` and an operator confirmation prompt, and the scripts do not assume root access.
|
||||
|
||||
## Why Disk Full Incidents Happen
|
||||
|
||||
- **Logs** - application, audit, system, or middleware logs can grow faster than rotation policy expects.
|
||||
- **Temporary files** - failed jobs, installers, archives, and batch workloads often leave large files in `/tmp`, `/var/tmp`, or application work directories.
|
||||
- **Deleted open files** - a process can keep writing to a file after it has been deleted, hiding disk usage from normal directory listings until the process closes the file.
|
||||
- **Inode exhaustion** - a filesystem can fail writes even when space is available if it has too many small files and no free inodes.
|
||||
|
||||
## Safety Model
|
||||
|
||||
- Safe dry-run behavior is the default.
|
||||
- No script blindly deletes files.
|
||||
- Cleanup operations require `--execute` and confirmation.
|
||||
- Missing optional commands are reported as `WARNING`.
|
||||
- Output is formatted with `OK`, `WARNING`, and `CRITICAL` for incident notes.
|
||||
- The scripts are designed to work without root, while warning when permissions may limit visibility.
|
||||
|
||||
## Scripts
|
||||
|
||||
- `00_env.sh` - shared configuration and helper functions.
|
||||
- `01_disk_overview.sh` - `df -h`, `df -i`, sorted mount usage, and threshold highlights.
|
||||
- `02_find_big_files.sh` - read-only largest-file discovery.
|
||||
- `03_deleted_open_files.sh` - deleted but open file detection with `lsof` when available.
|
||||
- `04_top_dirs.sh` - largest directory discovery with `du`.
|
||||
- `05_log_cleanup.sh` - safe log cleanup analysis and optional old rotated log removal.
|
||||
- `06_quick_fix.sh` - defensive emergency actions for verified truncation or service restart.
|
||||
- `07_postcheck.sh` - validation after cleanup, with optional before/after comparison.
|
||||
- `disk_full_runbook.sh` - guided incident workflow.
|
||||
|
||||
## Example Usage
|
||||
|
||||
```bash
|
||||
cd infra-run/scripts/bash/disk-full
|
||||
|
||||
./01_disk_overview.sh
|
||||
./02_find_big_files.sh --path /var --top 20
|
||||
./03_deleted_open_files.sh
|
||||
./04_top_dirs.sh --path /var --depth 2
|
||||
./05_log_cleanup.sh
|
||||
./07_postcheck.sh
|
||||
```
|
||||
|
||||
Run the guided read-only workflow:
|
||||
|
||||
```bash
|
||||
./disk_full_runbook.sh --path /var --top 20 --depth 2
|
||||
```
|
||||
|
||||
Review old rotated logs without deleting them:
|
||||
|
||||
```bash
|
||||
./05_log_cleanup.sh --path /var/log --days-old 14
|
||||
```
|
||||
|
||||
Remove old rotated logs only after approval:
|
||||
|
||||
```bash
|
||||
./05_log_cleanup.sh --path /var/log --days-old 14 --execute
|
||||
```
|
||||
|
||||
Emergency truncation of a verified active log:
|
||||
|
||||
```bash
|
||||
./06_quick_fix.sh --truncate-file /var/log/app/verified-large.log --execute
|
||||
```
|
||||
|
||||
Restart a specific service after confirming it is holding deleted files open:
|
||||
|
||||
```bash
|
||||
./06_quick_fix.sh --restart-service app.service --execute
|
||||
```
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK
|
||||
- `1` - operational issue detected or still critical
|
||||
- `2` - invalid input
|
||||
|
||||
## Production Warning
|
||||
|
||||
Use this toolkit as an incident aid, not an autopilot. Confirm the affected filesystem, application ownership, retention requirements, backup expectations, and change approval before cleanup. In enterprise environments, coordinate service restarts and file truncation with application owners because both can destroy evidence or interrupt production workloads.
|
||||
+67
@@ -0,0 +1,67 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
SEARCH_PATH="/"
|
||||
TOP_N=20
|
||||
DEPTH=2
|
||||
EXECUTE=false
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--path <path>] [--top <N>] [--depth <N>] [--execute]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--path) SEARCH_PATH="${2:-}"; shift 2 ;;
|
||||
--top) TOP_N="${2:-}"; shift 2 ;;
|
||||
--depth) DEPTH="${2:-}"; shift 2 ;;
|
||||
--execute) EXECUTE=true; DRY_RUN=false; shift ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) critical "Unknown argument: $1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
section "Disk Full Incident Workflow"
|
||||
cat <<FLOW | tee -a "$LOG_FILE"
|
||||
Step 1. Disk overview
|
||||
$SCRIPT_DIR/01_disk_overview.sh
|
||||
|
||||
Step 2. Find largest files
|
||||
$SCRIPT_DIR/02_find_big_files.sh --path "$SEARCH_PATH" --top "$TOP_N"
|
||||
|
||||
Step 3. Check deleted but open files
|
||||
$SCRIPT_DIR/03_deleted_open_files.sh
|
||||
|
||||
Step 4. Identify top directories
|
||||
$SCRIPT_DIR/04_top_dirs.sh --path "$SEARCH_PATH" --depth "$DEPTH"
|
||||
|
||||
Step 5. Review safe log cleanup suggestions
|
||||
$SCRIPT_DIR/05_log_cleanup.sh
|
||||
|
||||
Step 6. Optional emergency quick fix, only after approval
|
||||
$SCRIPT_DIR/06_quick_fix.sh --truncate-file /path/to/verified.log --execute
|
||||
$SCRIPT_DIR/06_quick_fix.sh --restart-service service-name --execute
|
||||
|
||||
Step 7. Post-check
|
||||
$SCRIPT_DIR/07_postcheck.sh
|
||||
FLOW
|
||||
|
||||
if [[ "$EXECUTE" == "true" ]]; then
|
||||
warning "--execute was supplied to the runbook. Destructive actions are still not run automatically."
|
||||
fi
|
||||
|
||||
section "Running Read-Only Incident Checks"
|
||||
"$SCRIPT_DIR/01_disk_overview.sh" || warning "Disk overview reported critical usage"
|
||||
"$SCRIPT_DIR/02_find_big_files.sh" --path "$SEARCH_PATH" --top "$TOP_N" || warning "Large-file scan reported an issue"
|
||||
"$SCRIPT_DIR/03_deleted_open_files.sh" || warning "Deleted-open-file check reported an issue"
|
||||
"$SCRIPT_DIR/04_top_dirs.sh" --path "$SEARCH_PATH" --depth "$DEPTH" || warning "Top-directory scan reported an issue"
|
||||
"$SCRIPT_DIR/05_log_cleanup.sh" || warning "Log cleanup suggestion step reported an issue"
|
||||
|
||||
section "Next Manual Decision"
|
||||
ok "Review findings, identify owner and retention requirements, then run a targeted cleanup script with --execute only if approved."
|
||||
Executable
+29
@@ -0,0 +1,29 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
threshold="${1:-80}"
|
||||
|
||||
if [[ ! "$threshold" =~ ^[0-9]+$ ]] || (( threshold < 1 || threshold > 100 )); then
|
||||
printf 'CRITICAL: invalid threshold "%s"; provide an integer from 1 to 100\n' "$threshold" >&2
|
||||
exit 2
|
||||
fi
|
||||
|
||||
status=0
|
||||
warning_threshold=$(( threshold > 5 ? threshold - 5 : threshold ))
|
||||
|
||||
while read -r filesystem size used avail use_percent mountpoint; do
|
||||
usage="${use_percent%\%}"
|
||||
|
||||
if (( usage >= threshold )); then
|
||||
printf 'CRITICAL: %s mounted on %s is %s used; threshold is %s%% (%s free)\n' "$filesystem" "$mountpoint" "$use_percent" "$threshold" "$avail"
|
||||
status=1
|
||||
elif (( usage >= warning_threshold )); then
|
||||
printf 'WARNING: %s mounted on %s is %s used; threshold is %s%%\n' "$filesystem" "$mountpoint" "$use_percent" "$threshold"
|
||||
else
|
||||
printf 'OK: %s mounted on %s is %s used\n' "$filesystem" "$mountpoint" "$use_percent"
|
||||
fi
|
||||
done < <(df -P -x tmpfs -x devtmpfs | awk 'NR > 1 {print $1, $2, $3, $4, $5, $6}')
|
||||
|
||||
exit "$status"
|
||||
Executable
+114
@@ -0,0 +1,114 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
TIMESTAMP="${TIMESTAMP:-$(date +%Y%m%d_%H%M%S)}"
|
||||
DRY_RUN="${DRY_RUN:-true}"
|
||||
LOG_FILE="${LOG_FILE:-/tmp/gpfs_extend_${TIMESTAMP}.log}"
|
||||
|
||||
FILESYSTEM="${FILESYSTEM:-}"
|
||||
NSD_STANZA="${NSD_STANZA:-}"
|
||||
FAILURE_GROUP="${FAILURE_GROUP:-}"
|
||||
STORAGE_POOL="${STORAGE_POOL:-system}"
|
||||
USAGE="${USAGE:-dataAndMetadata}"
|
||||
|
||||
log() {
|
||||
local level="$1"
|
||||
shift
|
||||
local message="$*"
|
||||
|
||||
printf '%s: %s\n' "$level" "$message" | tee -a "$LOG_FILE"
|
||||
}
|
||||
|
||||
ok() {
|
||||
log "OK" "$@"
|
||||
}
|
||||
|
||||
warning() {
|
||||
log "WARNING" "$@"
|
||||
}
|
||||
|
||||
critical() {
|
||||
log "CRITICAL" "$@"
|
||||
}
|
||||
|
||||
require_cmd() {
|
||||
local cmd="$1"
|
||||
|
||||
if command -v "$cmd" >/dev/null 2>&1; then
|
||||
ok "Command available: $cmd"
|
||||
return 0
|
||||
fi
|
||||
|
||||
critical "Required command not found: $cmd"
|
||||
return 1
|
||||
}
|
||||
|
||||
validate_gpfs_command() {
|
||||
local cmd="$1"
|
||||
|
||||
if command -v "$cmd" >/dev/null 2>&1; then
|
||||
return 0
|
||||
fi
|
||||
|
||||
warning "GPFS command not available, skipping: $cmd"
|
||||
return 1
|
||||
}
|
||||
|
||||
run_cmd() {
|
||||
if [[ "$#" -eq 0 ]]; then
|
||||
critical "run_cmd called without a command"
|
||||
return 2
|
||||
fi
|
||||
|
||||
if [[ "$DRY_RUN" == "true" ]]; then
|
||||
log "OK" "DRY-RUN: $*"
|
||||
return 0
|
||||
fi
|
||||
|
||||
log "OK" "RUN: $*"
|
||||
"$@" 2>&1 | tee -a "$LOG_FILE"
|
||||
}
|
||||
|
||||
run_readonly() {
|
||||
if [[ "$#" -eq 0 ]]; then
|
||||
critical "run_readonly called without a command"
|
||||
return 2
|
||||
fi
|
||||
|
||||
log "OK" "READ-ONLY: $*"
|
||||
"$@" 2>&1 | tee -a "$LOG_FILE"
|
||||
}
|
||||
|
||||
confirm_execute() {
|
||||
local target="${1:-GPFS change}"
|
||||
|
||||
if [[ "$DRY_RUN" == "true" ]]; then
|
||||
ok "Dry-run mode enabled. No changes will be made."
|
||||
return 0
|
||||
fi
|
||||
|
||||
warning "Execution mode requested for: $target"
|
||||
warning "Coordinate this change with storage, GPFS, application, and change-management teams."
|
||||
printf 'Type EXECUTE to continue: '
|
||||
read -r confirmation
|
||||
|
||||
if [[ "$confirmation" != "EXECUTE" ]]; then
|
||||
critical "Confirmation failed. Aborting."
|
||||
exit 1
|
||||
fi
|
||||
|
||||
ok "Execution confirmed by operator."
|
||||
}
|
||||
|
||||
usage_value_valid() {
|
||||
case "$1" in
|
||||
dataOnly|metadataOnly|dataAndMetadata) return 0 ;;
|
||||
*) return 1 ;;
|
||||
esac
|
||||
}
|
||||
|
||||
section() {
|
||||
printf '\n== %s ==\n' "$1" | tee -a "$LOG_FILE"
|
||||
}
|
||||
+37
@@ -0,0 +1,37 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
run_optional() {
|
||||
local description="$1"
|
||||
shift
|
||||
|
||||
section "$description"
|
||||
if validate_gpfs_command "$1"; then
|
||||
run_readonly "$@" || warning "$description command failed"
|
||||
fi
|
||||
}
|
||||
|
||||
section "GPFS / Spectrum Scale Cluster Overview"
|
||||
ok "Log file: $LOG_FILE"
|
||||
|
||||
run_optional "GPFS daemon state on all nodes" mmgetstate -a
|
||||
run_optional "Cluster definition" mmlscluster
|
||||
run_optional "Cluster configuration" mmlsconfig
|
||||
run_optional "Managers and quorum information" mmlsmgr
|
||||
run_optional "NSD inventory" mmlsnsd
|
||||
run_optional "Disk inventory for all filesystems" mmlsdisk all
|
||||
run_optional "Filesystem definitions" mmlsfs all
|
||||
run_optional "Mount state for all filesystems" mmlsmount all
|
||||
|
||||
section "Mounted GPFS filesystems from df"
|
||||
if command -v df >/dev/null 2>&1; then
|
||||
df -h -t gpfs 2>/dev/null | tee -a "$LOG_FILE" || df -h | awk 'NR == 1 || /gpfs|mmfs/' | tee -a "$LOG_FILE"
|
||||
else
|
||||
warning "df command not available"
|
||||
fi
|
||||
+103
@@ -0,0 +1,103 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s --fs <filesystem>\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--fs)
|
||||
FILESYSTEM="${2:-}"
|
||||
shift 2
|
||||
;;
|
||||
-h|--help)
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
critical "Unknown argument: $1"
|
||||
usage
|
||||
exit 2
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$FILESYSTEM" ]]; then
|
||||
critical "Missing required --fs <filesystem>"
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
|
||||
missing=0
|
||||
for cmd in mmgetstate mmlscluster mmlsfs mmlsdisk mmlsmount mmlsmgr df; do
|
||||
require_cmd "$cmd" || missing=1
|
||||
done
|
||||
|
||||
if [[ "$missing" -ne 0 ]]; then
|
||||
exit 2
|
||||
fi
|
||||
|
||||
issues=0
|
||||
|
||||
section "GPFS daemon state"
|
||||
state_output="$(mmgetstate -a 2>&1 || true)"
|
||||
printf '%s\n' "$state_output" | tee -a "$LOG_FILE"
|
||||
if printf '%s\n' "$state_output" | awk 'NR > 1 && $0 !~ / active / { found=1 } END { exit found ? 0 : 1 }'; then
|
||||
warning "Not all GPFS nodes appear active"
|
||||
fi
|
||||
|
||||
section "Target filesystem definition"
|
||||
if mmlsfs "$FILESYSTEM" 2>&1 | tee -a "$LOG_FILE"; then
|
||||
ok "Filesystem exists: $FILESYSTEM"
|
||||
else
|
||||
critical "Filesystem does not exist or cannot be queried: $FILESYSTEM"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
section "Target filesystem mount state"
|
||||
mount_output="$(mmlsmount "$FILESYSTEM" 2>&1 || true)"
|
||||
printf '%s\n' "$mount_output" | tee -a "$LOG_FILE"
|
||||
if printf '%s\n' "$mount_output" | grep -Eiq 'not mounted|no file systems were found|not found'; then
|
||||
warning "Filesystem may not be mounted anywhere: $FILESYSTEM"
|
||||
fi
|
||||
|
||||
section "Existing disks"
|
||||
if ! mmlsdisk "$FILESYSTEM" 2>&1 | tee -a "$LOG_FILE"; then
|
||||
critical "Unable to list disks for filesystem: $FILESYSTEM"
|
||||
issues=1
|
||||
fi
|
||||
|
||||
section "Filesystem capacity"
|
||||
df -h 2>&1 | awk -v fs="$FILESYSTEM" 'NR == 1 || $0 ~ fs || $0 ~ /gpfs|mmfs/' | tee -a "$LOG_FILE"
|
||||
|
||||
section "Cluster health"
|
||||
if command -v mmhealth >/dev/null 2>&1; then
|
||||
health_output="$(mmhealth cluster show 2>&1 || true)"
|
||||
printf '%s\n' "$health_output" | tee -a "$LOG_FILE"
|
||||
if printf '%s\n' "$health_output" | grep -Eiq 'degraded|failed|down|error|unhealthy'; then
|
||||
warning "Cluster health output indicates a degraded condition"
|
||||
fi
|
||||
else
|
||||
warning "mmhealth command not available, skipping health check"
|
||||
fi
|
||||
|
||||
section "Managers and quorum"
|
||||
mmlsmgr 2>&1 | tee -a "$LOG_FILE" || {
|
||||
critical "Unable to query GPFS manager/quorum information"
|
||||
issues=1
|
||||
}
|
||||
|
||||
if [[ "$issues" -eq 0 ]]; then
|
||||
ok "Precheck completed for filesystem: $FILESYSTEM"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
critical "Precheck found operational validation failures"
|
||||
exit 1
|
||||
+83
@@ -0,0 +1,83 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
EXCLUDE_MOUNTED=false
|
||||
EXCLUDE_EXISTING_NSD=false
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s [--exclude-mounted] [--exclude-existing-nsd]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--exclude-mounted)
|
||||
EXCLUDE_MOUNTED=true
|
||||
shift
|
||||
;;
|
||||
--exclude-existing-nsd)
|
||||
EXCLUDE_EXISTING_NSD=true
|
||||
shift
|
||||
;;
|
||||
-h|--help)
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
critical "Unknown argument: $1"
|
||||
usage
|
||||
exit 2
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
for cmd in lsblk findmnt; do
|
||||
require_cmd "$cmd" || exit 2
|
||||
done
|
||||
|
||||
warning "Candidate devices are not automatically safe. Confirm every device with the storage and cluster teams before use."
|
||||
|
||||
existing_gpfs_devices=""
|
||||
if [[ "$EXCLUDE_EXISTING_NSD" == "true" ]]; then
|
||||
if command -v mmlsnsd >/dev/null 2>&1; then
|
||||
existing_gpfs_devices="$(mmlsnsd 2>/dev/null || true)"
|
||||
elif command -v mmlsdisk >/dev/null 2>&1; then
|
||||
existing_gpfs_devices="$(mmlsdisk all 2>/dev/null || true)"
|
||||
else
|
||||
warning "mmlsnsd and mmlsdisk are unavailable; cannot exclude existing GPFS devices"
|
||||
fi
|
||||
fi
|
||||
|
||||
section "Block device inventory"
|
||||
lsblk -dpno NAME,TYPE,SIZE,MODEL,SERIAL,MOUNTPOINT 2>&1 | tee -a "$LOG_FILE"
|
||||
|
||||
section "Candidate devices"
|
||||
found=0
|
||||
while read -r name type size model serial mountpoint; do
|
||||
[[ "$type" == "disk" ]] || continue
|
||||
|
||||
if [[ "$EXCLUDE_MOUNTED" == "true" ]]; then
|
||||
if [[ -n "${mountpoint:-}" ]] || findmnt -rn --source "$name" >/dev/null 2>&1; then
|
||||
continue
|
||||
fi
|
||||
fi
|
||||
|
||||
if [[ "$EXCLUDE_EXISTING_NSD" == "true" ]] && [[ -n "$existing_gpfs_devices" ]]; then
|
||||
if printf '%s\n' "$existing_gpfs_devices" | grep -Fq "$name"; then
|
||||
continue
|
||||
fi
|
||||
fi
|
||||
|
||||
printf 'OK: candidate=%s size=%s model=%s serial=%s mountpoint=%s\n' \
|
||||
"$name" "${size:-unknown}" "${model:-unknown}" "${serial:-unknown}" "${mountpoint:-none}" | tee -a "$LOG_FILE"
|
||||
found=1
|
||||
done < <(lsblk -dpno NAME,TYPE,SIZE,MODEL,SERIAL,MOUNTPOINT)
|
||||
|
||||
if [[ "$found" -eq 0 ]]; then
|
||||
warning "No candidate devices found with the selected filters"
|
||||
fi
|
||||
+76
@@ -0,0 +1,76 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
DEVICES=""
|
||||
SERVERS=""
|
||||
OUTPUT=""
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s --fs <filesystem> --devices "/dev/sdb /dev/sdc" --servers "node1,node2" --failure-group <number> --pool <storage_pool> --usage <dataOnly|metadataOnly|dataAndMetadata> [--output <path>]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--fs) FILESYSTEM="${2:-}"; shift 2 ;;
|
||||
--devices) DEVICES="${2:-}"; shift 2 ;;
|
||||
--servers) SERVERS="${2:-}"; shift 2 ;;
|
||||
--failure-group) FAILURE_GROUP="${2:-}"; shift 2 ;;
|
||||
--pool) STORAGE_POOL="${2:-}"; shift 2 ;;
|
||||
--usage) USAGE="${2:-}"; shift 2 ;;
|
||||
--output) OUTPUT="${2:-}"; shift 2 ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) critical "Unknown argument: $1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$FILESYSTEM" || -z "$DEVICES" || -z "$SERVERS" || -z "$FAILURE_GROUP" || -z "$STORAGE_POOL" || -z "$USAGE" ]]; then
|
||||
critical "Missing required input"
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
|
||||
if ! [[ "$FAILURE_GROUP" =~ ^-?[0-9]+$ ]]; then
|
||||
critical "--failure-group must be an integer"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
if ! usage_value_valid "$USAGE"; then
|
||||
critical "--usage must be one of: dataOnly, metadataOnly, dataAndMetadata"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
if [[ -z "$OUTPUT" ]]; then
|
||||
OUTPUT="/tmp/gpfs_nsd_${FILESYSTEM}_${TIMESTAMP}.stanza"
|
||||
fi
|
||||
|
||||
safe_fs="$(printf '%s' "$FILESYSTEM" | tr -c '[:alnum:]_' '_')"
|
||||
|
||||
{
|
||||
printf '# Generated GPFS NSD stanza for filesystem %s\n' "$FILESYSTEM"
|
||||
printf '# Review with storage and cluster teams before use.\n\n'
|
||||
for device in $DEVICES; do
|
||||
if [[ "$device" != /dev/* ]]; then
|
||||
critical "Device must be an absolute /dev path: $device"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
device_base="$(basename "$device" | tr -c '[:alnum:]_' '_')"
|
||||
nsd_name="nsd_${safe_fs}_${device_base}"
|
||||
printf '%%nsd:\n'
|
||||
printf ' device=%s\n' "$device"
|
||||
printf ' nsd=%s\n' "$nsd_name"
|
||||
printf ' servers=%s\n' "$SERVERS"
|
||||
printf ' usage=%s\n' "$USAGE"
|
||||
printf ' failureGroup=%s\n' "$FAILURE_GROUP"
|
||||
printf ' pool=%s\n\n' "$STORAGE_POOL"
|
||||
done
|
||||
} > "$OUTPUT"
|
||||
|
||||
ok "Generated NSD stanza: $OUTPUT"
|
||||
warning "This script only writes a stanza file. It does not create NSDs or modify GPFS."
|
||||
+59
@@ -0,0 +1,59 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s --fs <filesystem> --stanza <stanza_file> [--execute]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--fs) FILESYSTEM="${2:-}"; shift 2 ;;
|
||||
--stanza) NSD_STANZA="${2:-}"; shift 2 ;;
|
||||
--execute) DRY_RUN=false; shift ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) critical "Unknown argument: $1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$FILESYSTEM" || -z "$NSD_STANZA" ]]; then
|
||||
critical "Missing required --fs or --stanza"
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
|
||||
if [[ ! -r "$NSD_STANZA" ]]; then
|
||||
critical "Stanza file does not exist or is not readable: $NSD_STANZA"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
for cmd in mmlsfs mmcrnsd mmadddisk; do
|
||||
require_cmd "$cmd" || exit 2
|
||||
done
|
||||
|
||||
if ! mmlsfs "$FILESYSTEM" >/dev/null 2>&1; then
|
||||
critical "Filesystem does not exist or cannot be queried: $FILESYSTEM"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
warning "Adding NSDs must be coordinated with storage, GPFS, application, and change-management teams."
|
||||
section "Planned GPFS changes"
|
||||
ok "DRY-RUN: mmcrnsd -F $NSD_STANZA"
|
||||
ok "DRY-RUN: mmadddisk $FILESYSTEM -F $NSD_STANZA"
|
||||
|
||||
confirm_execute "create NSDs and add disks to $FILESYSTEM"
|
||||
|
||||
if [[ "$DRY_RUN" == "false" ]]; then
|
||||
run_cmd mmcrnsd -F "$NSD_STANZA"
|
||||
run_cmd mmadddisk "$FILESYSTEM" -F "$NSD_STANZA"
|
||||
|
||||
section "Post-add NSD inventory"
|
||||
mmlsnsd 2>&1 | tee -a "$LOG_FILE" || warning "mmlsnsd command failed after execution"
|
||||
section "Post-add filesystem disks"
|
||||
mmlsdisk "$FILESYSTEM" 2>&1 | tee -a "$LOG_FILE" || warning "mmlsdisk command failed after execution"
|
||||
fi
|
||||
+56
@@ -0,0 +1,56 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
BACKGROUND=false
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s --fs <filesystem> [--execute] [--background]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--fs) FILESYSTEM="${2:-}"; shift 2 ;;
|
||||
--execute) DRY_RUN=false; shift ;;
|
||||
--background) BACKGROUND=true; shift ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) critical "Unknown argument: $1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$FILESYSTEM" ]]; then
|
||||
critical "Missing required --fs <filesystem>"
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
|
||||
for cmd in mmlsdisk mmrestripefs; do
|
||||
require_cmd "$cmd" || exit 2
|
||||
done
|
||||
|
||||
warning "Restripe/rebalance can be I/O intensive. Run only in an approved change window."
|
||||
|
||||
section "Current disk balance"
|
||||
mmlsdisk "$FILESYSTEM" 2>&1 | tee -a "$LOG_FILE" || warning "Unable to show current disk state"
|
||||
|
||||
section "Planned rebalance"
|
||||
if [[ "$BACKGROUND" == "true" ]]; then
|
||||
if [[ "$DRY_RUN" == "true" ]]; then
|
||||
ok "DRY-RUN: mmrestripefs $FILESYSTEM -b &"
|
||||
else
|
||||
confirm_execute "background restripe for $FILESYSTEM"
|
||||
ok "RUN: mmrestripefs $FILESYSTEM -b &"
|
||||
mmrestripefs "$FILESYSTEM" -b 2>&1 | tee -a "$LOG_FILE" &
|
||||
fi
|
||||
else
|
||||
ok "DRY-RUN: mmrestripefs $FILESYSTEM -b"
|
||||
confirm_execute "restripe for $FILESYSTEM"
|
||||
if [[ "$DRY_RUN" == "false" ]]; then
|
||||
run_cmd mmrestripefs "$FILESYSTEM" -b
|
||||
fi
|
||||
fi
|
||||
+89
@@ -0,0 +1,89 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s --fs <filesystem>\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--fs) FILESYSTEM="${2:-}"; shift 2 ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) critical "Unknown argument: $1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$FILESYSTEM" ]]; then
|
||||
critical "Missing required --fs <filesystem>"
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
|
||||
issues=0
|
||||
|
||||
run_check() {
|
||||
local description="$1"
|
||||
shift
|
||||
|
||||
section "$description"
|
||||
if command -v "$1" >/dev/null 2>&1; then
|
||||
"$@" 2>&1 | tee -a "$LOG_FILE" || {
|
||||
critical "$description failed"
|
||||
issues=1
|
||||
}
|
||||
else
|
||||
warning "$1 command not available, skipping"
|
||||
fi
|
||||
}
|
||||
|
||||
run_check "GPFS daemon state" mmgetstate -a
|
||||
run_check "Target filesystem mount state" mmlsmount "$FILESYSTEM"
|
||||
run_check "Target filesystem disks" mmlsdisk "$FILESYSTEM"
|
||||
run_check "NSD inventory" mmlsnsd
|
||||
|
||||
section "Filesystem capacity"
|
||||
if command -v df >/dev/null 2>&1; then
|
||||
df -h 2>&1 | awk -v fs="$FILESYSTEM" 'NR == 1 || $0 ~ fs || $0 ~ /gpfs|mmfs/' | tee -a "$LOG_FILE"
|
||||
else
|
||||
warning "df command not available, skipping"
|
||||
fi
|
||||
|
||||
section "Cluster health"
|
||||
if command -v mmhealth >/dev/null 2>&1; then
|
||||
health_output="$(mmhealth cluster show 2>&1 || true)"
|
||||
printf '%s\n' "$health_output" | tee -a "$LOG_FILE"
|
||||
if printf '%s\n' "$health_output" | grep -Eiq 'degraded|failed|down|error|unhealthy'; then
|
||||
critical "Cluster health output indicates an issue"
|
||||
issues=1
|
||||
fi
|
||||
else
|
||||
warning "mmhealth command not available, skipping"
|
||||
fi
|
||||
|
||||
section "Recent GPFS journal entries"
|
||||
if command -v journalctl >/dev/null 2>&1; then
|
||||
journalctl -u 'gpfs*' -n 50 --no-pager 2>&1 | tee -a "$LOG_FILE" || warning "journalctl GPFS query failed"
|
||||
else
|
||||
warning "journalctl command not available, skipping"
|
||||
fi
|
||||
|
||||
section "Recent kernel messages"
|
||||
if command -v dmesg >/dev/null 2>&1; then
|
||||
dmesg -T 2>/dev/null | tail -50 | tee -a "$LOG_FILE" || warning "dmesg query failed"
|
||||
else
|
||||
warning "dmesg command not available, skipping"
|
||||
fi
|
||||
|
||||
if [[ "$issues" -eq 0 ]]; then
|
||||
ok "Post-check completed without detected operational failures"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
critical "Post-check detected one or more issues"
|
||||
exit 1
|
||||
+78
@@ -0,0 +1,78 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
REPORT_FILE=""
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s --fs <filesystem>\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--fs) FILESYSTEM="${2:-}"; shift 2 ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) critical "Unknown argument: $1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if [[ -z "$FILESYSTEM" ]]; then
|
||||
critical "Missing required --fs <filesystem>"
|
||||
usage
|
||||
exit 2
|
||||
fi
|
||||
|
||||
REPORT_FILE="/tmp/gpfs_extend_report_${FILESYSTEM}_${TIMESTAMP}.txt"
|
||||
|
||||
append_section() {
|
||||
local title="$1"
|
||||
shift
|
||||
|
||||
{
|
||||
printf '\n== %s ==\n' "$title"
|
||||
if command -v "$1" >/dev/null 2>&1; then
|
||||
"$@" 2>&1 || printf 'WARNING: command failed: %s\n' "$*"
|
||||
else
|
||||
printf 'WARNING: command not available: %s\n' "$1"
|
||||
fi
|
||||
} >> "$REPORT_FILE"
|
||||
}
|
||||
|
||||
{
|
||||
printf 'GPFS / Spectrum Scale Filesystem Expansion Report\n'
|
||||
printf 'Hostname: %s\n' "$(hostname 2>/dev/null || printf 'unknown')"
|
||||
printf 'Date: %s\n' "$(date)"
|
||||
printf 'Target filesystem: %s\n' "$FILESYSTEM"
|
||||
} > "$REPORT_FILE"
|
||||
|
||||
append_section "GPFS daemon state" mmgetstate -a
|
||||
append_section "Cluster definition" mmlscluster
|
||||
append_section "Managers and quorum" mmlsmgr
|
||||
append_section "Target filesystem mount state" mmlsmount "$FILESYSTEM"
|
||||
append_section "Target filesystem disks" mmlsdisk "$FILESYSTEM"
|
||||
append_section "NSD inventory" mmlsnsd
|
||||
append_section "Filesystem capacity" df -h
|
||||
|
||||
if command -v mmhealth >/dev/null 2>&1; then
|
||||
append_section "Cluster health" mmhealth cluster show
|
||||
else
|
||||
printf '\n== Cluster health ==\nWARNING: mmhealth command not available\n' >> "$REPORT_FILE"
|
||||
fi
|
||||
|
||||
if command -v journalctl >/dev/null 2>&1; then
|
||||
append_section "Recent GPFS journal entries" journalctl -u 'gpfs*' -n 50 --no-pager
|
||||
fi
|
||||
|
||||
if command -v dmesg >/dev/null 2>&1; then
|
||||
{
|
||||
printf '\n== Recent kernel messages ==\n'
|
||||
dmesg -T 2>/dev/null | tail -50 || printf 'WARNING: dmesg query failed\n'
|
||||
} >> "$REPORT_FILE"
|
||||
fi
|
||||
|
||||
ok "Generated report: $REPORT_FILE"
|
||||
@@ -0,0 +1,136 @@
|
||||
# GPFS / IBM Spectrum Scale Filesystem Expansion Toolkit
|
||||
|
||||
Safe, sanitized Bash examples for planning and executing a GPFS / IBM Spectrum Scale filesystem expansion. The scripts are written as portfolio-grade operational tooling for a Linux Infrastructure Engineer: conservative defaults, clear validation, dry-run behavior, and explicit operator confirmation before changes.
|
||||
|
||||
These scripts are examples. Exact GPFS commands, flags, quorum practices, failure-group design, and storage naming standards vary by Spectrum Scale version and site policy.
|
||||
|
||||
## Concepts
|
||||
|
||||
- **Cluster** - the Spectrum Scale administrative domain containing the nodes, daemon configuration, quorum policy, filesystems, and NSDs.
|
||||
- **Node** - a server participating in the GPFS cluster. Nodes may be clients, NSD servers, quorum nodes, manager-capable nodes, or a mix of roles.
|
||||
- **Quorum** - the voting mechanism that protects the cluster from split-brain conditions. Expansion work should not proceed during quorum instability.
|
||||
- **Filesystem** - the GPFS namespace and data layout presented to clients, backed by one or more NSDs.
|
||||
- **NSD** - Network Shared Disk, the GPFS abstraction for a disk or LUN that is served to the cluster.
|
||||
- **Failure group** - a placement hint that tells GPFS which disks share a failure domain, such as an enclosure, rack, site, controller pair, or storage array.
|
||||
- **Storage pool** - a named pool of NSDs used for placement and lifecycle policy, commonly `system` plus optional data pools.
|
||||
- **Restripe/rebalance** - the operation that redistributes data after disks are added. It can be I/O intensive and should run only in an approved change window.
|
||||
|
||||
## Required Tools
|
||||
|
||||
Common GPFS / Spectrum Scale tools expected in production include:
|
||||
|
||||
- `mmgetstate`
|
||||
- `mmlscluster`
|
||||
- `mmlsfs`
|
||||
- `mmlsdisk`
|
||||
- `mmlsnsd`
|
||||
- `mmcrnsd`
|
||||
- `mmadddisk`
|
||||
- `mmrestripefs`
|
||||
|
||||
The toolkit also uses common Linux tools such as `df`, `lsblk`, `findmnt`, `journalctl`, and `dmesg` where available. Missing optional commands are reported as `WARNING` and skipped.
|
||||
|
||||
## Safety Model
|
||||
|
||||
- Default mode is dry-run.
|
||||
- Real GPFS modifications require `--execute`.
|
||||
- Destructive or high-impact steps also prompt for `EXECUTE`.
|
||||
- Disk detection is read-only and never partitions, formats, wipes, or modifies devices.
|
||||
- Device selection must always be confirmed with the storage team and cluster owners.
|
||||
- The scripts do not assume production disk names.
|
||||
|
||||
Output uses a consistent status format:
|
||||
|
||||
- `OK`
|
||||
- `WARNING`
|
||||
- `CRITICAL`
|
||||
|
||||
Exit codes:
|
||||
|
||||
- `0` - OK
|
||||
- `1` - operational validation failure
|
||||
- `2` - invalid input or missing requirement
|
||||
|
||||
## Scripts
|
||||
|
||||
- `00_env.sh` - shared configuration and helper functions.
|
||||
- `01_cluster_overview.sh` - read-only cluster overview.
|
||||
- `02_precheck_gpfs.sh` - pre-expansion validation for a target filesystem.
|
||||
- `03_detect_new_disks.sh` - read-only candidate block-device discovery.
|
||||
- `04_create_nsd_stanza.sh` - generate an NSD stanza file.
|
||||
- `05_add_nsd_to_filesystem.sh` - create NSDs and add disks to a filesystem, dry-run by default.
|
||||
- `06_rebalance_filesystem.sh` - optional restripe/rebalance, dry-run by default.
|
||||
- `07_postcheck_gpfs.sh` - post-change validation.
|
||||
- `08_generate_report.sh` - text report for the change record.
|
||||
- `gpfs_extend_runbook.sh` - guided order of operations plus safe read-only checks.
|
||||
|
||||
## Example Workflow
|
||||
|
||||
```bash
|
||||
cd infra-run/scripts/bash/gpfs
|
||||
|
||||
./01_cluster_overview.sh
|
||||
./02_precheck_gpfs.sh --fs gpfs01
|
||||
./03_detect_new_disks.sh --exclude-mounted --exclude-existing-nsd
|
||||
|
||||
./04_create_nsd_stanza.sh \
|
||||
--fs gpfs01 \
|
||||
--devices "/dev/sdb /dev/sdc" \
|
||||
--servers "gpfsnsd01,gpfsnsd02" \
|
||||
--failure-group 10 \
|
||||
--pool system \
|
||||
--usage dataAndMetadata
|
||||
```
|
||||
|
||||
Review the generated stanza with the storage and cluster teams. Confirm device identity, LUN masking, multipath naming, failure group placement, and site standards before continuing.
|
||||
|
||||
Dry-run the add step:
|
||||
|
||||
```bash
|
||||
./05_add_nsd_to_filesystem.sh \
|
||||
--fs gpfs01 \
|
||||
--stanza /tmp/gpfs_nsd_gpfs01_YYYYmmdd_HHMMSS.stanza
|
||||
```
|
||||
|
||||
Execute only in an approved change window:
|
||||
|
||||
```bash
|
||||
./05_add_nsd_to_filesystem.sh \
|
||||
--fs gpfs01 \
|
||||
--stanza /tmp/gpfs_nsd_gpfs01_YYYYmmdd_HHMMSS.stanza \
|
||||
--execute
|
||||
```
|
||||
|
||||
Optional rebalance:
|
||||
|
||||
```bash
|
||||
./06_rebalance_filesystem.sh --fs gpfs01
|
||||
./06_rebalance_filesystem.sh --fs gpfs01 --execute --background
|
||||
```
|
||||
|
||||
Post-check and report:
|
||||
|
||||
```bash
|
||||
./07_postcheck_gpfs.sh --fs gpfs01
|
||||
./08_generate_report.sh --fs gpfs01
|
||||
```
|
||||
|
||||
Runbook helper:
|
||||
|
||||
```bash
|
||||
./gpfs_extend_runbook.sh \
|
||||
--fs gpfs01 \
|
||||
--devices "/dev/sdb /dev/sdc" \
|
||||
--servers "gpfsnsd01,gpfsnsd02" \
|
||||
--failure-group 10 \
|
||||
--pool system \
|
||||
--usage dataAndMetadata
|
||||
```
|
||||
|
||||
## Operational Notes
|
||||
|
||||
- Do not run these scripts blindly on production clusters.
|
||||
- Confirm disk and multipath identity with the storage team before creating NSDs.
|
||||
- Validate quorum and manager health before expansion.
|
||||
- Confirm application I/O risk and rollback procedures before `mmadddisk` or `mmrestripefs`.
|
||||
- Confirm the Spectrum Scale version and local standards for stanza fields before executing changes.
|
||||
+94
@@ -0,0 +1,94 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
. "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
DEVICES=""
|
||||
SERVERS=""
|
||||
EXECUTE=false
|
||||
|
||||
usage() {
|
||||
printf 'Usage: %s --fs <filesystem> --devices "/dev/sdb /dev/sdc" --servers "node1,node2" --failure-group <number> --pool <storage_pool> --usage <dataOnly|metadataOnly|dataAndMetadata> [--execute]\n' "$(basename "$0")"
|
||||
}
|
||||
|
||||
while [[ "$#" -gt 0 ]]; do
|
||||
case "$1" in
|
||||
--fs) FILESYSTEM="${2:-}"; shift 2 ;;
|
||||
--devices) DEVICES="${2:-}"; shift 2 ;;
|
||||
--servers) SERVERS="${2:-}"; shift 2 ;;
|
||||
--failure-group) FAILURE_GROUP="${2:-}"; shift 2 ;;
|
||||
--pool) STORAGE_POOL="${2:-}"; shift 2 ;;
|
||||
--usage) USAGE="${2:-}"; shift 2 ;;
|
||||
--execute) EXECUTE=true; DRY_RUN=false; shift ;;
|
||||
-h|--help) usage; exit 0 ;;
|
||||
*) critical "Unknown argument: $1"; usage; exit 2 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
section "Recommended GPFS Expansion Flow"
|
||||
cat <<FLOW
|
||||
Step 1: Cluster overview
|
||||
$SCRIPT_DIR/01_cluster_overview.sh
|
||||
|
||||
Step 2: GPFS precheck
|
||||
$SCRIPT_DIR/02_precheck_gpfs.sh --fs <filesystem>
|
||||
|
||||
Step 3: Detect candidate disks
|
||||
$SCRIPT_DIR/03_detect_new_disks.sh --exclude-mounted --exclude-existing-nsd
|
||||
|
||||
Step 4: Generate NSD stanza
|
||||
$SCRIPT_DIR/04_create_nsd_stanza.sh --fs <filesystem> --devices "/dev/sdb /dev/sdc" --servers "node1,node2" --failure-group <number> --pool <storage_pool> --usage <usage>
|
||||
|
||||
Step 5: Create NSDs and add disks to filesystem
|
||||
$SCRIPT_DIR/05_add_nsd_to_filesystem.sh --fs <filesystem> --stanza <stanza_file> [--execute]
|
||||
|
||||
Step 6: Optional restripe/rebalance
|
||||
$SCRIPT_DIR/06_rebalance_filesystem.sh --fs <filesystem> [--execute] [--background]
|
||||
|
||||
Step 7: Post-check
|
||||
$SCRIPT_DIR/07_postcheck_gpfs.sh --fs <filesystem>
|
||||
|
||||
Step 8: Generate report
|
||||
$SCRIPT_DIR/08_generate_report.sh --fs <filesystem>
|
||||
FLOW
|
||||
|
||||
if [[ -z "$FILESYSTEM" ]]; then
|
||||
warning "No --fs supplied. Printed runbook only."
|
||||
exit 0
|
||||
fi
|
||||
|
||||
if [[ "$EXECUTE" == "true" ]]; then
|
||||
warning "--execute was supplied. Destructive steps still require the individual script confirmation prompt."
|
||||
else
|
||||
DRY_RUN=true
|
||||
fi
|
||||
|
||||
section "Running Safe Read-Only Steps"
|
||||
"$SCRIPT_DIR/01_cluster_overview.sh" || warning "Cluster overview reported warnings or failures"
|
||||
"$SCRIPT_DIR/02_precheck_gpfs.sh" --fs "$FILESYSTEM" || warning "Precheck reported warnings or failures"
|
||||
"$SCRIPT_DIR/03_detect_new_disks.sh" --exclude-mounted --exclude-existing-nsd || warning "Disk detection reported warnings or failures"
|
||||
|
||||
if [[ -n "$DEVICES" || -n "$SERVERS" || -n "$FAILURE_GROUP" ]]; then
|
||||
if [[ -z "$DEVICES" || -z "$SERVERS" || -z "$FAILURE_GROUP" ]]; then
|
||||
warning "NSD stanza generation requires --devices, --servers, --failure-group, --pool, and --usage"
|
||||
else
|
||||
"$SCRIPT_DIR/04_create_nsd_stanza.sh" \
|
||||
--fs "$FILESYSTEM" \
|
||||
--devices "$DEVICES" \
|
||||
--servers "$SERVERS" \
|
||||
--failure-group "$FAILURE_GROUP" \
|
||||
--pool "$STORAGE_POOL" \
|
||||
--usage "$USAGE"
|
||||
fi
|
||||
fi
|
||||
|
||||
section "Next Manual Step"
|
||||
if [[ "$EXECUTE" == "true" ]]; then
|
||||
warning "Run 05_add_nsd_to_filesystem.sh manually with --execute after reviewing the generated stanza."
|
||||
else
|
||||
ok "Review outputs and generated stanza. Add disks only through 05_add_nsd_to_filesystem.sh with --execute."
|
||||
fi
|
||||
Executable
+68
@@ -0,0 +1,68 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
section() {
|
||||
printf '\n== %s ==\n' "$1"
|
||||
}
|
||||
|
||||
run_or_warn() {
|
||||
local description="$1"
|
||||
shift
|
||||
|
||||
if command -v "$1" >/dev/null 2>&1; then
|
||||
"$@" || printf 'WARNING: %s command failed\n' "$description"
|
||||
else
|
||||
printf 'WARNING: %s command not available\n' "$1"
|
||||
fi
|
||||
}
|
||||
|
||||
top_processes() {
|
||||
local sort_key="$1"
|
||||
|
||||
if command -v ps >/dev/null 2>&1; then
|
||||
ps -eo pid,ppid,comm,%cpu,%mem --sort="$sort_key" | head -n 11
|
||||
else
|
||||
printf 'WARNING: ps command not available\n'
|
||||
fi
|
||||
}
|
||||
|
||||
section "Host"
|
||||
hostname
|
||||
uptime
|
||||
|
||||
section "OS"
|
||||
if [[ -r /etc/os-release ]]; then
|
||||
. /etc/os-release
|
||||
printf '%s\n' "${PRETTY_NAME:-Unknown Linux}"
|
||||
else
|
||||
printf 'WARNING: /etc/os-release not readable\n'
|
||||
fi
|
||||
uname -r
|
||||
|
||||
section "CPU Load"
|
||||
if [[ -r /proc/loadavg ]]; then
|
||||
awk '{print "1m="$1, "5m="$2, "15m="$3}' /proc/loadavg
|
||||
else
|
||||
uptime
|
||||
fi
|
||||
|
||||
section "Memory"
|
||||
run_or_warn "memory usage" free -h
|
||||
|
||||
section "Disk"
|
||||
run_or_warn "disk usage" df -h -x tmpfs -x devtmpfs
|
||||
|
||||
section "Failed systemd Services"
|
||||
if command -v systemctl >/dev/null 2>&1; then
|
||||
systemctl --failed --no-pager || true
|
||||
else
|
||||
printf 'WARNING: systemctl command not available\n'
|
||||
fi
|
||||
|
||||
section "Top CPU Processes"
|
||||
top_processes "-%cpu"
|
||||
|
||||
section "Top Memory Processes"
|
||||
top_processes "-%mem"
|
||||
+148
@@ -0,0 +1,148 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
target="${1:-}"
|
||||
status=0
|
||||
warnings=()
|
||||
criticals=()
|
||||
|
||||
section() {
|
||||
printf '\n[%s]\n' "$1"
|
||||
}
|
||||
|
||||
warn() {
|
||||
warnings+=("$1")
|
||||
printf 'WARNING: %s\n' "$1"
|
||||
}
|
||||
|
||||
critical() {
|
||||
criticals+=("$1")
|
||||
status=1
|
||||
printf 'CRITICAL: %s\n' "$1"
|
||||
}
|
||||
|
||||
have() {
|
||||
command -v "$1" >/dev/null 2>&1
|
||||
}
|
||||
|
||||
run_if_available() {
|
||||
local command_name="$1"
|
||||
shift
|
||||
|
||||
if have "$command_name"; then
|
||||
"$@" || warn "$command_name command failed"
|
||||
else
|
||||
warn "$command_name command not available"
|
||||
fi
|
||||
}
|
||||
|
||||
section "LOCAL NETWORK"
|
||||
if have ip; then
|
||||
ip addr || warn "ip addr command failed"
|
||||
printf '\nRouting table:\n'
|
||||
ip route || warn "ip route command failed"
|
||||
printf '\nDefault gateway:\n'
|
||||
if ! ip route show default; then
|
||||
critical "default gateway not found"
|
||||
elif ! ip route show default | grep -q '^default '; then
|
||||
critical "default gateway not configured"
|
||||
fi
|
||||
else
|
||||
warn "ip command not available"
|
||||
fi
|
||||
|
||||
section "INTERFACES"
|
||||
active_interfaces=0
|
||||
if have ip; then
|
||||
ip -br link || warn "interface state query failed"
|
||||
active_interfaces="$(ip -br link 2>/dev/null | awk '$2 == "UP" && $1 != "lo" {count++} END {print count+0}')"
|
||||
if (( active_interfaces == 0 )); then
|
||||
critical "no active non-loopback interface detected"
|
||||
else
|
||||
printf 'OK: %s active non-loopback interface(s) detected\n' "$active_interfaces"
|
||||
fi
|
||||
else
|
||||
warn "cannot inspect interface state without ip command"
|
||||
fi
|
||||
|
||||
section "DNS"
|
||||
if [[ -r /etc/resolv.conf ]]; then
|
||||
cat /etc/resolv.conf
|
||||
else
|
||||
warn "/etc/resolv.conf not readable"
|
||||
fi
|
||||
|
||||
dns_target="${target:-localhost}"
|
||||
if have getent; then
|
||||
if getent hosts "$dns_target" >/dev/null 2>&1; then
|
||||
printf 'OK: DNS resolution succeeded for %s\n' "$dns_target"
|
||||
getent hosts "$dns_target"
|
||||
else
|
||||
critical "DNS resolution failed for ${dns_target}"
|
||||
fi
|
||||
elif have nslookup; then
|
||||
if nslookup "$dns_target"; then
|
||||
printf 'OK: DNS resolution succeeded for %s\n' "$dns_target"
|
||||
else
|
||||
critical "DNS resolution failed for ${dns_target}"
|
||||
fi
|
||||
else
|
||||
warn "no DNS lookup tool available"
|
||||
fi
|
||||
|
||||
section "CONNECTIVITY"
|
||||
if [[ -n "$target" ]]; then
|
||||
if have ping; then
|
||||
if ping -c 3 -W 2 "$target"; then
|
||||
printf 'OK: ping succeeded for %s\n' "$target"
|
||||
else
|
||||
critical "ping failed for ${target}"
|
||||
fi
|
||||
else
|
||||
warn "ping command not available"
|
||||
fi
|
||||
|
||||
run_if_available traceroute traceroute "$target"
|
||||
|
||||
if have nc; then
|
||||
if nc -vz -w 3 "$target" 443; then
|
||||
printf 'OK: TCP 443 reachable on %s\n' "$target"
|
||||
else
|
||||
critical "TCP 443 connectivity failed for ${target}"
|
||||
fi
|
||||
elif have curl; then
|
||||
if curl --head --silent --show-error --connect-timeout 5 "https://${target}" >/dev/null; then
|
||||
printf 'OK: HTTPS connectivity succeeded for %s\n' "$target"
|
||||
else
|
||||
critical "HTTPS connectivity failed for ${target}"
|
||||
fi
|
||||
else
|
||||
warn "no TCP connectivity test tool available (nc or curl)"
|
||||
fi
|
||||
else
|
||||
printf 'OK: no target provided; skipped remote connectivity checks\n'
|
||||
fi
|
||||
|
||||
section "PORTS"
|
||||
if have ss; then
|
||||
ss -tuln || warn "ss command failed"
|
||||
else
|
||||
warn "ss command not available"
|
||||
fi
|
||||
|
||||
section "SUMMARY"
|
||||
if (( ${#criticals[@]} > 0 )); then
|
||||
printf 'CRITICAL: %s issue(s) detected\n' "${#criticals[@]}"
|
||||
fi
|
||||
|
||||
if (( ${#warnings[@]} > 0 )); then
|
||||
printf 'WARNING: %s warning(s) detected\n' "${#warnings[@]}"
|
||||
fi
|
||||
|
||||
if (( status == 0 )); then
|
||||
printf 'OK: no obvious DNS or connectivity problems detected\n'
|
||||
fi
|
||||
|
||||
exit "$status"
|
||||
Executable
+60
@@ -0,0 +1,60 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
services=("$@")
|
||||
|
||||
service_exists() {
|
||||
local service="$1"
|
||||
systemctl list-unit-files "${service}.service" --no-legend 2>/dev/null | awk '{print $1}' | grep -qx "${service}.service"
|
||||
}
|
||||
|
||||
pick_default_scheduler() {
|
||||
if service_exists cron; then
|
||||
printf 'cron'
|
||||
elif service_exists crond; then
|
||||
printf 'crond'
|
||||
else
|
||||
printf 'cron'
|
||||
fi
|
||||
}
|
||||
|
||||
pick_default_ssh() {
|
||||
if service_exists sshd; then
|
||||
printf 'sshd'
|
||||
elif service_exists ssh; then
|
||||
printf 'ssh'
|
||||
else
|
||||
printf 'sshd'
|
||||
fi
|
||||
}
|
||||
|
||||
if ! command -v systemctl >/dev/null 2>&1; then
|
||||
printf 'CRITICAL: systemctl command not available; cannot check services\n' >&2
|
||||
exit 1
|
||||
fi
|
||||
|
||||
if (( ${#services[@]} == 0 )); then
|
||||
services=("$(pick_default_ssh)" "$(pick_default_scheduler)")
|
||||
fi
|
||||
|
||||
status=0
|
||||
|
||||
for service in "${services[@]}"; do
|
||||
if ! service_exists "$service"; then
|
||||
printf 'CRITICAL: %s service not found\n' "$service"
|
||||
status=1
|
||||
continue
|
||||
fi
|
||||
|
||||
if systemctl is-active --quiet "$service"; then
|
||||
printf 'OK: %s is active\n' "$service"
|
||||
else
|
||||
state="$(systemctl is-active "$service" 2>/dev/null || true)"
|
||||
printf 'CRITICAL: %s is %s\n' "$service" "${state:-unknown}"
|
||||
status=1
|
||||
fi
|
||||
done
|
||||
|
||||
exit "$status"
|
||||
Executable
+81
@@ -0,0 +1,81 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
host="$(hostname)"
|
||||
timestamp="$(date '+%Y-%m-%d_%H%M%S')"
|
||||
report="/tmp/system_report_${host}_${timestamp}.txt"
|
||||
|
||||
section() {
|
||||
printf '\n== %s ==\n' "$1"
|
||||
}
|
||||
|
||||
run_or_warn() {
|
||||
local description="$1"
|
||||
shift
|
||||
|
||||
if command -v "$1" >/dev/null 2>&1; then
|
||||
"$@" || printf 'WARNING: %s command failed\n' "$description"
|
||||
else
|
||||
printf 'WARNING: %s command not available\n' "$1"
|
||||
fi
|
||||
}
|
||||
|
||||
{
|
||||
section "Host"
|
||||
hostname
|
||||
|
||||
section "Date"
|
||||
date
|
||||
|
||||
section "Uptime"
|
||||
uptime
|
||||
|
||||
section "OS"
|
||||
if [[ -r /etc/os-release ]]; then
|
||||
. /etc/os-release
|
||||
printf '%s\n' "${PRETTY_NAME:-Unknown Linux}"
|
||||
else
|
||||
printf 'WARNING: /etc/os-release not readable\n'
|
||||
fi
|
||||
|
||||
section "Kernel"
|
||||
uname -r
|
||||
|
||||
section "CPU Load"
|
||||
if [[ -r /proc/loadavg ]]; then
|
||||
awk '{print "1m="$1, "5m="$2, "15m="$3}' /proc/loadavg
|
||||
else
|
||||
uptime
|
||||
fi
|
||||
|
||||
section "Memory"
|
||||
run_or_warn "memory usage" free -h
|
||||
|
||||
section "Disk"
|
||||
run_or_warn "disk usage" df -h -x tmpfs -x devtmpfs
|
||||
|
||||
section "Failed systemd Services"
|
||||
if command -v systemctl >/dev/null 2>&1; then
|
||||
systemctl --failed --no-pager || true
|
||||
else
|
||||
printf 'WARNING: systemctl command not available\n'
|
||||
fi
|
||||
|
||||
section "Listening Ports"
|
||||
if command -v ss >/dev/null 2>&1; then
|
||||
ss -tuln || printf 'WARNING: ss command failed\n'
|
||||
else
|
||||
printf 'WARNING: ss command not available\n'
|
||||
fi
|
||||
|
||||
section "Recent Kernel Messages"
|
||||
if command -v journalctl >/dev/null 2>&1; then
|
||||
journalctl -k -n 50 --no-pager || printf 'WARNING: journalctl kernel log query failed\n'
|
||||
else
|
||||
printf 'WARNING: journalctl command not available\n'
|
||||
fi
|
||||
} > "$report"
|
||||
|
||||
printf 'System report written to: %s\n' "$report"
|
||||
Executable
+238
@@ -0,0 +1,238 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
DRY_RUN=true
|
||||
TIMESTAMP="$(date +%Y%m%d_%H%M%S)"
|
||||
LOG_FILE="${LOG_FILE:-/tmp/veritas_extend_${TIMESTAMP}.log}"
|
||||
|
||||
SERVICE_GROUP=""
|
||||
DISKGROUP=""
|
||||
VOLUME=""
|
||||
MOUNTPOINT=""
|
||||
SIZE=""
|
||||
DISKS=""
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
|
||||
log() {
|
||||
local level="${1:-INFO}"
|
||||
shift || true
|
||||
local message="$*"
|
||||
local line
|
||||
|
||||
line="$(printf '%s [%s] %s' "$(date '+%Y-%m-%d %H:%M:%S')" "$level" "$message")"
|
||||
printf '%s\n' "$line"
|
||||
printf '%s\n' "$line" >> "$LOG_FILE"
|
||||
}
|
||||
|
||||
ok() {
|
||||
log "OK" "$*"
|
||||
}
|
||||
|
||||
warning() {
|
||||
log "WARNING" "$*"
|
||||
}
|
||||
|
||||
critical() {
|
||||
log "CRITICAL" "$*"
|
||||
}
|
||||
|
||||
require_cmd() {
|
||||
local cmd="$1"
|
||||
|
||||
if ! command -v "$cmd" >/dev/null 2>&1; then
|
||||
critical "required command not found: $cmd"
|
||||
return 1
|
||||
fi
|
||||
|
||||
return 0
|
||||
}
|
||||
|
||||
run_cmd() {
|
||||
local description="$1"
|
||||
shift
|
||||
|
||||
if (( "$#" == 0 )); then
|
||||
critical "run_cmd called without a command"
|
||||
return 2
|
||||
fi
|
||||
|
||||
log "INFO" "$description"
|
||||
log "INFO" "command: $*"
|
||||
|
||||
if [[ "$DRY_RUN" == "true" ]]; then
|
||||
log "INFO" "DRY-RUN: command not executed"
|
||||
return 0
|
||||
fi
|
||||
|
||||
"$@" 2>&1 | tee -a "$LOG_FILE"
|
||||
}
|
||||
|
||||
confirm_execute() {
|
||||
local prompt="${1:-Type EXECUTE to continue with real changes}"
|
||||
local answer=""
|
||||
|
||||
if [[ "$DRY_RUN" == "true" ]]; then
|
||||
ok "dry-run mode active; confirmation not required"
|
||||
return 0
|
||||
fi
|
||||
|
||||
warning "real execution mode requested with --execute"
|
||||
warning "$prompt"
|
||||
printf 'Type EXECUTE to continue: '
|
||||
read -r answer
|
||||
|
||||
if [[ "$answer" != "EXECUTE" ]]; then
|
||||
critical "confirmation failed; no changes made"
|
||||
exit 2
|
||||
fi
|
||||
}
|
||||
|
||||
usage_common() {
|
||||
cat <<'USAGE'
|
||||
Common options:
|
||||
--sg <service_group>
|
||||
--dg <diskgroup>
|
||||
--vol <volume>
|
||||
--mount <mountpoint>
|
||||
--size <+SIZE>
|
||||
--disks "disk1 disk2"
|
||||
--execute
|
||||
--help
|
||||
USAGE
|
||||
}
|
||||
|
||||
parse_common_args() {
|
||||
while (( "$#" > 0 )); do
|
||||
case "$1" in
|
||||
--sg)
|
||||
if [[ -z "${2:-}" ]]; then
|
||||
critical "missing value for --sg"
|
||||
exit 2
|
||||
fi
|
||||
SERVICE_GROUP="${2:-}"
|
||||
shift 2
|
||||
;;
|
||||
--dg)
|
||||
if [[ -z "${2:-}" ]]; then
|
||||
critical "missing value for --dg"
|
||||
exit 2
|
||||
fi
|
||||
DISKGROUP="${2:-}"
|
||||
shift 2
|
||||
;;
|
||||
--vol)
|
||||
if [[ -z "${2:-}" ]]; then
|
||||
critical "missing value for --vol"
|
||||
exit 2
|
||||
fi
|
||||
VOLUME="${2:-}"
|
||||
shift 2
|
||||
;;
|
||||
--mount)
|
||||
if [[ -z "${2:-}" ]]; then
|
||||
critical "missing value for --mount"
|
||||
exit 2
|
||||
fi
|
||||
MOUNTPOINT="${2:-}"
|
||||
shift 2
|
||||
;;
|
||||
--size)
|
||||
if [[ -z "${2:-}" ]]; then
|
||||
critical "missing value for --size"
|
||||
exit 2
|
||||
fi
|
||||
SIZE="${2:-}"
|
||||
shift 2
|
||||
;;
|
||||
--disks)
|
||||
if [[ -z "${2:-}" ]]; then
|
||||
critical "missing value for --disks"
|
||||
exit 2
|
||||
fi
|
||||
DISKS="${2:-}"
|
||||
shift 2
|
||||
;;
|
||||
--execute)
|
||||
DRY_RUN=false
|
||||
shift
|
||||
;;
|
||||
--help|-h)
|
||||
usage_common
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
critical "unknown argument: $1"
|
||||
usage_common
|
||||
exit 2
|
||||
;;
|
||||
esac
|
||||
done
|
||||
}
|
||||
|
||||
require_nonempty() {
|
||||
local value="$1"
|
||||
local name="$2"
|
||||
|
||||
if [[ -z "$value" ]]; then
|
||||
critical "missing required argument: $name"
|
||||
return 1
|
||||
fi
|
||||
|
||||
return 0
|
||||
}
|
||||
|
||||
require_inputs() {
|
||||
local failed=0
|
||||
local name
|
||||
|
||||
for name in "$@"; do
|
||||
case "$name" in
|
||||
sg) require_nonempty "$SERVICE_GROUP" "--sg" || failed=1 ;;
|
||||
dg) require_nonempty "$DISKGROUP" "--dg" || failed=1 ;;
|
||||
vol) require_nonempty "$VOLUME" "--vol" || failed=1 ;;
|
||||
mount) require_nonempty "$MOUNTPOINT" "--mount" || failed=1 ;;
|
||||
size) require_nonempty "$SIZE" "--size" || failed=1 ;;
|
||||
disks) require_nonempty "$DISKS" "--disks" || failed=1 ;;
|
||||
*) critical "internal error: unknown required input '$name'"; failed=1 ;;
|
||||
esac
|
||||
done
|
||||
|
||||
if (( failed != 0 )); then
|
||||
usage_common
|
||||
exit 2
|
||||
fi
|
||||
}
|
||||
|
||||
has_cmd() {
|
||||
command -v "$1" >/dev/null 2>&1
|
||||
}
|
||||
|
||||
capture_cmd() {
|
||||
local description="$1"
|
||||
shift
|
||||
|
||||
log "INFO" "$description"
|
||||
log "INFO" "command: $*"
|
||||
"$@" 2>&1 | tee -a "$LOG_FILE"
|
||||
}
|
||||
|
||||
disk_status_line() {
|
||||
local disk="$1"
|
||||
vxdisk list "$disk" 2>/dev/null | awk -F': *' '
|
||||
/device:/ {device=$2}
|
||||
/status:/ {status=$2}
|
||||
END {
|
||||
if (device != "" || status != "") {
|
||||
print device "|" status
|
||||
}
|
||||
}'
|
||||
}
|
||||
|
||||
vxprint_volume_device() {
|
||||
local dg="$1"
|
||||
local vol="$2"
|
||||
vxprint -g "$dg" -F '%device' "$vol" 2>/dev/null || true
|
||||
}
|
||||
+42
@@ -0,0 +1,42 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
source "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
parse_common_args "$@"
|
||||
|
||||
missing=0
|
||||
for cmd in lsblk vxdisk vxdctl; do
|
||||
require_cmd "$cmd" || missing=1
|
||||
done
|
||||
|
||||
if (( missing != 0 )); then
|
||||
exit 2
|
||||
fi
|
||||
|
||||
ok "Veritas LUN discovery started"
|
||||
log "INFO" "log file: $LOG_FILE"
|
||||
capture_cmd "Current Linux block devices" lsblk
|
||||
capture_cmd "Current VxVM disks" vxdisk list
|
||||
|
||||
run_cmd "Refresh VxVM device discovery" vxdctl enable
|
||||
run_cmd "Scan disks known to VxVM" vxdisk scandisks
|
||||
|
||||
ok "Candidate VxVM disks with status 'online invalid'"
|
||||
candidate_count=0
|
||||
while read -r disk status rest; do
|
||||
if [[ "$status $rest" == *"online invalid"* ]]; then
|
||||
printf ' %s %s %s\n' "$disk" "$status" "$rest" | tee -a "$LOG_FILE"
|
||||
candidate_count=$(( candidate_count + 1 ))
|
||||
fi
|
||||
done < <(vxdisk list 2>/dev/null | awk 'NR > 1 {print $1, $4, $5, $6, $7, $8}')
|
||||
|
||||
if (( candidate_count == 0 )); then
|
||||
warning "no candidate disks detected with VxVM status 'online invalid'"
|
||||
else
|
||||
ok "detected $candidate_count candidate disk(s); review before initialization"
|
||||
fi
|
||||
+94
@@ -0,0 +1,94 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
source "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
parse_common_args "$@"
|
||||
require_inputs sg dg vol mount
|
||||
|
||||
missing=0
|
||||
for cmd in hastatus hagrp hares vxdisk vxdg vxprint df findmnt; do
|
||||
require_cmd "$cmd" || missing=1
|
||||
done
|
||||
|
||||
if (( missing != 0 )); then
|
||||
exit 2
|
||||
fi
|
||||
|
||||
status=0
|
||||
ok "Precheck started for service group '$SERVICE_GROUP', diskgroup '$DISKGROUP', volume '$VOLUME'"
|
||||
log "INFO" "log file: $LOG_FILE"
|
||||
|
||||
if hastatus -sum >/dev/null 2>&1; then
|
||||
ok "VCS status is available"
|
||||
else
|
||||
critical "VCS does not appear to be running or hastatus failed"
|
||||
status=1
|
||||
fi
|
||||
|
||||
if hagrp -display "$SERVICE_GROUP" >/dev/null 2>&1; then
|
||||
ok "service group exists: $SERVICE_GROUP"
|
||||
else
|
||||
critical "service group not found: $SERVICE_GROUP"
|
||||
status=1
|
||||
fi
|
||||
|
||||
group_state="$(hagrp -state "$SERVICE_GROUP" 2>/dev/null || true)"
|
||||
printf '%s\n' "$group_state" | tee -a "$LOG_FILE"
|
||||
if printf '%s\n' "$group_state" | grep -qi "ONLINE"; then
|
||||
ok "service group is online"
|
||||
else
|
||||
critical "service group is not online"
|
||||
status=1
|
||||
fi
|
||||
|
||||
online_node="$(printf '%s\n' "$group_state" | awk '/ONLINE/ {print $NF; exit}')"
|
||||
if [[ -n "$online_node" ]]; then
|
||||
ok "possible online node: $online_node"
|
||||
else
|
||||
warning "unable to identify online node from hagrp output"
|
||||
fi
|
||||
|
||||
if vxdg list "$DISKGROUP" >/dev/null 2>&1; then
|
||||
ok "diskgroup exists: $DISKGROUP"
|
||||
else
|
||||
critical "diskgroup not found: $DISKGROUP"
|
||||
status=1
|
||||
fi
|
||||
|
||||
if vxprint -g "$DISKGROUP" "$VOLUME" >/dev/null 2>&1; then
|
||||
ok "volume exists: $VOLUME"
|
||||
else
|
||||
critical "volume not found in diskgroup: $VOLUME"
|
||||
status=1
|
||||
fi
|
||||
|
||||
if findmnt --target "$MOUNTPOINT" >/dev/null 2>&1; then
|
||||
ok "mountpoint is mounted: $MOUNTPOINT"
|
||||
fs_type="$(findmnt --noheadings --output FSTYPE --target "$MOUNTPOINT" | awk 'NR == 1 {print $1}')"
|
||||
ok "filesystem type: ${fs_type:-unknown}"
|
||||
else
|
||||
critical "mountpoint is not mounted: $MOUNTPOINT"
|
||||
status=1
|
||||
fi
|
||||
|
||||
capture_cmd "Current filesystem usage" df -h "$MOUNTPOINT" || status=1
|
||||
capture_cmd "Current VxVM layout" vxprint -g "$DISKGROUP" -ht || status=1
|
||||
capture_cmd "Current VCS service group display" hagrp -display "$SERVICE_GROUP" || status=1
|
||||
if hares -display 2>/dev/null | grep -F "$SERVICE_GROUP" | tee -a "$LOG_FILE"; then
|
||||
ok "displayed VCS resources related to service group: $SERVICE_GROUP"
|
||||
else
|
||||
warning "no VCS resource display rows matched service group: $SERVICE_GROUP"
|
||||
fi
|
||||
|
||||
if (( status == 0 )); then
|
||||
ok "precheck completed successfully"
|
||||
else
|
||||
critical "precheck found one or more issues"
|
||||
fi
|
||||
|
||||
exit "$status"
|
||||
+31
@@ -0,0 +1,31 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
source "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
parse_common_args "$@"
|
||||
require_inputs sg
|
||||
|
||||
missing=0
|
||||
for cmd in hagrp grep; do
|
||||
require_cmd "$cmd" || missing=1
|
||||
done
|
||||
|
||||
if (( missing != 0 )); then
|
||||
exit 2
|
||||
fi
|
||||
|
||||
ok "Current service group state"
|
||||
capture_cmd "hagrp state for $SERVICE_GROUP" hagrp -state "$SERVICE_GROUP"
|
||||
warning "Freezing a VCS service group prevents automatic failover actions while the freeze is active"
|
||||
|
||||
confirm_execute "This will persistently freeze VCS service group '$SERVICE_GROUP'."
|
||||
run_cmd "Freeze VCS service group persistently" hagrp -freeze "$SERVICE_GROUP" -persistent
|
||||
|
||||
ok "Freeze state check"
|
||||
hagrp -display "$SERVICE_GROUP" 2>&1 | tee -a "$LOG_FILE" | grep -i "Frozen" || true
|
||||
ok "freeze step completed"
|
||||
+56
@@ -0,0 +1,56 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
source "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
parse_common_args "$@"
|
||||
require_inputs disks
|
||||
|
||||
missing=0
|
||||
for cmd in vxdisk vxdisksetup awk; do
|
||||
require_cmd "$cmd" || missing=1
|
||||
done
|
||||
|
||||
if (( missing != 0 )); then
|
||||
exit 2
|
||||
fi
|
||||
|
||||
status=0
|
||||
for disk in $DISKS; do
|
||||
if ! vxdisk list "$disk" >/dev/null 2>&1; then
|
||||
critical "disk not found in vxdisk list: $disk"
|
||||
status=1
|
||||
continue
|
||||
fi
|
||||
|
||||
info="$(disk_status_line "$disk")"
|
||||
disk_status="${info#*|}"
|
||||
if [[ "$disk_status" != *"online invalid"* ]]; then
|
||||
critical "disk '$disk' is not safe to initialize; status is '${disk_status:-unknown}', expected 'online invalid'"
|
||||
status=1
|
||||
continue
|
||||
fi
|
||||
|
||||
ok "disk '$disk' validated as online invalid"
|
||||
done
|
||||
|
||||
if (( status != 0 )); then
|
||||
critical "one or more disks failed validation; no initialization attempted"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
confirm_execute "This will initialize VxVM metadata on disk(s): $DISKS"
|
||||
|
||||
for disk in $DISKS; do
|
||||
run_cmd "Initialize VxVM disk $disk" vxdisksetup -i "$disk"
|
||||
done
|
||||
|
||||
for disk in $DISKS; do
|
||||
capture_cmd "Post-initialization VxVM disk state for $disk" vxdisk list "$disk"
|
||||
done
|
||||
|
||||
ok "disk initialization step completed"
|
||||
+70
@@ -0,0 +1,70 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
source "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
parse_common_args "$@"
|
||||
require_inputs dg disks
|
||||
|
||||
missing=0
|
||||
for cmd in vxdg vxdisk vxprint tr; do
|
||||
require_cmd "$cmd" || missing=1
|
||||
done
|
||||
|
||||
if (( missing != 0 )); then
|
||||
exit 2
|
||||
fi
|
||||
|
||||
if ! vxdg list "$DISKGROUP" >/dev/null 2>&1; then
|
||||
critical "diskgroup not found: $DISKGROUP"
|
||||
exit 1
|
||||
fi
|
||||
ok "diskgroup exists: $DISKGROUP"
|
||||
|
||||
status=0
|
||||
for disk in $DISKS; do
|
||||
if ! vxdisk list "$disk" >/dev/null 2>&1; then
|
||||
critical "disk not found in vxdisk list: $disk"
|
||||
status=1
|
||||
continue
|
||||
fi
|
||||
|
||||
summary="$(vxdisk list 2>/dev/null | awk -v disk="$disk" '$1 == disk {print $0}')"
|
||||
if [[ -z "$summary" ]]; then
|
||||
warning "unable to find summary row for $disk; using detailed status only"
|
||||
elif printf '%s\n' "$summary" | awk '{print $3}' | grep -qv '^-'; then
|
||||
critical "disk '$disk' appears to belong to a diskgroup: $summary"
|
||||
status=1
|
||||
continue
|
||||
fi
|
||||
|
||||
info="$(disk_status_line "$disk")"
|
||||
disk_status="${info#*|}"
|
||||
if [[ "$disk_status" == *"online invalid"* ]]; then
|
||||
critical "disk '$disk' is still online invalid; initialize it before adding to a diskgroup"
|
||||
status=1
|
||||
continue
|
||||
fi
|
||||
|
||||
ok "disk '$disk' appears initialized and unassigned"
|
||||
done
|
||||
|
||||
if (( status != 0 )); then
|
||||
critical "one or more disks failed validation; diskgroup extension not attempted"
|
||||
exit 1
|
||||
fi
|
||||
|
||||
confirm_execute "This will add disk(s) '$DISKS' to VxVM diskgroup '$DISKGROUP'."
|
||||
|
||||
for disk in $DISKS; do
|
||||
alias_base="$(printf '%s_%s' "$DISKGROUP" "$disk" | tr -c 'A-Za-z0-9_' '_')"
|
||||
run_cmd "Add disk $disk to diskgroup $DISKGROUP as $alias_base" vxdg -g "$DISKGROUP" adddisk "${alias_base}=${disk}"
|
||||
done
|
||||
|
||||
capture_cmd "Diskgroup details after extension" vxdg list "$DISKGROUP"
|
||||
capture_cmd "VxVM layout after diskgroup extension" vxprint -g "$DISKGROUP" -ht
|
||||
ok "diskgroup extension step completed"
|
||||
+81
@@ -0,0 +1,81 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
source "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
parse_common_args "$@"
|
||||
require_inputs dg vol mount size
|
||||
|
||||
missing=0
|
||||
for cmd in vxdg vxprint vxassist df findmnt; do
|
||||
require_cmd "$cmd" || missing=1
|
||||
done
|
||||
|
||||
if (( missing != 0 )); then
|
||||
exit 2
|
||||
fi
|
||||
|
||||
if [[ ! "$SIZE" =~ ^\+[0-9]+[KkMmGgTtPp]?$ ]]; then
|
||||
critical "invalid --size '$SIZE'; use a grow-by value such as +10G"
|
||||
exit 2
|
||||
fi
|
||||
|
||||
status=0
|
||||
vxdg list "$DISKGROUP" >/dev/null 2>&1 || { critical "diskgroup not found: $DISKGROUP"; status=1; }
|
||||
vxprint -g "$DISKGROUP" "$VOLUME" >/dev/null 2>&1 || { critical "volume not found: $VOLUME"; status=1; }
|
||||
findmnt --target "$MOUNTPOINT" >/dev/null 2>&1 || { critical "mountpoint is not mounted: $MOUNTPOINT"; status=1; }
|
||||
|
||||
if (( status != 0 )); then
|
||||
exit 1
|
||||
fi
|
||||
|
||||
fs_type="$(findmnt --noheadings --output FSTYPE --target "$MOUNTPOINT" | awk 'NR == 1 {print $1}')"
|
||||
device="$(findmnt --noheadings --output SOURCE --target "$MOUNTPOINT" | awk 'NR == 1 {print $1}')"
|
||||
ok "filesystem type: ${fs_type:-unknown}"
|
||||
ok "mounted device: ${device:-unknown}"
|
||||
|
||||
capture_cmd "Filesystem usage before expansion" df -h "$MOUNTPOINT"
|
||||
capture_cmd "VxVM layout before volume expansion" vxprint -g "$DISKGROUP" -ht
|
||||
|
||||
confirm_execute "This will grow VxVM volume '$VOLUME' in diskgroup '$DISKGROUP' by '$SIZE'."
|
||||
run_cmd "Grow VxVM volume by requested size" vxassist -g "$DISKGROUP" growby "$VOLUME" "$SIZE"
|
||||
|
||||
case "$fs_type" in
|
||||
vxfs)
|
||||
warning "VxFS fsadm syntax can vary by Veritas release and site standard"
|
||||
warning "manual filesystem resize recommended after volume growth; review a command such as: fsadm -F vxfs -b <new_size_or_supported_option> $MOUNTPOINT"
|
||||
;;
|
||||
xfs)
|
||||
if has_cmd xfs_growfs; then
|
||||
run_cmd "Resize XFS filesystem online" xfs_growfs "$MOUNTPOINT"
|
||||
else
|
||||
critical "xfs_growfs not found; cannot resize XFS safely"
|
||||
exit 1
|
||||
fi
|
||||
;;
|
||||
ext3|ext4)
|
||||
if has_cmd resize2fs; then
|
||||
if [[ -n "$device" ]]; then
|
||||
run_cmd "Resize ext filesystem" resize2fs "$device"
|
||||
else
|
||||
critical "unable to detect mounted device for resize2fs"
|
||||
exit 1
|
||||
fi
|
||||
else
|
||||
critical "resize2fs not found; cannot resize ext filesystem safely"
|
||||
exit 1
|
||||
fi
|
||||
;;
|
||||
*)
|
||||
warning "unsupported or unknown filesystem type '$fs_type'; volume growth command was handled according to dry-run/execute mode"
|
||||
warning "manual filesystem resize required after confirming platform-specific procedure"
|
||||
;;
|
||||
esac
|
||||
|
||||
capture_cmd "Filesystem usage after expansion attempt" df -h "$MOUNTPOINT"
|
||||
capture_cmd "VxVM layout after volume expansion attempt" vxprint -g "$DISKGROUP" -ht
|
||||
ok "volume and filesystem expansion step completed"
|
||||
+94
@@ -0,0 +1,94 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
source "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
parse_common_args "$@"
|
||||
require_inputs sg dg vol mount
|
||||
|
||||
missing=0
|
||||
for cmd in hagrp vxdisk vxdg vxprint df findmnt; do
|
||||
require_cmd "$cmd" || missing=1
|
||||
done
|
||||
|
||||
if (( missing != 0 )); then
|
||||
exit 2
|
||||
fi
|
||||
|
||||
status=0
|
||||
ok "Post-check started"
|
||||
log "INFO" "log file: $LOG_FILE"
|
||||
|
||||
group_state="$(hagrp -state "$SERVICE_GROUP" 2>/dev/null || true)"
|
||||
printf '%s\n' "$group_state" | tee -a "$LOG_FILE"
|
||||
if printf '%s\n' "$group_state" | grep -qi "ONLINE"; then
|
||||
ok "service group is online"
|
||||
else
|
||||
critical "service group is not online"
|
||||
status=1
|
||||
fi
|
||||
|
||||
freeze_display="$(hagrp -display "$SERVICE_GROUP" 2>/dev/null | grep -i "Frozen" || true)"
|
||||
printf '%s\n' "$freeze_display" | tee -a "$LOG_FILE"
|
||||
if printf '%s\n' "$freeze_display" | grep -Eqi "(1|true|yes|persistent)"; then
|
||||
ok "service group still appears frozen before unfreeze"
|
||||
else
|
||||
warning "unable to confirm service group freeze state; review before unfreezing"
|
||||
fi
|
||||
|
||||
if vxdg list "$DISKGROUP" >/dev/null 2>&1; then
|
||||
ok "diskgroup imported and visible: $DISKGROUP"
|
||||
else
|
||||
critical "diskgroup not visible: $DISKGROUP"
|
||||
status=1
|
||||
fi
|
||||
|
||||
volume_line="$(vxprint -g "$DISKGROUP" -v "$VOLUME" 2>/dev/null || true)"
|
||||
printf '%s\n' "$volume_line" | tee -a "$LOG_FILE"
|
||||
if printf '%s\n' "$volume_line" | grep -Eqi "(ENABLED|ACTIVE|started|fsgen)"; then
|
||||
ok "volume appears enabled or active"
|
||||
else
|
||||
critical "unable to confirm volume is enabled or active"
|
||||
status=1
|
||||
fi
|
||||
|
||||
if findmnt --target "$MOUNTPOINT" >/dev/null 2>&1; then
|
||||
ok "mountpoint is mounted: $MOUNTPOINT"
|
||||
else
|
||||
critical "mountpoint is not mounted: $MOUNTPOINT"
|
||||
status=1
|
||||
fi
|
||||
|
||||
capture_cmd "Filesystem usage after expansion" df -h "$MOUNTPOINT" || status=1
|
||||
capture_cmd "VxVM layout after expansion" vxprint -g "$DISKGROUP" -ht || status=1
|
||||
capture_cmd "VxVM disk list after expansion" vxdisk list || status=1
|
||||
|
||||
if has_cmd journalctl; then
|
||||
capture_cmd "Recent kernel journal messages" journalctl -k -n 50 || warning "journalctl check failed; review permissions or system logging"
|
||||
else
|
||||
warning "journalctl not found; skipping kernel journal check"
|
||||
fi
|
||||
|
||||
if has_cmd dmesg; then
|
||||
log "INFO" "Recent dmesg messages"
|
||||
log "INFO" "command: dmesg -T | tail -50"
|
||||
if dmesg -T 2>&1 | tail -50 | tee -a "$LOG_FILE"; then
|
||||
ok "captured recent dmesg messages"
|
||||
else
|
||||
warning "dmesg check failed; review permissions or kernel logging"
|
||||
fi
|
||||
else
|
||||
warning "dmesg not found; skipping dmesg check"
|
||||
fi
|
||||
|
||||
if (( status == 0 )); then
|
||||
ok "post-check completed successfully; compare df output with precheck baseline for expected size increase"
|
||||
else
|
||||
critical "post-check found one or more issues"
|
||||
fi
|
||||
|
||||
exit "$status"
|
||||
+31
@@ -0,0 +1,31 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
source "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
parse_common_args "$@"
|
||||
require_inputs sg
|
||||
|
||||
missing=0
|
||||
for cmd in hagrp grep; do
|
||||
require_cmd "$cmd" || missing=1
|
||||
done
|
||||
|
||||
if (( missing != 0 )); then
|
||||
exit 2
|
||||
fi
|
||||
|
||||
ok "Current service group freeze state"
|
||||
hagrp -display "$SERVICE_GROUP" 2>&1 | tee -a "$LOG_FILE" | grep -i "Frozen" || true
|
||||
|
||||
confirm_execute "This will persistently unfreeze VCS service group '$SERVICE_GROUP'."
|
||||
run_cmd "Unfreeze VCS service group persistently" hagrp -unfreeze "$SERVICE_GROUP" -persistent
|
||||
|
||||
ok "Verify service group freeze state"
|
||||
hagrp -display "$SERVICE_GROUP" 2>&1 | tee -a "$LOG_FILE" | grep -i "Frozen" || true
|
||||
capture_cmd "Current service group state" hagrp -state "$SERVICE_GROUP"
|
||||
ok "unfreeze step completed"
|
||||
@@ -0,0 +1,106 @@
|
||||
# Veritas VxVM/VCS Storage Expansion Toolkit
|
||||
|
||||
Production-style Bash examples for expanding storage in a Veritas environment. These scripts are sanitized operational tooling for a Linux Infrastructure Engineer portfolio: they show the flow, guardrails, logging, and validation patterns used in enterprise change work.
|
||||
|
||||
## VxVM vs VCS
|
||||
|
||||
Veritas Volume Manager (VxVM) manages disks, disk groups, volumes, plexes, and subdisks. It is the storage virtualization layer used to initialize SAN LUNs, add capacity to disk groups, and grow volumes.
|
||||
|
||||
Veritas Cluster Server (VCS) manages application availability through service groups and resources. During storage changes, freezing the relevant service group can prevent unexpected automated failover actions while operators perform controlled work.
|
||||
|
||||
## Safety Notes
|
||||
|
||||
- Default mode is always dry-run.
|
||||
- Real execution requires `--execute`.
|
||||
- Mutating scripts require an interactive `EXECUTE` confirmation after `--execute`.
|
||||
- Disk names are never assumed. Candidate disks must be supplied explicitly.
|
||||
- Disks are initialized only when VxVM reports the expected `online invalid` state.
|
||||
- Filesystem growth is conservative and depends on detected filesystem type.
|
||||
- Exact Veritas and filesystem commands can differ by product version, OS, and site standards.
|
||||
|
||||
## Required Tools
|
||||
|
||||
Common commands used by the toolkit:
|
||||
|
||||
- Linux: `bash`, `lsblk`, `df`, `findmnt`, `awk`, `grep`, `tee`
|
||||
- VCS: `hastatus`, `hagrp`, `hares`
|
||||
- VxVM: `vxdctl`, `vxdisk`, `vxdisksetup`, `vxdg`, `vxprint`, `vxassist`
|
||||
- Filesystem resize tools as applicable: `fsadm`, `xfs_growfs`, `resize2fs`
|
||||
- Optional log checks: `journalctl`, `dmesg`
|
||||
|
||||
## Scripts
|
||||
|
||||
- `00_env.sh` - shared configuration, logging, dry-run handling, argument helpers.
|
||||
- `01_detect_new_luns.sh` - discovers Linux block devices and VxVM `online invalid` candidates.
|
||||
- `02_precheck_vcs_vxvm.sh` - validates cluster, diskgroup, volume, and filesystem state.
|
||||
- `03_freeze_vcs_group.sh` - freezes a VCS service group.
|
||||
- `04_init_vxvm_disks.sh` - initializes candidate VxVM disks.
|
||||
- `05_extend_diskgroup.sh` - adds initialized disks to a diskgroup.
|
||||
- `06_extend_volume_fs.sh` - grows a VxVM volume and resizes the filesystem where safe.
|
||||
- `07_postcheck_vcs_vxvm.sh` - validates final state and gathers post-change evidence.
|
||||
- `08_unfreeze_vcs_group.sh` - unfreezes the VCS service group.
|
||||
- `veritas_extend_runbook.sh` - prints the recommended order and can optionally run the steps.
|
||||
|
||||
## Example Workflow
|
||||
|
||||
Print the runbook only:
|
||||
|
||||
```bash
|
||||
./veritas_extend_runbook.sh \
|
||||
--sg app_sg \
|
||||
--dg appdg \
|
||||
--vol appvol \
|
||||
--mount /app \
|
||||
--size +100G \
|
||||
--disks "emc0_1234 emc0_1235"
|
||||
```
|
||||
|
||||
Run all steps in dry-run mode:
|
||||
|
||||
```bash
|
||||
./veritas_extend_runbook.sh \
|
||||
--run \
|
||||
--sg app_sg \
|
||||
--dg appdg \
|
||||
--vol appvol \
|
||||
--mount /app \
|
||||
--size +100G \
|
||||
--disks "emc0_1234 emc0_1235"
|
||||
```
|
||||
|
||||
Run a controlled execution. Each mutating step still asks for `EXECUTE`:
|
||||
|
||||
```bash
|
||||
./veritas_extend_runbook.sh \
|
||||
--run \
|
||||
--execute \
|
||||
--sg app_sg \
|
||||
--dg appdg \
|
||||
--vol appvol \
|
||||
--mount /app \
|
||||
--size +100G \
|
||||
--disks "emc0_1234 emc0_1235"
|
||||
```
|
||||
|
||||
Run individual steps:
|
||||
|
||||
```bash
|
||||
./01_detect_new_luns.sh
|
||||
./02_precheck_vcs_vxvm.sh --sg app_sg --dg appdg --vol appvol --mount /app
|
||||
./03_freeze_vcs_group.sh --sg app_sg
|
||||
./04_init_vxvm_disks.sh --disks "emc0_1234 emc0_1235"
|
||||
./05_extend_diskgroup.sh --dg appdg --disks "emc0_1234 emc0_1235"
|
||||
./06_extend_volume_fs.sh --dg appdg --vol appvol --mount /app --size +100G
|
||||
./07_postcheck_vcs_vxvm.sh --sg app_sg --dg appdg --vol appvol --mount /app
|
||||
./08_unfreeze_vcs_group.sh --sg app_sg
|
||||
```
|
||||
|
||||
## Exit Codes
|
||||
|
||||
- `0` - OK.
|
||||
- `1` - operational validation or execution failure.
|
||||
- `2` - invalid input or missing required command.
|
||||
|
||||
## Operational Reminder
|
||||
|
||||
Use these scripts as examples and adapt them to local runbooks, naming standards, multipath stack, Veritas release, filesystem type, and change-control policy before production use.
|
||||
@@ -0,0 +1,94 @@
|
||||
#!/usr/bin/env bash
|
||||
set -o errexit
|
||||
set -o nounset
|
||||
set -o pipefail
|
||||
|
||||
SCRIPT_DIR="$(cd "$(dirname "${BASH_SOURCE[0]}")" && pwd)"
|
||||
# shellcheck source=00_env.sh
|
||||
source "$SCRIPT_DIR/00_env.sh"
|
||||
|
||||
RUN_STEPS=false
|
||||
|
||||
usage() {
|
||||
cat <<'USAGE'
|
||||
Usage:
|
||||
./veritas_extend_runbook.sh --sg <service_group> --dg <diskgroup> --vol <volume> --mount <mountpoint> --size <+SIZE> --disks "disk1 disk2" [--execute] [--run]
|
||||
|
||||
Options:
|
||||
--run Run each step in the recommended order. Without --run, only print the runbook.
|
||||
--execute Pass --execute to change steps. Dry-run remains the default.
|
||||
USAGE
|
||||
usage_common
|
||||
}
|
||||
|
||||
args=()
|
||||
while (( "$#" > 0 )); do
|
||||
case "$1" in
|
||||
--run)
|
||||
RUN_STEPS=true
|
||||
shift
|
||||
;;
|
||||
--help|-h)
|
||||
usage
|
||||
exit 0
|
||||
;;
|
||||
*)
|
||||
args+=("$1")
|
||||
shift
|
||||
;;
|
||||
esac
|
||||
done
|
||||
|
||||
parse_common_args "${args[@]}"
|
||||
|
||||
cat <<FLOW
|
||||
Veritas VxVM/VCS storage expansion runbook
|
||||
Mode: $(if [[ "$DRY_RUN" == "true" ]]; then printf 'DRY-RUN'; else printf 'EXECUTE'; fi)
|
||||
Log file: $LOG_FILE
|
||||
|
||||
Step 1: Detect new LUNs
|
||||
$SCRIPT_DIR/01_detect_new_luns.sh
|
||||
|
||||
Step 2: Run VCS/VxVM precheck
|
||||
$SCRIPT_DIR/02_precheck_vcs_vxvm.sh --sg "$SERVICE_GROUP" --dg "$DISKGROUP" --vol "$VOLUME" --mount "$MOUNTPOINT"
|
||||
|
||||
Step 3: Freeze VCS service group
|
||||
$SCRIPT_DIR/03_freeze_vcs_group.sh --sg "$SERVICE_GROUP"
|
||||
|
||||
Step 4: Initialize new VxVM disks
|
||||
$SCRIPT_DIR/04_init_vxvm_disks.sh --disks "$DISKS"
|
||||
|
||||
Step 5: Add disks to diskgroup
|
||||
$SCRIPT_DIR/05_extend_diskgroup.sh --dg "$DISKGROUP" --disks "$DISKS"
|
||||
|
||||
Step 6: Grow volume and filesystem
|
||||
$SCRIPT_DIR/06_extend_volume_fs.sh --dg "$DISKGROUP" --vol "$VOLUME" --mount "$MOUNTPOINT" --size "$SIZE"
|
||||
|
||||
Step 7: Run post-check
|
||||
$SCRIPT_DIR/07_postcheck_vcs_vxvm.sh --sg "$SERVICE_GROUP" --dg "$DISKGROUP" --vol "$VOLUME" --mount "$MOUNTPOINT"
|
||||
|
||||
Step 8: Unfreeze VCS service group
|
||||
$SCRIPT_DIR/08_unfreeze_vcs_group.sh --sg "$SERVICE_GROUP"
|
||||
FLOW
|
||||
|
||||
if [[ "$RUN_STEPS" != "true" ]]; then
|
||||
warning "runbook printed only; add --run to invoke steps"
|
||||
exit 0
|
||||
fi
|
||||
|
||||
require_inputs sg dg vol mount size disks
|
||||
|
||||
execute_arg=()
|
||||
if [[ "$DRY_RUN" == "false" ]]; then
|
||||
warning "--execute supplied to wrapper; destructive steps will request confirmation individually"
|
||||
execute_arg=(--execute)
|
||||
fi
|
||||
|
||||
"$SCRIPT_DIR/01_detect_new_luns.sh"
|
||||
"$SCRIPT_DIR/02_precheck_vcs_vxvm.sh" --sg "$SERVICE_GROUP" --dg "$DISKGROUP" --vol "$VOLUME" --mount "$MOUNTPOINT"
|
||||
"$SCRIPT_DIR/03_freeze_vcs_group.sh" --sg "$SERVICE_GROUP" "${execute_arg[@]}"
|
||||
"$SCRIPT_DIR/04_init_vxvm_disks.sh" --disks "$DISKS" "${execute_arg[@]}"
|
||||
"$SCRIPT_DIR/05_extend_diskgroup.sh" --dg "$DISKGROUP" --disks "$DISKS" "${execute_arg[@]}"
|
||||
"$SCRIPT_DIR/06_extend_volume_fs.sh" --dg "$DISKGROUP" --vol "$VOLUME" --mount "$MOUNTPOINT" --size "$SIZE" "${execute_arg[@]}"
|
||||
"$SCRIPT_DIR/07_postcheck_vcs_vxvm.sh" --sg "$SERVICE_GROUP" --dg "$DISKGROUP" --vol "$VOLUME" --mount "$MOUNTPOINT"
|
||||
"$SCRIPT_DIR/08_unfreeze_vcs_group.sh" --sg "$SERVICE_GROUP" "${execute_arg[@]}"
|
||||
Reference in New Issue
Block a user