858 lines
18 KiB
Markdown
858 lines
18 KiB
Markdown
|
|
# Production Operations Cheatsheet
|
||
|
|
|
||
|
|
Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change.
|
||
|
|
|
||
|
|
## Linux / Unix Daily Operations
|
||
|
|
|
||
|
|
### Uptime and Host State
|
||
|
|
|
||
|
|
Check host age, kernel, clock, and recent reboot history before touching anything:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
uptime
|
||
|
|
uname -r
|
||
|
|
hostnamectl
|
||
|
|
timedatectl
|
||
|
|
who -b
|
||
|
|
last -x | head -20
|
||
|
|
```
|
||
|
|
|
||
|
|
Pre-check pattern:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
date -u
|
||
|
|
uptime
|
||
|
|
df -h
|
||
|
|
free -m
|
||
|
|
systemctl --failed
|
||
|
|
```
|
||
|
|
|
||
|
|
### Process Management
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ps -ef | head
|
||
|
|
ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
|
||
|
|
pgrep -a java
|
||
|
|
pstree -ap | less
|
||
|
|
pidof sshd
|
||
|
|
renice +5 -p <pid>
|
||
|
|
kill -TERM <pid>
|
||
|
|
kill -9 <pid> # DANGEROUS: last resort only
|
||
|
|
```
|
||
|
|
|
||
|
|
Validation:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ps -p <pid> -o pid,stat,etime,cmd
|
||
|
|
journalctl -u <service> -n 50 --no-pager
|
||
|
|
```
|
||
|
|
|
||
|
|
### systemctl
|
||
|
|
|
||
|
|
```bash
|
||
|
|
systemctl status <service> --no-pager -l
|
||
|
|
systemctl is-active <service>
|
||
|
|
systemctl is-enabled <service>
|
||
|
|
systemctl list-units --type=service --state=running
|
||
|
|
systemctl list-units --failed
|
||
|
|
systemctl daemon-reload
|
||
|
|
systemctl restart <service> # impact: confirms service interruption policy first
|
||
|
|
```
|
||
|
|
|
||
|
|
### journalctl
|
||
|
|
|
||
|
|
```bash
|
||
|
|
journalctl -u <service> -n 100 --no-pager
|
||
|
|
journalctl -u <service> --since '30 min ago'
|
||
|
|
journalctl -p err -S today
|
||
|
|
journalctl -k -b
|
||
|
|
journalctl --disk-usage
|
||
|
|
```
|
||
|
|
|
||
|
|
### Service Troubleshooting Flow
|
||
|
|
|
||
|
|
1. Confirm service state and recent restart count.
|
||
|
|
2. Read the last 100-200 journal lines.
|
||
|
|
3. Validate config syntax before restart if the daemon supports it.
|
||
|
|
4. Check dependent ports, mounts, credentials, and name resolution.
|
||
|
|
5. Restart only after cause is understood or rollback exists.
|
||
|
|
|
||
|
|
Example:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
systemctl status nginx --no-pager -l
|
||
|
|
journalctl -u nginx -n 100 --no-pager
|
||
|
|
nginx -t
|
||
|
|
ss -ltnp | grep ':80\|:443'
|
||
|
|
curl -kI https://127.0.0.1/
|
||
|
|
```
|
||
|
|
|
||
|
|
### CPU and Memory Diagnostics
|
||
|
|
|
||
|
|
```bash
|
||
|
|
uptime
|
||
|
|
top -H -b -n 1 | head -40
|
||
|
|
pidstat 1 5
|
||
|
|
pidstat -ru -p ALL 1 3
|
||
|
|
vmstat 1 5
|
||
|
|
iostat -xz 1 5
|
||
|
|
free -m
|
||
|
|
sar -q 1 5
|
||
|
|
```
|
||
|
|
|
||
|
|
Quick interpretation:
|
||
|
|
|
||
|
|
- high `%wa`: storage path or NFS issue
|
||
|
|
- high run queue with low CPU idle: CPU contention
|
||
|
|
- swap growth plus page scans: memory pressure
|
||
|
|
|
||
|
|
### Disk Usage
|
||
|
|
|
||
|
|
```bash
|
||
|
|
df -hT
|
||
|
|
du -xhd1 /var | sort -h
|
||
|
|
find /var/log -type f -size +500M -ls | sort -k7,7n
|
||
|
|
lsof +L1
|
||
|
|
```
|
||
|
|
|
||
|
|
### Inode Exhaustion
|
||
|
|
|
||
|
|
```bash
|
||
|
|
df -ih
|
||
|
|
find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n
|
||
|
|
find /tmp -xdev -type f | wc -l
|
||
|
|
```
|
||
|
|
|
||
|
|
### Mounts
|
||
|
|
|
||
|
|
```bash
|
||
|
|
mount | column -t
|
||
|
|
findmnt
|
||
|
|
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
|
||
|
|
cat /etc/fstab
|
||
|
|
mount -a # can expose bad fstab entries; use in change window
|
||
|
|
```
|
||
|
|
|
||
|
|
### Permissions
|
||
|
|
|
||
|
|
```bash
|
||
|
|
namei -l /path/to/file
|
||
|
|
stat /path/to/file
|
||
|
|
getfacl /path/to/file
|
||
|
|
chmod 640 /path/to/file
|
||
|
|
chown root:app /path/to/file
|
||
|
|
```
|
||
|
|
|
||
|
|
### SELinux
|
||
|
|
|
||
|
|
State and mode:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
getenforce
|
||
|
|
sestatus
|
||
|
|
cat /etc/selinux/config
|
||
|
|
```
|
||
|
|
|
||
|
|
Check file, process, and port context:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ls -Zd /var/www/html
|
||
|
|
ls -lZ /var/www/html/index.html
|
||
|
|
ps -eZ | grep nginx
|
||
|
|
id -Z
|
||
|
|
semanage port -l | grep http
|
||
|
|
```
|
||
|
|
|
||
|
|
Audit and denial review:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
|
||
|
|
ausearch -m AVC -ts today | audit2why
|
||
|
|
journalctl -t setroubleshoot --since '1 hour ago'
|
||
|
|
sealert -a /var/log/audit/audit.log
|
||
|
|
```
|
||
|
|
|
||
|
|
Typical flow:
|
||
|
|
|
||
|
|
1. Confirm SELinux mode is `Enforcing` or `Permissive`.
|
||
|
|
2. Identify the failing path, process domain, and target context.
|
||
|
|
3. Read AVC denials before changing labels or booleans.
|
||
|
|
4. Prefer persistent policy-aligned fixes over `chcon`.
|
||
|
|
5. Restore default labels and retest service path.
|
||
|
|
|
||
|
|
Modify and restore context:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
chcon -t httpd_sys_content_t /srv/app/index.html # temporary until relabel/restore
|
||
|
|
chcon -R -t httpd_sys_rw_content_t /srv/app/uploads # temporary until relabel/restore
|
||
|
|
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
|
||
|
|
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
|
||
|
|
restorecon -Rv /srv/app
|
||
|
|
matchpathcon /srv/app/uploads/file.txt
|
||
|
|
```
|
||
|
|
|
||
|
|
Booleans and validation:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
getsebool -a | grep httpd
|
||
|
|
getsebool httpd_can_network_connect
|
||
|
|
setsebool -P httpd_can_network_connect on
|
||
|
|
runcon -t httpd_t -- id -Z
|
||
|
|
```
|
||
|
|
|
||
|
|
Notes:
|
||
|
|
|
||
|
|
- prefer `semanage fcontext` plus `restorecon` for persistent fixes
|
||
|
|
- use `chcon` only as a short-lived diagnostic or emergency workaround
|
||
|
|
- avoid generating local policy modules from `audit2allow` until root cause is understood
|
||
|
|
- after context changes, validate service startup, AVC silence, and application path access
|
||
|
|
|
||
|
|
### Archives
|
||
|
|
|
||
|
|
```bash
|
||
|
|
tar tf backup.tar | head
|
||
|
|
tar czf logs-$(date +%F).tgz /var/log/app
|
||
|
|
tar xzf bundle.tgz -C /restore/path
|
||
|
|
gzip -t file.gz
|
||
|
|
```
|
||
|
|
|
||
|
|
### File Operations
|
||
|
|
|
||
|
|
```bash
|
||
|
|
cp -a source/ target/
|
||
|
|
rsync -aHAXvn /src/ /dst/
|
||
|
|
rsync -aHAX --delete --info=progress2 /src/ /dst/ # impact: verify source/destination twice
|
||
|
|
mv file file.$(date +%F-%H%M%S).bak
|
||
|
|
sha256sum file
|
||
|
|
```
|
||
|
|
|
||
|
|
## Text Processing & Regex
|
||
|
|
|
||
|
|
### Core Tools
|
||
|
|
|
||
|
|
```bash
|
||
|
|
grep -n 'ERROR' app.log
|
||
|
|
grep -E 'ERROR|WARN' app.log
|
||
|
|
grep -P '^\d{4}-\d{2}-\d{2}T' app.log
|
||
|
|
awk '{print $1,$4,$5}' access.log
|
||
|
|
awk -F, 'NR==1 || $3 ~ /failed/' report.csv
|
||
|
|
sed -n '1,20p' file
|
||
|
|
sed -E 's/[[:space:]]+/ /g' file
|
||
|
|
cut -d: -f1,7 /etc/passwd
|
||
|
|
sort file | uniq -c | sort -nr
|
||
|
|
xargs -r -n1 systemctl status < service-list.txt
|
||
|
|
jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json
|
||
|
|
```
|
||
|
|
|
||
|
|
### Regex Reference
|
||
|
|
|
||
|
|
```text
|
||
|
|
IPv4 \b(?:\d{1,3}\.){3}\d{1,3}\b
|
||
|
|
ISO timestamp \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b
|
||
|
|
UUID \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b
|
||
|
|
Log level \b(?:ERROR|WARN|INFO)\b
|
||
|
|
Failed SSH Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})
|
||
|
|
Ansible changed/fail ^(changed|fatal|failed):\s+\[[^]]+\]
|
||
|
|
```
|
||
|
|
|
||
|
|
### Log Parsing Examples
|
||
|
|
|
||
|
|
IP extraction:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head
|
||
|
|
```
|
||
|
|
|
||
|
|
Timestamp filter:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log
|
||
|
|
```
|
||
|
|
|
||
|
|
UUID extraction:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u
|
||
|
|
```
|
||
|
|
|
||
|
|
ERROR/WARN/INFO parsing:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c
|
||
|
|
```
|
||
|
|
|
||
|
|
Failed SSH login parsing:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
grep 'Failed password' /var/log/secure \
|
||
|
|
| awk '{print $(NF-3),$NF}' \
|
||
|
|
| sort | uniq -c | sort -nr | head
|
||
|
|
```
|
||
|
|
|
||
|
|
Extract fields from logs:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log
|
||
|
|
```
|
||
|
|
|
||
|
|
Filter Ansible output:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log
|
||
|
|
grep -E '^fatal:|^failed:' ansible.log
|
||
|
|
```
|
||
|
|
|
||
|
|
## Incident Response
|
||
|
|
|
||
|
|
### Disk Full
|
||
|
|
|
||
|
|
Workflow:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
df -hT
|
||
|
|
df -ih
|
||
|
|
findmnt
|
||
|
|
du -xhd1 /var | sort -h
|
||
|
|
find /var -xdev -type f -size +1G -ls | sort -k7,7n
|
||
|
|
lsof +L1
|
||
|
|
journalctl --disk-usage
|
||
|
|
```
|
||
|
|
|
||
|
|
Typical branches:
|
||
|
|
|
||
|
|
- filesystem full: identify growth path, compress/rotate/archive, validate app behavior
|
||
|
|
- inode full: remove file storms, spool buildup, temp-file leaks
|
||
|
|
- deleted open files: restart offender only after sizing impact
|
||
|
|
|
||
|
|
Post-check:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
df -hT
|
||
|
|
df -ih
|
||
|
|
systemctl --failed
|
||
|
|
```
|
||
|
|
|
||
|
|
### High CPU
|
||
|
|
|
||
|
|
```bash
|
||
|
|
uptime
|
||
|
|
mpstat -P ALL 1 5
|
||
|
|
pidstat -u -p ALL 1 5
|
||
|
|
top -H -b -n 1 | head -40
|
||
|
|
ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20
|
||
|
|
```
|
||
|
|
|
||
|
|
Flow:
|
||
|
|
|
||
|
|
1. Confirm sustained load, not a short spike.
|
||
|
|
2. Separate user CPU vs system CPU vs I/O wait.
|
||
|
|
3. Identify hot process and hot threads.
|
||
|
|
4. Correlate with deploys, cron, backups, or JVM GC.
|
||
|
|
5. Throttle, stop, or fail over only with service impact understood.
|
||
|
|
|
||
|
|
### Memory Pressure
|
||
|
|
|
||
|
|
```bash
|
||
|
|
free -m
|
||
|
|
vmstat 1 5
|
||
|
|
sar -r 1 5
|
||
|
|
ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20
|
||
|
|
dmesg -T | egrep -i 'oom|killed process'
|
||
|
|
```
|
||
|
|
|
||
|
|
Flow:
|
||
|
|
|
||
|
|
1. Check swap growth and page scan rates.
|
||
|
|
2. Identify top RSS owners.
|
||
|
|
3. Check kernel logs for OOM.
|
||
|
|
4. Validate cache vs real process growth.
|
||
|
|
5. Restart leaking service only after capturing evidence.
|
||
|
|
|
||
|
|
### Failed Service
|
||
|
|
|
||
|
|
```bash
|
||
|
|
systemctl status <service> --no-pager -l
|
||
|
|
journalctl -u <service> -b --no-pager | tail -100
|
||
|
|
systemctl show <service> -p ExecStart -p FragmentPath -p ActiveEnterTimestamp
|
||
|
|
```
|
||
|
|
|
||
|
|
Flow:
|
||
|
|
|
||
|
|
1. Validate config.
|
||
|
|
2. Validate credentials, ports, mounts, permissions.
|
||
|
|
3. Confirm dependency availability.
|
||
|
|
4. Restart and recheck logs immediately.
|
||
|
|
|
||
|
|
### SELinux Denials
|
||
|
|
|
||
|
|
Typical case: service works in `Permissive`, fails in `Enforcing`, or logs show `permission denied` while UNIX permissions look correct.
|
||
|
|
|
||
|
|
Triage:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
getenforce
|
||
|
|
sestatus
|
||
|
|
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
|
||
|
|
ausearch -m AVC -ts recent | audit2why
|
||
|
|
journalctl -t setroubleshoot --since '30 min ago'
|
||
|
|
systemctl status <service> --no-pager -l
|
||
|
|
ps -eZ | grep <service>
|
||
|
|
ls -lZ /path/to/app /path/to/app/*
|
||
|
|
```
|
||
|
|
|
||
|
|
Flow:
|
||
|
|
|
||
|
|
1. Confirm the failure is current and reproducible.
|
||
|
|
2. Identify the denied process domain, target path, and requested access from AVC logs.
|
||
|
|
3. Validate expected default context with `matchpathcon`.
|
||
|
|
4. Check for mislabeled files, wrong port types, or missing SELinux booleans.
|
||
|
|
5. Apply the smallest persistent fix, then retest in `Enforcing`.
|
||
|
|
|
||
|
|
Common fixes:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
matchpathcon /srv/app/config.yml
|
||
|
|
restorecon -Rv /srv/app
|
||
|
|
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
|
||
|
|
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
|
||
|
|
semanage port -l | grep http
|
||
|
|
getsebool -a | grep httpd
|
||
|
|
setsebool -P httpd_can_network_connect on
|
||
|
|
```
|
||
|
|
|
||
|
|
Validation:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
getenforce
|
||
|
|
systemctl restart <service>
|
||
|
|
systemctl status <service> --no-pager -l
|
||
|
|
ausearch -m AVC -ts recent
|
||
|
|
curl -fsS http://127.0.0.1:<port>/health
|
||
|
|
```
|
||
|
|
|
||
|
|
Operational notes:
|
||
|
|
|
||
|
|
- do not leave systems in `Permissive` as the fix
|
||
|
|
- prefer `restorecon` and `semanage fcontext` over repeated `chcon`
|
||
|
|
- treat `audit2allow` output as investigation material, not automatic remediation
|
||
|
|
- if policy changes are unavoidable, document exact AVC evidence and rollback path
|
||
|
|
|
||
|
|
### SSL Issues
|
||
|
|
|
||
|
|
```bash
|
||
|
|
openssl s_client -connect host:443 -servername host -showcerts </dev/null
|
||
|
|
openssl x509 -in cert.pem -noout -subject -issuer -dates -ext subjectAltName
|
||
|
|
curl -vkI https://host/
|
||
|
|
```
|
||
|
|
|
||
|
|
Check for:
|
||
|
|
|
||
|
|
- expired certificate
|
||
|
|
- missing SAN
|
||
|
|
- incomplete chain
|
||
|
|
- hostname mismatch
|
||
|
|
- TLS version or cipher mismatch
|
||
|
|
|
||
|
|
### DNS Issues
|
||
|
|
|
||
|
|
```bash
|
||
|
|
dig +short app.example.com
|
||
|
|
dig @<resolver> app.example.com
|
||
|
|
dig +trace app.example.com
|
||
|
|
getent hosts app.example.com
|
||
|
|
resolvectl status
|
||
|
|
```
|
||
|
|
|
||
|
|
Flow:
|
||
|
|
|
||
|
|
1. Compare resolver result with authoritative result.
|
||
|
|
2. Check TTL and stale cache.
|
||
|
|
3. Validate `/etc/resolv.conf`, local resolver, and search domains.
|
||
|
|
4. Test from affected host and unaffected host.
|
||
|
|
|
||
|
|
### Network Issues
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ip addr
|
||
|
|
ip route
|
||
|
|
ss -tulpen
|
||
|
|
tcpdump -ni any host <peer> and port <port>
|
||
|
|
curl -sv http://host:port/health
|
||
|
|
mtr -rwzc 20 host
|
||
|
|
```
|
||
|
|
|
||
|
|
Flow:
|
||
|
|
|
||
|
|
1. Interface/link state.
|
||
|
|
2. Route and source IP selection.
|
||
|
|
3. Listening socket on target.
|
||
|
|
4. Firewall and security controls.
|
||
|
|
5. Packet capture if app logs are inconclusive.
|
||
|
|
|
||
|
|
### JVM / Tomcat Issues
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ps -ef | grep -i tomcat
|
||
|
|
jcmd <pid> VM.flags
|
||
|
|
jstat -gcutil <pid> 1000 10
|
||
|
|
jstack <pid> | head -100
|
||
|
|
ss -ltnp | grep java
|
||
|
|
tail -100 /opt/tomcat/logs/catalina.out
|
||
|
|
```
|
||
|
|
|
||
|
|
Focus:
|
||
|
|
|
||
|
|
- stuck threads
|
||
|
|
- full GC loops
|
||
|
|
- heap exhaustion
|
||
|
|
- connector bind failures
|
||
|
|
- slow backend dependency
|
||
|
|
|
||
|
|
### Certificate Expiration
|
||
|
|
|
||
|
|
```bash
|
||
|
|
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
|
||
|
|
| openssl x509 -noout -enddate
|
||
|
|
|
||
|
|
openssl x509 -checkend 2592000 -noout -in cert.pem
|
||
|
|
```
|
||
|
|
|
||
|
|
### Suspicious Login Attempts
|
||
|
|
|
||
|
|
```bash
|
||
|
|
last -ai | head -30
|
||
|
|
lastb -ai | head -30
|
||
|
|
grep 'Failed password' /var/log/secure | tail -50
|
||
|
|
grep 'Accepted ' /var/log/secure | tail -50
|
||
|
|
ausearch -m USER_LOGIN -ts recent
|
||
|
|
```
|
||
|
|
|
||
|
|
Workflow:
|
||
|
|
|
||
|
|
1. Identify source IPs and usernames.
|
||
|
|
2. Validate whether attempts are expected from bastions/scanners.
|
||
|
|
3. Check successful logins from same sources.
|
||
|
|
4. Review sudo usage and persistence changes.
|
||
|
|
5. Preserve logs before cleanup or rotation.
|
||
|
|
|
||
|
|
## Networking Operations
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ip -br addr
|
||
|
|
ip route get 8.8.8.8
|
||
|
|
ss -ltnp
|
||
|
|
ss -tn state established '( sport = :443 or dport = :443 )'
|
||
|
|
tcpdump -ni eth0 port 53
|
||
|
|
dig +short mx example.com
|
||
|
|
curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health
|
||
|
|
mtr -rwzc 10 host
|
||
|
|
traceroute -T -p 443 host
|
||
|
|
openssl s_client -connect host:443 -servername host </dev/null
|
||
|
|
```
|
||
|
|
|
||
|
|
## Storage Operations
|
||
|
|
|
||
|
|
### Block and Filesystem Discovery
|
||
|
|
|
||
|
|
```bash
|
||
|
|
lsblk -f
|
||
|
|
blkid
|
||
|
|
findmnt
|
||
|
|
cat /proc/partitions
|
||
|
|
multipath -ll
|
||
|
|
```
|
||
|
|
|
||
|
|
### LVM
|
||
|
|
|
||
|
|
```bash
|
||
|
|
pvs
|
||
|
|
vgs
|
||
|
|
lvs -a -o +devices
|
||
|
|
pvdisplay /dev/sdX
|
||
|
|
vgdisplay <vg>
|
||
|
|
lvdisplay /dev/<vg>/<lv>
|
||
|
|
```
|
||
|
|
|
||
|
|
Growth example:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
pvcreate /dev/mapper/mpatha # impact: write metadata
|
||
|
|
vgextend vgdata /dev/mapper/mpatha # impact: changes VG layout
|
||
|
|
lvextend -L +100G -r /dev/vgdata/lvapp
|
||
|
|
```
|
||
|
|
|
||
|
|
### XFS
|
||
|
|
|
||
|
|
```bash
|
||
|
|
xfs_info /mountpoint
|
||
|
|
xfs_repair -n /dev/mapper/vg-lv
|
||
|
|
xfs_growfs /mountpoint
|
||
|
|
```
|
||
|
|
|
||
|
|
### ext4
|
||
|
|
|
||
|
|
```bash
|
||
|
|
tune2fs -l /dev/mapper/vg-lv | head -40
|
||
|
|
e2fsck -fn /dev/mapper/vg-lv
|
||
|
|
resize2fs /dev/mapper/vg-lv
|
||
|
|
```
|
||
|
|
|
||
|
|
### Multipath
|
||
|
|
|
||
|
|
```bash
|
||
|
|
multipath -ll
|
||
|
|
lsblk -S
|
||
|
|
udevadm info --query=all --name=/dev/mapper/mpatha | head -40
|
||
|
|
```
|
||
|
|
|
||
|
|
### NFS
|
||
|
|
|
||
|
|
```bash
|
||
|
|
showmount -e nfs-server
|
||
|
|
nfsstat -m
|
||
|
|
mount | grep nfs
|
||
|
|
rpcinfo -p nfs-server
|
||
|
|
```
|
||
|
|
|
||
|
|
### iSCSI
|
||
|
|
|
||
|
|
```bash
|
||
|
|
iscsiadm -m session
|
||
|
|
iscsiadm -m node
|
||
|
|
iscsiadm -m discovery -t sendtargets -p <target-ip>
|
||
|
|
```
|
||
|
|
|
||
|
|
### Mount Troubleshooting
|
||
|
|
|
||
|
|
```bash
|
||
|
|
findmnt /mountpoint
|
||
|
|
mount -v /mountpoint
|
||
|
|
dmesg -T | tail -50
|
||
|
|
journalctl -k -n 100 --no-pager
|
||
|
|
```
|
||
|
|
|
||
|
|
Check:
|
||
|
|
|
||
|
|
- device path stable
|
||
|
|
- UUID correct
|
||
|
|
- filesystem type correct
|
||
|
|
- multipath settled
|
||
|
|
- network and RPC available for NFS
|
||
|
|
|
||
|
|
### Filesystem Validation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
|
||
|
|
df -hT /data
|
||
|
|
touch /data/.write-test && rm -f /data/.write-test
|
||
|
|
```
|
||
|
|
|
||
|
|
### Migration Validation Example
|
||
|
|
|
||
|
|
```bash
|
||
|
|
findmnt /data
|
||
|
|
df -hT /data
|
||
|
|
rsync -aHAXvn /olddata/ /data/
|
||
|
|
rsync -aHAXc --delete --dry-run /olddata/ /data/
|
||
|
|
sha256sum /olddata/keyfile /data/keyfile
|
||
|
|
```
|
||
|
|
|
||
|
|
## AIX Operations
|
||
|
|
|
||
|
|
```bash
|
||
|
|
oslevel -s
|
||
|
|
errpt | head
|
||
|
|
errpt -a | more
|
||
|
|
topas
|
||
|
|
lsvg -o
|
||
|
|
lsvg rootvg
|
||
|
|
lslpp -L | grep -i openssl
|
||
|
|
svmon -G
|
||
|
|
svmon -P <pid>
|
||
|
|
netstat -rn
|
||
|
|
```
|
||
|
|
|
||
|
|
## SSL/TLS Operations
|
||
|
|
|
||
|
|
### OpenSSL Checks
|
||
|
|
|
||
|
|
```bash
|
||
|
|
openssl version -a
|
||
|
|
openssl x509 -in cert.pem -noout -text | less
|
||
|
|
openssl rsa -in key.pem -check
|
||
|
|
openssl verify -CAfile chain.pem cert.pem
|
||
|
|
```
|
||
|
|
|
||
|
|
### Expiration Validation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
openssl x509 -enddate -noout -in cert.pem
|
||
|
|
openssl x509 -checkend 604800 -noout -in cert.pem
|
||
|
|
```
|
||
|
|
|
||
|
|
### keytool Basics
|
||
|
|
|
||
|
|
```bash
|
||
|
|
keytool -list -v -keystore keystore.jks
|
||
|
|
keytool -list -cacerts | grep -i <alias>
|
||
|
|
keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks
|
||
|
|
```
|
||
|
|
|
||
|
|
### Chain Validation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
openssl s_client -connect host:443 -servername host -showcerts </dev/null
|
||
|
|
openssl verify -untrusted intermediate.pem -CAfile root.pem server.pem
|
||
|
|
```
|
||
|
|
|
||
|
|
## Automation Operations
|
||
|
|
|
||
|
|
### Bash Safety Patterns
|
||
|
|
|
||
|
|
```bash
|
||
|
|
set -euo pipefail
|
||
|
|
IFS=$'\n\t'
|
||
|
|
trap 'echo "line ${LINENO}: command failed" >&2' ERR
|
||
|
|
trap 'rm -f "${tmpfile:-}"' EXIT
|
||
|
|
```
|
||
|
|
|
||
|
|
Safe loop examples:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
while IFS= read -r host; do
|
||
|
|
ssh "$host" uptime
|
||
|
|
done < hostlist.txt
|
||
|
|
|
||
|
|
find /var/log -type f -name '*.log' -print0 \
|
||
|
|
| while IFS= read -r -d '' file; do
|
||
|
|
gzip -t "$file"
|
||
|
|
done
|
||
|
|
```
|
||
|
|
|
||
|
|
Operational scripting patterns:
|
||
|
|
|
||
|
|
- default to read-only mode
|
||
|
|
- require explicit `--execute` for changes
|
||
|
|
- log actions with timestamps
|
||
|
|
- validate dependencies with `command -v`
|
||
|
|
- use temp files with `mktemp`
|
||
|
|
- guard destructive paths and empty variables
|
||
|
|
|
||
|
|
## Ansible Operations
|
||
|
|
|
||
|
|
### Execution
|
||
|
|
|
||
|
|
```bash
|
||
|
|
ansible-inventory -i inventory/hosts.yml --graph
|
||
|
|
ansible-inventory -i inventory/hosts.yml --list | jq '.'
|
||
|
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check
|
||
|
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
|
||
|
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01
|
||
|
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages
|
||
|
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Safe Rollout Workflow
|
||
|
|
|
||
|
|
1. Validate inventory and variable targeting.
|
||
|
|
2. Run syntax-check.
|
||
|
|
3. Run `--check --diff` on a single host.
|
||
|
|
4. Execute against one host or one tier.
|
||
|
|
5. Validate service health, logs, and config.
|
||
|
|
6. Expand rollout only after post-check passes.
|
||
|
|
|
||
|
|
Rollback mindset:
|
||
|
|
|
||
|
|
- keep before/after config copies
|
||
|
|
- know which tasks restart services
|
||
|
|
- define manual backout if package/config changes fail
|
||
|
|
- avoid broad `--limit` mistakes by reviewing resolved host list first
|
||
|
|
|
||
|
|
## Monitoring & Observability
|
||
|
|
|
||
|
|
### Zabbix Checks
|
||
|
|
|
||
|
|
```bash
|
||
|
|
systemctl status zabbix-agent2 --no-pager
|
||
|
|
zabbix_agent2 -t vfs.fs.size[/,free]
|
||
|
|
grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log
|
||
|
|
```
|
||
|
|
|
||
|
|
### ELK Log Workflows
|
||
|
|
|
||
|
|
```bash
|
||
|
|
grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50
|
||
|
|
journalctl -u filebeat -n 100 --no-pager
|
||
|
|
curl -s http://localhost:9200/_cluster/health?pretty
|
||
|
|
```
|
||
|
|
|
||
|
|
### Grafana Checks
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
|
||
|
|
grep -i 'error' /var/log/grafana/grafana.log | tail -50
|
||
|
|
```
|
||
|
|
|
||
|
|
### Health Endpoints and Alert Validation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -fsS http://app:8080/health
|
||
|
|
curl -fsS http://app:8080/metrics | head
|
||
|
|
```
|
||
|
|
|
||
|
|
False positive validation:
|
||
|
|
|
||
|
|
1. Compare alert timestamp with deploy/change window.
|
||
|
|
2. Confirm on-host evidence, not only dashboard data.
|
||
|
|
3. Check collector lag, scrape failures, and stale metrics.
|
||
|
|
4. Validate from a second source before escalating.
|
||
|
|
|
||
|
|
## Operational Habits
|
||
|
|
|
||
|
|
### Pre-checks
|
||
|
|
|
||
|
|
- capture time, hostname, and operator
|
||
|
|
- capture current config and service state
|
||
|
|
- check recent alerts, maintenance windows, and dependencies
|
||
|
|
- confirm backup or rollback path exists
|
||
|
|
|
||
|
|
### Post-checks
|
||
|
|
|
||
|
|
- validate service state
|
||
|
|
- validate logs for fresh errors
|
||
|
|
- validate client path, ports, and name resolution
|
||
|
|
- compare metrics before/after
|
||
|
|
|
||
|
|
### Rollback Thinking
|
||
|
|
|
||
|
|
- define exact backout trigger before change
|
||
|
|
- prefer reversible steps
|
||
|
|
- keep config backups with timestamps
|
||
|
|
- avoid bundling unrelated changes
|
||
|
|
|
||
|
|
### Change Validation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
systemctl is-active <service>
|
||
|
|
curl -fsS http://127.0.0.1:<port>/health
|
||
|
|
ss -ltnp | grep :<port>
|
||
|
|
journalctl -u <service> -S '5 min ago' --no-pager
|
||
|
|
```
|
||
|
|
|
||
|
|
### Operational Communication
|
||
|
|
|
||
|
|
- state scope, risk, and expected impact before action
|
||
|
|
- record start and stop times in UTC
|
||
|
|
- document what changed, what was checked, and remaining risk
|
||
|
|
- escalate with evidence, not assumptions
|
||
|
|
|
||
|
|
### Evidence Collection During Incidents
|
||
|
|
|
||
|
|
```bash
|
||
|
|
mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ)
|
||
|
|
journalctl -b > /tmp/incident-*/journal.txt
|
||
|
|
ss -tulpen > /tmp/incident-*/sockets.txt
|
||
|
|
df -hT > /tmp/incident-*/df.txt
|
||
|
|
free -m > /tmp/incident-*/free.txt
|
||
|
|
```
|