Files
portfolio/infra-run/docs/operations-cheatsheet.md
T

858 lines
18 KiB
Markdown
Raw Normal View History

# Production Operations Cheatsheet
Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change.
## Linux / Unix Daily Operations
### Uptime and Host State
Check host age, kernel, clock, and recent reboot history before touching anything:
```bash
uptime
uname -r
hostnamectl
timedatectl
who -b
last -x | head -20
```
Pre-check pattern:
```bash
date -u
uptime
df -h
free -m
systemctl --failed
```
### Process Management
```bash
ps -ef | head
ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
pgrep -a java
pstree -ap | less
pidof sshd
renice +5 -p <pid>
kill -TERM <pid>
kill -9 <pid> # DANGEROUS: last resort only
```
Validation:
```bash
ps -p <pid> -o pid,stat,etime,cmd
journalctl -u <service> -n 50 --no-pager
```
### systemctl
```bash
systemctl status <service> --no-pager -l
systemctl is-active <service>
systemctl is-enabled <service>
systemctl list-units --type=service --state=running
systemctl list-units --failed
systemctl daemon-reload
systemctl restart <service> # impact: confirms service interruption policy first
```
### journalctl
```bash
journalctl -u <service> -n 100 --no-pager
journalctl -u <service> --since '30 min ago'
journalctl -p err -S today
journalctl -k -b
journalctl --disk-usage
```
### Service Troubleshooting Flow
1. Confirm service state and recent restart count.
2. Read the last 100-200 journal lines.
3. Validate config syntax before restart if the daemon supports it.
4. Check dependent ports, mounts, credentials, and name resolution.
5. Restart only after cause is understood or rollback exists.
Example:
```bash
systemctl status nginx --no-pager -l
journalctl -u nginx -n 100 --no-pager
nginx -t
ss -ltnp | grep ':80\|:443'
curl -kI https://127.0.0.1/
```
### CPU and Memory Diagnostics
```bash
uptime
top -H -b -n 1 | head -40
pidstat 1 5
pidstat -ru -p ALL 1 3
vmstat 1 5
iostat -xz 1 5
free -m
sar -q 1 5
```
Quick interpretation:
- high `%wa`: storage path or NFS issue
- high run queue with low CPU idle: CPU contention
- swap growth plus page scans: memory pressure
### Disk Usage
```bash
df -hT
du -xhd1 /var | sort -h
find /var/log -type f -size +500M -ls | sort -k7,7n
lsof +L1
```
### Inode Exhaustion
```bash
df -ih
find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n
find /tmp -xdev -type f | wc -l
```
### Mounts
```bash
mount | column -t
findmnt
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
cat /etc/fstab
mount -a # can expose bad fstab entries; use in change window
```
### Permissions
```bash
namei -l /path/to/file
stat /path/to/file
getfacl /path/to/file
chmod 640 /path/to/file
chown root:app /path/to/file
```
### SELinux
State and mode:
```bash
getenforce
sestatus
cat /etc/selinux/config
```
Check file, process, and port context:
```bash
ls -Zd /var/www/html
ls -lZ /var/www/html/index.html
ps -eZ | grep nginx
id -Z
semanage port -l | grep http
```
Audit and denial review:
```bash
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
ausearch -m AVC -ts today | audit2why
journalctl -t setroubleshoot --since '1 hour ago'
sealert -a /var/log/audit/audit.log
```
Typical flow:
1. Confirm SELinux mode is `Enforcing` or `Permissive`.
2. Identify the failing path, process domain, and target context.
3. Read AVC denials before changing labels or booleans.
4. Prefer persistent policy-aligned fixes over `chcon`.
5. Restore default labels and retest service path.
Modify and restore context:
```bash
chcon -t httpd_sys_content_t /srv/app/index.html # temporary until relabel/restore
chcon -R -t httpd_sys_rw_content_t /srv/app/uploads # temporary until relabel/restore
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
restorecon -Rv /srv/app
matchpathcon /srv/app/uploads/file.txt
```
Booleans and validation:
```bash
getsebool -a | grep httpd
getsebool httpd_can_network_connect
setsebool -P httpd_can_network_connect on
runcon -t httpd_t -- id -Z
```
Notes:
- prefer `semanage fcontext` plus `restorecon` for persistent fixes
- use `chcon` only as a short-lived diagnostic or emergency workaround
- avoid generating local policy modules from `audit2allow` until root cause is understood
- after context changes, validate service startup, AVC silence, and application path access
### Archives
```bash
tar tf backup.tar | head
tar czf logs-$(date +%F).tgz /var/log/app
tar xzf bundle.tgz -C /restore/path
gzip -t file.gz
```
### File Operations
```bash
cp -a source/ target/
rsync -aHAXvn /src/ /dst/
rsync -aHAX --delete --info=progress2 /src/ /dst/ # impact: verify source/destination twice
mv file file.$(date +%F-%H%M%S).bak
sha256sum file
```
## Text Processing & Regex
### Core Tools
```bash
grep -n 'ERROR' app.log
grep -E 'ERROR|WARN' app.log
grep -P '^\d{4}-\d{2}-\d{2}T' app.log
awk '{print $1,$4,$5}' access.log
awk -F, 'NR==1 || $3 ~ /failed/' report.csv
sed -n '1,20p' file
sed -E 's/[[:space:]]+/ /g' file
cut -d: -f1,7 /etc/passwd
sort file | uniq -c | sort -nr
xargs -r -n1 systemctl status < service-list.txt
jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json
```
### Regex Reference
```text
IPv4 \b(?:\d{1,3}\.){3}\d{1,3}\b
ISO timestamp \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b
UUID \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b
Log level \b(?:ERROR|WARN|INFO)\b
Failed SSH Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})
Ansible changed/fail ^(changed|fatal|failed):\s+\[[^]]+\]
```
### Log Parsing Examples
IP extraction:
```bash
grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head
```
Timestamp filter:
```bash
grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log
```
UUID extraction:
```bash
grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u
```
ERROR/WARN/INFO parsing:
```bash
grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c
```
Failed SSH login parsing:
```bash
grep 'Failed password' /var/log/secure \
| awk '{print $(NF-3),$NF}' \
| sort | uniq -c | sort -nr | head
```
Extract fields from logs:
```bash
awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log
```
Filter Ansible output:
```bash
grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log
grep -E '^fatal:|^failed:' ansible.log
```
## Incident Response
### Disk Full
Workflow:
```bash
df -hT
df -ih
findmnt
du -xhd1 /var | sort -h
find /var -xdev -type f -size +1G -ls | sort -k7,7n
lsof +L1
journalctl --disk-usage
```
Typical branches:
- filesystem full: identify growth path, compress/rotate/archive, validate app behavior
- inode full: remove file storms, spool buildup, temp-file leaks
- deleted open files: restart offender only after sizing impact
Post-check:
```bash
df -hT
df -ih
systemctl --failed
```
### High CPU
```bash
uptime
mpstat -P ALL 1 5
pidstat -u -p ALL 1 5
top -H -b -n 1 | head -40
ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20
```
Flow:
1. Confirm sustained load, not a short spike.
2. Separate user CPU vs system CPU vs I/O wait.
3. Identify hot process and hot threads.
4. Correlate with deploys, cron, backups, or JVM GC.
5. Throttle, stop, or fail over only with service impact understood.
### Memory Pressure
```bash
free -m
vmstat 1 5
sar -r 1 5
ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20
dmesg -T | egrep -i 'oom|killed process'
```
Flow:
1. Check swap growth and page scan rates.
2. Identify top RSS owners.
3. Check kernel logs for OOM.
4. Validate cache vs real process growth.
5. Restart leaking service only after capturing evidence.
### Failed Service
```bash
systemctl status <service> --no-pager -l
journalctl -u <service> -b --no-pager | tail -100
systemctl show <service> -p ExecStart -p FragmentPath -p ActiveEnterTimestamp
```
Flow:
1. Validate config.
2. Validate credentials, ports, mounts, permissions.
3. Confirm dependency availability.
4. Restart and recheck logs immediately.
### SELinux Denials
Typical case: service works in `Permissive`, fails in `Enforcing`, or logs show `permission denied` while UNIX permissions look correct.
Triage:
```bash
getenforce
sestatus
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
ausearch -m AVC -ts recent | audit2why
journalctl -t setroubleshoot --since '30 min ago'
systemctl status <service> --no-pager -l
ps -eZ | grep <service>
ls -lZ /path/to/app /path/to/app/*
```
Flow:
1. Confirm the failure is current and reproducible.
2. Identify the denied process domain, target path, and requested access from AVC logs.
3. Validate expected default context with `matchpathcon`.
4. Check for mislabeled files, wrong port types, or missing SELinux booleans.
5. Apply the smallest persistent fix, then retest in `Enforcing`.
Common fixes:
```bash
matchpathcon /srv/app/config.yml
restorecon -Rv /srv/app
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
semanage port -l | grep http
getsebool -a | grep httpd
setsebool -P httpd_can_network_connect on
```
Validation:
```bash
getenforce
systemctl restart <service>
systemctl status <service> --no-pager -l
ausearch -m AVC -ts recent
curl -fsS http://127.0.0.1:<port>/health
```
Operational notes:
- do not leave systems in `Permissive` as the fix
- prefer `restorecon` and `semanage fcontext` over repeated `chcon`
- treat `audit2allow` output as investigation material, not automatic remediation
- if policy changes are unavoidable, document exact AVC evidence and rollback path
### SSL Issues
```bash
openssl s_client -connect host:443 -servername host -showcerts </dev/null
openssl x509 -in cert.pem -noout -subject -issuer -dates -ext subjectAltName
curl -vkI https://host/
```
Check for:
- expired certificate
- missing SAN
- incomplete chain
- hostname mismatch
- TLS version or cipher mismatch
### DNS Issues
```bash
dig +short app.example.com
dig @<resolver> app.example.com
dig +trace app.example.com
getent hosts app.example.com
resolvectl status
```
Flow:
1. Compare resolver result with authoritative result.
2. Check TTL and stale cache.
3. Validate `/etc/resolv.conf`, local resolver, and search domains.
4. Test from affected host and unaffected host.
### Network Issues
```bash
ip addr
ip route
ss -tulpen
tcpdump -ni any host <peer> and port <port>
curl -sv http://host:port/health
mtr -rwzc 20 host
```
Flow:
1. Interface/link state.
2. Route and source IP selection.
3. Listening socket on target.
4. Firewall and security controls.
5. Packet capture if app logs are inconclusive.
### JVM / Tomcat Issues
```bash
ps -ef | grep -i tomcat
jcmd <pid> VM.flags
jstat -gcutil <pid> 1000 10
jstack <pid> | head -100
ss -ltnp | grep java
tail -100 /opt/tomcat/logs/catalina.out
```
Focus:
- stuck threads
- full GC loops
- heap exhaustion
- connector bind failures
- slow backend dependency
### Certificate Expiration
```bash
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| openssl x509 -noout -enddate
openssl x509 -checkend 2592000 -noout -in cert.pem
```
### Suspicious Login Attempts
```bash
last -ai | head -30
lastb -ai | head -30
grep 'Failed password' /var/log/secure | tail -50
grep 'Accepted ' /var/log/secure | tail -50
ausearch -m USER_LOGIN -ts recent
```
Workflow:
1. Identify source IPs and usernames.
2. Validate whether attempts are expected from bastions/scanners.
3. Check successful logins from same sources.
4. Review sudo usage and persistence changes.
5. Preserve logs before cleanup or rotation.
## Networking Operations
```bash
ip -br addr
ip route get 8.8.8.8
ss -ltnp
ss -tn state established '( sport = :443 or dport = :443 )'
tcpdump -ni eth0 port 53
dig +short mx example.com
curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health
mtr -rwzc 10 host
traceroute -T -p 443 host
openssl s_client -connect host:443 -servername host </dev/null
```
## Storage Operations
### Block and Filesystem Discovery
```bash
lsblk -f
blkid
findmnt
cat /proc/partitions
multipath -ll
```
### LVM
```bash
pvs
vgs
lvs -a -o +devices
pvdisplay /dev/sdX
vgdisplay <vg>
lvdisplay /dev/<vg>/<lv>
```
Growth example:
```bash
pvcreate /dev/mapper/mpatha # impact: write metadata
vgextend vgdata /dev/mapper/mpatha # impact: changes VG layout
lvextend -L +100G -r /dev/vgdata/lvapp
```
### XFS
```bash
xfs_info /mountpoint
xfs_repair -n /dev/mapper/vg-lv
xfs_growfs /mountpoint
```
### ext4
```bash
tune2fs -l /dev/mapper/vg-lv | head -40
e2fsck -fn /dev/mapper/vg-lv
resize2fs /dev/mapper/vg-lv
```
### Multipath
```bash
multipath -ll
lsblk -S
udevadm info --query=all --name=/dev/mapper/mpatha | head -40
```
### NFS
```bash
showmount -e nfs-server
nfsstat -m
mount | grep nfs
rpcinfo -p nfs-server
```
### iSCSI
```bash
iscsiadm -m session
iscsiadm -m node
iscsiadm -m discovery -t sendtargets -p <target-ip>
```
### Mount Troubleshooting
```bash
findmnt /mountpoint
mount -v /mountpoint
dmesg -T | tail -50
journalctl -k -n 100 --no-pager
```
Check:
- device path stable
- UUID correct
- filesystem type correct
- multipath settled
- network and RPC available for NFS
### Filesystem Validation
```bash
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
df -hT /data
touch /data/.write-test && rm -f /data/.write-test
```
### Migration Validation Example
```bash
findmnt /data
df -hT /data
rsync -aHAXvn /olddata/ /data/
rsync -aHAXc --delete --dry-run /olddata/ /data/
sha256sum /olddata/keyfile /data/keyfile
```
## AIX Operations
```bash
oslevel -s
errpt | head
errpt -a | more
topas
lsvg -o
lsvg rootvg
lslpp -L | grep -i openssl
svmon -G
svmon -P <pid>
netstat -rn
```
## SSL/TLS Operations
### OpenSSL Checks
```bash
openssl version -a
openssl x509 -in cert.pem -noout -text | less
openssl rsa -in key.pem -check
openssl verify -CAfile chain.pem cert.pem
```
### Expiration Validation
```bash
openssl x509 -enddate -noout -in cert.pem
openssl x509 -checkend 604800 -noout -in cert.pem
```
### keytool Basics
```bash
keytool -list -v -keystore keystore.jks
keytool -list -cacerts | grep -i <alias>
keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks
```
### Chain Validation
```bash
openssl s_client -connect host:443 -servername host -showcerts </dev/null
openssl verify -untrusted intermediate.pem -CAfile root.pem server.pem
```
## Automation Operations
### Bash Safety Patterns
```bash
set -euo pipefail
IFS=$'\n\t'
trap 'echo "line ${LINENO}: command failed" >&2' ERR
trap 'rm -f "${tmpfile:-}"' EXIT
```
Safe loop examples:
```bash
while IFS= read -r host; do
ssh "$host" uptime
done < hostlist.txt
find /var/log -type f -name '*.log' -print0 \
| while IFS= read -r -d '' file; do
gzip -t "$file"
done
```
Operational scripting patterns:
- default to read-only mode
- require explicit `--execute` for changes
- log actions with timestamps
- validate dependencies with `command -v`
- use temp files with `mktemp`
- guard destructive paths and empty variables
## Ansible Operations
### Execution
```bash
ansible-inventory -i inventory/hosts.yml --graph
ansible-inventory -i inventory/hosts.yml --list | jq '.'
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx'
```
### Safe Rollout Workflow
1. Validate inventory and variable targeting.
2. Run syntax-check.
3. Run `--check --diff` on a single host.
4. Execute against one host or one tier.
5. Validate service health, logs, and config.
6. Expand rollout only after post-check passes.
Rollback mindset:
- keep before/after config copies
- know which tasks restart services
- define manual backout if package/config changes fail
- avoid broad `--limit` mistakes by reviewing resolved host list first
## Monitoring & Observability
### Zabbix Checks
```bash
systemctl status zabbix-agent2 --no-pager
zabbix_agent2 -t vfs.fs.size[/,free]
grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log
```
### ELK Log Workflows
```bash
grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50
journalctl -u filebeat -n 100 --no-pager
curl -s http://localhost:9200/_cluster/health?pretty
```
### Grafana Checks
```bash
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
grep -i 'error' /var/log/grafana/grafana.log | tail -50
```
### Health Endpoints and Alert Validation
```bash
curl -fsS http://app:8080/health
curl -fsS http://app:8080/metrics | head
```
False positive validation:
1. Compare alert timestamp with deploy/change window.
2. Confirm on-host evidence, not only dashboard data.
3. Check collector lag, scrape failures, and stale metrics.
4. Validate from a second source before escalating.
## Operational Habits
### Pre-checks
- capture time, hostname, and operator
- capture current config and service state
- check recent alerts, maintenance windows, and dependencies
- confirm backup or rollback path exists
### Post-checks
- validate service state
- validate logs for fresh errors
- validate client path, ports, and name resolution
- compare metrics before/after
### Rollback Thinking
- define exact backout trigger before change
- prefer reversible steps
- keep config backups with timestamps
- avoid bundling unrelated changes
### Change Validation
```bash
systemctl is-active <service>
curl -fsS http://127.0.0.1:<port>/health
ss -ltnp | grep :<port>
journalctl -u <service> -S '5 min ago' --no-pager
```
### Operational Communication
- state scope, risk, and expected impact before action
- record start and stop times in UTC
- document what changed, what was checked, and remaining risk
- escalate with evidence, not assumptions
### Evidence Collection During Incidents
```bash
mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ)
journalctl -b > /tmp/incident-*/journal.txt
ss -tulpen > /tmp/incident-*/sockets.txt
df -hT > /tmp/incident-*/df.txt
free -m > /tmp/incident-*/free.txt
```