18 KiB
Production Operations Cheatsheet
Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change.
Linux / Unix Daily Operations
Uptime and Host State
Check host age, kernel, clock, and recent reboot history before touching anything:
uptime
uname -r
hostnamectl
timedatectl
who -b
last -x | head -20
Pre-check pattern:
date -u
uptime
df -h
free -m
systemctl --failed
Process Management
ps -ef | head
ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
pgrep -a java
pstree -ap | less
pidof sshd
renice +5 -p <pid>
kill -TERM <pid>
kill -9 <pid> # DANGEROUS: last resort only
Validation:
ps -p <pid> -o pid,stat,etime,cmd
journalctl -u <service> -n 50 --no-pager
systemctl
systemctl status <service> --no-pager -l
systemctl is-active <service>
systemctl is-enabled <service>
systemctl list-units --type=service --state=running
systemctl list-units --failed
systemctl daemon-reload
systemctl restart <service> # impact: confirms service interruption policy first
journalctl
journalctl -u <service> -n 100 --no-pager
journalctl -u <service> --since '30 min ago'
journalctl -p err -S today
journalctl -k -b
journalctl --disk-usage
Service Troubleshooting Flow
- Confirm service state and recent restart count.
- Read the last 100-200 journal lines.
- Validate config syntax before restart if the daemon supports it.
- Check dependent ports, mounts, credentials, and name resolution.
- Restart only after cause is understood or rollback exists.
Example:
systemctl status nginx --no-pager -l
journalctl -u nginx -n 100 --no-pager
nginx -t
ss -ltnp | grep ':80\|:443'
curl -kI https://127.0.0.1/
CPU and Memory Diagnostics
uptime
top -H -b -n 1 | head -40
pidstat 1 5
pidstat -ru -p ALL 1 3
vmstat 1 5
iostat -xz 1 5
free -m
sar -q 1 5
Quick interpretation:
- high
%wa: storage path or NFS issue - high run queue with low CPU idle: CPU contention
- swap growth plus page scans: memory pressure
Disk Usage
df -hT
du -xhd1 /var | sort -h
find /var/log -type f -size +500M -ls | sort -k7,7n
lsof +L1
Inode Exhaustion
df -ih
find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n
find /tmp -xdev -type f | wc -l
Mounts
mount | column -t
findmnt
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
cat /etc/fstab
mount -a # can expose bad fstab entries; use in change window
Permissions
namei -l /path/to/file
stat /path/to/file
getfacl /path/to/file
chmod 640 /path/to/file
chown root:app /path/to/file
SELinux
State and mode:
getenforce
sestatus
cat /etc/selinux/config
Check file, process, and port context:
ls -Zd /var/www/html
ls -lZ /var/www/html/index.html
ps -eZ | grep nginx
id -Z
semanage port -l | grep http
Audit and denial review:
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
ausearch -m AVC -ts today | audit2why
journalctl -t setroubleshoot --since '1 hour ago'
sealert -a /var/log/audit/audit.log
Typical flow:
- Confirm SELinux mode is
EnforcingorPermissive. - Identify the failing path, process domain, and target context.
- Read AVC denials before changing labels or booleans.
- Prefer persistent policy-aligned fixes over
chcon. - Restore default labels and retest service path.
Modify and restore context:
chcon -t httpd_sys_content_t /srv/app/index.html # temporary until relabel/restore
chcon -R -t httpd_sys_rw_content_t /srv/app/uploads # temporary until relabel/restore
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
restorecon -Rv /srv/app
matchpathcon /srv/app/uploads/file.txt
Booleans and validation:
getsebool -a | grep httpd
getsebool httpd_can_network_connect
setsebool -P httpd_can_network_connect on
runcon -t httpd_t -- id -Z
Notes:
- prefer
semanage fcontextplusrestoreconfor persistent fixes - use
chcononly as a short-lived diagnostic or emergency workaround - avoid generating local policy modules from
audit2allowuntil root cause is understood - after context changes, validate service startup, AVC silence, and application path access
Archives
tar tf backup.tar | head
tar czf logs-$(date +%F).tgz /var/log/app
tar xzf bundle.tgz -C /restore/path
gzip -t file.gz
File Operations
cp -a source/ target/
rsync -aHAXvn /src/ /dst/
rsync -aHAX --delete --info=progress2 /src/ /dst/ # impact: verify source/destination twice
mv file file.$(date +%F-%H%M%S).bak
sha256sum file
Text Processing & Regex
Core Tools
grep -n 'ERROR' app.log
grep -E 'ERROR|WARN' app.log
grep -P '^\d{4}-\d{2}-\d{2}T' app.log
awk '{print $1,$4,$5}' access.log
awk -F, 'NR==1 || $3 ~ /failed/' report.csv
sed -n '1,20p' file
sed -E 's/[[:space:]]+/ /g' file
cut -d: -f1,7 /etc/passwd
sort file | uniq -c | sort -nr
xargs -r -n1 systemctl status < service-list.txt
jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json
Regex Reference
IPv4 \b(?:\d{1,3}\.){3}\d{1,3}\b
ISO timestamp \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b
UUID \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b
Log level \b(?:ERROR|WARN|INFO)\b
Failed SSH Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})
Ansible changed/fail ^(changed|fatal|failed):\s+\[[^]]+\]
Log Parsing Examples
IP extraction:
grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head
Timestamp filter:
grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log
UUID extraction:
grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u
ERROR/WARN/INFO parsing:
grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c
Failed SSH login parsing:
grep 'Failed password' /var/log/secure \
| awk '{print $(NF-3),$NF}' \
| sort | uniq -c | sort -nr | head
Extract fields from logs:
awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log
Filter Ansible output:
grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log
grep -E '^fatal:|^failed:' ansible.log
Incident Response
Disk Full
Workflow:
df -hT
df -ih
findmnt
du -xhd1 /var | sort -h
find /var -xdev -type f -size +1G -ls | sort -k7,7n
lsof +L1
journalctl --disk-usage
Typical branches:
- filesystem full: identify growth path, compress/rotate/archive, validate app behavior
- inode full: remove file storms, spool buildup, temp-file leaks
- deleted open files: restart offender only after sizing impact
Post-check:
df -hT
df -ih
systemctl --failed
High CPU
uptime
mpstat -P ALL 1 5
pidstat -u -p ALL 1 5
top -H -b -n 1 | head -40
ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20
Flow:
- Confirm sustained load, not a short spike.
- Separate user CPU vs system CPU vs I/O wait.
- Identify hot process and hot threads.
- Correlate with deploys, cron, backups, or JVM GC.
- Throttle, stop, or fail over only with service impact understood.
Memory Pressure
free -m
vmstat 1 5
sar -r 1 5
ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20
dmesg -T | egrep -i 'oom|killed process'
Flow:
- Check swap growth and page scan rates.
- Identify top RSS owners.
- Check kernel logs for OOM.
- Validate cache vs real process growth.
- Restart leaking service only after capturing evidence.
Failed Service
systemctl status <service> --no-pager -l
journalctl -u <service> -b --no-pager | tail -100
systemctl show <service> -p ExecStart -p FragmentPath -p ActiveEnterTimestamp
Flow:
- Validate config.
- Validate credentials, ports, mounts, permissions.
- Confirm dependency availability.
- Restart and recheck logs immediately.
SELinux Denials
Typical case: service works in Permissive, fails in Enforcing, or logs show permission denied while UNIX permissions look correct.
Triage:
getenforce
sestatus
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
ausearch -m AVC -ts recent | audit2why
journalctl -t setroubleshoot --since '30 min ago'
systemctl status <service> --no-pager -l
ps -eZ | grep <service>
ls -lZ /path/to/app /path/to/app/*
Flow:
- Confirm the failure is current and reproducible.
- Identify the denied process domain, target path, and requested access from AVC logs.
- Validate expected default context with
matchpathcon. - Check for mislabeled files, wrong port types, or missing SELinux booleans.
- Apply the smallest persistent fix, then retest in
Enforcing.
Common fixes:
matchpathcon /srv/app/config.yml
restorecon -Rv /srv/app
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
semanage port -l | grep http
getsebool -a | grep httpd
setsebool -P httpd_can_network_connect on
Validation:
getenforce
systemctl restart <service>
systemctl status <service> --no-pager -l
ausearch -m AVC -ts recent
curl -fsS http://127.0.0.1:<port>/health
Operational notes:
- do not leave systems in
Permissiveas the fix - prefer
restoreconandsemanage fcontextover repeatedchcon - treat
audit2allowoutput as investigation material, not automatic remediation - if policy changes are unavoidable, document exact AVC evidence and rollback path
SSL Issues
openssl s_client -connect host:443 -servername host -showcerts </dev/null
openssl x509 -in cert.pem -noout -subject -issuer -dates -ext subjectAltName
curl -vkI https://host/
Check for:
- expired certificate
- missing SAN
- incomplete chain
- hostname mismatch
- TLS version or cipher mismatch
DNS Issues
dig +short app.example.com
dig @<resolver> app.example.com
dig +trace app.example.com
getent hosts app.example.com
resolvectl status
Flow:
- Compare resolver result with authoritative result.
- Check TTL and stale cache.
- Validate
/etc/resolv.conf, local resolver, and search domains. - Test from affected host and unaffected host.
Network Issues
ip addr
ip route
ss -tulpen
tcpdump -ni any host <peer> and port <port>
curl -sv http://host:port/health
mtr -rwzc 20 host
Flow:
- Interface/link state.
- Route and source IP selection.
- Listening socket on target.
- Firewall and security controls.
- Packet capture if app logs are inconclusive.
JVM / Tomcat Issues
ps -ef | grep -i tomcat
jcmd <pid> VM.flags
jstat -gcutil <pid> 1000 10
jstack <pid> | head -100
ss -ltnp | grep java
tail -100 /opt/tomcat/logs/catalina.out
Focus:
- stuck threads
- full GC loops
- heap exhaustion
- connector bind failures
- slow backend dependency
Certificate Expiration
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| openssl x509 -noout -enddate
openssl x509 -checkend 2592000 -noout -in cert.pem
Suspicious Login Attempts
last -ai | head -30
lastb -ai | head -30
grep 'Failed password' /var/log/secure | tail -50
grep 'Accepted ' /var/log/secure | tail -50
ausearch -m USER_LOGIN -ts recent
Workflow:
- Identify source IPs and usernames.
- Validate whether attempts are expected from bastions/scanners.
- Check successful logins from same sources.
- Review sudo usage and persistence changes.
- Preserve logs before cleanup or rotation.
Networking Operations
ip -br addr
ip route get 8.8.8.8
ss -ltnp
ss -tn state established '( sport = :443 or dport = :443 )'
tcpdump -ni eth0 port 53
dig +short mx example.com
curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health
mtr -rwzc 10 host
traceroute -T -p 443 host
openssl s_client -connect host:443 -servername host </dev/null
Storage Operations
Block and Filesystem Discovery
lsblk -f
blkid
findmnt
cat /proc/partitions
multipath -ll
LVM
pvs
vgs
lvs -a -o +devices
pvdisplay /dev/sdX
vgdisplay <vg>
lvdisplay /dev/<vg>/<lv>
Growth example:
pvcreate /dev/mapper/mpatha # impact: write metadata
vgextend vgdata /dev/mapper/mpatha # impact: changes VG layout
lvextend -L +100G -r /dev/vgdata/lvapp
XFS
xfs_info /mountpoint
xfs_repair -n /dev/mapper/vg-lv
xfs_growfs /mountpoint
ext4
tune2fs -l /dev/mapper/vg-lv | head -40
e2fsck -fn /dev/mapper/vg-lv
resize2fs /dev/mapper/vg-lv
Multipath
multipath -ll
lsblk -S
udevadm info --query=all --name=/dev/mapper/mpatha | head -40
NFS
showmount -e nfs-server
nfsstat -m
mount | grep nfs
rpcinfo -p nfs-server
iSCSI
iscsiadm -m session
iscsiadm -m node
iscsiadm -m discovery -t sendtargets -p <target-ip>
Mount Troubleshooting
findmnt /mountpoint
mount -v /mountpoint
dmesg -T | tail -50
journalctl -k -n 100 --no-pager
Check:
- device path stable
- UUID correct
- filesystem type correct
- multipath settled
- network and RPC available for NFS
Filesystem Validation
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
df -hT /data
touch /data/.write-test && rm -f /data/.write-test
Migration Validation Example
findmnt /data
df -hT /data
rsync -aHAXvn /olddata/ /data/
rsync -aHAXc --delete --dry-run /olddata/ /data/
sha256sum /olddata/keyfile /data/keyfile
AIX Operations
oslevel -s
errpt | head
errpt -a | more
topas
lsvg -o
lsvg rootvg
lslpp -L | grep -i openssl
svmon -G
svmon -P <pid>
netstat -rn
SSL/TLS Operations
OpenSSL Checks
openssl version -a
openssl x509 -in cert.pem -noout -text | less
openssl rsa -in key.pem -check
openssl verify -CAfile chain.pem cert.pem
Expiration Validation
openssl x509 -enddate -noout -in cert.pem
openssl x509 -checkend 604800 -noout -in cert.pem
keytool Basics
keytool -list -v -keystore keystore.jks
keytool -list -cacerts | grep -i <alias>
keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks
Chain Validation
openssl s_client -connect host:443 -servername host -showcerts </dev/null
openssl verify -untrusted intermediate.pem -CAfile root.pem server.pem
Automation Operations
Bash Safety Patterns
set -euo pipefail
IFS=$'\n\t'
trap 'echo "line ${LINENO}: command failed" >&2' ERR
trap 'rm -f "${tmpfile:-}"' EXIT
Safe loop examples:
while IFS= read -r host; do
ssh "$host" uptime
done < hostlist.txt
find /var/log -type f -name '*.log' -print0 \
| while IFS= read -r -d '' file; do
gzip -t "$file"
done
Operational scripting patterns:
- default to read-only mode
- require explicit
--executefor changes - log actions with timestamps
- validate dependencies with
command -v - use temp files with
mktemp - guard destructive paths and empty variables
Ansible Operations
Execution
ansible-inventory -i inventory/hosts.yml --graph
ansible-inventory -i inventory/hosts.yml --list | jq '.'
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx'
Safe Rollout Workflow
- Validate inventory and variable targeting.
- Run syntax-check.
- Run
--check --diffon a single host. - Execute against one host or one tier.
- Validate service health, logs, and config.
- Expand rollout only after post-check passes.
Rollback mindset:
- keep before/after config copies
- know which tasks restart services
- define manual backout if package/config changes fail
- avoid broad
--limitmistakes by reviewing resolved host list first
Monitoring & Observability
Zabbix Checks
systemctl status zabbix-agent2 --no-pager
zabbix_agent2 -t vfs.fs.size[/,free]
grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log
ELK Log Workflows
grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50
journalctl -u filebeat -n 100 --no-pager
curl -s http://localhost:9200/_cluster/health?pretty
Grafana Checks
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
grep -i 'error' /var/log/grafana/grafana.log | tail -50
Health Endpoints and Alert Validation
curl -fsS http://app:8080/health
curl -fsS http://app:8080/metrics | head
False positive validation:
- Compare alert timestamp with deploy/change window.
- Confirm on-host evidence, not only dashboard data.
- Check collector lag, scrape failures, and stale metrics.
- Validate from a second source before escalating.
Operational Habits
Pre-checks
- capture time, hostname, and operator
- capture current config and service state
- check recent alerts, maintenance windows, and dependencies
- confirm backup or rollback path exists
Post-checks
- validate service state
- validate logs for fresh errors
- validate client path, ports, and name resolution
- compare metrics before/after
Rollback Thinking
- define exact backout trigger before change
- prefer reversible steps
- keep config backups with timestamps
- avoid bundling unrelated changes
Change Validation
systemctl is-active <service>
curl -fsS http://127.0.0.1:<port>/health
ss -ltnp | grep :<port>
journalctl -u <service> -S '5 min ago' --no-pager
Operational Communication
- state scope, risk, and expected impact before action
- record start and stop times in UTC
- document what changed, what was checked, and remaining risk
- escalate with evidence, not assumptions
Evidence Collection During Incidents
mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ)
journalctl -b > /tmp/incident-*/journal.txt
ss -tulpen > /tmp/incident-*/sockets.txt
df -hT > /tmp/incident-*/df.txt
free -m > /tmp/incident-*/free.txt