Files
portfolio/infra-run/docs/operations-cheatsheet.md
T
Mateusz Suski 0d3905b8a1
lint / shell-yaml-ansible (push) Failing after 17s
Add operational cheatsheets across repository
2026-05-09 09:41:55 +00:00

18 KiB

Production Operations Cheatsheet

Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change.

Linux / Unix Daily Operations

Uptime and Host State

Check host age, kernel, clock, and recent reboot history before touching anything:

uptime
uname -r
hostnamectl
timedatectl
who -b
last -x | head -20

Pre-check pattern:

date -u
uptime
df -h
free -m
systemctl --failed

Process Management

ps -ef | head
ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
pgrep -a java
pstree -ap | less
pidof sshd
renice +5 -p <pid>
kill -TERM <pid>
kill -9 <pid>   # DANGEROUS: last resort only

Validation:

ps -p <pid> -o pid,stat,etime,cmd
journalctl -u <service> -n 50 --no-pager

systemctl

systemctl status <service> --no-pager -l
systemctl is-active <service>
systemctl is-enabled <service>
systemctl list-units --type=service --state=running
systemctl list-units --failed
systemctl daemon-reload
systemctl restart <service>   # impact: confirms service interruption policy first

journalctl

journalctl -u <service> -n 100 --no-pager
journalctl -u <service> --since '30 min ago'
journalctl -p err -S today
journalctl -k -b
journalctl --disk-usage

Service Troubleshooting Flow

  1. Confirm service state and recent restart count.
  2. Read the last 100-200 journal lines.
  3. Validate config syntax before restart if the daemon supports it.
  4. Check dependent ports, mounts, credentials, and name resolution.
  5. Restart only after cause is understood or rollback exists.

Example:

systemctl status nginx --no-pager -l
journalctl -u nginx -n 100 --no-pager
nginx -t
ss -ltnp | grep ':80\|:443'
curl -kI https://127.0.0.1/

CPU and Memory Diagnostics

uptime
top -H -b -n 1 | head -40
pidstat 1 5
pidstat -ru -p ALL 1 3
vmstat 1 5
iostat -xz 1 5
free -m
sar -q 1 5

Quick interpretation:

  • high %wa: storage path or NFS issue
  • high run queue with low CPU idle: CPU contention
  • swap growth plus page scans: memory pressure

Disk Usage

df -hT
du -xhd1 /var | sort -h
find /var/log -type f -size +500M -ls | sort -k7,7n
lsof +L1

Inode Exhaustion

df -ih
find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n
find /tmp -xdev -type f | wc -l

Mounts

mount | column -t
findmnt
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
cat /etc/fstab
mount -a   # can expose bad fstab entries; use in change window

Permissions

namei -l /path/to/file
stat /path/to/file
getfacl /path/to/file
chmod 640 /path/to/file
chown root:app /path/to/file

SELinux

State and mode:

getenforce
sestatus
cat /etc/selinux/config

Check file, process, and port context:

ls -Zd /var/www/html
ls -lZ /var/www/html/index.html
ps -eZ | grep nginx
id -Z
semanage port -l | grep http

Audit and denial review:

ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
ausearch -m AVC -ts today | audit2why
journalctl -t setroubleshoot --since '1 hour ago'
sealert -a /var/log/audit/audit.log

Typical flow:

  1. Confirm SELinux mode is Enforcing or Permissive.
  2. Identify the failing path, process domain, and target context.
  3. Read AVC denials before changing labels or booleans.
  4. Prefer persistent policy-aligned fixes over chcon.
  5. Restore default labels and retest service path.

Modify and restore context:

chcon -t httpd_sys_content_t /srv/app/index.html              # temporary until relabel/restore
chcon -R -t httpd_sys_rw_content_t /srv/app/uploads           # temporary until relabel/restore
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
restorecon -Rv /srv/app
matchpathcon /srv/app/uploads/file.txt

Booleans and validation:

getsebool -a | grep httpd
getsebool httpd_can_network_connect
setsebool -P httpd_can_network_connect on
runcon -t httpd_t -- id -Z

Notes:

  • prefer semanage fcontext plus restorecon for persistent fixes
  • use chcon only as a short-lived diagnostic or emergency workaround
  • avoid generating local policy modules from audit2allow until root cause is understood
  • after context changes, validate service startup, AVC silence, and application path access

Archives

tar tf backup.tar | head
tar czf logs-$(date +%F).tgz /var/log/app
tar xzf bundle.tgz -C /restore/path
gzip -t file.gz

File Operations

cp -a source/ target/
rsync -aHAXvn /src/ /dst/
rsync -aHAX --delete --info=progress2 /src/ /dst/   # impact: verify source/destination twice
mv file file.$(date +%F-%H%M%S).bak
sha256sum file

Text Processing & Regex

Core Tools

grep -n 'ERROR' app.log
grep -E 'ERROR|WARN' app.log
grep -P '^\d{4}-\d{2}-\d{2}T' app.log
awk '{print $1,$4,$5}' access.log
awk -F, 'NR==1 || $3 ~ /failed/' report.csv
sed -n '1,20p' file
sed -E 's/[[:space:]]+/ /g' file
cut -d: -f1,7 /etc/passwd
sort file | uniq -c | sort -nr
xargs -r -n1 systemctl status < service-list.txt
jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json

Regex Reference

IPv4                  \b(?:\d{1,3}\.){3}\d{1,3}\b
ISO timestamp         \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b
UUID                  \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b
Log level             \b(?:ERROR|WARN|INFO)\b
Failed SSH            Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})
Ansible changed/fail  ^(changed|fatal|failed):\s+\[[^]]+\]

Log Parsing Examples

IP extraction:

grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head

Timestamp filter:

grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log

UUID extraction:

grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u

ERROR/WARN/INFO parsing:

grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c

Failed SSH login parsing:

grep 'Failed password' /var/log/secure \
| awk '{print $(NF-3),$NF}' \
| sort | uniq -c | sort -nr | head

Extract fields from logs:

awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log

Filter Ansible output:

grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log
grep -E '^fatal:|^failed:' ansible.log

Incident Response

Disk Full

Workflow:

df -hT
df -ih
findmnt
du -xhd1 /var | sort -h
find /var -xdev -type f -size +1G -ls | sort -k7,7n
lsof +L1
journalctl --disk-usage

Typical branches:

  • filesystem full: identify growth path, compress/rotate/archive, validate app behavior
  • inode full: remove file storms, spool buildup, temp-file leaks
  • deleted open files: restart offender only after sizing impact

Post-check:

df -hT
df -ih
systemctl --failed

High CPU

uptime
mpstat -P ALL 1 5
pidstat -u -p ALL 1 5
top -H -b -n 1 | head -40
ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20

Flow:

  1. Confirm sustained load, not a short spike.
  2. Separate user CPU vs system CPU vs I/O wait.
  3. Identify hot process and hot threads.
  4. Correlate with deploys, cron, backups, or JVM GC.
  5. Throttle, stop, or fail over only with service impact understood.

Memory Pressure

free -m
vmstat 1 5
sar -r 1 5
ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20
dmesg -T | egrep -i 'oom|killed process'

Flow:

  1. Check swap growth and page scan rates.
  2. Identify top RSS owners.
  3. Check kernel logs for OOM.
  4. Validate cache vs real process growth.
  5. Restart leaking service only after capturing evidence.

Failed Service

systemctl status <service> --no-pager -l
journalctl -u <service> -b --no-pager | tail -100
systemctl show <service> -p ExecStart -p FragmentPath -p ActiveEnterTimestamp

Flow:

  1. Validate config.
  2. Validate credentials, ports, mounts, permissions.
  3. Confirm dependency availability.
  4. Restart and recheck logs immediately.

SELinux Denials

Typical case: service works in Permissive, fails in Enforcing, or logs show permission denied while UNIX permissions look correct.

Triage:

getenforce
sestatus
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
ausearch -m AVC -ts recent | audit2why
journalctl -t setroubleshoot --since '30 min ago'
systemctl status <service> --no-pager -l
ps -eZ | grep <service>
ls -lZ /path/to/app /path/to/app/*

Flow:

  1. Confirm the failure is current and reproducible.
  2. Identify the denied process domain, target path, and requested access from AVC logs.
  3. Validate expected default context with matchpathcon.
  4. Check for mislabeled files, wrong port types, or missing SELinux booleans.
  5. Apply the smallest persistent fix, then retest in Enforcing.

Common fixes:

matchpathcon /srv/app/config.yml
restorecon -Rv /srv/app
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
semanage port -l | grep http
getsebool -a | grep httpd
setsebool -P httpd_can_network_connect on

Validation:

getenforce
systemctl restart <service>
systemctl status <service> --no-pager -l
ausearch -m AVC -ts recent
curl -fsS http://127.0.0.1:<port>/health

Operational notes:

  • do not leave systems in Permissive as the fix
  • prefer restorecon and semanage fcontext over repeated chcon
  • treat audit2allow output as investigation material, not automatic remediation
  • if policy changes are unavoidable, document exact AVC evidence and rollback path

SSL Issues

openssl s_client -connect host:443 -servername host -showcerts </dev/null
openssl x509 -in cert.pem -noout -subject -issuer -dates -ext subjectAltName
curl -vkI https://host/

Check for:

  • expired certificate
  • missing SAN
  • incomplete chain
  • hostname mismatch
  • TLS version or cipher mismatch

DNS Issues

dig +short app.example.com
dig @<resolver> app.example.com
dig +trace app.example.com
getent hosts app.example.com
resolvectl status

Flow:

  1. Compare resolver result with authoritative result.
  2. Check TTL and stale cache.
  3. Validate /etc/resolv.conf, local resolver, and search domains.
  4. Test from affected host and unaffected host.

Network Issues

ip addr
ip route
ss -tulpen
tcpdump -ni any host <peer> and port <port>
curl -sv http://host:port/health
mtr -rwzc 20 host

Flow:

  1. Interface/link state.
  2. Route and source IP selection.
  3. Listening socket on target.
  4. Firewall and security controls.
  5. Packet capture if app logs are inconclusive.

JVM / Tomcat Issues

ps -ef | grep -i tomcat
jcmd <pid> VM.flags
jstat -gcutil <pid> 1000 10
jstack <pid> | head -100
ss -ltnp | grep java
tail -100 /opt/tomcat/logs/catalina.out

Focus:

  • stuck threads
  • full GC loops
  • heap exhaustion
  • connector bind failures
  • slow backend dependency

Certificate Expiration

echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| openssl x509 -noout -enddate

openssl x509 -checkend 2592000 -noout -in cert.pem

Suspicious Login Attempts

last -ai | head -30
lastb -ai | head -30
grep 'Failed password' /var/log/secure | tail -50
grep 'Accepted ' /var/log/secure | tail -50
ausearch -m USER_LOGIN -ts recent

Workflow:

  1. Identify source IPs and usernames.
  2. Validate whether attempts are expected from bastions/scanners.
  3. Check successful logins from same sources.
  4. Review sudo usage and persistence changes.
  5. Preserve logs before cleanup or rotation.

Networking Operations

ip -br addr
ip route get 8.8.8.8
ss -ltnp
ss -tn state established '( sport = :443 or dport = :443 )'
tcpdump -ni eth0 port 53
dig +short mx example.com
curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health
mtr -rwzc 10 host
traceroute -T -p 443 host
openssl s_client -connect host:443 -servername host </dev/null

Storage Operations

Block and Filesystem Discovery

lsblk -f
blkid
findmnt
cat /proc/partitions
multipath -ll

LVM

pvs
vgs
lvs -a -o +devices
pvdisplay /dev/sdX
vgdisplay <vg>
lvdisplay /dev/<vg>/<lv>

Growth example:

pvcreate /dev/mapper/mpatha          # impact: write metadata
vgextend vgdata /dev/mapper/mpatha   # impact: changes VG layout
lvextend -L +100G -r /dev/vgdata/lvapp

XFS

xfs_info /mountpoint
xfs_repair -n /dev/mapper/vg-lv
xfs_growfs /mountpoint

ext4

tune2fs -l /dev/mapper/vg-lv | head -40
e2fsck -fn /dev/mapper/vg-lv
resize2fs /dev/mapper/vg-lv

Multipath

multipath -ll
lsblk -S
udevadm info --query=all --name=/dev/mapper/mpatha | head -40

NFS

showmount -e nfs-server
nfsstat -m
mount | grep nfs
rpcinfo -p nfs-server

iSCSI

iscsiadm -m session
iscsiadm -m node
iscsiadm -m discovery -t sendtargets -p <target-ip>

Mount Troubleshooting

findmnt /mountpoint
mount -v /mountpoint
dmesg -T | tail -50
journalctl -k -n 100 --no-pager

Check:

  • device path stable
  • UUID correct
  • filesystem type correct
  • multipath settled
  • network and RPC available for NFS

Filesystem Validation

findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
df -hT /data
touch /data/.write-test && rm -f /data/.write-test

Migration Validation Example

findmnt /data
df -hT /data
rsync -aHAXvn /olddata/ /data/
rsync -aHAXc --delete --dry-run /olddata/ /data/
sha256sum /olddata/keyfile /data/keyfile

AIX Operations

oslevel -s
errpt | head
errpt -a | more
topas
lsvg -o
lsvg rootvg
lslpp -L | grep -i openssl
svmon -G
svmon -P <pid>
netstat -rn

SSL/TLS Operations

OpenSSL Checks

openssl version -a
openssl x509 -in cert.pem -noout -text | less
openssl rsa -in key.pem -check
openssl verify -CAfile chain.pem cert.pem

Expiration Validation

openssl x509 -enddate -noout -in cert.pem
openssl x509 -checkend 604800 -noout -in cert.pem

keytool Basics

keytool -list -v -keystore keystore.jks
keytool -list -cacerts | grep -i <alias>
keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks

Chain Validation

openssl s_client -connect host:443 -servername host -showcerts </dev/null
openssl verify -untrusted intermediate.pem -CAfile root.pem server.pem

Automation Operations

Bash Safety Patterns

set -euo pipefail
IFS=$'\n\t'
trap 'echo "line ${LINENO}: command failed" >&2' ERR
trap 'rm -f "${tmpfile:-}"' EXIT

Safe loop examples:

while IFS= read -r host; do
  ssh "$host" uptime
done < hostlist.txt

find /var/log -type f -name '*.log' -print0 \
| while IFS= read -r -d '' file; do
    gzip -t "$file"
  done

Operational scripting patterns:

  • default to read-only mode
  • require explicit --execute for changes
  • log actions with timestamps
  • validate dependencies with command -v
  • use temp files with mktemp
  • guard destructive paths and empty variables

Ansible Operations

Execution

ansible-inventory -i inventory/hosts.yml --graph
ansible-inventory -i inventory/hosts.yml --list | jq '.'
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx'

Safe Rollout Workflow

  1. Validate inventory and variable targeting.
  2. Run syntax-check.
  3. Run --check --diff on a single host.
  4. Execute against one host or one tier.
  5. Validate service health, logs, and config.
  6. Expand rollout only after post-check passes.

Rollback mindset:

  • keep before/after config copies
  • know which tasks restart services
  • define manual backout if package/config changes fail
  • avoid broad --limit mistakes by reviewing resolved host list first

Monitoring & Observability

Zabbix Checks

systemctl status zabbix-agent2 --no-pager
zabbix_agent2 -t vfs.fs.size[/,free]
grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log

ELK Log Workflows

grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50
journalctl -u filebeat -n 100 --no-pager
curl -s http://localhost:9200/_cluster/health?pretty

Grafana Checks

curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
grep -i 'error' /var/log/grafana/grafana.log | tail -50

Health Endpoints and Alert Validation

curl -fsS http://app:8080/health
curl -fsS http://app:8080/metrics | head

False positive validation:

  1. Compare alert timestamp with deploy/change window.
  2. Confirm on-host evidence, not only dashboard data.
  3. Check collector lag, scrape failures, and stale metrics.
  4. Validate from a second source before escalating.

Operational Habits

Pre-checks

  • capture time, hostname, and operator
  • capture current config and service state
  • check recent alerts, maintenance windows, and dependencies
  • confirm backup or rollback path exists

Post-checks

  • validate service state
  • validate logs for fresh errors
  • validate client path, ports, and name resolution
  • compare metrics before/after

Rollback Thinking

  • define exact backout trigger before change
  • prefer reversible steps
  • keep config backups with timestamps
  • avoid bundling unrelated changes

Change Validation

systemctl is-active <service>
curl -fsS http://127.0.0.1:<port>/health
ss -ltnp | grep :<port>
journalctl -u <service> -S '5 min ago' --no-pager

Operational Communication

  • state scope, risk, and expected impact before action
  • record start and stop times in UTC
  • document what changed, what was checked, and remaining risk
  • escalate with evidence, not assumptions

Evidence Collection During Incidents

mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ)
journalctl -b > /tmp/incident-*/journal.txt
ss -tulpen > /tmp/incident-*/sockets.txt
df -hT > /tmp/incident-*/df.txt
free -m > /tmp/incident-*/free.txt