# Production Operations Cheatsheet Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change. ## Linux / Unix Daily Operations ### Uptime and Host State Check host age, kernel, clock, and recent reboot history before touching anything: ```bash uptime uname -r hostnamectl timedatectl who -b last -x | head -20 ``` Pre-check pattern: ```bash date -u uptime df -h free -m systemctl --failed ``` ### Process Management ```bash ps -ef | head ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20 pgrep -a java pstree -ap | less pidof sshd renice +5 -p kill -TERM kill -9 # DANGEROUS: last resort only ``` Validation: ```bash ps -p -o pid,stat,etime,cmd journalctl -u -n 50 --no-pager ``` ### systemctl ```bash systemctl status --no-pager -l systemctl is-active systemctl is-enabled systemctl list-units --type=service --state=running systemctl list-units --failed systemctl daemon-reload systemctl restart # impact: confirms service interruption policy first ``` ### journalctl ```bash journalctl -u -n 100 --no-pager journalctl -u --since '30 min ago' journalctl -p err -S today journalctl -k -b journalctl --disk-usage ``` ### Service Troubleshooting Flow 1. Confirm service state and recent restart count. 2. Read the last 100-200 journal lines. 3. Validate config syntax before restart if the daemon supports it. 4. Check dependent ports, mounts, credentials, and name resolution. 5. Restart only after cause is understood or rollback exists. Example: ```bash systemctl status nginx --no-pager -l journalctl -u nginx -n 100 --no-pager nginx -t ss -ltnp | grep ':80\|:443' curl -kI https://127.0.0.1/ ``` ### CPU and Memory Diagnostics ```bash uptime top -H -b -n 1 | head -40 pidstat 1 5 pidstat -ru -p ALL 1 3 vmstat 1 5 iostat -xz 1 5 free -m sar -q 1 5 ``` Quick interpretation: - high `%wa`: storage path or NFS issue - high run queue with low CPU idle: CPU contention - swap growth plus page scans: memory pressure ### Disk Usage ```bash df -hT du -xhd1 /var | sort -h find /var/log -type f -size +500M -ls | sort -k7,7n lsof +L1 ``` ### Inode Exhaustion ```bash df -ih find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n find /tmp -xdev -type f | wc -l ``` ### Mounts ```bash mount | column -t findmnt findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data cat /etc/fstab mount -a # can expose bad fstab entries; use in change window ``` ### Permissions ```bash namei -l /path/to/file stat /path/to/file getfacl /path/to/file chmod 640 /path/to/file chown root:app /path/to/file ``` ### SELinux State and mode: ```bash getenforce sestatus cat /etc/selinux/config ``` Check file, process, and port context: ```bash ls -Zd /var/www/html ls -lZ /var/www/html/index.html ps -eZ | grep nginx id -Z semanage port -l | grep http ``` Audit and denial review: ```bash ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent ausearch -m AVC -ts today | audit2why journalctl -t setroubleshoot --since '1 hour ago' sealert -a /var/log/audit/audit.log ``` Typical flow: 1. Confirm SELinux mode is `Enforcing` or `Permissive`. 2. Identify the failing path, process domain, and target context. 3. Read AVC denials before changing labels or booleans. 4. Prefer persistent policy-aligned fixes over `chcon`. 5. Restore default labels and retest service path. Modify and restore context: ```bash chcon -t httpd_sys_content_t /srv/app/index.html # temporary until relabel/restore chcon -R -t httpd_sys_rw_content_t /srv/app/uploads # temporary until relabel/restore semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?' semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?' restorecon -Rv /srv/app matchpathcon /srv/app/uploads/file.txt ``` Booleans and validation: ```bash getsebool -a | grep httpd getsebool httpd_can_network_connect setsebool -P httpd_can_network_connect on runcon -t httpd_t -- id -Z ``` Notes: - prefer `semanage fcontext` plus `restorecon` for persistent fixes - use `chcon` only as a short-lived diagnostic or emergency workaround - avoid generating local policy modules from `audit2allow` until root cause is understood - after context changes, validate service startup, AVC silence, and application path access ### Archives ```bash tar tf backup.tar | head tar czf logs-$(date +%F).tgz /var/log/app tar xzf bundle.tgz -C /restore/path gzip -t file.gz ``` ### File Operations ```bash cp -a source/ target/ rsync -aHAXvn /src/ /dst/ rsync -aHAX --delete --info=progress2 /src/ /dst/ # impact: verify source/destination twice mv file file.$(date +%F-%H%M%S).bak sha256sum file ``` ## Text Processing & Regex ### Core Tools ```bash grep -n 'ERROR' app.log grep -E 'ERROR|WARN' app.log grep -P '^\d{4}-\d{2}-\d{2}T' app.log awk '{print $1,$4,$5}' access.log awk -F, 'NR==1 || $3 ~ /failed/' report.csv sed -n '1,20p' file sed -E 's/[[:space:]]+/ /g' file cut -d: -f1,7 /etc/passwd sort file | uniq -c | sort -nr xargs -r -n1 systemctl status < service-list.txt jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json ``` ### Regex Reference ```text IPv4 \b(?:\d{1,3}\.){3}\d{1,3}\b ISO timestamp \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b UUID \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b Log level \b(?:ERROR|WARN|INFO)\b Failed SSH Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3}) Ansible changed/fail ^(changed|fatal|failed):\s+\[[^]]+\] ``` ### Log Parsing Examples IP extraction: ```bash grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head ``` Timestamp filter: ```bash grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log ``` UUID extraction: ```bash grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u ``` ERROR/WARN/INFO parsing: ```bash grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c ``` Failed SSH login parsing: ```bash grep 'Failed password' /var/log/secure \ | awk '{print $(NF-3),$NF}' \ | sort | uniq -c | sort -nr | head ``` Extract fields from logs: ```bash awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log ``` Filter Ansible output: ```bash grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log grep -E '^fatal:|^failed:' ansible.log ``` ## Incident Response ### Disk Full Workflow: ```bash df -hT df -ih findmnt du -xhd1 /var | sort -h find /var -xdev -type f -size +1G -ls | sort -k7,7n lsof +L1 journalctl --disk-usage ``` Typical branches: - filesystem full: identify growth path, compress/rotate/archive, validate app behavior - inode full: remove file storms, spool buildup, temp-file leaks - deleted open files: restart offender only after sizing impact Post-check: ```bash df -hT df -ih systemctl --failed ``` ### High CPU ```bash uptime mpstat -P ALL 1 5 pidstat -u -p ALL 1 5 top -H -b -n 1 | head -40 ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20 ``` Flow: 1. Confirm sustained load, not a short spike. 2. Separate user CPU vs system CPU vs I/O wait. 3. Identify hot process and hot threads. 4. Correlate with deploys, cron, backups, or JVM GC. 5. Throttle, stop, or fail over only with service impact understood. ### Memory Pressure ```bash free -m vmstat 1 5 sar -r 1 5 ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20 dmesg -T | egrep -i 'oom|killed process' ``` Flow: 1. Check swap growth and page scan rates. 2. Identify top RSS owners. 3. Check kernel logs for OOM. 4. Validate cache vs real process growth. 5. Restart leaking service only after capturing evidence. ### Failed Service ```bash systemctl status --no-pager -l journalctl -u -b --no-pager | tail -100 systemctl show -p ExecStart -p FragmentPath -p ActiveEnterTimestamp ``` Flow: 1. Validate config. 2. Validate credentials, ports, mounts, permissions. 3. Confirm dependency availability. 4. Restart and recheck logs immediately. ### SELinux Denials Typical case: service works in `Permissive`, fails in `Enforcing`, or logs show `permission denied` while UNIX permissions look correct. Triage: ```bash getenforce sestatus ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent ausearch -m AVC -ts recent | audit2why journalctl -t setroubleshoot --since '30 min ago' systemctl status --no-pager -l ps -eZ | grep ls -lZ /path/to/app /path/to/app/* ``` Flow: 1. Confirm the failure is current and reproducible. 2. Identify the denied process domain, target path, and requested access from AVC logs. 3. Validate expected default context with `matchpathcon`. 4. Check for mislabeled files, wrong port types, or missing SELinux booleans. 5. Apply the smallest persistent fix, then retest in `Enforcing`. Common fixes: ```bash matchpathcon /srv/app/config.yml restorecon -Rv /srv/app semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?' semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?' semanage port -l | grep http getsebool -a | grep httpd setsebool -P httpd_can_network_connect on ``` Validation: ```bash getenforce systemctl restart systemctl status --no-pager -l ausearch -m AVC -ts recent curl -fsS http://127.0.0.1:/health ``` Operational notes: - do not leave systems in `Permissive` as the fix - prefer `restorecon` and `semanage fcontext` over repeated `chcon` - treat `audit2allow` output as investigation material, not automatic remediation - if policy changes are unavoidable, document exact AVC evidence and rollback path ### SSL Issues ```bash openssl s_client -connect host:443 -servername host -showcerts app.example.com dig +trace app.example.com getent hosts app.example.com resolvectl status ``` Flow: 1. Compare resolver result with authoritative result. 2. Check TTL and stale cache. 3. Validate `/etc/resolv.conf`, local resolver, and search domains. 4. Test from affected host and unaffected host. ### Network Issues ```bash ip addr ip route ss -tulpen tcpdump -ni any host and port curl -sv http://host:port/health mtr -rwzc 20 host ``` Flow: 1. Interface/link state. 2. Route and source IP selection. 3. Listening socket on target. 4. Firewall and security controls. 5. Packet capture if app logs are inconclusive. ### JVM / Tomcat Issues ```bash ps -ef | grep -i tomcat jcmd VM.flags jstat -gcutil 1000 10 jstack | head -100 ss -ltnp | grep java tail -100 /opt/tomcat/logs/catalina.out ``` Focus: - stuck threads - full GC loops - heap exhaustion - connector bind failures - slow backend dependency ### Certificate Expiration ```bash echo | openssl s_client -connect host:443 -servername host 2>/dev/null \ | openssl x509 -noout -enddate openssl x509 -checkend 2592000 -noout -in cert.pem ``` ### Suspicious Login Attempts ```bash last -ai | head -30 lastb -ai | head -30 grep 'Failed password' /var/log/secure | tail -50 grep 'Accepted ' /var/log/secure | tail -50 ausearch -m USER_LOGIN -ts recent ``` Workflow: 1. Identify source IPs and usernames. 2. Validate whether attempts are expected from bastions/scanners. 3. Check successful logins from same sources. 4. Review sudo usage and persistence changes. 5. Preserve logs before cleanup or rotation. ## Networking Operations ```bash ip -br addr ip route get 8.8.8.8 ss -ltnp ss -tn state established '( sport = :443 or dport = :443 )' tcpdump -ni eth0 port 53 dig +short mx example.com curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health mtr -rwzc 10 host traceroute -T -p 443 host openssl s_client -connect host:443 -servername host lvdisplay /dev// ``` Growth example: ```bash pvcreate /dev/mapper/mpatha # impact: write metadata vgextend vgdata /dev/mapper/mpatha # impact: changes VG layout lvextend -L +100G -r /dev/vgdata/lvapp ``` ### XFS ```bash xfs_info /mountpoint xfs_repair -n /dev/mapper/vg-lv xfs_growfs /mountpoint ``` ### ext4 ```bash tune2fs -l /dev/mapper/vg-lv | head -40 e2fsck -fn /dev/mapper/vg-lv resize2fs /dev/mapper/vg-lv ``` ### Multipath ```bash multipath -ll lsblk -S udevadm info --query=all --name=/dev/mapper/mpatha | head -40 ``` ### NFS ```bash showmount -e nfs-server nfsstat -m mount | grep nfs rpcinfo -p nfs-server ``` ### iSCSI ```bash iscsiadm -m session iscsiadm -m node iscsiadm -m discovery -t sendtargets -p ``` ### Mount Troubleshooting ```bash findmnt /mountpoint mount -v /mountpoint dmesg -T | tail -50 journalctl -k -n 100 --no-pager ``` Check: - device path stable - UUID correct - filesystem type correct - multipath settled - network and RPC available for NFS ### Filesystem Validation ```bash findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data df -hT /data touch /data/.write-test && rm -f /data/.write-test ``` ### Migration Validation Example ```bash findmnt /data df -hT /data rsync -aHAXvn /olddata/ /data/ rsync -aHAXc --delete --dry-run /olddata/ /data/ sha256sum /olddata/keyfile /data/keyfile ``` ## AIX Operations ```bash oslevel -s errpt | head errpt -a | more topas lsvg -o lsvg rootvg lslpp -L | grep -i openssl svmon -G svmon -P netstat -rn ``` ## SSL/TLS Operations ### OpenSSL Checks ```bash openssl version -a openssl x509 -in cert.pem -noout -text | less openssl rsa -in key.pem -check openssl verify -CAfile chain.pem cert.pem ``` ### Expiration Validation ```bash openssl x509 -enddate -noout -in cert.pem openssl x509 -checkend 604800 -noout -in cert.pem ``` ### keytool Basics ```bash keytool -list -v -keystore keystore.jks keytool -list -cacerts | grep -i keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks ``` ### Chain Validation ```bash openssl s_client -connect host:443 -servername host -showcerts &2' ERR trap 'rm -f "${tmpfile:-}"' EXIT ``` Safe loop examples: ```bash while IFS= read -r host; do ssh "$host" uptime done < hostlist.txt find /var/log -type f -name '*.log' -print0 \ | while IFS= read -r -d '' file; do gzip -t "$file" done ``` Operational scripting patterns: - default to read-only mode - require explicit `--execute` for changes - log actions with timestamps - validate dependencies with `command -v` - use temp files with `mktemp` - guard destructive paths and empty variables ## Ansible Operations ### Execution ```bash ansible-inventory -i inventory/hosts.yml --graph ansible-inventory -i inventory/hosts.yml --list | jq '.' ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01 ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx' ``` ### Safe Rollout Workflow 1. Validate inventory and variable targeting. 2. Run syntax-check. 3. Run `--check --diff` on a single host. 4. Execute against one host or one tier. 5. Validate service health, logs, and config. 6. Expand rollout only after post-check passes. Rollback mindset: - keep before/after config copies - know which tasks restart services - define manual backout if package/config changes fail - avoid broad `--limit` mistakes by reviewing resolved host list first ## Monitoring & Observability ### Zabbix Checks ```bash systemctl status zabbix-agent2 --no-pager zabbix_agent2 -t vfs.fs.size[/,free] grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log ``` ### ELK Log Workflows ```bash grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50 journalctl -u filebeat -n 100 --no-pager curl -s http://localhost:9200/_cluster/health?pretty ``` ### Grafana Checks ```bash curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login grep -i 'error' /var/log/grafana/grafana.log | tail -50 ``` ### Health Endpoints and Alert Validation ```bash curl -fsS http://app:8080/health curl -fsS http://app:8080/metrics | head ``` False positive validation: 1. Compare alert timestamp with deploy/change window. 2. Confirm on-host evidence, not only dashboard data. 3. Check collector lag, scrape failures, and stale metrics. 4. Validate from a second source before escalating. ## Operational Habits ### Pre-checks - capture time, hostname, and operator - capture current config and service state - check recent alerts, maintenance windows, and dependencies - confirm backup or rollback path exists ### Post-checks - validate service state - validate logs for fresh errors - validate client path, ports, and name resolution - compare metrics before/after ### Rollback Thinking - define exact backout trigger before change - prefer reversible steps - keep config backups with timestamps - avoid bundling unrelated changes ### Change Validation ```bash systemctl is-active curl -fsS http://127.0.0.1:/health ss -ltnp | grep : journalctl -u -S '5 min ago' --no-pager ``` ### Operational Communication - state scope, risk, and expected impact before action - record start and stop times in UTC - document what changed, what was checked, and remaining risk - escalate with evidence, not assumptions ### Evidence Collection During Incidents ```bash mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ) journalctl -b > /tmp/incident-*/journal.txt ss -tulpen > /tmp/incident-*/sockets.txt df -hT > /tmp/incident-*/df.txt free -m > /tmp/incident-*/free.txt ```