mateusz/portfolio

Fork 0

Files

T

Mateusz Suski 0d3905b8a1

lint / shell-yaml-ansible (push) Failing after 17s

Details

Add operational cheatsheets across repository

2026-05-09 09:41:55 +00:00

18 KiB

Raw Blame History

Production Operations Cheatsheet

Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change.

Linux / Unix Daily Operations

Uptime and Host State

Check host age, kernel, clock, and recent reboot history before touching anything:

uptime
uname -r
hostnamectl
timedatectl
who -b
last -x | head -20

Pre-check pattern:

date -u
uptime
df -h
free -m
systemctl --failed

Process Management

ps -ef | head
ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
pgrep -a java
pstree -ap | less
pidof sshd
renice +5 -p <pid>
kill -TERM <pid>
kill -9 <pid>   # DANGEROUS: last resort only

Validation:

ps -p <pid> -o pid,stat,etime,cmd
journalctl -u <service> -n 50 --no-pager

systemctl

systemctl status <service> --no-pager -l
systemctl is-active <service>
systemctl is-enabled <service>
systemctl list-units --type=service --state=running
systemctl list-units --failed
systemctl daemon-reload
systemctl restart <service>   # impact: confirms service interruption policy first

journalctl

journalctl -u <service> -n 100 --no-pager
journalctl -u <service> --since '30 min ago'
journalctl -p err -S today
journalctl -k -b
journalctl --disk-usage

Service Troubleshooting Flow

Confirm service state and recent restart count.
Read the last 100-200 journal lines.
Validate config syntax before restart if the daemon supports it.
Check dependent ports, mounts, credentials, and name resolution.
Restart only after cause is understood or rollback exists.

Example:

systemctl status nginx --no-pager -l
journalctl -u nginx -n 100 --no-pager
nginx -t
ss -ltnp | grep ':80\|:443'
curl -kI https://127.0.0.1/

CPU and Memory Diagnostics

uptime
top -H -b -n 1 | head -40
pidstat 1 5
pidstat -ru -p ALL 1 3
vmstat 1 5
iostat -xz 1 5
free -m
sar -q 1 5

Quick interpretation:

high %wa: storage path or NFS issue
high run queue with low CPU idle: CPU contention
swap growth plus page scans: memory pressure

Disk Usage

df -hT
du -xhd1 /var | sort -h
find /var/log -type f -size +500M -ls | sort -k7,7n
lsof +L1

Inode Exhaustion

df -ih
find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n
find /tmp -xdev -type f | wc -l

Mounts

mount | column -t
findmnt
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
cat /etc/fstab
mount -a   # can expose bad fstab entries; use in change window

Permissions

namei -l /path/to/file
stat /path/to/file
getfacl /path/to/file
chmod 640 /path/to/file
chown root:app /path/to/file

SELinux

State and mode:

getenforce
sestatus
cat /etc/selinux/config

Check file, process, and port context:

ls -Zd /var/www/html
ls -lZ /var/www/html/index.html
ps -eZ | grep nginx
id -Z
semanage port -l | grep http

Audit and denial review:

ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
ausearch -m AVC -ts today | audit2why
journalctl -t setroubleshoot --since '1 hour ago'
sealert -a /var/log/audit/audit.log

Typical flow:

Confirm SELinux mode is Enforcing or Permissive.
Identify the failing path, process domain, and target context.
Read AVC denials before changing labels or booleans.
Prefer persistent policy-aligned fixes over chcon.
Restore default labels and retest service path.

Modify and restore context:

chcon -t httpd_sys_content_t /srv/app/index.html              # temporary until relabel/restore
chcon -R -t httpd_sys_rw_content_t /srv/app/uploads           # temporary until relabel/restore
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
restorecon -Rv /srv/app
matchpathcon /srv/app/uploads/file.txt

Booleans and validation:

getsebool -a | grep httpd
getsebool httpd_can_network_connect
setsebool -P httpd_can_network_connect on
runcon -t httpd_t -- id -Z

Notes:

prefer semanage fcontext plus restorecon for persistent fixes
use chcon only as a short-lived diagnostic or emergency workaround
avoid generating local policy modules from audit2allow until root cause is understood
after context changes, validate service startup, AVC silence, and application path access

File Operations

cp -a source/ target/
rsync -aHAXvn /src/ /dst/
rsync -aHAX --delete --info=progress2 /src/ /dst/   # impact: verify source/destination twice
mv file file.$(date +%F-%H%M%S).bak
sha256sum file

Text Processing & Regex

Core Tools

grep -n 'ERROR' app.log
grep -E 'ERROR|WARN' app.log
grep -P '^\d{4}-\d{2}-\d{2}T' app.log
awk '{print $1,$4,$5}' access.log
awk -F, 'NR==1 || $3 ~ /failed/' report.csv
sed -n '1,20p' file
sed -E 's/[[:space:]]+/ /g' file
cut -d: -f1,7 /etc/passwd
sort file | uniq -c | sort -nr
xargs -r -n1 systemctl status < service-list.txt
jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json

Regex Reference

IPv4                  \b(?:\d{1,3}\.){3}\d{1,3}\b
ISO timestamp         \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b
UUID                  \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b
Log level             \b(?:ERROR|WARN|INFO)\b
Failed SSH            Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})
Ansible changed/fail  ^(changed|fatal|failed):\s+\[[^]]+\]

Log Parsing Examples

IP extraction:

grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head

Timestamp filter:

grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log

UUID extraction:

grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u

ERROR/WARN/INFO parsing:

grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c

Failed SSH login parsing:

grep 'Failed password' /var/log/secure \
| awk '{print $(NF-3),$NF}' \
| sort | uniq -c | sort -nr | head

Extract fields from logs:

awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log

Filter Ansible output:

grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log
grep -E '^fatal:|^failed:' ansible.log

Incident Response

Disk Full

Workflow:

df -hT
df -ih
findmnt
du -xhd1 /var | sort -h
find /var -xdev -type f -size +1G -ls | sort -k7,7n
lsof +L1
journalctl --disk-usage

Typical branches:

filesystem full: identify growth path, compress/rotate/archive, validate app behavior
inode full: remove file storms, spool buildup, temp-file leaks
deleted open files: restart offender only after sizing impact

Post-check:

df -hT
df -ih
systemctl --failed

High CPU

uptime
mpstat -P ALL 1 5
pidstat -u -p ALL 1 5
top -H -b -n 1 | head -40
ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20

Flow:

Confirm sustained load, not a short spike.
Separate user CPU vs system CPU vs I/O wait.
Identify hot process and hot threads.
Correlate with deploys, cron, backups, or JVM GC.
Throttle, stop, or fail over only with service impact understood.

Memory Pressure

free -m
vmstat 1 5
sar -r 1 5
ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20
dmesg -T | egrep -i 'oom|killed process'

Flow:

Check swap growth and page scan rates.
Identify top RSS owners.
Check kernel logs for OOM.
Validate cache vs real process growth.
Restart leaking service only after capturing evidence.

Failed Service

systemctl status <service> --no-pager -l
journalctl -u <service> -b --no-pager | tail -100
systemctl show <service> -p ExecStart -p FragmentPath -p ActiveEnterTimestamp

Flow:

Validate config.
Validate credentials, ports, mounts, permissions.
Confirm dependency availability.
Restart and recheck logs immediately.

SELinux Denials

Typical case: service works in Permissive, fails in Enforcing, or logs show permission denied while UNIX permissions look correct.

Triage:

getenforce
sestatus
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
ausearch -m AVC -ts recent | audit2why
journalctl -t setroubleshoot --since '30 min ago'
systemctl status <service> --no-pager -l
ps -eZ | grep <service>
ls -lZ /path/to/app /path/to/app/*

Flow:

Confirm the failure is current and reproducible.
Identify the denied process domain, target path, and requested access from AVC logs.
Validate expected default context with matchpathcon.
Check for mislabeled files, wrong port types, or missing SELinux booleans.
Apply the smallest persistent fix, then retest in Enforcing.

Common fixes:

matchpathcon /srv/app/config.yml
restorecon -Rv /srv/app
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
semanage port -l | grep http
getsebool -a | grep httpd
setsebool -P httpd_can_network_connect on

Validation:

getenforce
systemctl restart <service>
systemctl status <service> --no-pager -l
ausearch -m AVC -ts recent
curl -fsS http://127.0.0.1:<port>/health

Operational notes:

do not leave systems in Permissive as the fix
prefer restorecon and semanage fcontext over repeated chcon
treat audit2allow output as investigation material, not automatic remediation
if policy changes are unavoidable, document exact AVC evidence and rollback path

SSL Issues

openssl s_client -connect host:443 -servername host -showcerts </dev/null
openssl x509 -in cert.pem -noout -subject -issuer -dates -ext subjectAltName
curl -vkI https://host/

Check for:

expired certificate
missing SAN
incomplete chain
hostname mismatch
TLS version or cipher mismatch

DNS Issues

dig +short app.example.com
dig @<resolver> app.example.com
dig +trace app.example.com
getent hosts app.example.com
resolvectl status

Flow:

Compare resolver result with authoritative result.
Check TTL and stale cache.
Validate /etc/resolv.conf, local resolver, and search domains.
Test from affected host and unaffected host.

Network Issues

ip addr
ip route
ss -tulpen
tcpdump -ni any host <peer> and port <port>
curl -sv http://host:port/health
mtr -rwzc 20 host

Flow:

Interface/link state.
Route and source IP selection.
Listening socket on target.
Firewall and security controls.
Packet capture if app logs are inconclusive.

JVM / Tomcat Issues

ps -ef | grep -i tomcat
jcmd <pid> VM.flags
jstat -gcutil <pid> 1000 10
jstack <pid> | head -100
ss -ltnp | grep java
tail -100 /opt/tomcat/logs/catalina.out

Focus:

stuck threads
full GC loops
heap exhaustion
connector bind failures
slow backend dependency

Certificate Expiration

echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| openssl x509 -noout -enddate

openssl x509 -checkend 2592000 -noout -in cert.pem

last -ai | head -30
lastb -ai | head -30
grep 'Failed password' /var/log/secure | tail -50
grep 'Accepted ' /var/log/secure | tail -50
ausearch -m USER_LOGIN -ts recent

Workflow:

Identify source IPs and usernames.
Validate whether attempts are expected from bastions/scanners.
Check successful logins from same sources.
Review sudo usage and persistence changes.
Preserve logs before cleanup or rotation.

Networking Operations

ip -br addr
ip route get 8.8.8.8
ss -ltnp
ss -tn state established '( sport = :443 or dport = :443 )'
tcpdump -ni eth0 port 53
dig +short mx example.com
curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health
mtr -rwzc 10 host
traceroute -T -p 443 host
openssl s_client -connect host:443 -servername host </dev/null

Storage Operations

Block and Filesystem Discovery

lsblk -f
blkid
findmnt
cat /proc/partitions
multipath -ll

LVM

pvs
vgs
lvs -a -o +devices
pvdisplay /dev/sdX
vgdisplay <vg>
lvdisplay /dev/<vg>/<lv>

Growth example:

pvcreate /dev/mapper/mpatha          # impact: write metadata
vgextend vgdata /dev/mapper/mpatha   # impact: changes VG layout
lvextend -L +100G -r /dev/vgdata/lvapp

XFS

xfs_info /mountpoint
xfs_repair -n /dev/mapper/vg-lv
xfs_growfs /mountpoint

ext4

tune2fs -l /dev/mapper/vg-lv | head -40
e2fsck -fn /dev/mapper/vg-lv
resize2fs /dev/mapper/vg-lv

Multipath

multipath -ll
lsblk -S
udevadm info --query=all --name=/dev/mapper/mpatha | head -40

NFS

showmount -e nfs-server
nfsstat -m
mount | grep nfs
rpcinfo -p nfs-server

iSCSI

iscsiadm -m session
iscsiadm -m node
iscsiadm -m discovery -t sendtargets -p <target-ip>

Mount Troubleshooting

findmnt /mountpoint
mount -v /mountpoint
dmesg -T | tail -50
journalctl -k -n 100 --no-pager

Check:

device path stable
UUID correct
filesystem type correct
multipath settled
network and RPC available for NFS

Filesystem Validation

findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
df -hT /data
touch /data/.write-test && rm -f /data/.write-test

Migration Validation Example

findmnt /data
df -hT /data
rsync -aHAXvn /olddata/ /data/
rsync -aHAXc --delete --dry-run /olddata/ /data/
sha256sum /olddata/keyfile /data/keyfile

AIX Operations

oslevel -s
errpt | head
errpt -a | more
topas
lsvg -o
lsvg rootvg
lslpp -L | grep -i openssl
svmon -G
svmon -P <pid>
netstat -rn

SSL/TLS Operations

OpenSSL Checks

openssl version -a
openssl x509 -in cert.pem -noout -text | less
openssl rsa -in key.pem -check
openssl verify -CAfile chain.pem cert.pem

Expiration Validation

openssl x509 -enddate -noout -in cert.pem
openssl x509 -checkend 604800 -noout -in cert.pem

keytool Basics

keytool -list -v -keystore keystore.jks
keytool -list -cacerts | grep -i <alias>
keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks

Chain Validation

openssl s_client -connect host:443 -servername host -showcerts </dev/null
openssl verify -untrusted intermediate.pem -CAfile root.pem server.pem

Automation Operations

Bash Safety Patterns

set -euo pipefail
IFS=$'\n\t'
trap 'echo "line ${LINENO}: command failed" >&2' ERR
trap 'rm -f "${tmpfile:-}"' EXIT

Safe loop examples:

while IFS= read -r host; do
  ssh "$host" uptime
done < hostlist.txt

find /var/log -type f -name '*.log' -print0 \
| while IFS= read -r -d '' file; do
    gzip -t "$file"
  done

Operational scripting patterns:

default to read-only mode
require explicit --execute for changes
log actions with timestamps
validate dependencies with command -v
use temp files with mktemp
guard destructive paths and empty variables

Ansible Operations

Execution

ansible-inventory -i inventory/hosts.yml --graph
ansible-inventory -i inventory/hosts.yml --list | jq '.'
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx'

Safe Rollout Workflow

Validate inventory and variable targeting.
Run syntax-check.
Run --check --diff on a single host.
Execute against one host or one tier.
Validate service health, logs, and config.
Expand rollout only after post-check passes.

Rollback mindset:

keep before/after config copies
know which tasks restart services
define manual backout if package/config changes fail
avoid broad --limit mistakes by reviewing resolved host list first

Monitoring & Observability

Zabbix Checks

systemctl status zabbix-agent2 --no-pager
zabbix_agent2 -t vfs.fs.size[/,free]
grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log

ELK Log Workflows

grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50
journalctl -u filebeat -n 100 --no-pager
curl -s http://localhost:9200/_cluster/health?pretty

Grafana Checks

curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
grep -i 'error' /var/log/grafana/grafana.log | tail -50

Health Endpoints and Alert Validation

curl -fsS http://app:8080/health
curl -fsS http://app:8080/metrics | head

False positive validation:

Compare alert timestamp with deploy/change window.
Confirm on-host evidence, not only dashboard data.
Check collector lag, scrape failures, and stale metrics.
Validate from a second source before escalating.

Operational Habits

Pre-checks

capture time, hostname, and operator
capture current config and service state
check recent alerts, maintenance windows, and dependencies
confirm backup or rollback path exists

Post-checks

validate service state
validate logs for fresh errors
validate client path, ports, and name resolution
compare metrics before/after

Rollback Thinking

define exact backout trigger before change
prefer reversible steps
keep config backups with timestamps
avoid bundling unrelated changes

Change Validation

systemctl is-active <service>
curl -fsS http://127.0.0.1:<port>/health
ss -ltnp | grep :<port>
journalctl -u <service> -S '5 min ago' --no-pager

Operational Communication

state scope, risk, and expected impact before action
record start and stop times in UTC
document what changed, what was checked, and remaining risk
escalate with evidence, not assumptions

Evidence Collection During Incidents

mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ)
journalctl -b > /tmp/incident-*/journal.txt
ss -tulpen > /tmp/incident-*/sockets.txt
df -hT > /tmp/incident-*/df.txt
free -m > /tmp/incident-*/free.txt

18 KiB Raw Blame History

Production Operations Cheatsheet

Linux / Unix Daily Operations

Uptime and Host State

Process Management

systemctl

journalctl

Service Troubleshooting Flow

CPU and Memory Diagnostics

Disk Usage

Inode Exhaustion

Mounts

Permissions

SELinux

Archives

File Operations

Text Processing & Regex

Core Tools

Regex Reference

Log Parsing Examples

Incident Response

Disk Full

High CPU

Memory Pressure

Failed Service

SELinux Denials

SSL Issues

DNS Issues

Network Issues

JVM / Tomcat Issues

Certificate Expiration

Suspicious Login Attempts

Networking Operations

Storage Operations

Block and Filesystem Discovery

LVM

XFS

ext4

Multipath

NFS

iSCSI

Mount Troubleshooting

Filesystem Validation

Migration Validation Example

AIX Operations

SSL/TLS Operations

OpenSSL Checks

Expiration Validation

keytool Basics

Chain Validation

Automation Operations

Bash Safety Patterns

Ansible Operations

Execution

Safe Rollout Workflow

Monitoring & Observability

Zabbix Checks

ELK Log Workflows

Grafana Checks

Health Endpoints and Alert Validation

Operational Habits

Pre-checks

Post-checks

Rollback Thinking

Change Validation

Operational Communication

Evidence Collection During Incidents

18 KiB

Raw Blame History