Add operational cheatsheets across repository

2026-05-09 09:41:55 +00:00
parent ca5a876d03
commit 0d3905b8a1
6 changed files with 1394 additions and 0 deletions
@@ -4,6 +4,12 @@

 ### Added

+- Cross-repository operational documentation structure:
+  - `infra-run/docs/operations-cheatsheet.md`
+  - `platform-projects/docs/platform-cheatsheet.md`
+  - `labs/docs/lab-cheatsheet.md`
+- Production-oriented Linux/Unix operations reference with incident workflows, storage and networking checks, SSL/TLS notes, AIX commands, automation safety patterns, Ansible operational usage, and observability quick-reference.
+- SELinux operational coverage for mode checks, context inspection, AVC audit review, persistent relabel workflow, booleans, and SELinux-specific incident response.
 - Selected baseline Ansible hardening automation:
  - RHEL 9 role and playbook.
  - Debian 13 / Ubuntu 26.04 role and playbook.
@@ -13,6 +19,7 @@

 ### Changed

+- Updated repository and `infra-run` README files to surface the new documentation structure and operational cheatsheets.
 - Updated repository, `infra-run`, and Ansible README files to describe the new hardening automation instead of placeholder-only Ansible structure.

 ### Notes
@@ -17,6 +17,20 @@ It is a technical portfolio, not a production toolkit. The examples are meant to

 The `labs` and `platform-projects` trees are intentionally thin. They are kept as planning areas for future lab notes and case studies, not as completed projects. Current planned topics are tracked in [ROADMAP.md](./ROADMAP.md).

+## Documentation
+
+### Production Operations
+
+- [infra-run/docs/operations-cheatsheet.md](./infra-run/docs/operations-cheatsheet.md) - production-focused Linux/Unix operations reference for incident handling, validation, storage, networking, Ansible, observability, and safety-first change execution.
+
+### Platform Engineering
+
+- [platform-projects/docs/platform-cheatsheet.md](./platform-projects/docs/platform-cheatsheet.md) - platform operations reference for Kubernetes, Helm, containers, Terraform, CI/CD, observability, and GPU-backed infrastructure troubleshooting.
+
+### Labs & Experiments
+
+- [labs/docs/lab-cheatsheet.md](./labs/docs/lab-cheatsheet.md) - quick-reference scratchpad for K3s, Proxmox, Terraform, Docker, networking, and short-lived lab troubleshooting work.
+
 ## What This Repo Is Not

 - It is not a compliance benchmark implementation.
@@ -13,6 +13,10 @@ The goal is to show operational judgment, not to ship a universal automation pro
 - [ansible](./ansible/) - selected baseline hardening examples for RHEL-like Linux, Debian/Ubuntu, and AIX.
 - [examples](./examples/) - sanitized sample command outputs and incident notes.

+## Documentation
+
+- [docs/operations-cheatsheet.md](./docs/operations-cheatsheet.md) - production operations quick reference covering Linux/Unix triage, text processing, incident workflows, networking, storage, AIX, SSL/TLS, automation safety, Ansible execution, observability, and operational habits.
+
 ## What This Is

 - A portfolio project for Linux and infrastructure operations roles.
@@ -0,0 +1,857 @@
+# Production Operations Cheatsheet
+
+Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change.
+
+## Linux / Unix Daily Operations
+
+### Uptime and Host State
+
+Check host age, kernel, clock, and recent reboot history before touching anything:
+
+```bash
+uptime
+uname -r
+hostnamectl
+timedatectl
+who -b
+last -x | head -20
+```
+
+Pre-check pattern:
+
+```bash
+date -u
+uptime
+df -h
+free -m
+systemctl --failed
+```
+
+### Process Management
+
+```bash
+ps -ef | head
+ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
+pgrep -a java
+pstree -ap | less
+pidof sshd
+renice +5 -p <pid>
+kill -TERM <pid>
+kill -9 <pid>   # DANGEROUS: last resort only
+```
+
+Validation:
+
+```bash
+ps -p <pid> -o pid,stat,etime,cmd
+journalctl -u <service> -n 50 --no-pager
+```
+
+### systemctl
+
+```bash
+systemctl status <service> --no-pager -l
+systemctl is-active <service>
+systemctl is-enabled <service>
+systemctl list-units --type=service --state=running
+systemctl list-units --failed
+systemctl daemon-reload
+systemctl restart <service>   # impact: confirms service interruption policy first
+```
+
+### journalctl
+
+```bash
+journalctl -u <service> -n 100 --no-pager
+journalctl -u <service> --since '30 min ago'
+journalctl -p err -S today
+journalctl -k -b
+journalctl --disk-usage
+```
+
+### Service Troubleshooting Flow
+
+1. Confirm service state and recent restart count.
+2. Read the last 100-200 journal lines.
+3. Validate config syntax before restart if the daemon supports it.
+4. Check dependent ports, mounts, credentials, and name resolution.
+5. Restart only after cause is understood or rollback exists.
+
+Example:
+
+```bash
+systemctl status nginx --no-pager -l
+journalctl -u nginx -n 100 --no-pager
+nginx -t
+ss -ltnp | grep ':80\|:443'
+curl -kI https://127.0.0.1/
+```
+
+### CPU and Memory Diagnostics
+
+```bash
+uptime
+top -H -b -n 1 | head -40
+pidstat 1 5
+pidstat -ru -p ALL 1 3
+vmstat 1 5
+iostat -xz 1 5
+free -m
+sar -q 1 5
+```
+
+Quick interpretation:
+
+- high `%wa`: storage path or NFS issue
+- high run queue with low CPU idle: CPU contention
+- swap growth plus page scans: memory pressure
+
+### Disk Usage
+
+```bash
+df -hT
+du -xhd1 /var | sort -h
+find /var/log -type f -size +500M -ls | sort -k7,7n
+lsof +L1
+```
+
+### Inode Exhaustion
+
+```bash
+df -ih
+find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n
+find /tmp -xdev -type f | wc -l
+```
+
+### Mounts
+
+```bash
+mount | column -t
+findmnt
+findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
+cat /etc/fstab
+mount -a   # can expose bad fstab entries; use in change window
+```
+
+### Permissions
+
+```bash
+namei -l /path/to/file
+stat /path/to/file
+getfacl /path/to/file
+chmod 640 /path/to/file
+chown root:app /path/to/file
+```
+
+### SELinux
+
+State and mode:
+
+```bash
+getenforce
+sestatus
+cat /etc/selinux/config
+```
+
+Check file, process, and port context:
+
+```bash
+ls -Zd /var/www/html
+ls -lZ /var/www/html/index.html
+ps -eZ | grep nginx
+id -Z
+semanage port -l | grep http
+```
+
+Audit and denial review:
+
+```bash
+ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
+ausearch -m AVC -ts today | audit2why
+journalctl -t setroubleshoot --since '1 hour ago'
+sealert -a /var/log/audit/audit.log
+```
+
+Typical flow:
+
+1. Confirm SELinux mode is `Enforcing` or `Permissive`.
+2. Identify the failing path, process domain, and target context.
+3. Read AVC denials before changing labels or booleans.
+4. Prefer persistent policy-aligned fixes over `chcon`.
+5. Restore default labels and retest service path.
+
+Modify and restore context:
+
+```bash
+chcon -t httpd_sys_content_t /srv/app/index.html              # temporary until relabel/restore
+chcon -R -t httpd_sys_rw_content_t /srv/app/uploads           # temporary until relabel/restore
+semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
+semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
+restorecon -Rv /srv/app
+matchpathcon /srv/app/uploads/file.txt
+```
+
+Booleans and validation:
+
+```bash
+getsebool -a | grep httpd
+getsebool httpd_can_network_connect
+setsebool -P httpd_can_network_connect on
+runcon -t httpd_t -- id -Z
+```
+
+Notes:
+
+- prefer `semanage fcontext` plus `restorecon` for persistent fixes
+- use `chcon` only as a short-lived diagnostic or emergency workaround
+- avoid generating local policy modules from `audit2allow` until root cause is understood
+- after context changes, validate service startup, AVC silence, and application path access
+
+### Archives
+
+```bash
+tar tf backup.tar | head
+tar czf logs-$(date +%F).tgz /var/log/app
+tar xzf bundle.tgz -C /restore/path
+gzip -t file.gz
+```
+
+### File Operations
+
+```bash
+cp -a source/ target/
+rsync -aHAXvn /src/ /dst/
+rsync -aHAX --delete --info=progress2 /src/ /dst/   # impact: verify source/destination twice
+mv file file.$(date +%F-%H%M%S).bak
+sha256sum file
+```
+
+## Text Processing & Regex
+
+### Core Tools
+
+```bash
+grep -n 'ERROR' app.log
+grep -E 'ERROR|WARN' app.log
+grep -P '^\d{4}-\d{2}-\d{2}T' app.log
+awk '{print $1,$4,$5}' access.log
+awk -F, 'NR==1 || $3 ~ /failed/' report.csv
+sed -n '1,20p' file
+sed -E 's/[[:space:]]+/ /g' file
+cut -d: -f1,7 /etc/passwd
+sort file | uniq -c | sort -nr
+xargs -r -n1 systemctl status < service-list.txt
+jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json
+```
+
+### Regex Reference
+
+```text
+IPv4                  \b(?:\d{1,3}\.){3}\d{1,3}\b
+ISO timestamp         \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b
+UUID                  \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b
+Log level             \b(?:ERROR|WARN|INFO)\b
+Failed SSH            Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})
+Ansible changed/fail  ^(changed|fatal|failed):\s+\[[^]]+\]
+```
+
+### Log Parsing Examples
+
+IP extraction:
+
+```bash
+grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head
+```
+
+Timestamp filter:
+
+```bash
+grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log
+```
+
+UUID extraction:
+
+```bash
+grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u
+```
+
+ERROR/WARN/INFO parsing:
+
+```bash
+grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c
+```
+
+Failed SSH login parsing:
+
+```bash
+grep 'Failed password' /var/log/secure \
+| awk '{print $(NF-3),$NF}' \
+| sort | uniq -c | sort -nr | head
+```
+
+Extract fields from logs:
+
+```bash
+awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log
+```
+
+Filter Ansible output:
+
+```bash
+grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log
+grep -E '^fatal:|^failed:' ansible.log
+```
+
+## Incident Response
+
+### Disk Full
+
+Workflow:
+
+```bash
+df -hT
+df -ih
+findmnt
+du -xhd1 /var | sort -h
+find /var -xdev -type f -size +1G -ls | sort -k7,7n
+lsof +L1
+journalctl --disk-usage
+```
+
+Typical branches:
+
+- filesystem full: identify growth path, compress/rotate/archive, validate app behavior
+- inode full: remove file storms, spool buildup, temp-file leaks
+- deleted open files: restart offender only after sizing impact
+
+Post-check:
+
+```bash
+df -hT
+df -ih
+systemctl --failed
+```
+
+### High CPU
+
+```bash
+uptime
+mpstat -P ALL 1 5
+pidstat -u -p ALL 1 5
+top -H -b -n 1 | head -40
+ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20
+```
+
+Flow:
+
+1. Confirm sustained load, not a short spike.
+2. Separate user CPU vs system CPU vs I/O wait.
+3. Identify hot process and hot threads.
+4. Correlate with deploys, cron, backups, or JVM GC.
+5. Throttle, stop, or fail over only with service impact understood.
+
+### Memory Pressure
+
+```bash
+free -m
+vmstat 1 5
+sar -r 1 5
+ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20
+dmesg -T | egrep -i 'oom|killed process'
+```
+
+Flow:
+
+1. Check swap growth and page scan rates.
+2. Identify top RSS owners.
+3. Check kernel logs for OOM.
+4. Validate cache vs real process growth.
+5. Restart leaking service only after capturing evidence.
+
+### Failed Service
+
+```bash
+systemctl status <service> --no-pager -l
+journalctl -u <service> -b --no-pager | tail -100
+systemctl show <service> -p ExecStart -p FragmentPath -p ActiveEnterTimestamp
+```
+
+Flow:
+
+1. Validate config.
+2. Validate credentials, ports, mounts, permissions.
+3. Confirm dependency availability.
+4. Restart and recheck logs immediately.
+
+### SELinux Denials
+
+Typical case: service works in `Permissive`, fails in `Enforcing`, or logs show `permission denied` while UNIX permissions look correct.
+
+Triage:
+
+```bash
+getenforce
+sestatus
+ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
+ausearch -m AVC -ts recent | audit2why
+journalctl -t setroubleshoot --since '30 min ago'
+systemctl status <service> --no-pager -l
+ps -eZ | grep <service>
+ls -lZ /path/to/app /path/to/app/*
+```
+
+Flow:
+
+1. Confirm the failure is current and reproducible.
+2. Identify the denied process domain, target path, and requested access from AVC logs.
+3. Validate expected default context with `matchpathcon`.
+4. Check for mislabeled files, wrong port types, or missing SELinux booleans.
+5. Apply the smallest persistent fix, then retest in `Enforcing`.
+
+Common fixes:
+
+```bash
+matchpathcon /srv/app/config.yml
+restorecon -Rv /srv/app
+semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
+semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
+semanage port -l | grep http
+getsebool -a | grep httpd
+setsebool -P httpd_can_network_connect on
+```
+
+Validation:
+
+```bash
+getenforce
+systemctl restart <service>
+systemctl status <service> --no-pager -l
+ausearch -m AVC -ts recent
+curl -fsS http://127.0.0.1:<port>/health
+```
+
+Operational notes:
+
+- do not leave systems in `Permissive` as the fix
+- prefer `restorecon` and `semanage fcontext` over repeated `chcon`
+- treat `audit2allow` output as investigation material, not automatic remediation
+- if policy changes are unavoidable, document exact AVC evidence and rollback path
+
+### SSL Issues
+
+```bash
+openssl s_client -connect host:443 -servername host -showcerts </dev/null
+openssl x509 -in cert.pem -noout -subject -issuer -dates -ext subjectAltName
+curl -vkI https://host/
+```
+
+Check for:
+
+- expired certificate
+- missing SAN
+- incomplete chain
+- hostname mismatch
+- TLS version or cipher mismatch
+
+### DNS Issues
+
+```bash
+dig +short app.example.com
+dig @<resolver> app.example.com
+dig +trace app.example.com
+getent hosts app.example.com
+resolvectl status
+```
+
+Flow:
+
+1. Compare resolver result with authoritative result.
+2. Check TTL and stale cache.
+3. Validate `/etc/resolv.conf`, local resolver, and search domains.
+4. Test from affected host and unaffected host.
+
+### Network Issues
+
+```bash
+ip addr
+ip route
+ss -tulpen
+tcpdump -ni any host <peer> and port <port>
+curl -sv http://host:port/health
+mtr -rwzc 20 host
+```
+
+Flow:
+
+1. Interface/link state.
+2. Route and source IP selection.
+3. Listening socket on target.
+4. Firewall and security controls.
+5. Packet capture if app logs are inconclusive.
+
+### JVM / Tomcat Issues
+
+```bash
+ps -ef | grep -i tomcat
+jcmd <pid> VM.flags
+jstat -gcutil <pid> 1000 10
+jstack <pid> | head -100
+ss -ltnp | grep java
+tail -100 /opt/tomcat/logs/catalina.out
+```
+
+Focus:
+
+- stuck threads
+- full GC loops
+- heap exhaustion
+- connector bind failures
+- slow backend dependency
+
+### Certificate Expiration
+
+```bash
+echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
+| openssl x509 -noout -enddate
+
+openssl x509 -checkend 2592000 -noout -in cert.pem
+```
+
+### Suspicious Login Attempts
+
+```bash
+last -ai | head -30
+lastb -ai | head -30
+grep 'Failed password' /var/log/secure | tail -50
+grep 'Accepted ' /var/log/secure | tail -50
+ausearch -m USER_LOGIN -ts recent
+```
+
+Workflow:
+
+1. Identify source IPs and usernames.
+2. Validate whether attempts are expected from bastions/scanners.
+3. Check successful logins from same sources.
+4. Review sudo usage and persistence changes.
+5. Preserve logs before cleanup or rotation.
+
+## Networking Operations
+
+```bash
+ip -br addr
+ip route get 8.8.8.8
+ss -ltnp
+ss -tn state established '( sport = :443 or dport = :443 )'
+tcpdump -ni eth0 port 53
+dig +short mx example.com
+curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health
+mtr -rwzc 10 host
+traceroute -T -p 443 host
+openssl s_client -connect host:443 -servername host </dev/null
+```
+
+## Storage Operations
+
+### Block and Filesystem Discovery
+
+```bash
+lsblk -f
+blkid
+findmnt
+cat /proc/partitions
+multipath -ll
+```
+
+### LVM
+
+```bash
+pvs
+vgs
+lvs -a -o +devices
+pvdisplay /dev/sdX
+vgdisplay <vg>
+lvdisplay /dev/<vg>/<lv>
+```
+
+Growth example:
+
+```bash
+pvcreate /dev/mapper/mpatha          # impact: write metadata
+vgextend vgdata /dev/mapper/mpatha   # impact: changes VG layout
+lvextend -L +100G -r /dev/vgdata/lvapp
+```
+
+### XFS
+
+```bash
+xfs_info /mountpoint
+xfs_repair -n /dev/mapper/vg-lv
+xfs_growfs /mountpoint
+```
+
+### ext4
+
+```bash
+tune2fs -l /dev/mapper/vg-lv | head -40
+e2fsck -fn /dev/mapper/vg-lv
+resize2fs /dev/mapper/vg-lv
+```
+
+### Multipath
+
+```bash
+multipath -ll
+lsblk -S
+udevadm info --query=all --name=/dev/mapper/mpatha | head -40
+```
+
+### NFS
+
+```bash
+showmount -e nfs-server
+nfsstat -m
+mount | grep nfs
+rpcinfo -p nfs-server
+```
+
+### iSCSI
+
+```bash
+iscsiadm -m session
+iscsiadm -m node
+iscsiadm -m discovery -t sendtargets -p <target-ip>
+```
+
+### Mount Troubleshooting
+
+```bash
+findmnt /mountpoint
+mount -v /mountpoint
+dmesg -T | tail -50
+journalctl -k -n 100 --no-pager
+```
+
+Check:
+
+- device path stable
+- UUID correct
+- filesystem type correct
+- multipath settled
+- network and RPC available for NFS
+
+### Filesystem Validation
+
+```bash
+findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
+df -hT /data
+touch /data/.write-test && rm -f /data/.write-test
+```
+
+### Migration Validation Example
+
+```bash
+findmnt /data
+df -hT /data
+rsync -aHAXvn /olddata/ /data/
+rsync -aHAXc --delete --dry-run /olddata/ /data/
+sha256sum /olddata/keyfile /data/keyfile
+```
+
+## AIX Operations
+
+```bash
+oslevel -s
+errpt | head
+errpt -a | more
+topas
+lsvg -o
+lsvg rootvg
+lslpp -L | grep -i openssl
+svmon -G
+svmon -P <pid>
+netstat -rn
+```
+
+## SSL/TLS Operations
+
+### OpenSSL Checks
+
+```bash
+openssl version -a
+openssl x509 -in cert.pem -noout -text | less
+openssl rsa -in key.pem -check
+openssl verify -CAfile chain.pem cert.pem
+```
+
+### Expiration Validation
+
+```bash
+openssl x509 -enddate -noout -in cert.pem
+openssl x509 -checkend 604800 -noout -in cert.pem
+```
+
+### keytool Basics
+
+```bash
+keytool -list -v -keystore keystore.jks
+keytool -list -cacerts | grep -i <alias>
+keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks
+```
+
+### Chain Validation
+
+```bash
+openssl s_client -connect host:443 -servername host -showcerts </dev/null
+openssl verify -untrusted intermediate.pem -CAfile root.pem server.pem
+```
+
+## Automation Operations
+
+### Bash Safety Patterns
+
+```bash
+set -euo pipefail
+IFS=$'\n\t'
+trap 'echo "line ${LINENO}: command failed" >&2' ERR
+trap 'rm -f "${tmpfile:-}"' EXIT
+```
+
+Safe loop examples:
+
+```bash
+while IFS= read -r host; do
+  ssh "$host" uptime
+done < hostlist.txt
+
+find /var/log -type f -name '*.log' -print0 \
+| while IFS= read -r -d '' file; do
+    gzip -t "$file"
+  done
+```
+
+Operational scripting patterns:
+
+- default to read-only mode
+- require explicit `--execute` for changes
+- log actions with timestamps
+- validate dependencies with `command -v`
+- use temp files with `mktemp`
+- guard destructive paths and empty variables
+
+## Ansible Operations
+
+### Execution
+
+```bash
+ansible-inventory -i inventory/hosts.yml --graph
+ansible-inventory -i inventory/hosts.yml --list | jq '.'
+ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check
+ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
+ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01
+ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages
+ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx'
+```
+
+### Safe Rollout Workflow
+
+1. Validate inventory and variable targeting.
+2. Run syntax-check.
+3. Run `--check --diff` on a single host.
+4. Execute against one host or one tier.
+5. Validate service health, logs, and config.
+6. Expand rollout only after post-check passes.
+
+Rollback mindset:
+
+- keep before/after config copies
+- know which tasks restart services
+- define manual backout if package/config changes fail
+- avoid broad `--limit` mistakes by reviewing resolved host list first
+
+## Monitoring & Observability
+
+### Zabbix Checks
+
+```bash
+systemctl status zabbix-agent2 --no-pager
+zabbix_agent2 -t vfs.fs.size[/,free]
+grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log
+```
+
+### ELK Log Workflows
+
+```bash
+grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50
+journalctl -u filebeat -n 100 --no-pager
+curl -s http://localhost:9200/_cluster/health?pretty
+```
+
+### Grafana Checks
+
+```bash
+curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
+grep -i 'error' /var/log/grafana/grafana.log | tail -50
+```
+
+### Health Endpoints and Alert Validation
+
+```bash
+curl -fsS http://app:8080/health
+curl -fsS http://app:8080/metrics | head
+```
+
+False positive validation:
+
+1. Compare alert timestamp with deploy/change window.
+2. Confirm on-host evidence, not only dashboard data.
+3. Check collector lag, scrape failures, and stale metrics.
+4. Validate from a second source before escalating.
+
+## Operational Habits
+
+### Pre-checks
+
+- capture time, hostname, and operator
+- capture current config and service state
+- check recent alerts, maintenance windows, and dependencies
+- confirm backup or rollback path exists
+
+### Post-checks
+
+- validate service state
+- validate logs for fresh errors
+- validate client path, ports, and name resolution
+- compare metrics before/after
+
+### Rollback Thinking
+
+- define exact backout trigger before change
+- prefer reversible steps
+- keep config backups with timestamps
+- avoid bundling unrelated changes
+
+### Change Validation
+
+```bash
+systemctl is-active <service>
+curl -fsS http://127.0.0.1:<port>/health
+ss -ltnp | grep :<port>
+journalctl -u <service> -S '5 min ago' --no-pager
+```
+
+### Operational Communication
+
+- state scope, risk, and expected impact before action
+- record start and stop times in UTC
+- document what changed, what was checked, and remaining risk
+- escalate with evidence, not assumptions
+
+### Evidence Collection During Incidents
+
+```bash
+mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ)
+journalctl -b > /tmp/incident-*/journal.txt
+ss -tulpen > /tmp/incident-*/sockets.txt
+df -hT > /tmp/incident-*/df.txt
+free -m > /tmp/incident-*/free.txt
+```
@@ -0,0 +1,144 @@
+# Lab Cheatsheet
+
+Quick-reference notes for experiments, rebuilds, and short-lived troubleshooting. Expect rough edges. Capture what worked, what broke, and what should not be repeated in production.
+
+## K3s Lab
+
+```bash
+sudo systemctl status k3s --no-pager
+sudo journalctl -u k3s -n 100 --no-pager
+kubectl get nodes -o wide
+kubectl get pods -A
+kubectl get events -A --sort-by=.lastTimestamp | tail -30
+sudo k3s kubectl get pods -A
+```
+
+Quick reset:
+
+```bash
+sudo /usr/local/bin/k3s-uninstall.sh   # destructive lab reset
+```
+
+## Proxmox Lab
+
+```bash
+pvesh get /nodes
+pvesh get /cluster/resources
+qm list
+qm config <vmid>
+pct list
+ha-manager status
+```
+
+Checks before changes:
+
+```bash
+zpool status
+pvesm status
+ip -br addr
+```
+
+## GPU Passthrough
+
+```bash
+lspci -nn | grep -Ei 'vga|3d|nvidia'
+nvidia-smi
+dmesg -T | grep -Ei 'vfio|iommu|nvidia'
+find /sys/kernel/iommu_groups/ -type l | sort
+```
+
+Good sanity check:
+
+```bash
+lsmod | grep -E 'vfio|kvm'
+```
+
+## Terraform Experiments
+
+```bash
+terraform fmt -recursive
+terraform init
+terraform validate
+terraform plan
+terraform state list
+```
+
+Scratch workflow:
+
+```bash
+terraform plan -out=tfplan
+terraform show tfplan
+```
+
+## Networking Labs
+
+```bash
+ip -br addr
+ip route
+bridge link
+ss -ltnp
+tcpdump -ni any port 53
+dig +short example.com
+mtr -rwzc 10 1.1.1.1
+```
+
+## Ansible Testing
+
+```bash
+ansible-inventory -i inventory/hosts.yml --graph
+ansible-playbook -i inventory/hosts.yml playbook.yml --syntax-check
+ansible-playbook -i inventory/hosts.yml playbook.yml --check --diff
+ansible all -i inventory/hosts.yml -m ping
+```
+
+## Docker Testing
+
+```bash
+docker ps -a
+docker logs --tail 100 <container>
+docker exec -it <container> sh
+docker inspect <container> | jq '.[0].NetworkSettings'
+docker system df
+```
+
+## Useful Temporary Commands
+
+```bash
+watch -n2 'kubectl get pods -A'
+watch -n2 'nvidia-smi'
+watch -n2 'ip -br addr'
+while true; do date -u; curl -fsS http://127.0.0.1:8080/health; sleep 2; done
+```
+
+## Quick PoC Commands
+
+```bash
+python3 -m http.server 8080
+openssl req -x509 -newkey rsa:2048 -nodes -days 3 -keyout key.pem -out cert.pem
+curl -vk https://127.0.0.1:8443/
+nc -lvkp 9000
+```
+
+## Troubleshooting Notes
+
+- If K3s pods fail after host reboot, check time sync before chasing cert or API errors.
+- If PVCs stay pending in lab clusters, inspect the default storage class first.
+- If Docker networking looks broken, compare bridge subnet overlaps with the host route table.
+- If GPU pods see no devices, validate driver, toolkit, and device plugin in that order.
+
+## Useful One-liners
+
+```bash
+kubectl get pods -A -o wide | egrep 'CrashLoopBackOff|Error|Pending'
+journalctl -p err -S today
+find /var/log -type f -mtime -1 -ls | sort -k7,7n
+ps -eo pid,%cpu,%mem,cmd --sort=-%cpu | head
+grep -RniE 'error|failed|timeout' .
+```
+
+## Things Worth Remembering
+
+- Pre-checks still matter in labs. Capture state before trying the risky thing.
+- Keep a copy of working configs before rapid iteration.
+- Short-lived labs still produce useful evidence; save command output when a fix works.
+- If a PoC needs repeated manual repair, turn the repair steps into a script or note.
@@ -0,0 +1,368 @@
+# Platform Engineering Cheatsheet
+
+Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts.
+
+## Kubernetes / K3s
+
+### Contexts, Namespaces, and Basic Workflows
+
+```bash
+kubectl config get-contexts
+kubectl config use-context <context>
+kubectl get ns
+kubectl -n <ns> get pods -o wide
+kubectl -n <ns> get deploy,sts,ds,svc,ingress
+kubectl get nodes -o wide
+```
+
+### Describe, Logs, Exec, Events
+
+```bash
+kubectl -n <ns> describe pod <pod>
+kubectl -n <ns> logs <pod> --tail=100
+kubectl -n <ns> logs <pod> -c <container> --previous
+kubectl -n <ns> exec -it <pod> -- sh
+kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
+```
+
+### Rollout Troubleshooting
+
+```bash
+kubectl -n <ns> rollout status deploy/<name>
+kubectl -n <ns> rollout history deploy/<name>
+kubectl -n <ns> rollout undo deploy/<name>
+kubectl -n <ns> get rs -l app=<name>
+```
+
+Safe pattern:
+
+1. `kubectl diff -f <manifest>`
+2. apply to non-prod or canary namespace
+3. watch rollout and events
+4. validate service and logs
+5. expand scope only after post-check
+
+### Node Validation
+
+```bash
+kubectl get nodes
+kubectl describe node <node>
+kubectl top nodes
+kubectl top pods -A --sort-by=cpu
+kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
+```
+
+### Pending / CrashLoopBackOff Flow
+
+Pending:
+
+```bash
+kubectl -n <ns> describe pod <pod>
+kubectl get events -A --sort-by=.lastTimestamp | tail -50
+```
+
+Check for:
+
+- unsatisfied CPU/memory requests
+- missing PVC
+- taints/tolerations mismatch
+- image pull secret issues
+- node selectors or affinity mismatch
+
+CrashLoopBackOff:
+
+```bash
+kubectl -n <ns> logs <pod> --previous
+kubectl -n <ns> describe pod <pod>
+kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'
+```
+
+Check for:
+
+- bad config or missing env vars
+- probe failures
+- dependency timeouts
+- permission or filesystem errors
+
+## Helm
+
+```bash
+helm repo list
+helm repo update
+helm list -A
+helm -n <ns> get values <release> -a
+helm -n <ns> get manifest <release>
+helm upgrade --install <release> <chart> -n <ns> -f values.yaml
+helm rollback -n <ns> <release> <revision>
+helm template <release> <chart> -f values.yaml | less
+```
+
+Validation:
+
+```bash
+helm lint <chart>
+kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -20
+```
+
+## Docker / Podman
+
+```bash
+docker images
+docker ps -a
+docker logs --tail 100 <container>
+docker exec -it <container> sh
+docker inspect <container>
+docker volume ls
+docker network ls
+docker system df
+docker image prune -f         # cleanup: review first
+docker container prune -f     # cleanup: review first
+podman ps -a
+podman inspect <container>
+```
+
+Container validation:
+
+```bash
+docker exec <container> env | sort
+docker exec <container> ss -ltnp
+docker inspect -f '{{.State.Status}} {{.RestartCount}}' <container>
+```
+
+## Terraform
+
+### Core Commands
+
+```bash
+terraform fmt -check -recursive
+terraform init
+terraform validate
+terraform plan -out=tfplan
+terraform apply tfplan
+terraform destroy -target=<resource>   # impact: targeted destruction needs review
+terraform state list
+terraform state show <resource>
+terraform import <resource> <id>
+```
+
+### Safe Workflow
+
+1. `terraform fmt -check -recursive`
+2. `terraform validate`
+3. refresh provider auth and backend access
+4. review `plan` output for replacements and destroys
+5. save plan artifact
+6. apply reviewed plan only
+7. validate resource state outside Terraform
+
+Plan review focus:
+
+- unexpected replacement
+- drift on security groups, routes, storage, or instance identity
+- provider alias mistakes
+- wrong workspace or backend
+
+## CI/CD Operations
+
+### GitLab CI
+
+```bash
+gitlab-runner verify
+grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml
+curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects/<id>/pipelines
+```
+
+### Jenkins
+
+```bash
+systemctl status jenkins --no-pager
+journalctl -u jenkins -n 100 --no-pager
+java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs
+```
+
+### Runners, Artifacts, Pipeline Failures
+
+```bash
+docker logs --tail 100 gitlab-runner
+kubectl -n ci get pods
+kubectl -n ci logs deploy/runner-controller --tail=100
+```
+
+Troubleshooting flow:
+
+1. validate YAML or Jenkinsfile syntax
+2. confirm runner/agent availability
+3. inspect job logs for auth, cache, DNS, or registry failures
+4. verify artifacts were uploaded and not expired
+5. correlate with platform outages, image changes, or secret rotation
+
+YAML validation:
+
+```bash
+yamllint .
+python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml
+```
+
+## Observability
+
+### Prometheus
+
+```bash
+curl -s http://prometheus:9090/-/ready
+curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
+curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'
+```
+
+### Loki
+
+```bash
+curl -s http://loki:3100/ready
+curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"'
+```
+
+### Grafana
+
+```bash
+curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
+grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50
+```
+
+### Metrics Validation and Log Correlation
+
+```bash
+kubectl -n <ns> port-forward svc/<svc> 9090:9090
+curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_'
+```
+
+Correlation flow:
+
+1. confirm alert time and impacted objects
+2. inspect deployment events in same window
+3. compare Prometheus series, Loki logs, and app logs
+4. rule out scrape lag or stale dashboards
+
+## GPU / AI Infrastructure
+
+### GPU Discovery and CUDA Validation
+
+```bash
+nvidia-smi
+nvidia-smi -L
+nvidia-smi topo -m
+nvidia-smi dmon -s pucm
+nvcc --version
+python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())'
+```
+
+### MIG Basics
+
+```bash
+nvidia-smi -i 0 -q | grep -i mig -A4
+nvidia-smi mig -lgip
+nvidia-smi mig -lgi
+```
+
+### GPU Operator and DCGM
+
+```bash
+kubectl get pods -A | grep -E 'nvidia|gpu'
+kubectl -n gpu-operator describe pod <pod>
+kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100
+kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100
+```
+
+### Container GPU Validation
+
+```bash
+docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
+kubectl run gpu-check --rm -it --restart=Never \
+  --image=nvidia/cuda:12.3.2-base-ubuntu22.04 \
+  --limits='nvidia.com/gpu=1' -- nvidia-smi
+```
+
+### Kubernetes GPU Troubleshooting
+
+Check for:
+
+- device plugin not running
+- driver/container toolkit mismatch
+- node missing `nvidia.com/gpu` allocatable resources
+- MIG profile mismatch
+- taints or tolerations blocking placement
+
+Useful checks:
+
+```bash
+kubectl describe node <gpu-node> | grep -A5 -B2 -i nvidia
+kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}'
+kubectl -n <ns> describe pod <gpu-pod>
+```
+
+## Platform Troubleshooting Flows
+
+### Pod Not Starting
+
+```bash
+kubectl -n <ns> get pod <pod> -o wide
+kubectl -n <ns> describe pod <pod>
+kubectl -n <ns> logs <pod> --previous
+kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
+```
+
+### Image Pull Errors
+
+```bash
+kubectl -n <ns> describe pod <pod> | grep -A5 -i 'image'
+crictl images | grep <image>
+ctr -n k8s.io images ls | grep <image>
+```
+
+Check:
+
+- image tag exists
+- registry reachable
+- pull secret valid
+- node clock sane for token-based auth
+
+### Failing Deployment
+
+```bash
+kubectl -n <ns> rollout status deploy/<name>
+kubectl -n <ns> describe deploy/<name>
+kubectl -n <ns> get rs,pods -l app=<name> -o wide
+```
+
+### Node Not Ready
+
+```bash
+kubectl describe node <node>
+journalctl -u k3s -n 100 --no-pager
+systemctl status kubelet --no-pager
+df -h
+free -m
+```
+
+Check:
+
+- kubelet or k3s service state
+- disk pressure
+- cert expiry
+- CNI failure
+- API reachability
+
+### Storage Provisioning Issues
+
+```bash
+kubectl get pvc,pv -A
+kubectl -n <ns> describe pvc <pvc>
+kubectl get sc
+kubectl -n kube-system logs deploy/<csi-controller> --tail=100
+```
+
+Check:
+
+- storage class defaulting
+- access mode mismatch
+- CSI controller errors
+- backend quota or LUN exhaustion
+- node attachment failures