diff --git a/CHANGELOG.md b/CHANGELOG.md index ad8cebf..5bb9d5c 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -4,6 +4,12 @@ ### Added +- Cross-repository operational documentation structure: + - `infra-run/docs/operations-cheatsheet.md` + - `platform-projects/docs/platform-cheatsheet.md` + - `labs/docs/lab-cheatsheet.md` +- Production-oriented Linux/Unix operations reference with incident workflows, storage and networking checks, SSL/TLS notes, AIX commands, automation safety patterns, Ansible operational usage, and observability quick-reference. +- SELinux operational coverage for mode checks, context inspection, AVC audit review, persistent relabel workflow, booleans, and SELinux-specific incident response. - Selected baseline Ansible hardening automation: - RHEL 9 role and playbook. - Debian 13 / Ubuntu 26.04 role and playbook. @@ -13,6 +19,7 @@ ### Changed +- Updated repository and `infra-run` README files to surface the new documentation structure and operational cheatsheets. - Updated repository, `infra-run`, and Ansible README files to describe the new hardening automation instead of placeholder-only Ansible structure. ### Notes diff --git a/README.md b/README.md index 6cb0853..6c7c737 100644 --- a/README.md +++ b/README.md @@ -17,6 +17,20 @@ It is a technical portfolio, not a production toolkit. The examples are meant to The `labs` and `platform-projects` trees are intentionally thin. They are kept as planning areas for future lab notes and case studies, not as completed projects. Current planned topics are tracked in [ROADMAP.md](./ROADMAP.md). +## Documentation + +### Production Operations + +- [infra-run/docs/operations-cheatsheet.md](./infra-run/docs/operations-cheatsheet.md) - production-focused Linux/Unix operations reference for incident handling, validation, storage, networking, Ansible, observability, and safety-first change execution. + +### Platform Engineering + +- [platform-projects/docs/platform-cheatsheet.md](./platform-projects/docs/platform-cheatsheet.md) - platform operations reference for Kubernetes, Helm, containers, Terraform, CI/CD, observability, and GPU-backed infrastructure troubleshooting. + +### Labs & Experiments + +- [labs/docs/lab-cheatsheet.md](./labs/docs/lab-cheatsheet.md) - quick-reference scratchpad for K3s, Proxmox, Terraform, Docker, networking, and short-lived lab troubleshooting work. + ## What This Repo Is Not - It is not a compliance benchmark implementation. diff --git a/infra-run/README.md b/infra-run/README.md index aa45101..403e4a6 100644 --- a/infra-run/README.md +++ b/infra-run/README.md @@ -13,6 +13,10 @@ The goal is to show operational judgment, not to ship a universal automation pro - [ansible](./ansible/) - selected baseline hardening examples for RHEL-like Linux, Debian/Ubuntu, and AIX. - [examples](./examples/) - sanitized sample command outputs and incident notes. +## Documentation + +- [docs/operations-cheatsheet.md](./docs/operations-cheatsheet.md) - production operations quick reference covering Linux/Unix triage, text processing, incident workflows, networking, storage, AIX, SSL/TLS, automation safety, Ansible execution, observability, and operational habits. + ## What This Is - A portfolio project for Linux and infrastructure operations roles. diff --git a/infra-run/docs/operations-cheatsheet.md b/infra-run/docs/operations-cheatsheet.md new file mode 100644 index 0000000..9f8555b --- /dev/null +++ b/infra-run/docs/operations-cheatsheet.md @@ -0,0 +1,857 @@ +# Production Operations Cheatsheet + +Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change. + +## Linux / Unix Daily Operations + +### Uptime and Host State + +Check host age, kernel, clock, and recent reboot history before touching anything: + +```bash +uptime +uname -r +hostnamectl +timedatectl +who -b +last -x | head -20 +``` + +Pre-check pattern: + +```bash +date -u +uptime +df -h +free -m +systemctl --failed +``` + +### Process Management + +```bash +ps -ef | head +ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20 +pgrep -a java +pstree -ap | less +pidof sshd +renice +5 -p +kill -TERM +kill -9 # DANGEROUS: last resort only +``` + +Validation: + +```bash +ps -p -o pid,stat,etime,cmd +journalctl -u -n 50 --no-pager +``` + +### systemctl + +```bash +systemctl status --no-pager -l +systemctl is-active +systemctl is-enabled +systemctl list-units --type=service --state=running +systemctl list-units --failed +systemctl daemon-reload +systemctl restart # impact: confirms service interruption policy first +``` + +### journalctl + +```bash +journalctl -u -n 100 --no-pager +journalctl -u --since '30 min ago' +journalctl -p err -S today +journalctl -k -b +journalctl --disk-usage +``` + +### Service Troubleshooting Flow + +1. Confirm service state and recent restart count. +2. Read the last 100-200 journal lines. +3. Validate config syntax before restart if the daemon supports it. +4. Check dependent ports, mounts, credentials, and name resolution. +5. Restart only after cause is understood or rollback exists. + +Example: + +```bash +systemctl status nginx --no-pager -l +journalctl -u nginx -n 100 --no-pager +nginx -t +ss -ltnp | grep ':80\|:443' +curl -kI https://127.0.0.1/ +``` + +### CPU and Memory Diagnostics + +```bash +uptime +top -H -b -n 1 | head -40 +pidstat 1 5 +pidstat -ru -p ALL 1 3 +vmstat 1 5 +iostat -xz 1 5 +free -m +sar -q 1 5 +``` + +Quick interpretation: + +- high `%wa`: storage path or NFS issue +- high run queue with low CPU idle: CPU contention +- swap growth plus page scans: memory pressure + +### Disk Usage + +```bash +df -hT +du -xhd1 /var | sort -h +find /var/log -type f -size +500M -ls | sort -k7,7n +lsof +L1 +``` + +### Inode Exhaustion + +```bash +df -ih +find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n +find /tmp -xdev -type f | wc -l +``` + +### Mounts + +```bash +mount | column -t +findmnt +findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data +cat /etc/fstab +mount -a # can expose bad fstab entries; use in change window +``` + +### Permissions + +```bash +namei -l /path/to/file +stat /path/to/file +getfacl /path/to/file +chmod 640 /path/to/file +chown root:app /path/to/file +``` + +### SELinux + +State and mode: + +```bash +getenforce +sestatus +cat /etc/selinux/config +``` + +Check file, process, and port context: + +```bash +ls -Zd /var/www/html +ls -lZ /var/www/html/index.html +ps -eZ | grep nginx +id -Z +semanage port -l | grep http +``` + +Audit and denial review: + +```bash +ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent +ausearch -m AVC -ts today | audit2why +journalctl -t setroubleshoot --since '1 hour ago' +sealert -a /var/log/audit/audit.log +``` + +Typical flow: + +1. Confirm SELinux mode is `Enforcing` or `Permissive`. +2. Identify the failing path, process domain, and target context. +3. Read AVC denials before changing labels or booleans. +4. Prefer persistent policy-aligned fixes over `chcon`. +5. Restore default labels and retest service path. + +Modify and restore context: + +```bash +chcon -t httpd_sys_content_t /srv/app/index.html # temporary until relabel/restore +chcon -R -t httpd_sys_rw_content_t /srv/app/uploads # temporary until relabel/restore +semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?' +semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?' +restorecon -Rv /srv/app +matchpathcon /srv/app/uploads/file.txt +``` + +Booleans and validation: + +```bash +getsebool -a | grep httpd +getsebool httpd_can_network_connect +setsebool -P httpd_can_network_connect on +runcon -t httpd_t -- id -Z +``` + +Notes: + +- prefer `semanage fcontext` plus `restorecon` for persistent fixes +- use `chcon` only as a short-lived diagnostic or emergency workaround +- avoid generating local policy modules from `audit2allow` until root cause is understood +- after context changes, validate service startup, AVC silence, and application path access + +### Archives + +```bash +tar tf backup.tar | head +tar czf logs-$(date +%F).tgz /var/log/app +tar xzf bundle.tgz -C /restore/path +gzip -t file.gz +``` + +### File Operations + +```bash +cp -a source/ target/ +rsync -aHAXvn /src/ /dst/ +rsync -aHAX --delete --info=progress2 /src/ /dst/ # impact: verify source/destination twice +mv file file.$(date +%F-%H%M%S).bak +sha256sum file +``` + +## Text Processing & Regex + +### Core Tools + +```bash +grep -n 'ERROR' app.log +grep -E 'ERROR|WARN' app.log +grep -P '^\d{4}-\d{2}-\d{2}T' app.log +awk '{print $1,$4,$5}' access.log +awk -F, 'NR==1 || $3 ~ /failed/' report.csv +sed -n '1,20p' file +sed -E 's/[[:space:]]+/ /g' file +cut -d: -f1,7 /etc/passwd +sort file | uniq -c | sort -nr +xargs -r -n1 systemctl status < service-list.txt +jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json +``` + +### Regex Reference + +```text +IPv4 \b(?:\d{1,3}\.){3}\d{1,3}\b +ISO timestamp \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b +UUID \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b +Log level \b(?:ERROR|WARN|INFO)\b +Failed SSH Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3}) +Ansible changed/fail ^(changed|fatal|failed):\s+\[[^]]+\] +``` + +### Log Parsing Examples + +IP extraction: + +```bash +grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head +``` + +Timestamp filter: + +```bash +grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log +``` + +UUID extraction: + +```bash +grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u +``` + +ERROR/WARN/INFO parsing: + +```bash +grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c +``` + +Failed SSH login parsing: + +```bash +grep 'Failed password' /var/log/secure \ +| awk '{print $(NF-3),$NF}' \ +| sort | uniq -c | sort -nr | head +``` + +Extract fields from logs: + +```bash +awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log +``` + +Filter Ansible output: + +```bash +grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log +grep -E '^fatal:|^failed:' ansible.log +``` + +## Incident Response + +### Disk Full + +Workflow: + +```bash +df -hT +df -ih +findmnt +du -xhd1 /var | sort -h +find /var -xdev -type f -size +1G -ls | sort -k7,7n +lsof +L1 +journalctl --disk-usage +``` + +Typical branches: + +- filesystem full: identify growth path, compress/rotate/archive, validate app behavior +- inode full: remove file storms, spool buildup, temp-file leaks +- deleted open files: restart offender only after sizing impact + +Post-check: + +```bash +df -hT +df -ih +systemctl --failed +``` + +### High CPU + +```bash +uptime +mpstat -P ALL 1 5 +pidstat -u -p ALL 1 5 +top -H -b -n 1 | head -40 +ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20 +``` + +Flow: + +1. Confirm sustained load, not a short spike. +2. Separate user CPU vs system CPU vs I/O wait. +3. Identify hot process and hot threads. +4. Correlate with deploys, cron, backups, or JVM GC. +5. Throttle, stop, or fail over only with service impact understood. + +### Memory Pressure + +```bash +free -m +vmstat 1 5 +sar -r 1 5 +ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20 +dmesg -T | egrep -i 'oom|killed process' +``` + +Flow: + +1. Check swap growth and page scan rates. +2. Identify top RSS owners. +3. Check kernel logs for OOM. +4. Validate cache vs real process growth. +5. Restart leaking service only after capturing evidence. + +### Failed Service + +```bash +systemctl status --no-pager -l +journalctl -u -b --no-pager | tail -100 +systemctl show -p ExecStart -p FragmentPath -p ActiveEnterTimestamp +``` + +Flow: + +1. Validate config. +2. Validate credentials, ports, mounts, permissions. +3. Confirm dependency availability. +4. Restart and recheck logs immediately. + +### SELinux Denials + +Typical case: service works in `Permissive`, fails in `Enforcing`, or logs show `permission denied` while UNIX permissions look correct. + +Triage: + +```bash +getenforce +sestatus +ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent +ausearch -m AVC -ts recent | audit2why +journalctl -t setroubleshoot --since '30 min ago' +systemctl status --no-pager -l +ps -eZ | grep +ls -lZ /path/to/app /path/to/app/* +``` + +Flow: + +1. Confirm the failure is current and reproducible. +2. Identify the denied process domain, target path, and requested access from AVC logs. +3. Validate expected default context with `matchpathcon`. +4. Check for mislabeled files, wrong port types, or missing SELinux booleans. +5. Apply the smallest persistent fix, then retest in `Enforcing`. + +Common fixes: + +```bash +matchpathcon /srv/app/config.yml +restorecon -Rv /srv/app +semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?' +semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?' +semanage port -l | grep http +getsebool -a | grep httpd +setsebool -P httpd_can_network_connect on +``` + +Validation: + +```bash +getenforce +systemctl restart +systemctl status --no-pager -l +ausearch -m AVC -ts recent +curl -fsS http://127.0.0.1:/health +``` + +Operational notes: + +- do not leave systems in `Permissive` as the fix +- prefer `restorecon` and `semanage fcontext` over repeated `chcon` +- treat `audit2allow` output as investigation material, not automatic remediation +- if policy changes are unavoidable, document exact AVC evidence and rollback path + +### SSL Issues + +```bash +openssl s_client -connect host:443 -servername host -showcerts app.example.com +dig +trace app.example.com +getent hosts app.example.com +resolvectl status +``` + +Flow: + +1. Compare resolver result with authoritative result. +2. Check TTL and stale cache. +3. Validate `/etc/resolv.conf`, local resolver, and search domains. +4. Test from affected host and unaffected host. + +### Network Issues + +```bash +ip addr +ip route +ss -tulpen +tcpdump -ni any host and port +curl -sv http://host:port/health +mtr -rwzc 20 host +``` + +Flow: + +1. Interface/link state. +2. Route and source IP selection. +3. Listening socket on target. +4. Firewall and security controls. +5. Packet capture if app logs are inconclusive. + +### JVM / Tomcat Issues + +```bash +ps -ef | grep -i tomcat +jcmd VM.flags +jstat -gcutil 1000 10 +jstack | head -100 +ss -ltnp | grep java +tail -100 /opt/tomcat/logs/catalina.out +``` + +Focus: + +- stuck threads +- full GC loops +- heap exhaustion +- connector bind failures +- slow backend dependency + +### Certificate Expiration + +```bash +echo | openssl s_client -connect host:443 -servername host 2>/dev/null \ +| openssl x509 -noout -enddate + +openssl x509 -checkend 2592000 -noout -in cert.pem +``` + +### Suspicious Login Attempts + +```bash +last -ai | head -30 +lastb -ai | head -30 +grep 'Failed password' /var/log/secure | tail -50 +grep 'Accepted ' /var/log/secure | tail -50 +ausearch -m USER_LOGIN -ts recent +``` + +Workflow: + +1. Identify source IPs and usernames. +2. Validate whether attempts are expected from bastions/scanners. +3. Check successful logins from same sources. +4. Review sudo usage and persistence changes. +5. Preserve logs before cleanup or rotation. + +## Networking Operations + +```bash +ip -br addr +ip route get 8.8.8.8 +ss -ltnp +ss -tn state established '( sport = :443 or dport = :443 )' +tcpdump -ni eth0 port 53 +dig +short mx example.com +curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health +mtr -rwzc 10 host +traceroute -T -p 443 host +openssl s_client -connect host:443 -servername host +lvdisplay /dev// +``` + +Growth example: + +```bash +pvcreate /dev/mapper/mpatha # impact: write metadata +vgextend vgdata /dev/mapper/mpatha # impact: changes VG layout +lvextend -L +100G -r /dev/vgdata/lvapp +``` + +### XFS + +```bash +xfs_info /mountpoint +xfs_repair -n /dev/mapper/vg-lv +xfs_growfs /mountpoint +``` + +### ext4 + +```bash +tune2fs -l /dev/mapper/vg-lv | head -40 +e2fsck -fn /dev/mapper/vg-lv +resize2fs /dev/mapper/vg-lv +``` + +### Multipath + +```bash +multipath -ll +lsblk -S +udevadm info --query=all --name=/dev/mapper/mpatha | head -40 +``` + +### NFS + +```bash +showmount -e nfs-server +nfsstat -m +mount | grep nfs +rpcinfo -p nfs-server +``` + +### iSCSI + +```bash +iscsiadm -m session +iscsiadm -m node +iscsiadm -m discovery -t sendtargets -p +``` + +### Mount Troubleshooting + +```bash +findmnt /mountpoint +mount -v /mountpoint +dmesg -T | tail -50 +journalctl -k -n 100 --no-pager +``` + +Check: + +- device path stable +- UUID correct +- filesystem type correct +- multipath settled +- network and RPC available for NFS + +### Filesystem Validation + +```bash +findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data +df -hT /data +touch /data/.write-test && rm -f /data/.write-test +``` + +### Migration Validation Example + +```bash +findmnt /data +df -hT /data +rsync -aHAXvn /olddata/ /data/ +rsync -aHAXc --delete --dry-run /olddata/ /data/ +sha256sum /olddata/keyfile /data/keyfile +``` + +## AIX Operations + +```bash +oslevel -s +errpt | head +errpt -a | more +topas +lsvg -o +lsvg rootvg +lslpp -L | grep -i openssl +svmon -G +svmon -P +netstat -rn +``` + +## SSL/TLS Operations + +### OpenSSL Checks + +```bash +openssl version -a +openssl x509 -in cert.pem -noout -text | less +openssl rsa -in key.pem -check +openssl verify -CAfile chain.pem cert.pem +``` + +### Expiration Validation + +```bash +openssl x509 -enddate -noout -in cert.pem +openssl x509 -checkend 604800 -noout -in cert.pem +``` + +### keytool Basics + +```bash +keytool -list -v -keystore keystore.jks +keytool -list -cacerts | grep -i +keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks +``` + +### Chain Validation + +```bash +openssl s_client -connect host:443 -servername host -showcerts &2' ERR +trap 'rm -f "${tmpfile:-}"' EXIT +``` + +Safe loop examples: + +```bash +while IFS= read -r host; do + ssh "$host" uptime +done < hostlist.txt + +find /var/log -type f -name '*.log' -print0 \ +| while IFS= read -r -d '' file; do + gzip -t "$file" + done +``` + +Operational scripting patterns: + +- default to read-only mode +- require explicit `--execute` for changes +- log actions with timestamps +- validate dependencies with `command -v` +- use temp files with `mktemp` +- guard destructive paths and empty variables + +## Ansible Operations + +### Execution + +```bash +ansible-inventory -i inventory/hosts.yml --graph +ansible-inventory -i inventory/hosts.yml --list | jq '.' +ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check +ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff +ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01 +ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages +ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx' +``` + +### Safe Rollout Workflow + +1. Validate inventory and variable targeting. +2. Run syntax-check. +3. Run `--check --diff` on a single host. +4. Execute against one host or one tier. +5. Validate service health, logs, and config. +6. Expand rollout only after post-check passes. + +Rollback mindset: + +- keep before/after config copies +- know which tasks restart services +- define manual backout if package/config changes fail +- avoid broad `--limit` mistakes by reviewing resolved host list first + +## Monitoring & Observability + +### Zabbix Checks + +```bash +systemctl status zabbix-agent2 --no-pager +zabbix_agent2 -t vfs.fs.size[/,free] +grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log +``` + +### ELK Log Workflows + +```bash +grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50 +journalctl -u filebeat -n 100 --no-pager +curl -s http://localhost:9200/_cluster/health?pretty +``` + +### Grafana Checks + +```bash +curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login +grep -i 'error' /var/log/grafana/grafana.log | tail -50 +``` + +### Health Endpoints and Alert Validation + +```bash +curl -fsS http://app:8080/health +curl -fsS http://app:8080/metrics | head +``` + +False positive validation: + +1. Compare alert timestamp with deploy/change window. +2. Confirm on-host evidence, not only dashboard data. +3. Check collector lag, scrape failures, and stale metrics. +4. Validate from a second source before escalating. + +## Operational Habits + +### Pre-checks + +- capture time, hostname, and operator +- capture current config and service state +- check recent alerts, maintenance windows, and dependencies +- confirm backup or rollback path exists + +### Post-checks + +- validate service state +- validate logs for fresh errors +- validate client path, ports, and name resolution +- compare metrics before/after + +### Rollback Thinking + +- define exact backout trigger before change +- prefer reversible steps +- keep config backups with timestamps +- avoid bundling unrelated changes + +### Change Validation + +```bash +systemctl is-active +curl -fsS http://127.0.0.1:/health +ss -ltnp | grep : +journalctl -u -S '5 min ago' --no-pager +``` + +### Operational Communication + +- state scope, risk, and expected impact before action +- record start and stop times in UTC +- document what changed, what was checked, and remaining risk +- escalate with evidence, not assumptions + +### Evidence Collection During Incidents + +```bash +mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ) +journalctl -b > /tmp/incident-*/journal.txt +ss -tulpen > /tmp/incident-*/sockets.txt +df -hT > /tmp/incident-*/df.txt +free -m > /tmp/incident-*/free.txt +``` diff --git a/labs/docs/lab-cheatsheet.md b/labs/docs/lab-cheatsheet.md new file mode 100644 index 0000000..8bd9f44 --- /dev/null +++ b/labs/docs/lab-cheatsheet.md @@ -0,0 +1,144 @@ +# Lab Cheatsheet + +Quick-reference notes for experiments, rebuilds, and short-lived troubleshooting. Expect rough edges. Capture what worked, what broke, and what should not be repeated in production. + +## K3s Lab + +```bash +sudo systemctl status k3s --no-pager +sudo journalctl -u k3s -n 100 --no-pager +kubectl get nodes -o wide +kubectl get pods -A +kubectl get events -A --sort-by=.lastTimestamp | tail -30 +sudo k3s kubectl get pods -A +``` + +Quick reset: + +```bash +sudo /usr/local/bin/k3s-uninstall.sh # destructive lab reset +``` + +## Proxmox Lab + +```bash +pvesh get /nodes +pvesh get /cluster/resources +qm list +qm config +pct list +ha-manager status +``` + +Checks before changes: + +```bash +zpool status +pvesm status +ip -br addr +``` + +## GPU Passthrough + +```bash +lspci -nn | grep -Ei 'vga|3d|nvidia' +nvidia-smi +dmesg -T | grep -Ei 'vfio|iommu|nvidia' +find /sys/kernel/iommu_groups/ -type l | sort +``` + +Good sanity check: + +```bash +lsmod | grep -E 'vfio|kvm' +``` + +## Terraform Experiments + +```bash +terraform fmt -recursive +terraform init +terraform validate +terraform plan +terraform state list +``` + +Scratch workflow: + +```bash +terraform plan -out=tfplan +terraform show tfplan +``` + +## Networking Labs + +```bash +ip -br addr +ip route +bridge link +ss -ltnp +tcpdump -ni any port 53 +dig +short example.com +mtr -rwzc 10 1.1.1.1 +``` + +## Ansible Testing + +```bash +ansible-inventory -i inventory/hosts.yml --graph +ansible-playbook -i inventory/hosts.yml playbook.yml --syntax-check +ansible-playbook -i inventory/hosts.yml playbook.yml --check --diff +ansible all -i inventory/hosts.yml -m ping +``` + +## Docker Testing + +```bash +docker ps -a +docker logs --tail 100 +docker exec -it sh +docker inspect | jq '.[0].NetworkSettings' +docker system df +``` + +## Useful Temporary Commands + +```bash +watch -n2 'kubectl get pods -A' +watch -n2 'nvidia-smi' +watch -n2 'ip -br addr' +while true; do date -u; curl -fsS http://127.0.0.1:8080/health; sleep 2; done +``` + +## Quick PoC Commands + +```bash +python3 -m http.server 8080 +openssl req -x509 -newkey rsa:2048 -nodes -days 3 -keyout key.pem -out cert.pem +curl -vk https://127.0.0.1:8443/ +nc -lvkp 9000 +``` + +## Troubleshooting Notes + +- If K3s pods fail after host reboot, check time sync before chasing cert or API errors. +- If PVCs stay pending in lab clusters, inspect the default storage class first. +- If Docker networking looks broken, compare bridge subnet overlaps with the host route table. +- If GPU pods see no devices, validate driver, toolkit, and device plugin in that order. + +## Useful One-liners + +```bash +kubectl get pods -A -o wide | egrep 'CrashLoopBackOff|Error|Pending' +journalctl -p err -S today +find /var/log -type f -mtime -1 -ls | sort -k7,7n +ps -eo pid,%cpu,%mem,cmd --sort=-%cpu | head +grep -RniE 'error|failed|timeout' . +``` + +## Things Worth Remembering + +- Pre-checks still matter in labs. Capture state before trying the risky thing. +- Keep a copy of working configs before rapid iteration. +- Short-lived labs still produce useful evidence; save command output when a fix works. +- If a PoC needs repeated manual repair, turn the repair steps into a script or note. diff --git a/platform-projects/docs/platform-cheatsheet.md b/platform-projects/docs/platform-cheatsheet.md new file mode 100644 index 0000000..6a023c3 --- /dev/null +++ b/platform-projects/docs/platform-cheatsheet.md @@ -0,0 +1,368 @@ +# Platform Engineering Cheatsheet + +Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts. + +## Kubernetes / K3s + +### Contexts, Namespaces, and Basic Workflows + +```bash +kubectl config get-contexts +kubectl config use-context +kubectl get ns +kubectl -n get pods -o wide +kubectl -n get deploy,sts,ds,svc,ingress +kubectl get nodes -o wide +``` + +### Describe, Logs, Exec, Events + +```bash +kubectl -n describe pod +kubectl -n logs --tail=100 +kubectl -n logs -c --previous +kubectl -n exec -it -- sh +kubectl -n get events --sort-by=.lastTimestamp | tail -30 +``` + +### Rollout Troubleshooting + +```bash +kubectl -n rollout status deploy/ +kubectl -n rollout history deploy/ +kubectl -n rollout undo deploy/ +kubectl -n get rs -l app= +``` + +Safe pattern: + +1. `kubectl diff -f ` +2. apply to non-prod or canary namespace +3. watch rollout and events +4. validate service and logs +5. expand scope only after post-check + +### Node Validation + +```bash +kubectl get nodes +kubectl describe node +kubectl top nodes +kubectl top pods -A --sort-by=cpu +kubectl get pods -A -o wide --field-selector spec.nodeName= +``` + +### Pending / CrashLoopBackOff Flow + +Pending: + +```bash +kubectl -n describe pod +kubectl get events -A --sort-by=.lastTimestamp | tail -50 +``` + +Check for: + +- unsatisfied CPU/memory requests +- missing PVC +- taints/tolerations mismatch +- image pull secret issues +- node selectors or affinity mismatch + +CrashLoopBackOff: + +```bash +kubectl -n logs --previous +kubectl -n describe pod +kubectl -n get pod -o jsonpath='{.status.containerStatuses[*].lastState}' +``` + +Check for: + +- bad config or missing env vars +- probe failures +- dependency timeouts +- permission or filesystem errors + +## Helm + +```bash +helm repo list +helm repo update +helm list -A +helm -n get values -a +helm -n get manifest +helm upgrade --install -n -f values.yaml +helm rollback -n +helm template -f values.yaml | less +``` + +Validation: + +```bash +helm lint +kubectl -n get events --sort-by=.lastTimestamp | tail -20 +``` + +## Docker / Podman + +```bash +docker images +docker ps -a +docker logs --tail 100 +docker exec -it sh +docker inspect +docker volume ls +docker network ls +docker system df +docker image prune -f # cleanup: review first +docker container prune -f # cleanup: review first +podman ps -a +podman inspect +``` + +Container validation: + +```bash +docker exec env | sort +docker exec ss -ltnp +docker inspect -f '{{.State.Status}} {{.RestartCount}}' +``` + +## Terraform + +### Core Commands + +```bash +terraform fmt -check -recursive +terraform init +terraform validate +terraform plan -out=tfplan +terraform apply tfplan +terraform destroy -target= # impact: targeted destruction needs review +terraform state list +terraform state show +terraform import +``` + +### Safe Workflow + +1. `terraform fmt -check -recursive` +2. `terraform validate` +3. refresh provider auth and backend access +4. review `plan` output for replacements and destroys +5. save plan artifact +6. apply reviewed plan only +7. validate resource state outside Terraform + +Plan review focus: + +- unexpected replacement +- drift on security groups, routes, storage, or instance identity +- provider alias mistakes +- wrong workspace or backend + +## CI/CD Operations + +### GitLab CI + +```bash +gitlab-runner verify +grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml +curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects//pipelines +``` + +### Jenkins + +```bash +systemctl status jenkins --no-pager +journalctl -u jenkins -n 100 --no-pager +java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs +``` + +### Runners, Artifacts, Pipeline Failures + +```bash +docker logs --tail 100 gitlab-runner +kubectl -n ci get pods +kubectl -n ci logs deploy/runner-controller --tail=100 +``` + +Troubleshooting flow: + +1. validate YAML or Jenkinsfile syntax +2. confirm runner/agent availability +3. inspect job logs for auth, cache, DNS, or registry failures +4. verify artifacts were uploaded and not expired +5. correlate with platform outages, image changes, or secret rotation + +YAML validation: + +```bash +yamllint . +python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml +``` + +## Observability + +### Prometheus + +```bash +curl -s http://prometheus:9090/-/ready +curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}' +curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}' +``` + +### Loki + +```bash +curl -s http://loki:3100/ready +curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"' +``` + +### Grafana + +```bash +curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login +grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50 +``` + +### Metrics Validation and Log Correlation + +```bash +kubectl -n port-forward svc/ 9090:9090 +curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_' +``` + +Correlation flow: + +1. confirm alert time and impacted objects +2. inspect deployment events in same window +3. compare Prometheus series, Loki logs, and app logs +4. rule out scrape lag or stale dashboards + +## GPU / AI Infrastructure + +### GPU Discovery and CUDA Validation + +```bash +nvidia-smi +nvidia-smi -L +nvidia-smi topo -m +nvidia-smi dmon -s pucm +nvcc --version +python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())' +``` + +### MIG Basics + +```bash +nvidia-smi -i 0 -q | grep -i mig -A4 +nvidia-smi mig -lgip +nvidia-smi mig -lgi +``` + +### GPU Operator and DCGM + +```bash +kubectl get pods -A | grep -E 'nvidia|gpu' +kubectl -n gpu-operator describe pod +kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100 +kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100 +``` + +### Container GPU Validation + +```bash +docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi +kubectl run gpu-check --rm -it --restart=Never \ + --image=nvidia/cuda:12.3.2-base-ubuntu22.04 \ + --limits='nvidia.com/gpu=1' -- nvidia-smi +``` + +### Kubernetes GPU Troubleshooting + +Check for: + +- device plugin not running +- driver/container toolkit mismatch +- node missing `nvidia.com/gpu` allocatable resources +- MIG profile mismatch +- taints or tolerations blocking placement + +Useful checks: + +```bash +kubectl describe node | grep -A5 -B2 -i nvidia +kubectl get node -o jsonpath='{.status.allocatable}' +kubectl -n describe pod +``` + +## Platform Troubleshooting Flows + +### Pod Not Starting + +```bash +kubectl -n get pod -o wide +kubectl -n describe pod +kubectl -n logs --previous +kubectl -n get events --sort-by=.lastTimestamp | tail -30 +``` + +### Image Pull Errors + +```bash +kubectl -n describe pod | grep -A5 -i 'image' +crictl images | grep +ctr -n k8s.io images ls | grep +``` + +Check: + +- image tag exists +- registry reachable +- pull secret valid +- node clock sane for token-based auth + +### Failing Deployment + +```bash +kubectl -n rollout status deploy/ +kubectl -n describe deploy/ +kubectl -n get rs,pods -l app= -o wide +``` + +### Node Not Ready + +```bash +kubectl describe node +journalctl -u k3s -n 100 --no-pager +systemctl status kubelet --no-pager +df -h +free -m +``` + +Check: + +- kubelet or k3s service state +- disk pressure +- cert expiry +- CNI failure +- API reachability + +### Storage Provisioning Issues + +```bash +kubectl get pvc,pv -A +kubectl -n describe pvc +kubectl get sc +kubectl -n kube-system logs deploy/ --tail=100 +``` + +Check: + +- storage class defaulting +- access mode mismatch +- CSI controller errors +- backend quota or LUN exhaustion +- node attachment failures