Add operational cheatsheets across repository
lint / shell-yaml-ansible (push) Failing after 17s

This commit is contained in:
Mateusz Suski
2026-05-09 09:41:55 +00:00
parent ca5a876d03
commit 0d3905b8a1
6 changed files with 1394 additions and 0 deletions
+7
View File
@@ -4,6 +4,12 @@
### Added
- Cross-repository operational documentation structure:
- `infra-run/docs/operations-cheatsheet.md`
- `platform-projects/docs/platform-cheatsheet.md`
- `labs/docs/lab-cheatsheet.md`
- Production-oriented Linux/Unix operations reference with incident workflows, storage and networking checks, SSL/TLS notes, AIX commands, automation safety patterns, Ansible operational usage, and observability quick-reference.
- SELinux operational coverage for mode checks, context inspection, AVC audit review, persistent relabel workflow, booleans, and SELinux-specific incident response.
- Selected baseline Ansible hardening automation:
- RHEL 9 role and playbook.
- Debian 13 / Ubuntu 26.04 role and playbook.
@@ -13,6 +19,7 @@
### Changed
- Updated repository and `infra-run` README files to surface the new documentation structure and operational cheatsheets.
- Updated repository, `infra-run`, and Ansible README files to describe the new hardening automation instead of placeholder-only Ansible structure.
### Notes
+14
View File
@@ -17,6 +17,20 @@ It is a technical portfolio, not a production toolkit. The examples are meant to
The `labs` and `platform-projects` trees are intentionally thin. They are kept as planning areas for future lab notes and case studies, not as completed projects. Current planned topics are tracked in [ROADMAP.md](./ROADMAP.md).
## Documentation
### Production Operations
- [infra-run/docs/operations-cheatsheet.md](./infra-run/docs/operations-cheatsheet.md) - production-focused Linux/Unix operations reference for incident handling, validation, storage, networking, Ansible, observability, and safety-first change execution.
### Platform Engineering
- [platform-projects/docs/platform-cheatsheet.md](./platform-projects/docs/platform-cheatsheet.md) - platform operations reference for Kubernetes, Helm, containers, Terraform, CI/CD, observability, and GPU-backed infrastructure troubleshooting.
### Labs & Experiments
- [labs/docs/lab-cheatsheet.md](./labs/docs/lab-cheatsheet.md) - quick-reference scratchpad for K3s, Proxmox, Terraform, Docker, networking, and short-lived lab troubleshooting work.
## What This Repo Is Not
- It is not a compliance benchmark implementation.
+4
View File
@@ -13,6 +13,10 @@ The goal is to show operational judgment, not to ship a universal automation pro
- [ansible](./ansible/) - selected baseline hardening examples for RHEL-like Linux, Debian/Ubuntu, and AIX.
- [examples](./examples/) - sanitized sample command outputs and incident notes.
## Documentation
- [docs/operations-cheatsheet.md](./docs/operations-cheatsheet.md) - production operations quick reference covering Linux/Unix triage, text processing, incident workflows, networking, storage, AIX, SSL/TLS, automation safety, Ansible execution, observability, and operational habits.
## What This Is
- A portfolio project for Linux and infrastructure operations roles.
+857
View File
@@ -0,0 +1,857 @@
# Production Operations Cheatsheet
Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change.
## Linux / Unix Daily Operations
### Uptime and Host State
Check host age, kernel, clock, and recent reboot history before touching anything:
```bash
uptime
uname -r
hostnamectl
timedatectl
who -b
last -x | head -20
```
Pre-check pattern:
```bash
date -u
uptime
df -h
free -m
systemctl --failed
```
### Process Management
```bash
ps -ef | head
ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
pgrep -a java
pstree -ap | less
pidof sshd
renice +5 -p <pid>
kill -TERM <pid>
kill -9 <pid> # DANGEROUS: last resort only
```
Validation:
```bash
ps -p <pid> -o pid,stat,etime,cmd
journalctl -u <service> -n 50 --no-pager
```
### systemctl
```bash
systemctl status <service> --no-pager -l
systemctl is-active <service>
systemctl is-enabled <service>
systemctl list-units --type=service --state=running
systemctl list-units --failed
systemctl daemon-reload
systemctl restart <service> # impact: confirms service interruption policy first
```
### journalctl
```bash
journalctl -u <service> -n 100 --no-pager
journalctl -u <service> --since '30 min ago'
journalctl -p err -S today
journalctl -k -b
journalctl --disk-usage
```
### Service Troubleshooting Flow
1. Confirm service state and recent restart count.
2. Read the last 100-200 journal lines.
3. Validate config syntax before restart if the daemon supports it.
4. Check dependent ports, mounts, credentials, and name resolution.
5. Restart only after cause is understood or rollback exists.
Example:
```bash
systemctl status nginx --no-pager -l
journalctl -u nginx -n 100 --no-pager
nginx -t
ss -ltnp | grep ':80\|:443'
curl -kI https://127.0.0.1/
```
### CPU and Memory Diagnostics
```bash
uptime
top -H -b -n 1 | head -40
pidstat 1 5
pidstat -ru -p ALL 1 3
vmstat 1 5
iostat -xz 1 5
free -m
sar -q 1 5
```
Quick interpretation:
- high `%wa`: storage path or NFS issue
- high run queue with low CPU idle: CPU contention
- swap growth plus page scans: memory pressure
### Disk Usage
```bash
df -hT
du -xhd1 /var | sort -h
find /var/log -type f -size +500M -ls | sort -k7,7n
lsof +L1
```
### Inode Exhaustion
```bash
df -ih
find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n
find /tmp -xdev -type f | wc -l
```
### Mounts
```bash
mount | column -t
findmnt
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
cat /etc/fstab
mount -a # can expose bad fstab entries; use in change window
```
### Permissions
```bash
namei -l /path/to/file
stat /path/to/file
getfacl /path/to/file
chmod 640 /path/to/file
chown root:app /path/to/file
```
### SELinux
State and mode:
```bash
getenforce
sestatus
cat /etc/selinux/config
```
Check file, process, and port context:
```bash
ls -Zd /var/www/html
ls -lZ /var/www/html/index.html
ps -eZ | grep nginx
id -Z
semanage port -l | grep http
```
Audit and denial review:
```bash
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
ausearch -m AVC -ts today | audit2why
journalctl -t setroubleshoot --since '1 hour ago'
sealert -a /var/log/audit/audit.log
```
Typical flow:
1. Confirm SELinux mode is `Enforcing` or `Permissive`.
2. Identify the failing path, process domain, and target context.
3. Read AVC denials before changing labels or booleans.
4. Prefer persistent policy-aligned fixes over `chcon`.
5. Restore default labels and retest service path.
Modify and restore context:
```bash
chcon -t httpd_sys_content_t /srv/app/index.html # temporary until relabel/restore
chcon -R -t httpd_sys_rw_content_t /srv/app/uploads # temporary until relabel/restore
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
restorecon -Rv /srv/app
matchpathcon /srv/app/uploads/file.txt
```
Booleans and validation:
```bash
getsebool -a | grep httpd
getsebool httpd_can_network_connect
setsebool -P httpd_can_network_connect on
runcon -t httpd_t -- id -Z
```
Notes:
- prefer `semanage fcontext` plus `restorecon` for persistent fixes
- use `chcon` only as a short-lived diagnostic or emergency workaround
- avoid generating local policy modules from `audit2allow` until root cause is understood
- after context changes, validate service startup, AVC silence, and application path access
### Archives
```bash
tar tf backup.tar | head
tar czf logs-$(date +%F).tgz /var/log/app
tar xzf bundle.tgz -C /restore/path
gzip -t file.gz
```
### File Operations
```bash
cp -a source/ target/
rsync -aHAXvn /src/ /dst/
rsync -aHAX --delete --info=progress2 /src/ /dst/ # impact: verify source/destination twice
mv file file.$(date +%F-%H%M%S).bak
sha256sum file
```
## Text Processing & Regex
### Core Tools
```bash
grep -n 'ERROR' app.log
grep -E 'ERROR|WARN' app.log
grep -P '^\d{4}-\d{2}-\d{2}T' app.log
awk '{print $1,$4,$5}' access.log
awk -F, 'NR==1 || $3 ~ /failed/' report.csv
sed -n '1,20p' file
sed -E 's/[[:space:]]+/ /g' file
cut -d: -f1,7 /etc/passwd
sort file | uniq -c | sort -nr
xargs -r -n1 systemctl status < service-list.txt
jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json
```
### Regex Reference
```text
IPv4 \b(?:\d{1,3}\.){3}\d{1,3}\b
ISO timestamp \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b
UUID \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b
Log level \b(?:ERROR|WARN|INFO)\b
Failed SSH Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})
Ansible changed/fail ^(changed|fatal|failed):\s+\[[^]]+\]
```
### Log Parsing Examples
IP extraction:
```bash
grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head
```
Timestamp filter:
```bash
grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log
```
UUID extraction:
```bash
grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u
```
ERROR/WARN/INFO parsing:
```bash
grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c
```
Failed SSH login parsing:
```bash
grep 'Failed password' /var/log/secure \
| awk '{print $(NF-3),$NF}' \
| sort | uniq -c | sort -nr | head
```
Extract fields from logs:
```bash
awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log
```
Filter Ansible output:
```bash
grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log
grep -E '^fatal:|^failed:' ansible.log
```
## Incident Response
### Disk Full
Workflow:
```bash
df -hT
df -ih
findmnt
du -xhd1 /var | sort -h
find /var -xdev -type f -size +1G -ls | sort -k7,7n
lsof +L1
journalctl --disk-usage
```
Typical branches:
- filesystem full: identify growth path, compress/rotate/archive, validate app behavior
- inode full: remove file storms, spool buildup, temp-file leaks
- deleted open files: restart offender only after sizing impact
Post-check:
```bash
df -hT
df -ih
systemctl --failed
```
### High CPU
```bash
uptime
mpstat -P ALL 1 5
pidstat -u -p ALL 1 5
top -H -b -n 1 | head -40
ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20
```
Flow:
1. Confirm sustained load, not a short spike.
2. Separate user CPU vs system CPU vs I/O wait.
3. Identify hot process and hot threads.
4. Correlate with deploys, cron, backups, or JVM GC.
5. Throttle, stop, or fail over only with service impact understood.
### Memory Pressure
```bash
free -m
vmstat 1 5
sar -r 1 5
ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20
dmesg -T | egrep -i 'oom|killed process'
```
Flow:
1. Check swap growth and page scan rates.
2. Identify top RSS owners.
3. Check kernel logs for OOM.
4. Validate cache vs real process growth.
5. Restart leaking service only after capturing evidence.
### Failed Service
```bash
systemctl status <service> --no-pager -l
journalctl -u <service> -b --no-pager | tail -100
systemctl show <service> -p ExecStart -p FragmentPath -p ActiveEnterTimestamp
```
Flow:
1. Validate config.
2. Validate credentials, ports, mounts, permissions.
3. Confirm dependency availability.
4. Restart and recheck logs immediately.
### SELinux Denials
Typical case: service works in `Permissive`, fails in `Enforcing`, or logs show `permission denied` while UNIX permissions look correct.
Triage:
```bash
getenforce
sestatus
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
ausearch -m AVC -ts recent | audit2why
journalctl -t setroubleshoot --since '30 min ago'
systemctl status <service> --no-pager -l
ps -eZ | grep <service>
ls -lZ /path/to/app /path/to/app/*
```
Flow:
1. Confirm the failure is current and reproducible.
2. Identify the denied process domain, target path, and requested access from AVC logs.
3. Validate expected default context with `matchpathcon`.
4. Check for mislabeled files, wrong port types, or missing SELinux booleans.
5. Apply the smallest persistent fix, then retest in `Enforcing`.
Common fixes:
```bash
matchpathcon /srv/app/config.yml
restorecon -Rv /srv/app
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
semanage port -l | grep http
getsebool -a | grep httpd
setsebool -P httpd_can_network_connect on
```
Validation:
```bash
getenforce
systemctl restart <service>
systemctl status <service> --no-pager -l
ausearch -m AVC -ts recent
curl -fsS http://127.0.0.1:<port>/health
```
Operational notes:
- do not leave systems in `Permissive` as the fix
- prefer `restorecon` and `semanage fcontext` over repeated `chcon`
- treat `audit2allow` output as investigation material, not automatic remediation
- if policy changes are unavoidable, document exact AVC evidence and rollback path
### SSL Issues
```bash
openssl s_client -connect host:443 -servername host -showcerts </dev/null
openssl x509 -in cert.pem -noout -subject -issuer -dates -ext subjectAltName
curl -vkI https://host/
```
Check for:
- expired certificate
- missing SAN
- incomplete chain
- hostname mismatch
- TLS version or cipher mismatch
### DNS Issues
```bash
dig +short app.example.com
dig @<resolver> app.example.com
dig +trace app.example.com
getent hosts app.example.com
resolvectl status
```
Flow:
1. Compare resolver result with authoritative result.
2. Check TTL and stale cache.
3. Validate `/etc/resolv.conf`, local resolver, and search domains.
4. Test from affected host and unaffected host.
### Network Issues
```bash
ip addr
ip route
ss -tulpen
tcpdump -ni any host <peer> and port <port>
curl -sv http://host:port/health
mtr -rwzc 20 host
```
Flow:
1. Interface/link state.
2. Route and source IP selection.
3. Listening socket on target.
4. Firewall and security controls.
5. Packet capture if app logs are inconclusive.
### JVM / Tomcat Issues
```bash
ps -ef | grep -i tomcat
jcmd <pid> VM.flags
jstat -gcutil <pid> 1000 10
jstack <pid> | head -100
ss -ltnp | grep java
tail -100 /opt/tomcat/logs/catalina.out
```
Focus:
- stuck threads
- full GC loops
- heap exhaustion
- connector bind failures
- slow backend dependency
### Certificate Expiration
```bash
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
| openssl x509 -noout -enddate
openssl x509 -checkend 2592000 -noout -in cert.pem
```
### Suspicious Login Attempts
```bash
last -ai | head -30
lastb -ai | head -30
grep 'Failed password' /var/log/secure | tail -50
grep 'Accepted ' /var/log/secure | tail -50
ausearch -m USER_LOGIN -ts recent
```
Workflow:
1. Identify source IPs and usernames.
2. Validate whether attempts are expected from bastions/scanners.
3. Check successful logins from same sources.
4. Review sudo usage and persistence changes.
5. Preserve logs before cleanup or rotation.
## Networking Operations
```bash
ip -br addr
ip route get 8.8.8.8
ss -ltnp
ss -tn state established '( sport = :443 or dport = :443 )'
tcpdump -ni eth0 port 53
dig +short mx example.com
curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health
mtr -rwzc 10 host
traceroute -T -p 443 host
openssl s_client -connect host:443 -servername host </dev/null
```
## Storage Operations
### Block and Filesystem Discovery
```bash
lsblk -f
blkid
findmnt
cat /proc/partitions
multipath -ll
```
### LVM
```bash
pvs
vgs
lvs -a -o +devices
pvdisplay /dev/sdX
vgdisplay <vg>
lvdisplay /dev/<vg>/<lv>
```
Growth example:
```bash
pvcreate /dev/mapper/mpatha # impact: write metadata
vgextend vgdata /dev/mapper/mpatha # impact: changes VG layout
lvextend -L +100G -r /dev/vgdata/lvapp
```
### XFS
```bash
xfs_info /mountpoint
xfs_repair -n /dev/mapper/vg-lv
xfs_growfs /mountpoint
```
### ext4
```bash
tune2fs -l /dev/mapper/vg-lv | head -40
e2fsck -fn /dev/mapper/vg-lv
resize2fs /dev/mapper/vg-lv
```
### Multipath
```bash
multipath -ll
lsblk -S
udevadm info --query=all --name=/dev/mapper/mpatha | head -40
```
### NFS
```bash
showmount -e nfs-server
nfsstat -m
mount | grep nfs
rpcinfo -p nfs-server
```
### iSCSI
```bash
iscsiadm -m session
iscsiadm -m node
iscsiadm -m discovery -t sendtargets -p <target-ip>
```
### Mount Troubleshooting
```bash
findmnt /mountpoint
mount -v /mountpoint
dmesg -T | tail -50
journalctl -k -n 100 --no-pager
```
Check:
- device path stable
- UUID correct
- filesystem type correct
- multipath settled
- network and RPC available for NFS
### Filesystem Validation
```bash
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
df -hT /data
touch /data/.write-test && rm -f /data/.write-test
```
### Migration Validation Example
```bash
findmnt /data
df -hT /data
rsync -aHAXvn /olddata/ /data/
rsync -aHAXc --delete --dry-run /olddata/ /data/
sha256sum /olddata/keyfile /data/keyfile
```
## AIX Operations
```bash
oslevel -s
errpt | head
errpt -a | more
topas
lsvg -o
lsvg rootvg
lslpp -L | grep -i openssl
svmon -G
svmon -P <pid>
netstat -rn
```
## SSL/TLS Operations
### OpenSSL Checks
```bash
openssl version -a
openssl x509 -in cert.pem -noout -text | less
openssl rsa -in key.pem -check
openssl verify -CAfile chain.pem cert.pem
```
### Expiration Validation
```bash
openssl x509 -enddate -noout -in cert.pem
openssl x509 -checkend 604800 -noout -in cert.pem
```
### keytool Basics
```bash
keytool -list -v -keystore keystore.jks
keytool -list -cacerts | grep -i <alias>
keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks
```
### Chain Validation
```bash
openssl s_client -connect host:443 -servername host -showcerts </dev/null
openssl verify -untrusted intermediate.pem -CAfile root.pem server.pem
```
## Automation Operations
### Bash Safety Patterns
```bash
set -euo pipefail
IFS=$'\n\t'
trap 'echo "line ${LINENO}: command failed" >&2' ERR
trap 'rm -f "${tmpfile:-}"' EXIT
```
Safe loop examples:
```bash
while IFS= read -r host; do
ssh "$host" uptime
done < hostlist.txt
find /var/log -type f -name '*.log' -print0 \
| while IFS= read -r -d '' file; do
gzip -t "$file"
done
```
Operational scripting patterns:
- default to read-only mode
- require explicit `--execute` for changes
- log actions with timestamps
- validate dependencies with `command -v`
- use temp files with `mktemp`
- guard destructive paths and empty variables
## Ansible Operations
### Execution
```bash
ansible-inventory -i inventory/hosts.yml --graph
ansible-inventory -i inventory/hosts.yml --list | jq '.'
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx'
```
### Safe Rollout Workflow
1. Validate inventory and variable targeting.
2. Run syntax-check.
3. Run `--check --diff` on a single host.
4. Execute against one host or one tier.
5. Validate service health, logs, and config.
6. Expand rollout only after post-check passes.
Rollback mindset:
- keep before/after config copies
- know which tasks restart services
- define manual backout if package/config changes fail
- avoid broad `--limit` mistakes by reviewing resolved host list first
## Monitoring & Observability
### Zabbix Checks
```bash
systemctl status zabbix-agent2 --no-pager
zabbix_agent2 -t vfs.fs.size[/,free]
grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log
```
### ELK Log Workflows
```bash
grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50
journalctl -u filebeat -n 100 --no-pager
curl -s http://localhost:9200/_cluster/health?pretty
```
### Grafana Checks
```bash
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
grep -i 'error' /var/log/grafana/grafana.log | tail -50
```
### Health Endpoints and Alert Validation
```bash
curl -fsS http://app:8080/health
curl -fsS http://app:8080/metrics | head
```
False positive validation:
1. Compare alert timestamp with deploy/change window.
2. Confirm on-host evidence, not only dashboard data.
3. Check collector lag, scrape failures, and stale metrics.
4. Validate from a second source before escalating.
## Operational Habits
### Pre-checks
- capture time, hostname, and operator
- capture current config and service state
- check recent alerts, maintenance windows, and dependencies
- confirm backup or rollback path exists
### Post-checks
- validate service state
- validate logs for fresh errors
- validate client path, ports, and name resolution
- compare metrics before/after
### Rollback Thinking
- define exact backout trigger before change
- prefer reversible steps
- keep config backups with timestamps
- avoid bundling unrelated changes
### Change Validation
```bash
systemctl is-active <service>
curl -fsS http://127.0.0.1:<port>/health
ss -ltnp | grep :<port>
journalctl -u <service> -S '5 min ago' --no-pager
```
### Operational Communication
- state scope, risk, and expected impact before action
- record start and stop times in UTC
- document what changed, what was checked, and remaining risk
- escalate with evidence, not assumptions
### Evidence Collection During Incidents
```bash
mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ)
journalctl -b > /tmp/incident-*/journal.txt
ss -tulpen > /tmp/incident-*/sockets.txt
df -hT > /tmp/incident-*/df.txt
free -m > /tmp/incident-*/free.txt
```
+144
View File
@@ -0,0 +1,144 @@
# Lab Cheatsheet
Quick-reference notes for experiments, rebuilds, and short-lived troubleshooting. Expect rough edges. Capture what worked, what broke, and what should not be repeated in production.
## K3s Lab
```bash
sudo systemctl status k3s --no-pager
sudo journalctl -u k3s -n 100 --no-pager
kubectl get nodes -o wide
kubectl get pods -A
kubectl get events -A --sort-by=.lastTimestamp | tail -30
sudo k3s kubectl get pods -A
```
Quick reset:
```bash
sudo /usr/local/bin/k3s-uninstall.sh # destructive lab reset
```
## Proxmox Lab
```bash
pvesh get /nodes
pvesh get /cluster/resources
qm list
qm config <vmid>
pct list
ha-manager status
```
Checks before changes:
```bash
zpool status
pvesm status
ip -br addr
```
## GPU Passthrough
```bash
lspci -nn | grep -Ei 'vga|3d|nvidia'
nvidia-smi
dmesg -T | grep -Ei 'vfio|iommu|nvidia'
find /sys/kernel/iommu_groups/ -type l | sort
```
Good sanity check:
```bash
lsmod | grep -E 'vfio|kvm'
```
## Terraform Experiments
```bash
terraform fmt -recursive
terraform init
terraform validate
terraform plan
terraform state list
```
Scratch workflow:
```bash
terraform plan -out=tfplan
terraform show tfplan
```
## Networking Labs
```bash
ip -br addr
ip route
bridge link
ss -ltnp
tcpdump -ni any port 53
dig +short example.com
mtr -rwzc 10 1.1.1.1
```
## Ansible Testing
```bash
ansible-inventory -i inventory/hosts.yml --graph
ansible-playbook -i inventory/hosts.yml playbook.yml --syntax-check
ansible-playbook -i inventory/hosts.yml playbook.yml --check --diff
ansible all -i inventory/hosts.yml -m ping
```
## Docker Testing
```bash
docker ps -a
docker logs --tail 100 <container>
docker exec -it <container> sh
docker inspect <container> | jq '.[0].NetworkSettings'
docker system df
```
## Useful Temporary Commands
```bash
watch -n2 'kubectl get pods -A'
watch -n2 'nvidia-smi'
watch -n2 'ip -br addr'
while true; do date -u; curl -fsS http://127.0.0.1:8080/health; sleep 2; done
```
## Quick PoC Commands
```bash
python3 -m http.server 8080
openssl req -x509 -newkey rsa:2048 -nodes -days 3 -keyout key.pem -out cert.pem
curl -vk https://127.0.0.1:8443/
nc -lvkp 9000
```
## Troubleshooting Notes
- If K3s pods fail after host reboot, check time sync before chasing cert or API errors.
- If PVCs stay pending in lab clusters, inspect the default storage class first.
- If Docker networking looks broken, compare bridge subnet overlaps with the host route table.
- If GPU pods see no devices, validate driver, toolkit, and device plugin in that order.
## Useful One-liners
```bash
kubectl get pods -A -o wide | egrep 'CrashLoopBackOff|Error|Pending'
journalctl -p err -S today
find /var/log -type f -mtime -1 -ls | sort -k7,7n
ps -eo pid,%cpu,%mem,cmd --sort=-%cpu | head
grep -RniE 'error|failed|timeout' .
```
## Things Worth Remembering
- Pre-checks still matter in labs. Capture state before trying the risky thing.
- Keep a copy of working configs before rapid iteration.
- Short-lived labs still produce useful evidence; save command output when a fix works.
- If a PoC needs repeated manual repair, turn the repair steps into a script or note.
@@ -0,0 +1,368 @@
# Platform Engineering Cheatsheet
Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts.
## Kubernetes / K3s
### Contexts, Namespaces, and Basic Workflows
```bash
kubectl config get-contexts
kubectl config use-context <context>
kubectl get ns
kubectl -n <ns> get pods -o wide
kubectl -n <ns> get deploy,sts,ds,svc,ingress
kubectl get nodes -o wide
```
### Describe, Logs, Exec, Events
```bash
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> logs <pod> --tail=100
kubectl -n <ns> logs <pod> -c <container> --previous
kubectl -n <ns> exec -it <pod> -- sh
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
```
### Rollout Troubleshooting
```bash
kubectl -n <ns> rollout status deploy/<name>
kubectl -n <ns> rollout history deploy/<name>
kubectl -n <ns> rollout undo deploy/<name>
kubectl -n <ns> get rs -l app=<name>
```
Safe pattern:
1. `kubectl diff -f <manifest>`
2. apply to non-prod or canary namespace
3. watch rollout and events
4. validate service and logs
5. expand scope only after post-check
### Node Validation
```bash
kubectl get nodes
kubectl describe node <node>
kubectl top nodes
kubectl top pods -A --sort-by=cpu
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
```
### Pending / CrashLoopBackOff Flow
Pending:
```bash
kubectl -n <ns> describe pod <pod>
kubectl get events -A --sort-by=.lastTimestamp | tail -50
```
Check for:
- unsatisfied CPU/memory requests
- missing PVC
- taints/tolerations mismatch
- image pull secret issues
- node selectors or affinity mismatch
CrashLoopBackOff:
```bash
kubectl -n <ns> logs <pod> --previous
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'
```
Check for:
- bad config or missing env vars
- probe failures
- dependency timeouts
- permission or filesystem errors
## Helm
```bash
helm repo list
helm repo update
helm list -A
helm -n <ns> get values <release> -a
helm -n <ns> get manifest <release>
helm upgrade --install <release> <chart> -n <ns> -f values.yaml
helm rollback -n <ns> <release> <revision>
helm template <release> <chart> -f values.yaml | less
```
Validation:
```bash
helm lint <chart>
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -20
```
## Docker / Podman
```bash
docker images
docker ps -a
docker logs --tail 100 <container>
docker exec -it <container> sh
docker inspect <container>
docker volume ls
docker network ls
docker system df
docker image prune -f # cleanup: review first
docker container prune -f # cleanup: review first
podman ps -a
podman inspect <container>
```
Container validation:
```bash
docker exec <container> env | sort
docker exec <container> ss -ltnp
docker inspect -f '{{.State.Status}} {{.RestartCount}}' <container>
```
## Terraform
### Core Commands
```bash
terraform fmt -check -recursive
terraform init
terraform validate
terraform plan -out=tfplan
terraform apply tfplan
terraform destroy -target=<resource> # impact: targeted destruction needs review
terraform state list
terraform state show <resource>
terraform import <resource> <id>
```
### Safe Workflow
1. `terraform fmt -check -recursive`
2. `terraform validate`
3. refresh provider auth and backend access
4. review `plan` output for replacements and destroys
5. save plan artifact
6. apply reviewed plan only
7. validate resource state outside Terraform
Plan review focus:
- unexpected replacement
- drift on security groups, routes, storage, or instance identity
- provider alias mistakes
- wrong workspace or backend
## CI/CD Operations
### GitLab CI
```bash
gitlab-runner verify
grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml
curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects/<id>/pipelines
```
### Jenkins
```bash
systemctl status jenkins --no-pager
journalctl -u jenkins -n 100 --no-pager
java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs
```
### Runners, Artifacts, Pipeline Failures
```bash
docker logs --tail 100 gitlab-runner
kubectl -n ci get pods
kubectl -n ci logs deploy/runner-controller --tail=100
```
Troubleshooting flow:
1. validate YAML or Jenkinsfile syntax
2. confirm runner/agent availability
3. inspect job logs for auth, cache, DNS, or registry failures
4. verify artifacts were uploaded and not expired
5. correlate with platform outages, image changes, or secret rotation
YAML validation:
```bash
yamllint .
python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml
```
## Observability
### Prometheus
```bash
curl -s http://prometheus:9090/-/ready
curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'
```
### Loki
```bash
curl -s http://loki:3100/ready
curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"'
```
### Grafana
```bash
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50
```
### Metrics Validation and Log Correlation
```bash
kubectl -n <ns> port-forward svc/<svc> 9090:9090
curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_'
```
Correlation flow:
1. confirm alert time and impacted objects
2. inspect deployment events in same window
3. compare Prometheus series, Loki logs, and app logs
4. rule out scrape lag or stale dashboards
## GPU / AI Infrastructure
### GPU Discovery and CUDA Validation
```bash
nvidia-smi
nvidia-smi -L
nvidia-smi topo -m
nvidia-smi dmon -s pucm
nvcc --version
python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())'
```
### MIG Basics
```bash
nvidia-smi -i 0 -q | grep -i mig -A4
nvidia-smi mig -lgip
nvidia-smi mig -lgi
```
### GPU Operator and DCGM
```bash
kubectl get pods -A | grep -E 'nvidia|gpu'
kubectl -n gpu-operator describe pod <pod>
kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100
kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100
```
### Container GPU Validation
```bash
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
kubectl run gpu-check --rm -it --restart=Never \
--image=nvidia/cuda:12.3.2-base-ubuntu22.04 \
--limits='nvidia.com/gpu=1' -- nvidia-smi
```
### Kubernetes GPU Troubleshooting
Check for:
- device plugin not running
- driver/container toolkit mismatch
- node missing `nvidia.com/gpu` allocatable resources
- MIG profile mismatch
- taints or tolerations blocking placement
Useful checks:
```bash
kubectl describe node <gpu-node> | grep -A5 -B2 -i nvidia
kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}'
kubectl -n <ns> describe pod <gpu-pod>
```
## Platform Troubleshooting Flows
### Pod Not Starting
```bash
kubectl -n <ns> get pod <pod> -o wide
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> logs <pod> --previous
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
```
### Image Pull Errors
```bash
kubectl -n <ns> describe pod <pod> | grep -A5 -i 'image'
crictl images | grep <image>
ctr -n k8s.io images ls | grep <image>
```
Check:
- image tag exists
- registry reachable
- pull secret valid
- node clock sane for token-based auth
### Failing Deployment
```bash
kubectl -n <ns> rollout status deploy/<name>
kubectl -n <ns> describe deploy/<name>
kubectl -n <ns> get rs,pods -l app=<name> -o wide
```
### Node Not Ready
```bash
kubectl describe node <node>
journalctl -u k3s -n 100 --no-pager
systemctl status kubelet --no-pager
df -h
free -m
```
Check:
- kubelet or k3s service state
- disk pressure
- cert expiry
- CNI failure
- API reachability
### Storage Provisioning Issues
```bash
kubectl get pvc,pv -A
kubectl -n <ns> describe pvc <pvc>
kubectl get sc
kubectl -n kube-system logs deploy/<csi-controller> --tail=100
```
Check:
- storage class defaulting
- access mode mismatch
- CSI controller errors
- backend quota or LUN exhaustion
- node attachment failures