This commit is contained in:
@@ -4,6 +4,12 @@
|
||||
|
||||
### Added
|
||||
|
||||
- Cross-repository operational documentation structure:
|
||||
- `infra-run/docs/operations-cheatsheet.md`
|
||||
- `platform-projects/docs/platform-cheatsheet.md`
|
||||
- `labs/docs/lab-cheatsheet.md`
|
||||
- Production-oriented Linux/Unix operations reference with incident workflows, storage and networking checks, SSL/TLS notes, AIX commands, automation safety patterns, Ansible operational usage, and observability quick-reference.
|
||||
- SELinux operational coverage for mode checks, context inspection, AVC audit review, persistent relabel workflow, booleans, and SELinux-specific incident response.
|
||||
- Selected baseline Ansible hardening automation:
|
||||
- RHEL 9 role and playbook.
|
||||
- Debian 13 / Ubuntu 26.04 role and playbook.
|
||||
@@ -13,6 +19,7 @@
|
||||
|
||||
### Changed
|
||||
|
||||
- Updated repository and `infra-run` README files to surface the new documentation structure and operational cheatsheets.
|
||||
- Updated repository, `infra-run`, and Ansible README files to describe the new hardening automation instead of placeholder-only Ansible structure.
|
||||
|
||||
### Notes
|
||||
|
||||
@@ -17,6 +17,20 @@ It is a technical portfolio, not a production toolkit. The examples are meant to
|
||||
|
||||
The `labs` and `platform-projects` trees are intentionally thin. They are kept as planning areas for future lab notes and case studies, not as completed projects. Current planned topics are tracked in [ROADMAP.md](./ROADMAP.md).
|
||||
|
||||
## Documentation
|
||||
|
||||
### Production Operations
|
||||
|
||||
- [infra-run/docs/operations-cheatsheet.md](./infra-run/docs/operations-cheatsheet.md) - production-focused Linux/Unix operations reference for incident handling, validation, storage, networking, Ansible, observability, and safety-first change execution.
|
||||
|
||||
### Platform Engineering
|
||||
|
||||
- [platform-projects/docs/platform-cheatsheet.md](./platform-projects/docs/platform-cheatsheet.md) - platform operations reference for Kubernetes, Helm, containers, Terraform, CI/CD, observability, and GPU-backed infrastructure troubleshooting.
|
||||
|
||||
### Labs & Experiments
|
||||
|
||||
- [labs/docs/lab-cheatsheet.md](./labs/docs/lab-cheatsheet.md) - quick-reference scratchpad for K3s, Proxmox, Terraform, Docker, networking, and short-lived lab troubleshooting work.
|
||||
|
||||
## What This Repo Is Not
|
||||
|
||||
- It is not a compliance benchmark implementation.
|
||||
|
||||
@@ -13,6 +13,10 @@ The goal is to show operational judgment, not to ship a universal automation pro
|
||||
- [ansible](./ansible/) - selected baseline hardening examples for RHEL-like Linux, Debian/Ubuntu, and AIX.
|
||||
- [examples](./examples/) - sanitized sample command outputs and incident notes.
|
||||
|
||||
## Documentation
|
||||
|
||||
- [docs/operations-cheatsheet.md](./docs/operations-cheatsheet.md) - production operations quick reference covering Linux/Unix triage, text processing, incident workflows, networking, storage, AIX, SSL/TLS, automation safety, Ansible execution, observability, and operational habits.
|
||||
|
||||
## What This Is
|
||||
|
||||
- A portfolio project for Linux and infrastructure operations roles.
|
||||
|
||||
@@ -0,0 +1,857 @@
|
||||
# Production Operations Cheatsheet
|
||||
|
||||
Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change.
|
||||
|
||||
## Linux / Unix Daily Operations
|
||||
|
||||
### Uptime and Host State
|
||||
|
||||
Check host age, kernel, clock, and recent reboot history before touching anything:
|
||||
|
||||
```bash
|
||||
uptime
|
||||
uname -r
|
||||
hostnamectl
|
||||
timedatectl
|
||||
who -b
|
||||
last -x | head -20
|
||||
```
|
||||
|
||||
Pre-check pattern:
|
||||
|
||||
```bash
|
||||
date -u
|
||||
uptime
|
||||
df -h
|
||||
free -m
|
||||
systemctl --failed
|
||||
```
|
||||
|
||||
### Process Management
|
||||
|
||||
```bash
|
||||
ps -ef | head
|
||||
ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
|
||||
pgrep -a java
|
||||
pstree -ap | less
|
||||
pidof sshd
|
||||
renice +5 -p <pid>
|
||||
kill -TERM <pid>
|
||||
kill -9 <pid> # DANGEROUS: last resort only
|
||||
```
|
||||
|
||||
Validation:
|
||||
|
||||
```bash
|
||||
ps -p <pid> -o pid,stat,etime,cmd
|
||||
journalctl -u <service> -n 50 --no-pager
|
||||
```
|
||||
|
||||
### systemctl
|
||||
|
||||
```bash
|
||||
systemctl status <service> --no-pager -l
|
||||
systemctl is-active <service>
|
||||
systemctl is-enabled <service>
|
||||
systemctl list-units --type=service --state=running
|
||||
systemctl list-units --failed
|
||||
systemctl daemon-reload
|
||||
systemctl restart <service> # impact: confirms service interruption policy first
|
||||
```
|
||||
|
||||
### journalctl
|
||||
|
||||
```bash
|
||||
journalctl -u <service> -n 100 --no-pager
|
||||
journalctl -u <service> --since '30 min ago'
|
||||
journalctl -p err -S today
|
||||
journalctl -k -b
|
||||
journalctl --disk-usage
|
||||
```
|
||||
|
||||
### Service Troubleshooting Flow
|
||||
|
||||
1. Confirm service state and recent restart count.
|
||||
2. Read the last 100-200 journal lines.
|
||||
3. Validate config syntax before restart if the daemon supports it.
|
||||
4. Check dependent ports, mounts, credentials, and name resolution.
|
||||
5. Restart only after cause is understood or rollback exists.
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
systemctl status nginx --no-pager -l
|
||||
journalctl -u nginx -n 100 --no-pager
|
||||
nginx -t
|
||||
ss -ltnp | grep ':80\|:443'
|
||||
curl -kI https://127.0.0.1/
|
||||
```
|
||||
|
||||
### CPU and Memory Diagnostics
|
||||
|
||||
```bash
|
||||
uptime
|
||||
top -H -b -n 1 | head -40
|
||||
pidstat 1 5
|
||||
pidstat -ru -p ALL 1 3
|
||||
vmstat 1 5
|
||||
iostat -xz 1 5
|
||||
free -m
|
||||
sar -q 1 5
|
||||
```
|
||||
|
||||
Quick interpretation:
|
||||
|
||||
- high `%wa`: storage path or NFS issue
|
||||
- high run queue with low CPU idle: CPU contention
|
||||
- swap growth plus page scans: memory pressure
|
||||
|
||||
### Disk Usage
|
||||
|
||||
```bash
|
||||
df -hT
|
||||
du -xhd1 /var | sort -h
|
||||
find /var/log -type f -size +500M -ls | sort -k7,7n
|
||||
lsof +L1
|
||||
```
|
||||
|
||||
### Inode Exhaustion
|
||||
|
||||
```bash
|
||||
df -ih
|
||||
find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n
|
||||
find /tmp -xdev -type f | wc -l
|
||||
```
|
||||
|
||||
### Mounts
|
||||
|
||||
```bash
|
||||
mount | column -t
|
||||
findmnt
|
||||
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
|
||||
cat /etc/fstab
|
||||
mount -a # can expose bad fstab entries; use in change window
|
||||
```
|
||||
|
||||
### Permissions
|
||||
|
||||
```bash
|
||||
namei -l /path/to/file
|
||||
stat /path/to/file
|
||||
getfacl /path/to/file
|
||||
chmod 640 /path/to/file
|
||||
chown root:app /path/to/file
|
||||
```
|
||||
|
||||
### SELinux
|
||||
|
||||
State and mode:
|
||||
|
||||
```bash
|
||||
getenforce
|
||||
sestatus
|
||||
cat /etc/selinux/config
|
||||
```
|
||||
|
||||
Check file, process, and port context:
|
||||
|
||||
```bash
|
||||
ls -Zd /var/www/html
|
||||
ls -lZ /var/www/html/index.html
|
||||
ps -eZ | grep nginx
|
||||
id -Z
|
||||
semanage port -l | grep http
|
||||
```
|
||||
|
||||
Audit and denial review:
|
||||
|
||||
```bash
|
||||
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
|
||||
ausearch -m AVC -ts today | audit2why
|
||||
journalctl -t setroubleshoot --since '1 hour ago'
|
||||
sealert -a /var/log/audit/audit.log
|
||||
```
|
||||
|
||||
Typical flow:
|
||||
|
||||
1. Confirm SELinux mode is `Enforcing` or `Permissive`.
|
||||
2. Identify the failing path, process domain, and target context.
|
||||
3. Read AVC denials before changing labels or booleans.
|
||||
4. Prefer persistent policy-aligned fixes over `chcon`.
|
||||
5. Restore default labels and retest service path.
|
||||
|
||||
Modify and restore context:
|
||||
|
||||
```bash
|
||||
chcon -t httpd_sys_content_t /srv/app/index.html # temporary until relabel/restore
|
||||
chcon -R -t httpd_sys_rw_content_t /srv/app/uploads # temporary until relabel/restore
|
||||
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
|
||||
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
|
||||
restorecon -Rv /srv/app
|
||||
matchpathcon /srv/app/uploads/file.txt
|
||||
```
|
||||
|
||||
Booleans and validation:
|
||||
|
||||
```bash
|
||||
getsebool -a | grep httpd
|
||||
getsebool httpd_can_network_connect
|
||||
setsebool -P httpd_can_network_connect on
|
||||
runcon -t httpd_t -- id -Z
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- prefer `semanage fcontext` plus `restorecon` for persistent fixes
|
||||
- use `chcon` only as a short-lived diagnostic or emergency workaround
|
||||
- avoid generating local policy modules from `audit2allow` until root cause is understood
|
||||
- after context changes, validate service startup, AVC silence, and application path access
|
||||
|
||||
### Archives
|
||||
|
||||
```bash
|
||||
tar tf backup.tar | head
|
||||
tar czf logs-$(date +%F).tgz /var/log/app
|
||||
tar xzf bundle.tgz -C /restore/path
|
||||
gzip -t file.gz
|
||||
```
|
||||
|
||||
### File Operations
|
||||
|
||||
```bash
|
||||
cp -a source/ target/
|
||||
rsync -aHAXvn /src/ /dst/
|
||||
rsync -aHAX --delete --info=progress2 /src/ /dst/ # impact: verify source/destination twice
|
||||
mv file file.$(date +%F-%H%M%S).bak
|
||||
sha256sum file
|
||||
```
|
||||
|
||||
## Text Processing & Regex
|
||||
|
||||
### Core Tools
|
||||
|
||||
```bash
|
||||
grep -n 'ERROR' app.log
|
||||
grep -E 'ERROR|WARN' app.log
|
||||
grep -P '^\d{4}-\d{2}-\d{2}T' app.log
|
||||
awk '{print $1,$4,$5}' access.log
|
||||
awk -F, 'NR==1 || $3 ~ /failed/' report.csv
|
||||
sed -n '1,20p' file
|
||||
sed -E 's/[[:space:]]+/ /g' file
|
||||
cut -d: -f1,7 /etc/passwd
|
||||
sort file | uniq -c | sort -nr
|
||||
xargs -r -n1 systemctl status < service-list.txt
|
||||
jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json
|
||||
```
|
||||
|
||||
### Regex Reference
|
||||
|
||||
```text
|
||||
IPv4 \b(?:\d{1,3}\.){3}\d{1,3}\b
|
||||
ISO timestamp \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b
|
||||
UUID \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b
|
||||
Log level \b(?:ERROR|WARN|INFO)\b
|
||||
Failed SSH Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})
|
||||
Ansible changed/fail ^(changed|fatal|failed):\s+\[[^]]+\]
|
||||
```
|
||||
|
||||
### Log Parsing Examples
|
||||
|
||||
IP extraction:
|
||||
|
||||
```bash
|
||||
grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head
|
||||
```
|
||||
|
||||
Timestamp filter:
|
||||
|
||||
```bash
|
||||
grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log
|
||||
```
|
||||
|
||||
UUID extraction:
|
||||
|
||||
```bash
|
||||
grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u
|
||||
```
|
||||
|
||||
ERROR/WARN/INFO parsing:
|
||||
|
||||
```bash
|
||||
grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c
|
||||
```
|
||||
|
||||
Failed SSH login parsing:
|
||||
|
||||
```bash
|
||||
grep 'Failed password' /var/log/secure \
|
||||
| awk '{print $(NF-3),$NF}' \
|
||||
| sort | uniq -c | sort -nr | head
|
||||
```
|
||||
|
||||
Extract fields from logs:
|
||||
|
||||
```bash
|
||||
awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log
|
||||
```
|
||||
|
||||
Filter Ansible output:
|
||||
|
||||
```bash
|
||||
grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log
|
||||
grep -E '^fatal:|^failed:' ansible.log
|
||||
```
|
||||
|
||||
## Incident Response
|
||||
|
||||
### Disk Full
|
||||
|
||||
Workflow:
|
||||
|
||||
```bash
|
||||
df -hT
|
||||
df -ih
|
||||
findmnt
|
||||
du -xhd1 /var | sort -h
|
||||
find /var -xdev -type f -size +1G -ls | sort -k7,7n
|
||||
lsof +L1
|
||||
journalctl --disk-usage
|
||||
```
|
||||
|
||||
Typical branches:
|
||||
|
||||
- filesystem full: identify growth path, compress/rotate/archive, validate app behavior
|
||||
- inode full: remove file storms, spool buildup, temp-file leaks
|
||||
- deleted open files: restart offender only after sizing impact
|
||||
|
||||
Post-check:
|
||||
|
||||
```bash
|
||||
df -hT
|
||||
df -ih
|
||||
systemctl --failed
|
||||
```
|
||||
|
||||
### High CPU
|
||||
|
||||
```bash
|
||||
uptime
|
||||
mpstat -P ALL 1 5
|
||||
pidstat -u -p ALL 1 5
|
||||
top -H -b -n 1 | head -40
|
||||
ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20
|
||||
```
|
||||
|
||||
Flow:
|
||||
|
||||
1. Confirm sustained load, not a short spike.
|
||||
2. Separate user CPU vs system CPU vs I/O wait.
|
||||
3. Identify hot process and hot threads.
|
||||
4. Correlate with deploys, cron, backups, or JVM GC.
|
||||
5. Throttle, stop, or fail over only with service impact understood.
|
||||
|
||||
### Memory Pressure
|
||||
|
||||
```bash
|
||||
free -m
|
||||
vmstat 1 5
|
||||
sar -r 1 5
|
||||
ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20
|
||||
dmesg -T | egrep -i 'oom|killed process'
|
||||
```
|
||||
|
||||
Flow:
|
||||
|
||||
1. Check swap growth and page scan rates.
|
||||
2. Identify top RSS owners.
|
||||
3. Check kernel logs for OOM.
|
||||
4. Validate cache vs real process growth.
|
||||
5. Restart leaking service only after capturing evidence.
|
||||
|
||||
### Failed Service
|
||||
|
||||
```bash
|
||||
systemctl status <service> --no-pager -l
|
||||
journalctl -u <service> -b --no-pager | tail -100
|
||||
systemctl show <service> -p ExecStart -p FragmentPath -p ActiveEnterTimestamp
|
||||
```
|
||||
|
||||
Flow:
|
||||
|
||||
1. Validate config.
|
||||
2. Validate credentials, ports, mounts, permissions.
|
||||
3. Confirm dependency availability.
|
||||
4. Restart and recheck logs immediately.
|
||||
|
||||
### SELinux Denials
|
||||
|
||||
Typical case: service works in `Permissive`, fails in `Enforcing`, or logs show `permission denied` while UNIX permissions look correct.
|
||||
|
||||
Triage:
|
||||
|
||||
```bash
|
||||
getenforce
|
||||
sestatus
|
||||
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
|
||||
ausearch -m AVC -ts recent | audit2why
|
||||
journalctl -t setroubleshoot --since '30 min ago'
|
||||
systemctl status <service> --no-pager -l
|
||||
ps -eZ | grep <service>
|
||||
ls -lZ /path/to/app /path/to/app/*
|
||||
```
|
||||
|
||||
Flow:
|
||||
|
||||
1. Confirm the failure is current and reproducible.
|
||||
2. Identify the denied process domain, target path, and requested access from AVC logs.
|
||||
3. Validate expected default context with `matchpathcon`.
|
||||
4. Check for mislabeled files, wrong port types, or missing SELinux booleans.
|
||||
5. Apply the smallest persistent fix, then retest in `Enforcing`.
|
||||
|
||||
Common fixes:
|
||||
|
||||
```bash
|
||||
matchpathcon /srv/app/config.yml
|
||||
restorecon -Rv /srv/app
|
||||
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
|
||||
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
|
||||
semanage port -l | grep http
|
||||
getsebool -a | grep httpd
|
||||
setsebool -P httpd_can_network_connect on
|
||||
```
|
||||
|
||||
Validation:
|
||||
|
||||
```bash
|
||||
getenforce
|
||||
systemctl restart <service>
|
||||
systemctl status <service> --no-pager -l
|
||||
ausearch -m AVC -ts recent
|
||||
curl -fsS http://127.0.0.1:<port>/health
|
||||
```
|
||||
|
||||
Operational notes:
|
||||
|
||||
- do not leave systems in `Permissive` as the fix
|
||||
- prefer `restorecon` and `semanage fcontext` over repeated `chcon`
|
||||
- treat `audit2allow` output as investigation material, not automatic remediation
|
||||
- if policy changes are unavoidable, document exact AVC evidence and rollback path
|
||||
|
||||
### SSL Issues
|
||||
|
||||
```bash
|
||||
openssl s_client -connect host:443 -servername host -showcerts </dev/null
|
||||
openssl x509 -in cert.pem -noout -subject -issuer -dates -ext subjectAltName
|
||||
curl -vkI https://host/
|
||||
```
|
||||
|
||||
Check for:
|
||||
|
||||
- expired certificate
|
||||
- missing SAN
|
||||
- incomplete chain
|
||||
- hostname mismatch
|
||||
- TLS version or cipher mismatch
|
||||
|
||||
### DNS Issues
|
||||
|
||||
```bash
|
||||
dig +short app.example.com
|
||||
dig @<resolver> app.example.com
|
||||
dig +trace app.example.com
|
||||
getent hosts app.example.com
|
||||
resolvectl status
|
||||
```
|
||||
|
||||
Flow:
|
||||
|
||||
1. Compare resolver result with authoritative result.
|
||||
2. Check TTL and stale cache.
|
||||
3. Validate `/etc/resolv.conf`, local resolver, and search domains.
|
||||
4. Test from affected host and unaffected host.
|
||||
|
||||
### Network Issues
|
||||
|
||||
```bash
|
||||
ip addr
|
||||
ip route
|
||||
ss -tulpen
|
||||
tcpdump -ni any host <peer> and port <port>
|
||||
curl -sv http://host:port/health
|
||||
mtr -rwzc 20 host
|
||||
```
|
||||
|
||||
Flow:
|
||||
|
||||
1. Interface/link state.
|
||||
2. Route and source IP selection.
|
||||
3. Listening socket on target.
|
||||
4. Firewall and security controls.
|
||||
5. Packet capture if app logs are inconclusive.
|
||||
|
||||
### JVM / Tomcat Issues
|
||||
|
||||
```bash
|
||||
ps -ef | grep -i tomcat
|
||||
jcmd <pid> VM.flags
|
||||
jstat -gcutil <pid> 1000 10
|
||||
jstack <pid> | head -100
|
||||
ss -ltnp | grep java
|
||||
tail -100 /opt/tomcat/logs/catalina.out
|
||||
```
|
||||
|
||||
Focus:
|
||||
|
||||
- stuck threads
|
||||
- full GC loops
|
||||
- heap exhaustion
|
||||
- connector bind failures
|
||||
- slow backend dependency
|
||||
|
||||
### Certificate Expiration
|
||||
|
||||
```bash
|
||||
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
|
||||
| openssl x509 -noout -enddate
|
||||
|
||||
openssl x509 -checkend 2592000 -noout -in cert.pem
|
||||
```
|
||||
|
||||
### Suspicious Login Attempts
|
||||
|
||||
```bash
|
||||
last -ai | head -30
|
||||
lastb -ai | head -30
|
||||
grep 'Failed password' /var/log/secure | tail -50
|
||||
grep 'Accepted ' /var/log/secure | tail -50
|
||||
ausearch -m USER_LOGIN -ts recent
|
||||
```
|
||||
|
||||
Workflow:
|
||||
|
||||
1. Identify source IPs and usernames.
|
||||
2. Validate whether attempts are expected from bastions/scanners.
|
||||
3. Check successful logins from same sources.
|
||||
4. Review sudo usage and persistence changes.
|
||||
5. Preserve logs before cleanup or rotation.
|
||||
|
||||
## Networking Operations
|
||||
|
||||
```bash
|
||||
ip -br addr
|
||||
ip route get 8.8.8.8
|
||||
ss -ltnp
|
||||
ss -tn state established '( sport = :443 or dport = :443 )'
|
||||
tcpdump -ni eth0 port 53
|
||||
dig +short mx example.com
|
||||
curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health
|
||||
mtr -rwzc 10 host
|
||||
traceroute -T -p 443 host
|
||||
openssl s_client -connect host:443 -servername host </dev/null
|
||||
```
|
||||
|
||||
## Storage Operations
|
||||
|
||||
### Block and Filesystem Discovery
|
||||
|
||||
```bash
|
||||
lsblk -f
|
||||
blkid
|
||||
findmnt
|
||||
cat /proc/partitions
|
||||
multipath -ll
|
||||
```
|
||||
|
||||
### LVM
|
||||
|
||||
```bash
|
||||
pvs
|
||||
vgs
|
||||
lvs -a -o +devices
|
||||
pvdisplay /dev/sdX
|
||||
vgdisplay <vg>
|
||||
lvdisplay /dev/<vg>/<lv>
|
||||
```
|
||||
|
||||
Growth example:
|
||||
|
||||
```bash
|
||||
pvcreate /dev/mapper/mpatha # impact: write metadata
|
||||
vgextend vgdata /dev/mapper/mpatha # impact: changes VG layout
|
||||
lvextend -L +100G -r /dev/vgdata/lvapp
|
||||
```
|
||||
|
||||
### XFS
|
||||
|
||||
```bash
|
||||
xfs_info /mountpoint
|
||||
xfs_repair -n /dev/mapper/vg-lv
|
||||
xfs_growfs /mountpoint
|
||||
```
|
||||
|
||||
### ext4
|
||||
|
||||
```bash
|
||||
tune2fs -l /dev/mapper/vg-lv | head -40
|
||||
e2fsck -fn /dev/mapper/vg-lv
|
||||
resize2fs /dev/mapper/vg-lv
|
||||
```
|
||||
|
||||
### Multipath
|
||||
|
||||
```bash
|
||||
multipath -ll
|
||||
lsblk -S
|
||||
udevadm info --query=all --name=/dev/mapper/mpatha | head -40
|
||||
```
|
||||
|
||||
### NFS
|
||||
|
||||
```bash
|
||||
showmount -e nfs-server
|
||||
nfsstat -m
|
||||
mount | grep nfs
|
||||
rpcinfo -p nfs-server
|
||||
```
|
||||
|
||||
### iSCSI
|
||||
|
||||
```bash
|
||||
iscsiadm -m session
|
||||
iscsiadm -m node
|
||||
iscsiadm -m discovery -t sendtargets -p <target-ip>
|
||||
```
|
||||
|
||||
### Mount Troubleshooting
|
||||
|
||||
```bash
|
||||
findmnt /mountpoint
|
||||
mount -v /mountpoint
|
||||
dmesg -T | tail -50
|
||||
journalctl -k -n 100 --no-pager
|
||||
```
|
||||
|
||||
Check:
|
||||
|
||||
- device path stable
|
||||
- UUID correct
|
||||
- filesystem type correct
|
||||
- multipath settled
|
||||
- network and RPC available for NFS
|
||||
|
||||
### Filesystem Validation
|
||||
|
||||
```bash
|
||||
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
|
||||
df -hT /data
|
||||
touch /data/.write-test && rm -f /data/.write-test
|
||||
```
|
||||
|
||||
### Migration Validation Example
|
||||
|
||||
```bash
|
||||
findmnt /data
|
||||
df -hT /data
|
||||
rsync -aHAXvn /olddata/ /data/
|
||||
rsync -aHAXc --delete --dry-run /olddata/ /data/
|
||||
sha256sum /olddata/keyfile /data/keyfile
|
||||
```
|
||||
|
||||
## AIX Operations
|
||||
|
||||
```bash
|
||||
oslevel -s
|
||||
errpt | head
|
||||
errpt -a | more
|
||||
topas
|
||||
lsvg -o
|
||||
lsvg rootvg
|
||||
lslpp -L | grep -i openssl
|
||||
svmon -G
|
||||
svmon -P <pid>
|
||||
netstat -rn
|
||||
```
|
||||
|
||||
## SSL/TLS Operations
|
||||
|
||||
### OpenSSL Checks
|
||||
|
||||
```bash
|
||||
openssl version -a
|
||||
openssl x509 -in cert.pem -noout -text | less
|
||||
openssl rsa -in key.pem -check
|
||||
openssl verify -CAfile chain.pem cert.pem
|
||||
```
|
||||
|
||||
### Expiration Validation
|
||||
|
||||
```bash
|
||||
openssl x509 -enddate -noout -in cert.pem
|
||||
openssl x509 -checkend 604800 -noout -in cert.pem
|
||||
```
|
||||
|
||||
### keytool Basics
|
||||
|
||||
```bash
|
||||
keytool -list -v -keystore keystore.jks
|
||||
keytool -list -cacerts | grep -i <alias>
|
||||
keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks
|
||||
```
|
||||
|
||||
### Chain Validation
|
||||
|
||||
```bash
|
||||
openssl s_client -connect host:443 -servername host -showcerts </dev/null
|
||||
openssl verify -untrusted intermediate.pem -CAfile root.pem server.pem
|
||||
```
|
||||
|
||||
## Automation Operations
|
||||
|
||||
### Bash Safety Patterns
|
||||
|
||||
```bash
|
||||
set -euo pipefail
|
||||
IFS=$'\n\t'
|
||||
trap 'echo "line ${LINENO}: command failed" >&2' ERR
|
||||
trap 'rm -f "${tmpfile:-}"' EXIT
|
||||
```
|
||||
|
||||
Safe loop examples:
|
||||
|
||||
```bash
|
||||
while IFS= read -r host; do
|
||||
ssh "$host" uptime
|
||||
done < hostlist.txt
|
||||
|
||||
find /var/log -type f -name '*.log' -print0 \
|
||||
| while IFS= read -r -d '' file; do
|
||||
gzip -t "$file"
|
||||
done
|
||||
```
|
||||
|
||||
Operational scripting patterns:
|
||||
|
||||
- default to read-only mode
|
||||
- require explicit `--execute` for changes
|
||||
- log actions with timestamps
|
||||
- validate dependencies with `command -v`
|
||||
- use temp files with `mktemp`
|
||||
- guard destructive paths and empty variables
|
||||
|
||||
## Ansible Operations
|
||||
|
||||
### Execution
|
||||
|
||||
```bash
|
||||
ansible-inventory -i inventory/hosts.yml --graph
|
||||
ansible-inventory -i inventory/hosts.yml --list | jq '.'
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx'
|
||||
```
|
||||
|
||||
### Safe Rollout Workflow
|
||||
|
||||
1. Validate inventory and variable targeting.
|
||||
2. Run syntax-check.
|
||||
3. Run `--check --diff` on a single host.
|
||||
4. Execute against one host or one tier.
|
||||
5. Validate service health, logs, and config.
|
||||
6. Expand rollout only after post-check passes.
|
||||
|
||||
Rollback mindset:
|
||||
|
||||
- keep before/after config copies
|
||||
- know which tasks restart services
|
||||
- define manual backout if package/config changes fail
|
||||
- avoid broad `--limit` mistakes by reviewing resolved host list first
|
||||
|
||||
## Monitoring & Observability
|
||||
|
||||
### Zabbix Checks
|
||||
|
||||
```bash
|
||||
systemctl status zabbix-agent2 --no-pager
|
||||
zabbix_agent2 -t vfs.fs.size[/,free]
|
||||
grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log
|
||||
```
|
||||
|
||||
### ELK Log Workflows
|
||||
|
||||
```bash
|
||||
grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50
|
||||
journalctl -u filebeat -n 100 --no-pager
|
||||
curl -s http://localhost:9200/_cluster/health?pretty
|
||||
```
|
||||
|
||||
### Grafana Checks
|
||||
|
||||
```bash
|
||||
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
|
||||
grep -i 'error' /var/log/grafana/grafana.log | tail -50
|
||||
```
|
||||
|
||||
### Health Endpoints and Alert Validation
|
||||
|
||||
```bash
|
||||
curl -fsS http://app:8080/health
|
||||
curl -fsS http://app:8080/metrics | head
|
||||
```
|
||||
|
||||
False positive validation:
|
||||
|
||||
1. Compare alert timestamp with deploy/change window.
|
||||
2. Confirm on-host evidence, not only dashboard data.
|
||||
3. Check collector lag, scrape failures, and stale metrics.
|
||||
4. Validate from a second source before escalating.
|
||||
|
||||
## Operational Habits
|
||||
|
||||
### Pre-checks
|
||||
|
||||
- capture time, hostname, and operator
|
||||
- capture current config and service state
|
||||
- check recent alerts, maintenance windows, and dependencies
|
||||
- confirm backup or rollback path exists
|
||||
|
||||
### Post-checks
|
||||
|
||||
- validate service state
|
||||
- validate logs for fresh errors
|
||||
- validate client path, ports, and name resolution
|
||||
- compare metrics before/after
|
||||
|
||||
### Rollback Thinking
|
||||
|
||||
- define exact backout trigger before change
|
||||
- prefer reversible steps
|
||||
- keep config backups with timestamps
|
||||
- avoid bundling unrelated changes
|
||||
|
||||
### Change Validation
|
||||
|
||||
```bash
|
||||
systemctl is-active <service>
|
||||
curl -fsS http://127.0.0.1:<port>/health
|
||||
ss -ltnp | grep :<port>
|
||||
journalctl -u <service> -S '5 min ago' --no-pager
|
||||
```
|
||||
|
||||
### Operational Communication
|
||||
|
||||
- state scope, risk, and expected impact before action
|
||||
- record start and stop times in UTC
|
||||
- document what changed, what was checked, and remaining risk
|
||||
- escalate with evidence, not assumptions
|
||||
|
||||
### Evidence Collection During Incidents
|
||||
|
||||
```bash
|
||||
mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ)
|
||||
journalctl -b > /tmp/incident-*/journal.txt
|
||||
ss -tulpen > /tmp/incident-*/sockets.txt
|
||||
df -hT > /tmp/incident-*/df.txt
|
||||
free -m > /tmp/incident-*/free.txt
|
||||
```
|
||||
@@ -0,0 +1,144 @@
|
||||
# Lab Cheatsheet
|
||||
|
||||
Quick-reference notes for experiments, rebuilds, and short-lived troubleshooting. Expect rough edges. Capture what worked, what broke, and what should not be repeated in production.
|
||||
|
||||
## K3s Lab
|
||||
|
||||
```bash
|
||||
sudo systemctl status k3s --no-pager
|
||||
sudo journalctl -u k3s -n 100 --no-pager
|
||||
kubectl get nodes -o wide
|
||||
kubectl get pods -A
|
||||
kubectl get events -A --sort-by=.lastTimestamp | tail -30
|
||||
sudo k3s kubectl get pods -A
|
||||
```
|
||||
|
||||
Quick reset:
|
||||
|
||||
```bash
|
||||
sudo /usr/local/bin/k3s-uninstall.sh # destructive lab reset
|
||||
```
|
||||
|
||||
## Proxmox Lab
|
||||
|
||||
```bash
|
||||
pvesh get /nodes
|
||||
pvesh get /cluster/resources
|
||||
qm list
|
||||
qm config <vmid>
|
||||
pct list
|
||||
ha-manager status
|
||||
```
|
||||
|
||||
Checks before changes:
|
||||
|
||||
```bash
|
||||
zpool status
|
||||
pvesm status
|
||||
ip -br addr
|
||||
```
|
||||
|
||||
## GPU Passthrough
|
||||
|
||||
```bash
|
||||
lspci -nn | grep -Ei 'vga|3d|nvidia'
|
||||
nvidia-smi
|
||||
dmesg -T | grep -Ei 'vfio|iommu|nvidia'
|
||||
find /sys/kernel/iommu_groups/ -type l | sort
|
||||
```
|
||||
|
||||
Good sanity check:
|
||||
|
||||
```bash
|
||||
lsmod | grep -E 'vfio|kvm'
|
||||
```
|
||||
|
||||
## Terraform Experiments
|
||||
|
||||
```bash
|
||||
terraform fmt -recursive
|
||||
terraform init
|
||||
terraform validate
|
||||
terraform plan
|
||||
terraform state list
|
||||
```
|
||||
|
||||
Scratch workflow:
|
||||
|
||||
```bash
|
||||
terraform plan -out=tfplan
|
||||
terraform show tfplan
|
||||
```
|
||||
|
||||
## Networking Labs
|
||||
|
||||
```bash
|
||||
ip -br addr
|
||||
ip route
|
||||
bridge link
|
||||
ss -ltnp
|
||||
tcpdump -ni any port 53
|
||||
dig +short example.com
|
||||
mtr -rwzc 10 1.1.1.1
|
||||
```
|
||||
|
||||
## Ansible Testing
|
||||
|
||||
```bash
|
||||
ansible-inventory -i inventory/hosts.yml --graph
|
||||
ansible-playbook -i inventory/hosts.yml playbook.yml --syntax-check
|
||||
ansible-playbook -i inventory/hosts.yml playbook.yml --check --diff
|
||||
ansible all -i inventory/hosts.yml -m ping
|
||||
```
|
||||
|
||||
## Docker Testing
|
||||
|
||||
```bash
|
||||
docker ps -a
|
||||
docker logs --tail 100 <container>
|
||||
docker exec -it <container> sh
|
||||
docker inspect <container> | jq '.[0].NetworkSettings'
|
||||
docker system df
|
||||
```
|
||||
|
||||
## Useful Temporary Commands
|
||||
|
||||
```bash
|
||||
watch -n2 'kubectl get pods -A'
|
||||
watch -n2 'nvidia-smi'
|
||||
watch -n2 'ip -br addr'
|
||||
while true; do date -u; curl -fsS http://127.0.0.1:8080/health; sleep 2; done
|
||||
```
|
||||
|
||||
## Quick PoC Commands
|
||||
|
||||
```bash
|
||||
python3 -m http.server 8080
|
||||
openssl req -x509 -newkey rsa:2048 -nodes -days 3 -keyout key.pem -out cert.pem
|
||||
curl -vk https://127.0.0.1:8443/
|
||||
nc -lvkp 9000
|
||||
```
|
||||
|
||||
## Troubleshooting Notes
|
||||
|
||||
- If K3s pods fail after host reboot, check time sync before chasing cert or API errors.
|
||||
- If PVCs stay pending in lab clusters, inspect the default storage class first.
|
||||
- If Docker networking looks broken, compare bridge subnet overlaps with the host route table.
|
||||
- If GPU pods see no devices, validate driver, toolkit, and device plugin in that order.
|
||||
|
||||
## Useful One-liners
|
||||
|
||||
```bash
|
||||
kubectl get pods -A -o wide | egrep 'CrashLoopBackOff|Error|Pending'
|
||||
journalctl -p err -S today
|
||||
find /var/log -type f -mtime -1 -ls | sort -k7,7n
|
||||
ps -eo pid,%cpu,%mem,cmd --sort=-%cpu | head
|
||||
grep -RniE 'error|failed|timeout' .
|
||||
```
|
||||
|
||||
## Things Worth Remembering
|
||||
|
||||
- Pre-checks still matter in labs. Capture state before trying the risky thing.
|
||||
- Keep a copy of working configs before rapid iteration.
|
||||
- Short-lived labs still produce useful evidence; save command output when a fix works.
|
||||
- If a PoC needs repeated manual repair, turn the repair steps into a script or note.
|
||||
@@ -0,0 +1,368 @@
|
||||
# Platform Engineering Cheatsheet
|
||||
|
||||
Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts.
|
||||
|
||||
## Kubernetes / K3s
|
||||
|
||||
### Contexts, Namespaces, and Basic Workflows
|
||||
|
||||
```bash
|
||||
kubectl config get-contexts
|
||||
kubectl config use-context <context>
|
||||
kubectl get ns
|
||||
kubectl -n <ns> get pods -o wide
|
||||
kubectl -n <ns> get deploy,sts,ds,svc,ingress
|
||||
kubectl get nodes -o wide
|
||||
```
|
||||
|
||||
### Describe, Logs, Exec, Events
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> describe pod <pod>
|
||||
kubectl -n <ns> logs <pod> --tail=100
|
||||
kubectl -n <ns> logs <pod> -c <container> --previous
|
||||
kubectl -n <ns> exec -it <pod> -- sh
|
||||
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
|
||||
```
|
||||
|
||||
### Rollout Troubleshooting
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> rollout status deploy/<name>
|
||||
kubectl -n <ns> rollout history deploy/<name>
|
||||
kubectl -n <ns> rollout undo deploy/<name>
|
||||
kubectl -n <ns> get rs -l app=<name>
|
||||
```
|
||||
|
||||
Safe pattern:
|
||||
|
||||
1. `kubectl diff -f <manifest>`
|
||||
2. apply to non-prod or canary namespace
|
||||
3. watch rollout and events
|
||||
4. validate service and logs
|
||||
5. expand scope only after post-check
|
||||
|
||||
### Node Validation
|
||||
|
||||
```bash
|
||||
kubectl get nodes
|
||||
kubectl describe node <node>
|
||||
kubectl top nodes
|
||||
kubectl top pods -A --sort-by=cpu
|
||||
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
|
||||
```
|
||||
|
||||
### Pending / CrashLoopBackOff Flow
|
||||
|
||||
Pending:
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> describe pod <pod>
|
||||
kubectl get events -A --sort-by=.lastTimestamp | tail -50
|
||||
```
|
||||
|
||||
Check for:
|
||||
|
||||
- unsatisfied CPU/memory requests
|
||||
- missing PVC
|
||||
- taints/tolerations mismatch
|
||||
- image pull secret issues
|
||||
- node selectors or affinity mismatch
|
||||
|
||||
CrashLoopBackOff:
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> logs <pod> --previous
|
||||
kubectl -n <ns> describe pod <pod>
|
||||
kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'
|
||||
```
|
||||
|
||||
Check for:
|
||||
|
||||
- bad config or missing env vars
|
||||
- probe failures
|
||||
- dependency timeouts
|
||||
- permission or filesystem errors
|
||||
|
||||
## Helm
|
||||
|
||||
```bash
|
||||
helm repo list
|
||||
helm repo update
|
||||
helm list -A
|
||||
helm -n <ns> get values <release> -a
|
||||
helm -n <ns> get manifest <release>
|
||||
helm upgrade --install <release> <chart> -n <ns> -f values.yaml
|
||||
helm rollback -n <ns> <release> <revision>
|
||||
helm template <release> <chart> -f values.yaml | less
|
||||
```
|
||||
|
||||
Validation:
|
||||
|
||||
```bash
|
||||
helm lint <chart>
|
||||
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -20
|
||||
```
|
||||
|
||||
## Docker / Podman
|
||||
|
||||
```bash
|
||||
docker images
|
||||
docker ps -a
|
||||
docker logs --tail 100 <container>
|
||||
docker exec -it <container> sh
|
||||
docker inspect <container>
|
||||
docker volume ls
|
||||
docker network ls
|
||||
docker system df
|
||||
docker image prune -f # cleanup: review first
|
||||
docker container prune -f # cleanup: review first
|
||||
podman ps -a
|
||||
podman inspect <container>
|
||||
```
|
||||
|
||||
Container validation:
|
||||
|
||||
```bash
|
||||
docker exec <container> env | sort
|
||||
docker exec <container> ss -ltnp
|
||||
docker inspect -f '{{.State.Status}} {{.RestartCount}}' <container>
|
||||
```
|
||||
|
||||
## Terraform
|
||||
|
||||
### Core Commands
|
||||
|
||||
```bash
|
||||
terraform fmt -check -recursive
|
||||
terraform init
|
||||
terraform validate
|
||||
terraform plan -out=tfplan
|
||||
terraform apply tfplan
|
||||
terraform destroy -target=<resource> # impact: targeted destruction needs review
|
||||
terraform state list
|
||||
terraform state show <resource>
|
||||
terraform import <resource> <id>
|
||||
```
|
||||
|
||||
### Safe Workflow
|
||||
|
||||
1. `terraform fmt -check -recursive`
|
||||
2. `terraform validate`
|
||||
3. refresh provider auth and backend access
|
||||
4. review `plan` output for replacements and destroys
|
||||
5. save plan artifact
|
||||
6. apply reviewed plan only
|
||||
7. validate resource state outside Terraform
|
||||
|
||||
Plan review focus:
|
||||
|
||||
- unexpected replacement
|
||||
- drift on security groups, routes, storage, or instance identity
|
||||
- provider alias mistakes
|
||||
- wrong workspace or backend
|
||||
|
||||
## CI/CD Operations
|
||||
|
||||
### GitLab CI
|
||||
|
||||
```bash
|
||||
gitlab-runner verify
|
||||
grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml
|
||||
curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects/<id>/pipelines
|
||||
```
|
||||
|
||||
### Jenkins
|
||||
|
||||
```bash
|
||||
systemctl status jenkins --no-pager
|
||||
journalctl -u jenkins -n 100 --no-pager
|
||||
java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs
|
||||
```
|
||||
|
||||
### Runners, Artifacts, Pipeline Failures
|
||||
|
||||
```bash
|
||||
docker logs --tail 100 gitlab-runner
|
||||
kubectl -n ci get pods
|
||||
kubectl -n ci logs deploy/runner-controller --tail=100
|
||||
```
|
||||
|
||||
Troubleshooting flow:
|
||||
|
||||
1. validate YAML or Jenkinsfile syntax
|
||||
2. confirm runner/agent availability
|
||||
3. inspect job logs for auth, cache, DNS, or registry failures
|
||||
4. verify artifacts were uploaded and not expired
|
||||
5. correlate with platform outages, image changes, or secret rotation
|
||||
|
||||
YAML validation:
|
||||
|
||||
```bash
|
||||
yamllint .
|
||||
python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml
|
||||
```
|
||||
|
||||
## Observability
|
||||
|
||||
### Prometheus
|
||||
|
||||
```bash
|
||||
curl -s http://prometheus:9090/-/ready
|
||||
curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'
|
||||
```
|
||||
|
||||
### Loki
|
||||
|
||||
```bash
|
||||
curl -s http://loki:3100/ready
|
||||
curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"'
|
||||
```
|
||||
|
||||
### Grafana
|
||||
|
||||
```bash
|
||||
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
|
||||
grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50
|
||||
```
|
||||
|
||||
### Metrics Validation and Log Correlation
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> port-forward svc/<svc> 9090:9090
|
||||
curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_'
|
||||
```
|
||||
|
||||
Correlation flow:
|
||||
|
||||
1. confirm alert time and impacted objects
|
||||
2. inspect deployment events in same window
|
||||
3. compare Prometheus series, Loki logs, and app logs
|
||||
4. rule out scrape lag or stale dashboards
|
||||
|
||||
## GPU / AI Infrastructure
|
||||
|
||||
### GPU Discovery and CUDA Validation
|
||||
|
||||
```bash
|
||||
nvidia-smi
|
||||
nvidia-smi -L
|
||||
nvidia-smi topo -m
|
||||
nvidia-smi dmon -s pucm
|
||||
nvcc --version
|
||||
python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())'
|
||||
```
|
||||
|
||||
### MIG Basics
|
||||
|
||||
```bash
|
||||
nvidia-smi -i 0 -q | grep -i mig -A4
|
||||
nvidia-smi mig -lgip
|
||||
nvidia-smi mig -lgi
|
||||
```
|
||||
|
||||
### GPU Operator and DCGM
|
||||
|
||||
```bash
|
||||
kubectl get pods -A | grep -E 'nvidia|gpu'
|
||||
kubectl -n gpu-operator describe pod <pod>
|
||||
kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100
|
||||
kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100
|
||||
```
|
||||
|
||||
### Container GPU Validation
|
||||
|
||||
```bash
|
||||
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
|
||||
kubectl run gpu-check --rm -it --restart=Never \
|
||||
--image=nvidia/cuda:12.3.2-base-ubuntu22.04 \
|
||||
--limits='nvidia.com/gpu=1' -- nvidia-smi
|
||||
```
|
||||
|
||||
### Kubernetes GPU Troubleshooting
|
||||
|
||||
Check for:
|
||||
|
||||
- device plugin not running
|
||||
- driver/container toolkit mismatch
|
||||
- node missing `nvidia.com/gpu` allocatable resources
|
||||
- MIG profile mismatch
|
||||
- taints or tolerations blocking placement
|
||||
|
||||
Useful checks:
|
||||
|
||||
```bash
|
||||
kubectl describe node <gpu-node> | grep -A5 -B2 -i nvidia
|
||||
kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}'
|
||||
kubectl -n <ns> describe pod <gpu-pod>
|
||||
```
|
||||
|
||||
## Platform Troubleshooting Flows
|
||||
|
||||
### Pod Not Starting
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> get pod <pod> -o wide
|
||||
kubectl -n <ns> describe pod <pod>
|
||||
kubectl -n <ns> logs <pod> --previous
|
||||
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
|
||||
```
|
||||
|
||||
### Image Pull Errors
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> describe pod <pod> | grep -A5 -i 'image'
|
||||
crictl images | grep <image>
|
||||
ctr -n k8s.io images ls | grep <image>
|
||||
```
|
||||
|
||||
Check:
|
||||
|
||||
- image tag exists
|
||||
- registry reachable
|
||||
- pull secret valid
|
||||
- node clock sane for token-based auth
|
||||
|
||||
### Failing Deployment
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> rollout status deploy/<name>
|
||||
kubectl -n <ns> describe deploy/<name>
|
||||
kubectl -n <ns> get rs,pods -l app=<name> -o wide
|
||||
```
|
||||
|
||||
### Node Not Ready
|
||||
|
||||
```bash
|
||||
kubectl describe node <node>
|
||||
journalctl -u k3s -n 100 --no-pager
|
||||
systemctl status kubelet --no-pager
|
||||
df -h
|
||||
free -m
|
||||
```
|
||||
|
||||
Check:
|
||||
|
||||
- kubelet or k3s service state
|
||||
- disk pressure
|
||||
- cert expiry
|
||||
- CNI failure
|
||||
- API reachability
|
||||
|
||||
### Storage Provisioning Issues
|
||||
|
||||
```bash
|
||||
kubectl get pvc,pv -A
|
||||
kubectl -n <ns> describe pvc <pvc>
|
||||
kubectl get sc
|
||||
kubectl -n kube-system logs deploy/<csi-controller> --tail=100
|
||||
```
|
||||
|
||||
Check:
|
||||
|
||||
- storage class defaulting
|
||||
- access mode mismatch
|
||||
- CSI controller errors
|
||||
- backend quota or LUN exhaustion
|
||||
- node attachment failures
|
||||
Reference in New Issue
Block a user