This commit is contained in:
@@ -4,6 +4,12 @@
|
|||||||
|
|
||||||
### Added
|
### Added
|
||||||
|
|
||||||
|
- Cross-repository operational documentation structure:
|
||||||
|
- `infra-run/docs/operations-cheatsheet.md`
|
||||||
|
- `platform-projects/docs/platform-cheatsheet.md`
|
||||||
|
- `labs/docs/lab-cheatsheet.md`
|
||||||
|
- Production-oriented Linux/Unix operations reference with incident workflows, storage and networking checks, SSL/TLS notes, AIX commands, automation safety patterns, Ansible operational usage, and observability quick-reference.
|
||||||
|
- SELinux operational coverage for mode checks, context inspection, AVC audit review, persistent relabel workflow, booleans, and SELinux-specific incident response.
|
||||||
- Selected baseline Ansible hardening automation:
|
- Selected baseline Ansible hardening automation:
|
||||||
- RHEL 9 role and playbook.
|
- RHEL 9 role and playbook.
|
||||||
- Debian 13 / Ubuntu 26.04 role and playbook.
|
- Debian 13 / Ubuntu 26.04 role and playbook.
|
||||||
@@ -13,6 +19,7 @@
|
|||||||
|
|
||||||
### Changed
|
### Changed
|
||||||
|
|
||||||
|
- Updated repository and `infra-run` README files to surface the new documentation structure and operational cheatsheets.
|
||||||
- Updated repository, `infra-run`, and Ansible README files to describe the new hardening automation instead of placeholder-only Ansible structure.
|
- Updated repository, `infra-run`, and Ansible README files to describe the new hardening automation instead of placeholder-only Ansible structure.
|
||||||
|
|
||||||
### Notes
|
### Notes
|
||||||
|
|||||||
@@ -17,6 +17,20 @@ It is a technical portfolio, not a production toolkit. The examples are meant to
|
|||||||
|
|
||||||
The `labs` and `platform-projects` trees are intentionally thin. They are kept as planning areas for future lab notes and case studies, not as completed projects. Current planned topics are tracked in [ROADMAP.md](./ROADMAP.md).
|
The `labs` and `platform-projects` trees are intentionally thin. They are kept as planning areas for future lab notes and case studies, not as completed projects. Current planned topics are tracked in [ROADMAP.md](./ROADMAP.md).
|
||||||
|
|
||||||
|
## Documentation
|
||||||
|
|
||||||
|
### Production Operations
|
||||||
|
|
||||||
|
- [infra-run/docs/operations-cheatsheet.md](./infra-run/docs/operations-cheatsheet.md) - production-focused Linux/Unix operations reference for incident handling, validation, storage, networking, Ansible, observability, and safety-first change execution.
|
||||||
|
|
||||||
|
### Platform Engineering
|
||||||
|
|
||||||
|
- [platform-projects/docs/platform-cheatsheet.md](./platform-projects/docs/platform-cheatsheet.md) - platform operations reference for Kubernetes, Helm, containers, Terraform, CI/CD, observability, and GPU-backed infrastructure troubleshooting.
|
||||||
|
|
||||||
|
### Labs & Experiments
|
||||||
|
|
||||||
|
- [labs/docs/lab-cheatsheet.md](./labs/docs/lab-cheatsheet.md) - quick-reference scratchpad for K3s, Proxmox, Terraform, Docker, networking, and short-lived lab troubleshooting work.
|
||||||
|
|
||||||
## What This Repo Is Not
|
## What This Repo Is Not
|
||||||
|
|
||||||
- It is not a compliance benchmark implementation.
|
- It is not a compliance benchmark implementation.
|
||||||
|
|||||||
@@ -13,6 +13,10 @@ The goal is to show operational judgment, not to ship a universal automation pro
|
|||||||
- [ansible](./ansible/) - selected baseline hardening examples for RHEL-like Linux, Debian/Ubuntu, and AIX.
|
- [ansible](./ansible/) - selected baseline hardening examples for RHEL-like Linux, Debian/Ubuntu, and AIX.
|
||||||
- [examples](./examples/) - sanitized sample command outputs and incident notes.
|
- [examples](./examples/) - sanitized sample command outputs and incident notes.
|
||||||
|
|
||||||
|
## Documentation
|
||||||
|
|
||||||
|
- [docs/operations-cheatsheet.md](./docs/operations-cheatsheet.md) - production operations quick reference covering Linux/Unix triage, text processing, incident workflows, networking, storage, AIX, SSL/TLS, automation safety, Ansible execution, observability, and operational habits.
|
||||||
|
|
||||||
## What This Is
|
## What This Is
|
||||||
|
|
||||||
- A portfolio project for Linux and infrastructure operations roles.
|
- A portfolio project for Linux and infrastructure operations roles.
|
||||||
|
|||||||
@@ -0,0 +1,857 @@
|
|||||||
|
# Production Operations Cheatsheet
|
||||||
|
|
||||||
|
Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change.
|
||||||
|
|
||||||
|
## Linux / Unix Daily Operations
|
||||||
|
|
||||||
|
### Uptime and Host State
|
||||||
|
|
||||||
|
Check host age, kernel, clock, and recent reboot history before touching anything:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uptime
|
||||||
|
uname -r
|
||||||
|
hostnamectl
|
||||||
|
timedatectl
|
||||||
|
who -b
|
||||||
|
last -x | head -20
|
||||||
|
```
|
||||||
|
|
||||||
|
Pre-check pattern:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
date -u
|
||||||
|
uptime
|
||||||
|
df -h
|
||||||
|
free -m
|
||||||
|
systemctl --failed
|
||||||
|
```
|
||||||
|
|
||||||
|
### Process Management
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ps -ef | head
|
||||||
|
ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
|
||||||
|
pgrep -a java
|
||||||
|
pstree -ap | less
|
||||||
|
pidof sshd
|
||||||
|
renice +5 -p <pid>
|
||||||
|
kill -TERM <pid>
|
||||||
|
kill -9 <pid> # DANGEROUS: last resort only
|
||||||
|
```
|
||||||
|
|
||||||
|
Validation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ps -p <pid> -o pid,stat,etime,cmd
|
||||||
|
journalctl -u <service> -n 50 --no-pager
|
||||||
|
```
|
||||||
|
|
||||||
|
### systemctl
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status <service> --no-pager -l
|
||||||
|
systemctl is-active <service>
|
||||||
|
systemctl is-enabled <service>
|
||||||
|
systemctl list-units --type=service --state=running
|
||||||
|
systemctl list-units --failed
|
||||||
|
systemctl daemon-reload
|
||||||
|
systemctl restart <service> # impact: confirms service interruption policy first
|
||||||
|
```
|
||||||
|
|
||||||
|
### journalctl
|
||||||
|
|
||||||
|
```bash
|
||||||
|
journalctl -u <service> -n 100 --no-pager
|
||||||
|
journalctl -u <service> --since '30 min ago'
|
||||||
|
journalctl -p err -S today
|
||||||
|
journalctl -k -b
|
||||||
|
journalctl --disk-usage
|
||||||
|
```
|
||||||
|
|
||||||
|
### Service Troubleshooting Flow
|
||||||
|
|
||||||
|
1. Confirm service state and recent restart count.
|
||||||
|
2. Read the last 100-200 journal lines.
|
||||||
|
3. Validate config syntax before restart if the daemon supports it.
|
||||||
|
4. Check dependent ports, mounts, credentials, and name resolution.
|
||||||
|
5. Restart only after cause is understood or rollback exists.
|
||||||
|
|
||||||
|
Example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status nginx --no-pager -l
|
||||||
|
journalctl -u nginx -n 100 --no-pager
|
||||||
|
nginx -t
|
||||||
|
ss -ltnp | grep ':80\|:443'
|
||||||
|
curl -kI https://127.0.0.1/
|
||||||
|
```
|
||||||
|
|
||||||
|
### CPU and Memory Diagnostics
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uptime
|
||||||
|
top -H -b -n 1 | head -40
|
||||||
|
pidstat 1 5
|
||||||
|
pidstat -ru -p ALL 1 3
|
||||||
|
vmstat 1 5
|
||||||
|
iostat -xz 1 5
|
||||||
|
free -m
|
||||||
|
sar -q 1 5
|
||||||
|
```
|
||||||
|
|
||||||
|
Quick interpretation:
|
||||||
|
|
||||||
|
- high `%wa`: storage path or NFS issue
|
||||||
|
- high run queue with low CPU idle: CPU contention
|
||||||
|
- swap growth plus page scans: memory pressure
|
||||||
|
|
||||||
|
### Disk Usage
|
||||||
|
|
||||||
|
```bash
|
||||||
|
df -hT
|
||||||
|
du -xhd1 /var | sort -h
|
||||||
|
find /var/log -type f -size +500M -ls | sort -k7,7n
|
||||||
|
lsof +L1
|
||||||
|
```
|
||||||
|
|
||||||
|
### Inode Exhaustion
|
||||||
|
|
||||||
|
```bash
|
||||||
|
df -ih
|
||||||
|
find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n
|
||||||
|
find /tmp -xdev -type f | wc -l
|
||||||
|
```
|
||||||
|
|
||||||
|
### Mounts
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mount | column -t
|
||||||
|
findmnt
|
||||||
|
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
|
||||||
|
cat /etc/fstab
|
||||||
|
mount -a # can expose bad fstab entries; use in change window
|
||||||
|
```
|
||||||
|
|
||||||
|
### Permissions
|
||||||
|
|
||||||
|
```bash
|
||||||
|
namei -l /path/to/file
|
||||||
|
stat /path/to/file
|
||||||
|
getfacl /path/to/file
|
||||||
|
chmod 640 /path/to/file
|
||||||
|
chown root:app /path/to/file
|
||||||
|
```
|
||||||
|
|
||||||
|
### SELinux
|
||||||
|
|
||||||
|
State and mode:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
getenforce
|
||||||
|
sestatus
|
||||||
|
cat /etc/selinux/config
|
||||||
|
```
|
||||||
|
|
||||||
|
Check file, process, and port context:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ls -Zd /var/www/html
|
||||||
|
ls -lZ /var/www/html/index.html
|
||||||
|
ps -eZ | grep nginx
|
||||||
|
id -Z
|
||||||
|
semanage port -l | grep http
|
||||||
|
```
|
||||||
|
|
||||||
|
Audit and denial review:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
|
||||||
|
ausearch -m AVC -ts today | audit2why
|
||||||
|
journalctl -t setroubleshoot --since '1 hour ago'
|
||||||
|
sealert -a /var/log/audit/audit.log
|
||||||
|
```
|
||||||
|
|
||||||
|
Typical flow:
|
||||||
|
|
||||||
|
1. Confirm SELinux mode is `Enforcing` or `Permissive`.
|
||||||
|
2. Identify the failing path, process domain, and target context.
|
||||||
|
3. Read AVC denials before changing labels or booleans.
|
||||||
|
4. Prefer persistent policy-aligned fixes over `chcon`.
|
||||||
|
5. Restore default labels and retest service path.
|
||||||
|
|
||||||
|
Modify and restore context:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
chcon -t httpd_sys_content_t /srv/app/index.html # temporary until relabel/restore
|
||||||
|
chcon -R -t httpd_sys_rw_content_t /srv/app/uploads # temporary until relabel/restore
|
||||||
|
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
|
||||||
|
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
|
||||||
|
restorecon -Rv /srv/app
|
||||||
|
matchpathcon /srv/app/uploads/file.txt
|
||||||
|
```
|
||||||
|
|
||||||
|
Booleans and validation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
getsebool -a | grep httpd
|
||||||
|
getsebool httpd_can_network_connect
|
||||||
|
setsebool -P httpd_can_network_connect on
|
||||||
|
runcon -t httpd_t -- id -Z
|
||||||
|
```
|
||||||
|
|
||||||
|
Notes:
|
||||||
|
|
||||||
|
- prefer `semanage fcontext` plus `restorecon` for persistent fixes
|
||||||
|
- use `chcon` only as a short-lived diagnostic or emergency workaround
|
||||||
|
- avoid generating local policy modules from `audit2allow` until root cause is understood
|
||||||
|
- after context changes, validate service startup, AVC silence, and application path access
|
||||||
|
|
||||||
|
### Archives
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tar tf backup.tar | head
|
||||||
|
tar czf logs-$(date +%F).tgz /var/log/app
|
||||||
|
tar xzf bundle.tgz -C /restore/path
|
||||||
|
gzip -t file.gz
|
||||||
|
```
|
||||||
|
|
||||||
|
### File Operations
|
||||||
|
|
||||||
|
```bash
|
||||||
|
cp -a source/ target/
|
||||||
|
rsync -aHAXvn /src/ /dst/
|
||||||
|
rsync -aHAX --delete --info=progress2 /src/ /dst/ # impact: verify source/destination twice
|
||||||
|
mv file file.$(date +%F-%H%M%S).bak
|
||||||
|
sha256sum file
|
||||||
|
```
|
||||||
|
|
||||||
|
## Text Processing & Regex
|
||||||
|
|
||||||
|
### Core Tools
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -n 'ERROR' app.log
|
||||||
|
grep -E 'ERROR|WARN' app.log
|
||||||
|
grep -P '^\d{4}-\d{2}-\d{2}T' app.log
|
||||||
|
awk '{print $1,$4,$5}' access.log
|
||||||
|
awk -F, 'NR==1 || $3 ~ /failed/' report.csv
|
||||||
|
sed -n '1,20p' file
|
||||||
|
sed -E 's/[[:space:]]+/ /g' file
|
||||||
|
cut -d: -f1,7 /etc/passwd
|
||||||
|
sort file | uniq -c | sort -nr
|
||||||
|
xargs -r -n1 systemctl status < service-list.txt
|
||||||
|
jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json
|
||||||
|
```
|
||||||
|
|
||||||
|
### Regex Reference
|
||||||
|
|
||||||
|
```text
|
||||||
|
IPv4 \b(?:\d{1,3}\.){3}\d{1,3}\b
|
||||||
|
ISO timestamp \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b
|
||||||
|
UUID \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b
|
||||||
|
Log level \b(?:ERROR|WARN|INFO)\b
|
||||||
|
Failed SSH Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})
|
||||||
|
Ansible changed/fail ^(changed|fatal|failed):\s+\[[^]]+\]
|
||||||
|
```
|
||||||
|
|
||||||
|
### Log Parsing Examples
|
||||||
|
|
||||||
|
IP extraction:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head
|
||||||
|
```
|
||||||
|
|
||||||
|
Timestamp filter:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log
|
||||||
|
```
|
||||||
|
|
||||||
|
UUID extraction:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u
|
||||||
|
```
|
||||||
|
|
||||||
|
ERROR/WARN/INFO parsing:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c
|
||||||
|
```
|
||||||
|
|
||||||
|
Failed SSH login parsing:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep 'Failed password' /var/log/secure \
|
||||||
|
| awk '{print $(NF-3),$NF}' \
|
||||||
|
| sort | uniq -c | sort -nr | head
|
||||||
|
```
|
||||||
|
|
||||||
|
Extract fields from logs:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log
|
||||||
|
```
|
||||||
|
|
||||||
|
Filter Ansible output:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log
|
||||||
|
grep -E '^fatal:|^failed:' ansible.log
|
||||||
|
```
|
||||||
|
|
||||||
|
## Incident Response
|
||||||
|
|
||||||
|
### Disk Full
|
||||||
|
|
||||||
|
Workflow:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
df -hT
|
||||||
|
df -ih
|
||||||
|
findmnt
|
||||||
|
du -xhd1 /var | sort -h
|
||||||
|
find /var -xdev -type f -size +1G -ls | sort -k7,7n
|
||||||
|
lsof +L1
|
||||||
|
journalctl --disk-usage
|
||||||
|
```
|
||||||
|
|
||||||
|
Typical branches:
|
||||||
|
|
||||||
|
- filesystem full: identify growth path, compress/rotate/archive, validate app behavior
|
||||||
|
- inode full: remove file storms, spool buildup, temp-file leaks
|
||||||
|
- deleted open files: restart offender only after sizing impact
|
||||||
|
|
||||||
|
Post-check:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
df -hT
|
||||||
|
df -ih
|
||||||
|
systemctl --failed
|
||||||
|
```
|
||||||
|
|
||||||
|
### High CPU
|
||||||
|
|
||||||
|
```bash
|
||||||
|
uptime
|
||||||
|
mpstat -P ALL 1 5
|
||||||
|
pidstat -u -p ALL 1 5
|
||||||
|
top -H -b -n 1 | head -40
|
||||||
|
ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20
|
||||||
|
```
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
1. Confirm sustained load, not a short spike.
|
||||||
|
2. Separate user CPU vs system CPU vs I/O wait.
|
||||||
|
3. Identify hot process and hot threads.
|
||||||
|
4. Correlate with deploys, cron, backups, or JVM GC.
|
||||||
|
5. Throttle, stop, or fail over only with service impact understood.
|
||||||
|
|
||||||
|
### Memory Pressure
|
||||||
|
|
||||||
|
```bash
|
||||||
|
free -m
|
||||||
|
vmstat 1 5
|
||||||
|
sar -r 1 5
|
||||||
|
ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20
|
||||||
|
dmesg -T | egrep -i 'oom|killed process'
|
||||||
|
```
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
1. Check swap growth and page scan rates.
|
||||||
|
2. Identify top RSS owners.
|
||||||
|
3. Check kernel logs for OOM.
|
||||||
|
4. Validate cache vs real process growth.
|
||||||
|
5. Restart leaking service only after capturing evidence.
|
||||||
|
|
||||||
|
### Failed Service
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status <service> --no-pager -l
|
||||||
|
journalctl -u <service> -b --no-pager | tail -100
|
||||||
|
systemctl show <service> -p ExecStart -p FragmentPath -p ActiveEnterTimestamp
|
||||||
|
```
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
1. Validate config.
|
||||||
|
2. Validate credentials, ports, mounts, permissions.
|
||||||
|
3. Confirm dependency availability.
|
||||||
|
4. Restart and recheck logs immediately.
|
||||||
|
|
||||||
|
### SELinux Denials
|
||||||
|
|
||||||
|
Typical case: service works in `Permissive`, fails in `Enforcing`, or logs show `permission denied` while UNIX permissions look correct.
|
||||||
|
|
||||||
|
Triage:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
getenforce
|
||||||
|
sestatus
|
||||||
|
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
|
||||||
|
ausearch -m AVC -ts recent | audit2why
|
||||||
|
journalctl -t setroubleshoot --since '30 min ago'
|
||||||
|
systemctl status <service> --no-pager -l
|
||||||
|
ps -eZ | grep <service>
|
||||||
|
ls -lZ /path/to/app /path/to/app/*
|
||||||
|
```
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
1. Confirm the failure is current and reproducible.
|
||||||
|
2. Identify the denied process domain, target path, and requested access from AVC logs.
|
||||||
|
3. Validate expected default context with `matchpathcon`.
|
||||||
|
4. Check for mislabeled files, wrong port types, or missing SELinux booleans.
|
||||||
|
5. Apply the smallest persistent fix, then retest in `Enforcing`.
|
||||||
|
|
||||||
|
Common fixes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
matchpathcon /srv/app/config.yml
|
||||||
|
restorecon -Rv /srv/app
|
||||||
|
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
|
||||||
|
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
|
||||||
|
semanage port -l | grep http
|
||||||
|
getsebool -a | grep httpd
|
||||||
|
setsebool -P httpd_can_network_connect on
|
||||||
|
```
|
||||||
|
|
||||||
|
Validation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
getenforce
|
||||||
|
systemctl restart <service>
|
||||||
|
systemctl status <service> --no-pager -l
|
||||||
|
ausearch -m AVC -ts recent
|
||||||
|
curl -fsS http://127.0.0.1:<port>/health
|
||||||
|
```
|
||||||
|
|
||||||
|
Operational notes:
|
||||||
|
|
||||||
|
- do not leave systems in `Permissive` as the fix
|
||||||
|
- prefer `restorecon` and `semanage fcontext` over repeated `chcon`
|
||||||
|
- treat `audit2allow` output as investigation material, not automatic remediation
|
||||||
|
- if policy changes are unavoidable, document exact AVC evidence and rollback path
|
||||||
|
|
||||||
|
### SSL Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
openssl s_client -connect host:443 -servername host -showcerts </dev/null
|
||||||
|
openssl x509 -in cert.pem -noout -subject -issuer -dates -ext subjectAltName
|
||||||
|
curl -vkI https://host/
|
||||||
|
```
|
||||||
|
|
||||||
|
Check for:
|
||||||
|
|
||||||
|
- expired certificate
|
||||||
|
- missing SAN
|
||||||
|
- incomplete chain
|
||||||
|
- hostname mismatch
|
||||||
|
- TLS version or cipher mismatch
|
||||||
|
|
||||||
|
### DNS Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
dig +short app.example.com
|
||||||
|
dig @<resolver> app.example.com
|
||||||
|
dig +trace app.example.com
|
||||||
|
getent hosts app.example.com
|
||||||
|
resolvectl status
|
||||||
|
```
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
1. Compare resolver result with authoritative result.
|
||||||
|
2. Check TTL and stale cache.
|
||||||
|
3. Validate `/etc/resolv.conf`, local resolver, and search domains.
|
||||||
|
4. Test from affected host and unaffected host.
|
||||||
|
|
||||||
|
### Network Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ip addr
|
||||||
|
ip route
|
||||||
|
ss -tulpen
|
||||||
|
tcpdump -ni any host <peer> and port <port>
|
||||||
|
curl -sv http://host:port/health
|
||||||
|
mtr -rwzc 20 host
|
||||||
|
```
|
||||||
|
|
||||||
|
Flow:
|
||||||
|
|
||||||
|
1. Interface/link state.
|
||||||
|
2. Route and source IP selection.
|
||||||
|
3. Listening socket on target.
|
||||||
|
4. Firewall and security controls.
|
||||||
|
5. Packet capture if app logs are inconclusive.
|
||||||
|
|
||||||
|
### JVM / Tomcat Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ps -ef | grep -i tomcat
|
||||||
|
jcmd <pid> VM.flags
|
||||||
|
jstat -gcutil <pid> 1000 10
|
||||||
|
jstack <pid> | head -100
|
||||||
|
ss -ltnp | grep java
|
||||||
|
tail -100 /opt/tomcat/logs/catalina.out
|
||||||
|
```
|
||||||
|
|
||||||
|
Focus:
|
||||||
|
|
||||||
|
- stuck threads
|
||||||
|
- full GC loops
|
||||||
|
- heap exhaustion
|
||||||
|
- connector bind failures
|
||||||
|
- slow backend dependency
|
||||||
|
|
||||||
|
### Certificate Expiration
|
||||||
|
|
||||||
|
```bash
|
||||||
|
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
|
||||||
|
| openssl x509 -noout -enddate
|
||||||
|
|
||||||
|
openssl x509 -checkend 2592000 -noout -in cert.pem
|
||||||
|
```
|
||||||
|
|
||||||
|
### Suspicious Login Attempts
|
||||||
|
|
||||||
|
```bash
|
||||||
|
last -ai | head -30
|
||||||
|
lastb -ai | head -30
|
||||||
|
grep 'Failed password' /var/log/secure | tail -50
|
||||||
|
grep 'Accepted ' /var/log/secure | tail -50
|
||||||
|
ausearch -m USER_LOGIN -ts recent
|
||||||
|
```
|
||||||
|
|
||||||
|
Workflow:
|
||||||
|
|
||||||
|
1. Identify source IPs and usernames.
|
||||||
|
2. Validate whether attempts are expected from bastions/scanners.
|
||||||
|
3. Check successful logins from same sources.
|
||||||
|
4. Review sudo usage and persistence changes.
|
||||||
|
5. Preserve logs before cleanup or rotation.
|
||||||
|
|
||||||
|
## Networking Operations
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ip -br addr
|
||||||
|
ip route get 8.8.8.8
|
||||||
|
ss -ltnp
|
||||||
|
ss -tn state established '( sport = :443 or dport = :443 )'
|
||||||
|
tcpdump -ni eth0 port 53
|
||||||
|
dig +short mx example.com
|
||||||
|
curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health
|
||||||
|
mtr -rwzc 10 host
|
||||||
|
traceroute -T -p 443 host
|
||||||
|
openssl s_client -connect host:443 -servername host </dev/null
|
||||||
|
```
|
||||||
|
|
||||||
|
## Storage Operations
|
||||||
|
|
||||||
|
### Block and Filesystem Discovery
|
||||||
|
|
||||||
|
```bash
|
||||||
|
lsblk -f
|
||||||
|
blkid
|
||||||
|
findmnt
|
||||||
|
cat /proc/partitions
|
||||||
|
multipath -ll
|
||||||
|
```
|
||||||
|
|
||||||
|
### LVM
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pvs
|
||||||
|
vgs
|
||||||
|
lvs -a -o +devices
|
||||||
|
pvdisplay /dev/sdX
|
||||||
|
vgdisplay <vg>
|
||||||
|
lvdisplay /dev/<vg>/<lv>
|
||||||
|
```
|
||||||
|
|
||||||
|
Growth example:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pvcreate /dev/mapper/mpatha # impact: write metadata
|
||||||
|
vgextend vgdata /dev/mapper/mpatha # impact: changes VG layout
|
||||||
|
lvextend -L +100G -r /dev/vgdata/lvapp
|
||||||
|
```
|
||||||
|
|
||||||
|
### XFS
|
||||||
|
|
||||||
|
```bash
|
||||||
|
xfs_info /mountpoint
|
||||||
|
xfs_repair -n /dev/mapper/vg-lv
|
||||||
|
xfs_growfs /mountpoint
|
||||||
|
```
|
||||||
|
|
||||||
|
### ext4
|
||||||
|
|
||||||
|
```bash
|
||||||
|
tune2fs -l /dev/mapper/vg-lv | head -40
|
||||||
|
e2fsck -fn /dev/mapper/vg-lv
|
||||||
|
resize2fs /dev/mapper/vg-lv
|
||||||
|
```
|
||||||
|
|
||||||
|
### Multipath
|
||||||
|
|
||||||
|
```bash
|
||||||
|
multipath -ll
|
||||||
|
lsblk -S
|
||||||
|
udevadm info --query=all --name=/dev/mapper/mpatha | head -40
|
||||||
|
```
|
||||||
|
|
||||||
|
### NFS
|
||||||
|
|
||||||
|
```bash
|
||||||
|
showmount -e nfs-server
|
||||||
|
nfsstat -m
|
||||||
|
mount | grep nfs
|
||||||
|
rpcinfo -p nfs-server
|
||||||
|
```
|
||||||
|
|
||||||
|
### iSCSI
|
||||||
|
|
||||||
|
```bash
|
||||||
|
iscsiadm -m session
|
||||||
|
iscsiadm -m node
|
||||||
|
iscsiadm -m discovery -t sendtargets -p <target-ip>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Mount Troubleshooting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
findmnt /mountpoint
|
||||||
|
mount -v /mountpoint
|
||||||
|
dmesg -T | tail -50
|
||||||
|
journalctl -k -n 100 --no-pager
|
||||||
|
```
|
||||||
|
|
||||||
|
Check:
|
||||||
|
|
||||||
|
- device path stable
|
||||||
|
- UUID correct
|
||||||
|
- filesystem type correct
|
||||||
|
- multipath settled
|
||||||
|
- network and RPC available for NFS
|
||||||
|
|
||||||
|
### Filesystem Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
|
||||||
|
df -hT /data
|
||||||
|
touch /data/.write-test && rm -f /data/.write-test
|
||||||
|
```
|
||||||
|
|
||||||
|
### Migration Validation Example
|
||||||
|
|
||||||
|
```bash
|
||||||
|
findmnt /data
|
||||||
|
df -hT /data
|
||||||
|
rsync -aHAXvn /olddata/ /data/
|
||||||
|
rsync -aHAXc --delete --dry-run /olddata/ /data/
|
||||||
|
sha256sum /olddata/keyfile /data/keyfile
|
||||||
|
```
|
||||||
|
|
||||||
|
## AIX Operations
|
||||||
|
|
||||||
|
```bash
|
||||||
|
oslevel -s
|
||||||
|
errpt | head
|
||||||
|
errpt -a | more
|
||||||
|
topas
|
||||||
|
lsvg -o
|
||||||
|
lsvg rootvg
|
||||||
|
lslpp -L | grep -i openssl
|
||||||
|
svmon -G
|
||||||
|
svmon -P <pid>
|
||||||
|
netstat -rn
|
||||||
|
```
|
||||||
|
|
||||||
|
## SSL/TLS Operations
|
||||||
|
|
||||||
|
### OpenSSL Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
openssl version -a
|
||||||
|
openssl x509 -in cert.pem -noout -text | less
|
||||||
|
openssl rsa -in key.pem -check
|
||||||
|
openssl verify -CAfile chain.pem cert.pem
|
||||||
|
```
|
||||||
|
|
||||||
|
### Expiration Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
openssl x509 -enddate -noout -in cert.pem
|
||||||
|
openssl x509 -checkend 604800 -noout -in cert.pem
|
||||||
|
```
|
||||||
|
|
||||||
|
### keytool Basics
|
||||||
|
|
||||||
|
```bash
|
||||||
|
keytool -list -v -keystore keystore.jks
|
||||||
|
keytool -list -cacerts | grep -i <alias>
|
||||||
|
keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks
|
||||||
|
```
|
||||||
|
|
||||||
|
### Chain Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
openssl s_client -connect host:443 -servername host -showcerts </dev/null
|
||||||
|
openssl verify -untrusted intermediate.pem -CAfile root.pem server.pem
|
||||||
|
```
|
||||||
|
|
||||||
|
## Automation Operations
|
||||||
|
|
||||||
|
### Bash Safety Patterns
|
||||||
|
|
||||||
|
```bash
|
||||||
|
set -euo pipefail
|
||||||
|
IFS=$'\n\t'
|
||||||
|
trap 'echo "line ${LINENO}: command failed" >&2' ERR
|
||||||
|
trap 'rm -f "${tmpfile:-}"' EXIT
|
||||||
|
```
|
||||||
|
|
||||||
|
Safe loop examples:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
while IFS= read -r host; do
|
||||||
|
ssh "$host" uptime
|
||||||
|
done < hostlist.txt
|
||||||
|
|
||||||
|
find /var/log -type f -name '*.log' -print0 \
|
||||||
|
| while IFS= read -r -d '' file; do
|
||||||
|
gzip -t "$file"
|
||||||
|
done
|
||||||
|
```
|
||||||
|
|
||||||
|
Operational scripting patterns:
|
||||||
|
|
||||||
|
- default to read-only mode
|
||||||
|
- require explicit `--execute` for changes
|
||||||
|
- log actions with timestamps
|
||||||
|
- validate dependencies with `command -v`
|
||||||
|
- use temp files with `mktemp`
|
||||||
|
- guard destructive paths and empty variables
|
||||||
|
|
||||||
|
## Ansible Operations
|
||||||
|
|
||||||
|
### Execution
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-inventory -i inventory/hosts.yml --graph
|
||||||
|
ansible-inventory -i inventory/hosts.yml --list | jq '.'
|
||||||
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check
|
||||||
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
|
||||||
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01
|
||||||
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages
|
||||||
|
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Safe Rollout Workflow
|
||||||
|
|
||||||
|
1. Validate inventory and variable targeting.
|
||||||
|
2. Run syntax-check.
|
||||||
|
3. Run `--check --diff` on a single host.
|
||||||
|
4. Execute against one host or one tier.
|
||||||
|
5. Validate service health, logs, and config.
|
||||||
|
6. Expand rollout only after post-check passes.
|
||||||
|
|
||||||
|
Rollback mindset:
|
||||||
|
|
||||||
|
- keep before/after config copies
|
||||||
|
- know which tasks restart services
|
||||||
|
- define manual backout if package/config changes fail
|
||||||
|
- avoid broad `--limit` mistakes by reviewing resolved host list first
|
||||||
|
|
||||||
|
## Monitoring & Observability
|
||||||
|
|
||||||
|
### Zabbix Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status zabbix-agent2 --no-pager
|
||||||
|
zabbix_agent2 -t vfs.fs.size[/,free]
|
||||||
|
grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log
|
||||||
|
```
|
||||||
|
|
||||||
|
### ELK Log Workflows
|
||||||
|
|
||||||
|
```bash
|
||||||
|
grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50
|
||||||
|
journalctl -u filebeat -n 100 --no-pager
|
||||||
|
curl -s http://localhost:9200/_cluster/health?pretty
|
||||||
|
```
|
||||||
|
|
||||||
|
### Grafana Checks
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
|
||||||
|
grep -i 'error' /var/log/grafana/grafana.log | tail -50
|
||||||
|
```
|
||||||
|
|
||||||
|
### Health Endpoints and Alert Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -fsS http://app:8080/health
|
||||||
|
curl -fsS http://app:8080/metrics | head
|
||||||
|
```
|
||||||
|
|
||||||
|
False positive validation:
|
||||||
|
|
||||||
|
1. Compare alert timestamp with deploy/change window.
|
||||||
|
2. Confirm on-host evidence, not only dashboard data.
|
||||||
|
3. Check collector lag, scrape failures, and stale metrics.
|
||||||
|
4. Validate from a second source before escalating.
|
||||||
|
|
||||||
|
## Operational Habits
|
||||||
|
|
||||||
|
### Pre-checks
|
||||||
|
|
||||||
|
- capture time, hostname, and operator
|
||||||
|
- capture current config and service state
|
||||||
|
- check recent alerts, maintenance windows, and dependencies
|
||||||
|
- confirm backup or rollback path exists
|
||||||
|
|
||||||
|
### Post-checks
|
||||||
|
|
||||||
|
- validate service state
|
||||||
|
- validate logs for fresh errors
|
||||||
|
- validate client path, ports, and name resolution
|
||||||
|
- compare metrics before/after
|
||||||
|
|
||||||
|
### Rollback Thinking
|
||||||
|
|
||||||
|
- define exact backout trigger before change
|
||||||
|
- prefer reversible steps
|
||||||
|
- keep config backups with timestamps
|
||||||
|
- avoid bundling unrelated changes
|
||||||
|
|
||||||
|
### Change Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl is-active <service>
|
||||||
|
curl -fsS http://127.0.0.1:<port>/health
|
||||||
|
ss -ltnp | grep :<port>
|
||||||
|
journalctl -u <service> -S '5 min ago' --no-pager
|
||||||
|
```
|
||||||
|
|
||||||
|
### Operational Communication
|
||||||
|
|
||||||
|
- state scope, risk, and expected impact before action
|
||||||
|
- record start and stop times in UTC
|
||||||
|
- document what changed, what was checked, and remaining risk
|
||||||
|
- escalate with evidence, not assumptions
|
||||||
|
|
||||||
|
### Evidence Collection During Incidents
|
||||||
|
|
||||||
|
```bash
|
||||||
|
mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ)
|
||||||
|
journalctl -b > /tmp/incident-*/journal.txt
|
||||||
|
ss -tulpen > /tmp/incident-*/sockets.txt
|
||||||
|
df -hT > /tmp/incident-*/df.txt
|
||||||
|
free -m > /tmp/incident-*/free.txt
|
||||||
|
```
|
||||||
@@ -0,0 +1,144 @@
|
|||||||
|
# Lab Cheatsheet
|
||||||
|
|
||||||
|
Quick-reference notes for experiments, rebuilds, and short-lived troubleshooting. Expect rough edges. Capture what worked, what broke, and what should not be repeated in production.
|
||||||
|
|
||||||
|
## K3s Lab
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo systemctl status k3s --no-pager
|
||||||
|
sudo journalctl -u k3s -n 100 --no-pager
|
||||||
|
kubectl get nodes -o wide
|
||||||
|
kubectl get pods -A
|
||||||
|
kubectl get events -A --sort-by=.lastTimestamp | tail -30
|
||||||
|
sudo k3s kubectl get pods -A
|
||||||
|
```
|
||||||
|
|
||||||
|
Quick reset:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
sudo /usr/local/bin/k3s-uninstall.sh # destructive lab reset
|
||||||
|
```
|
||||||
|
|
||||||
|
## Proxmox Lab
|
||||||
|
|
||||||
|
```bash
|
||||||
|
pvesh get /nodes
|
||||||
|
pvesh get /cluster/resources
|
||||||
|
qm list
|
||||||
|
qm config <vmid>
|
||||||
|
pct list
|
||||||
|
ha-manager status
|
||||||
|
```
|
||||||
|
|
||||||
|
Checks before changes:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
zpool status
|
||||||
|
pvesm status
|
||||||
|
ip -br addr
|
||||||
|
```
|
||||||
|
|
||||||
|
## GPU Passthrough
|
||||||
|
|
||||||
|
```bash
|
||||||
|
lspci -nn | grep -Ei 'vga|3d|nvidia'
|
||||||
|
nvidia-smi
|
||||||
|
dmesg -T | grep -Ei 'vfio|iommu|nvidia'
|
||||||
|
find /sys/kernel/iommu_groups/ -type l | sort
|
||||||
|
```
|
||||||
|
|
||||||
|
Good sanity check:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
lsmod | grep -E 'vfio|kvm'
|
||||||
|
```
|
||||||
|
|
||||||
|
## Terraform Experiments
|
||||||
|
|
||||||
|
```bash
|
||||||
|
terraform fmt -recursive
|
||||||
|
terraform init
|
||||||
|
terraform validate
|
||||||
|
terraform plan
|
||||||
|
terraform state list
|
||||||
|
```
|
||||||
|
|
||||||
|
Scratch workflow:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
terraform plan -out=tfplan
|
||||||
|
terraform show tfplan
|
||||||
|
```
|
||||||
|
|
||||||
|
## Networking Labs
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ip -br addr
|
||||||
|
ip route
|
||||||
|
bridge link
|
||||||
|
ss -ltnp
|
||||||
|
tcpdump -ni any port 53
|
||||||
|
dig +short example.com
|
||||||
|
mtr -rwzc 10 1.1.1.1
|
||||||
|
```
|
||||||
|
|
||||||
|
## Ansible Testing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
ansible-inventory -i inventory/hosts.yml --graph
|
||||||
|
ansible-playbook -i inventory/hosts.yml playbook.yml --syntax-check
|
||||||
|
ansible-playbook -i inventory/hosts.yml playbook.yml --check --diff
|
||||||
|
ansible all -i inventory/hosts.yml -m ping
|
||||||
|
```
|
||||||
|
|
||||||
|
## Docker Testing
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker ps -a
|
||||||
|
docker logs --tail 100 <container>
|
||||||
|
docker exec -it <container> sh
|
||||||
|
docker inspect <container> | jq '.[0].NetworkSettings'
|
||||||
|
docker system df
|
||||||
|
```
|
||||||
|
|
||||||
|
## Useful Temporary Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
watch -n2 'kubectl get pods -A'
|
||||||
|
watch -n2 'nvidia-smi'
|
||||||
|
watch -n2 'ip -br addr'
|
||||||
|
while true; do date -u; curl -fsS http://127.0.0.1:8080/health; sleep 2; done
|
||||||
|
```
|
||||||
|
|
||||||
|
## Quick PoC Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
python3 -m http.server 8080
|
||||||
|
openssl req -x509 -newkey rsa:2048 -nodes -days 3 -keyout key.pem -out cert.pem
|
||||||
|
curl -vk https://127.0.0.1:8443/
|
||||||
|
nc -lvkp 9000
|
||||||
|
```
|
||||||
|
|
||||||
|
## Troubleshooting Notes
|
||||||
|
|
||||||
|
- If K3s pods fail after host reboot, check time sync before chasing cert or API errors.
|
||||||
|
- If PVCs stay pending in lab clusters, inspect the default storage class first.
|
||||||
|
- If Docker networking looks broken, compare bridge subnet overlaps with the host route table.
|
||||||
|
- If GPU pods see no devices, validate driver, toolkit, and device plugin in that order.
|
||||||
|
|
||||||
|
## Useful One-liners
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl get pods -A -o wide | egrep 'CrashLoopBackOff|Error|Pending'
|
||||||
|
journalctl -p err -S today
|
||||||
|
find /var/log -type f -mtime -1 -ls | sort -k7,7n
|
||||||
|
ps -eo pid,%cpu,%mem,cmd --sort=-%cpu | head
|
||||||
|
grep -RniE 'error|failed|timeout' .
|
||||||
|
```
|
||||||
|
|
||||||
|
## Things Worth Remembering
|
||||||
|
|
||||||
|
- Pre-checks still matter in labs. Capture state before trying the risky thing.
|
||||||
|
- Keep a copy of working configs before rapid iteration.
|
||||||
|
- Short-lived labs still produce useful evidence; save command output when a fix works.
|
||||||
|
- If a PoC needs repeated manual repair, turn the repair steps into a script or note.
|
||||||
@@ -0,0 +1,368 @@
|
|||||||
|
# Platform Engineering Cheatsheet
|
||||||
|
|
||||||
|
Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts.
|
||||||
|
|
||||||
|
## Kubernetes / K3s
|
||||||
|
|
||||||
|
### Contexts, Namespaces, and Basic Workflows
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl config get-contexts
|
||||||
|
kubectl config use-context <context>
|
||||||
|
kubectl get ns
|
||||||
|
kubectl -n <ns> get pods -o wide
|
||||||
|
kubectl -n <ns> get deploy,sts,ds,svc,ingress
|
||||||
|
kubectl get nodes -o wide
|
||||||
|
```
|
||||||
|
|
||||||
|
### Describe, Logs, Exec, Events
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl -n <ns> describe pod <pod>
|
||||||
|
kubectl -n <ns> logs <pod> --tail=100
|
||||||
|
kubectl -n <ns> logs <pod> -c <container> --previous
|
||||||
|
kubectl -n <ns> exec -it <pod> -- sh
|
||||||
|
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
|
||||||
|
```
|
||||||
|
|
||||||
|
### Rollout Troubleshooting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl -n <ns> rollout status deploy/<name>
|
||||||
|
kubectl -n <ns> rollout history deploy/<name>
|
||||||
|
kubectl -n <ns> rollout undo deploy/<name>
|
||||||
|
kubectl -n <ns> get rs -l app=<name>
|
||||||
|
```
|
||||||
|
|
||||||
|
Safe pattern:
|
||||||
|
|
||||||
|
1. `kubectl diff -f <manifest>`
|
||||||
|
2. apply to non-prod or canary namespace
|
||||||
|
3. watch rollout and events
|
||||||
|
4. validate service and logs
|
||||||
|
5. expand scope only after post-check
|
||||||
|
|
||||||
|
### Node Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl get nodes
|
||||||
|
kubectl describe node <node>
|
||||||
|
kubectl top nodes
|
||||||
|
kubectl top pods -A --sort-by=cpu
|
||||||
|
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Pending / CrashLoopBackOff Flow
|
||||||
|
|
||||||
|
Pending:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl -n <ns> describe pod <pod>
|
||||||
|
kubectl get events -A --sort-by=.lastTimestamp | tail -50
|
||||||
|
```
|
||||||
|
|
||||||
|
Check for:
|
||||||
|
|
||||||
|
- unsatisfied CPU/memory requests
|
||||||
|
- missing PVC
|
||||||
|
- taints/tolerations mismatch
|
||||||
|
- image pull secret issues
|
||||||
|
- node selectors or affinity mismatch
|
||||||
|
|
||||||
|
CrashLoopBackOff:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl -n <ns> logs <pod> --previous
|
||||||
|
kubectl -n <ns> describe pod <pod>
|
||||||
|
kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'
|
||||||
|
```
|
||||||
|
|
||||||
|
Check for:
|
||||||
|
|
||||||
|
- bad config or missing env vars
|
||||||
|
- probe failures
|
||||||
|
- dependency timeouts
|
||||||
|
- permission or filesystem errors
|
||||||
|
|
||||||
|
## Helm
|
||||||
|
|
||||||
|
```bash
|
||||||
|
helm repo list
|
||||||
|
helm repo update
|
||||||
|
helm list -A
|
||||||
|
helm -n <ns> get values <release> -a
|
||||||
|
helm -n <ns> get manifest <release>
|
||||||
|
helm upgrade --install <release> <chart> -n <ns> -f values.yaml
|
||||||
|
helm rollback -n <ns> <release> <revision>
|
||||||
|
helm template <release> <chart> -f values.yaml | less
|
||||||
|
```
|
||||||
|
|
||||||
|
Validation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
helm lint <chart>
|
||||||
|
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -20
|
||||||
|
```
|
||||||
|
|
||||||
|
## Docker / Podman
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker images
|
||||||
|
docker ps -a
|
||||||
|
docker logs --tail 100 <container>
|
||||||
|
docker exec -it <container> sh
|
||||||
|
docker inspect <container>
|
||||||
|
docker volume ls
|
||||||
|
docker network ls
|
||||||
|
docker system df
|
||||||
|
docker image prune -f # cleanup: review first
|
||||||
|
docker container prune -f # cleanup: review first
|
||||||
|
podman ps -a
|
||||||
|
podman inspect <container>
|
||||||
|
```
|
||||||
|
|
||||||
|
Container validation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker exec <container> env | sort
|
||||||
|
docker exec <container> ss -ltnp
|
||||||
|
docker inspect -f '{{.State.Status}} {{.RestartCount}}' <container>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Terraform
|
||||||
|
|
||||||
|
### Core Commands
|
||||||
|
|
||||||
|
```bash
|
||||||
|
terraform fmt -check -recursive
|
||||||
|
terraform init
|
||||||
|
terraform validate
|
||||||
|
terraform plan -out=tfplan
|
||||||
|
terraform apply tfplan
|
||||||
|
terraform destroy -target=<resource> # impact: targeted destruction needs review
|
||||||
|
terraform state list
|
||||||
|
terraform state show <resource>
|
||||||
|
terraform import <resource> <id>
|
||||||
|
```
|
||||||
|
|
||||||
|
### Safe Workflow
|
||||||
|
|
||||||
|
1. `terraform fmt -check -recursive`
|
||||||
|
2. `terraform validate`
|
||||||
|
3. refresh provider auth and backend access
|
||||||
|
4. review `plan` output for replacements and destroys
|
||||||
|
5. save plan artifact
|
||||||
|
6. apply reviewed plan only
|
||||||
|
7. validate resource state outside Terraform
|
||||||
|
|
||||||
|
Plan review focus:
|
||||||
|
|
||||||
|
- unexpected replacement
|
||||||
|
- drift on security groups, routes, storage, or instance identity
|
||||||
|
- provider alias mistakes
|
||||||
|
- wrong workspace or backend
|
||||||
|
|
||||||
|
## CI/CD Operations
|
||||||
|
|
||||||
|
### GitLab CI
|
||||||
|
|
||||||
|
```bash
|
||||||
|
gitlab-runner verify
|
||||||
|
grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml
|
||||||
|
curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects/<id>/pipelines
|
||||||
|
```
|
||||||
|
|
||||||
|
### Jenkins
|
||||||
|
|
||||||
|
```bash
|
||||||
|
systemctl status jenkins --no-pager
|
||||||
|
journalctl -u jenkins -n 100 --no-pager
|
||||||
|
java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs
|
||||||
|
```
|
||||||
|
|
||||||
|
### Runners, Artifacts, Pipeline Failures
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker logs --tail 100 gitlab-runner
|
||||||
|
kubectl -n ci get pods
|
||||||
|
kubectl -n ci logs deploy/runner-controller --tail=100
|
||||||
|
```
|
||||||
|
|
||||||
|
Troubleshooting flow:
|
||||||
|
|
||||||
|
1. validate YAML or Jenkinsfile syntax
|
||||||
|
2. confirm runner/agent availability
|
||||||
|
3. inspect job logs for auth, cache, DNS, or registry failures
|
||||||
|
4. verify artifacts were uploaded and not expired
|
||||||
|
5. correlate with platform outages, image changes, or secret rotation
|
||||||
|
|
||||||
|
YAML validation:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
yamllint .
|
||||||
|
python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml
|
||||||
|
```
|
||||||
|
|
||||||
|
## Observability
|
||||||
|
|
||||||
|
### Prometheus
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -s http://prometheus:9090/-/ready
|
||||||
|
curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
|
||||||
|
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Loki
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -s http://loki:3100/ready
|
||||||
|
curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"'
|
||||||
|
```
|
||||||
|
|
||||||
|
### Grafana
|
||||||
|
|
||||||
|
```bash
|
||||||
|
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
|
||||||
|
grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50
|
||||||
|
```
|
||||||
|
|
||||||
|
### Metrics Validation and Log Correlation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl -n <ns> port-forward svc/<svc> 9090:9090
|
||||||
|
curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_'
|
||||||
|
```
|
||||||
|
|
||||||
|
Correlation flow:
|
||||||
|
|
||||||
|
1. confirm alert time and impacted objects
|
||||||
|
2. inspect deployment events in same window
|
||||||
|
3. compare Prometheus series, Loki logs, and app logs
|
||||||
|
4. rule out scrape lag or stale dashboards
|
||||||
|
|
||||||
|
## GPU / AI Infrastructure
|
||||||
|
|
||||||
|
### GPU Discovery and CUDA Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nvidia-smi
|
||||||
|
nvidia-smi -L
|
||||||
|
nvidia-smi topo -m
|
||||||
|
nvidia-smi dmon -s pucm
|
||||||
|
nvcc --version
|
||||||
|
python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())'
|
||||||
|
```
|
||||||
|
|
||||||
|
### MIG Basics
|
||||||
|
|
||||||
|
```bash
|
||||||
|
nvidia-smi -i 0 -q | grep -i mig -A4
|
||||||
|
nvidia-smi mig -lgip
|
||||||
|
nvidia-smi mig -lgi
|
||||||
|
```
|
||||||
|
|
||||||
|
### GPU Operator and DCGM
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl get pods -A | grep -E 'nvidia|gpu'
|
||||||
|
kubectl -n gpu-operator describe pod <pod>
|
||||||
|
kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100
|
||||||
|
kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100
|
||||||
|
```
|
||||||
|
|
||||||
|
### Container GPU Validation
|
||||||
|
|
||||||
|
```bash
|
||||||
|
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
|
||||||
|
kubectl run gpu-check --rm -it --restart=Never \
|
||||||
|
--image=nvidia/cuda:12.3.2-base-ubuntu22.04 \
|
||||||
|
--limits='nvidia.com/gpu=1' -- nvidia-smi
|
||||||
|
```
|
||||||
|
|
||||||
|
### Kubernetes GPU Troubleshooting
|
||||||
|
|
||||||
|
Check for:
|
||||||
|
|
||||||
|
- device plugin not running
|
||||||
|
- driver/container toolkit mismatch
|
||||||
|
- node missing `nvidia.com/gpu` allocatable resources
|
||||||
|
- MIG profile mismatch
|
||||||
|
- taints or tolerations blocking placement
|
||||||
|
|
||||||
|
Useful checks:
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl describe node <gpu-node> | grep -A5 -B2 -i nvidia
|
||||||
|
kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}'
|
||||||
|
kubectl -n <ns> describe pod <gpu-pod>
|
||||||
|
```
|
||||||
|
|
||||||
|
## Platform Troubleshooting Flows
|
||||||
|
|
||||||
|
### Pod Not Starting
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl -n <ns> get pod <pod> -o wide
|
||||||
|
kubectl -n <ns> describe pod <pod>
|
||||||
|
kubectl -n <ns> logs <pod> --previous
|
||||||
|
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
|
||||||
|
```
|
||||||
|
|
||||||
|
### Image Pull Errors
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl -n <ns> describe pod <pod> | grep -A5 -i 'image'
|
||||||
|
crictl images | grep <image>
|
||||||
|
ctr -n k8s.io images ls | grep <image>
|
||||||
|
```
|
||||||
|
|
||||||
|
Check:
|
||||||
|
|
||||||
|
- image tag exists
|
||||||
|
- registry reachable
|
||||||
|
- pull secret valid
|
||||||
|
- node clock sane for token-based auth
|
||||||
|
|
||||||
|
### Failing Deployment
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl -n <ns> rollout status deploy/<name>
|
||||||
|
kubectl -n <ns> describe deploy/<name>
|
||||||
|
kubectl -n <ns> get rs,pods -l app=<name> -o wide
|
||||||
|
```
|
||||||
|
|
||||||
|
### Node Not Ready
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl describe node <node>
|
||||||
|
journalctl -u k3s -n 100 --no-pager
|
||||||
|
systemctl status kubelet --no-pager
|
||||||
|
df -h
|
||||||
|
free -m
|
||||||
|
```
|
||||||
|
|
||||||
|
Check:
|
||||||
|
|
||||||
|
- kubelet or k3s service state
|
||||||
|
- disk pressure
|
||||||
|
- cert expiry
|
||||||
|
- CNI failure
|
||||||
|
- API reachability
|
||||||
|
|
||||||
|
### Storage Provisioning Issues
|
||||||
|
|
||||||
|
```bash
|
||||||
|
kubectl get pvc,pv -A
|
||||||
|
kubectl -n <ns> describe pvc <pvc>
|
||||||
|
kubectl get sc
|
||||||
|
kubectl -n kube-system logs deploy/<csi-controller> --tail=100
|
||||||
|
```
|
||||||
|
|
||||||
|
Check:
|
||||||
|
|
||||||
|
- storage class defaulting
|
||||||
|
- access mode mismatch
|
||||||
|
- CSI controller errors
|
||||||
|
- backend quota or LUN exhaustion
|
||||||
|
- node attachment failures
|
||||||
Reference in New Issue
Block a user