Add operational cheatsheets across repository

2026-05-09 09:41:55 +00:00
parent ca5a876d03
commit 0d3905b8a1
6 changed files with 1394 additions and 0 deletions
@@ -0,0 +1,368 @@
+# Platform Engineering Cheatsheet
+
+Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts.
+
+## Kubernetes / K3s
+
+### Contexts, Namespaces, and Basic Workflows
+
+```bash
+kubectl config get-contexts
+kubectl config use-context <context>
+kubectl get ns
+kubectl -n <ns> get pods -o wide
+kubectl -n <ns> get deploy,sts,ds,svc,ingress
+kubectl get nodes -o wide
+```
+
+### Describe, Logs, Exec, Events
+
+```bash
+kubectl -n <ns> describe pod <pod>
+kubectl -n <ns> logs <pod> --tail=100
+kubectl -n <ns> logs <pod> -c <container> --previous
+kubectl -n <ns> exec -it <pod> -- sh
+kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
+```
+
+### Rollout Troubleshooting
+
+```bash
+kubectl -n <ns> rollout status deploy/<name>
+kubectl -n <ns> rollout history deploy/<name>
+kubectl -n <ns> rollout undo deploy/<name>
+kubectl -n <ns> get rs -l app=<name>
+```
+
+Safe pattern:
+
+1. `kubectl diff -f <manifest>`
+2. apply to non-prod or canary namespace
+3. watch rollout and events
+4. validate service and logs
+5. expand scope only after post-check
+
+### Node Validation
+
+```bash
+kubectl get nodes
+kubectl describe node <node>
+kubectl top nodes
+kubectl top pods -A --sort-by=cpu
+kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
+```
+
+### Pending / CrashLoopBackOff Flow
+
+Pending:
+
+```bash
+kubectl -n <ns> describe pod <pod>
+kubectl get events -A --sort-by=.lastTimestamp | tail -50
+```
+
+Check for:
+
+- unsatisfied CPU/memory requests
+- missing PVC
+- taints/tolerations mismatch
+- image pull secret issues
+- node selectors or affinity mismatch
+
+CrashLoopBackOff:
+
+```bash
+kubectl -n <ns> logs <pod> --previous
+kubectl -n <ns> describe pod <pod>
+kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'
+```
+
+Check for:
+
+- bad config or missing env vars
+- probe failures
+- dependency timeouts
+- permission or filesystem errors
+
+## Helm
+
+```bash
+helm repo list
+helm repo update
+helm list -A
+helm -n <ns> get values <release> -a
+helm -n <ns> get manifest <release>
+helm upgrade --install <release> <chart> -n <ns> -f values.yaml
+helm rollback -n <ns> <release> <revision>
+helm template <release> <chart> -f values.yaml | less
+```
+
+Validation:
+
+```bash
+helm lint <chart>
+kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -20
+```
+
+## Docker / Podman
+
+```bash
+docker images
+docker ps -a
+docker logs --tail 100 <container>
+docker exec -it <container> sh
+docker inspect <container>
+docker volume ls
+docker network ls
+docker system df
+docker image prune -f         # cleanup: review first
+docker container prune -f     # cleanup: review first
+podman ps -a
+podman inspect <container>
+```
+
+Container validation:
+
+```bash
+docker exec <container> env | sort
+docker exec <container> ss -ltnp
+docker inspect -f '{{.State.Status}} {{.RestartCount}}' <container>
+```
+
+## Terraform
+
+### Core Commands
+
+```bash
+terraform fmt -check -recursive
+terraform init
+terraform validate
+terraform plan -out=tfplan
+terraform apply tfplan
+terraform destroy -target=<resource>   # impact: targeted destruction needs review
+terraform state list
+terraform state show <resource>
+terraform import <resource> <id>
+```
+
+### Safe Workflow
+
+1. `terraform fmt -check -recursive`
+2. `terraform validate`
+3. refresh provider auth and backend access
+4. review `plan` output for replacements and destroys
+5. save plan artifact
+6. apply reviewed plan only
+7. validate resource state outside Terraform
+
+Plan review focus:
+
+- unexpected replacement
+- drift on security groups, routes, storage, or instance identity
+- provider alias mistakes
+- wrong workspace or backend
+
+## CI/CD Operations
+
+### GitLab CI
+
+```bash
+gitlab-runner verify
+grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml
+curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects/<id>/pipelines
+```
+
+### Jenkins
+
+```bash
+systemctl status jenkins --no-pager
+journalctl -u jenkins -n 100 --no-pager
+java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs
+```
+
+### Runners, Artifacts, Pipeline Failures
+
+```bash
+docker logs --tail 100 gitlab-runner
+kubectl -n ci get pods
+kubectl -n ci logs deploy/runner-controller --tail=100
+```
+
+Troubleshooting flow:
+
+1. validate YAML or Jenkinsfile syntax
+2. confirm runner/agent availability
+3. inspect job logs for auth, cache, DNS, or registry failures
+4. verify artifacts were uploaded and not expired
+5. correlate with platform outages, image changes, or secret rotation
+
+YAML validation:
+
+```bash
+yamllint .
+python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml
+```
+
+## Observability
+
+### Prometheus
+
+```bash
+curl -s http://prometheus:9090/-/ready
+curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
+curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'
+```
+
+### Loki
+
+```bash
+curl -s http://loki:3100/ready
+curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"'
+```
+
+### Grafana
+
+```bash
+curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
+grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50
+```
+
+### Metrics Validation and Log Correlation
+
+```bash
+kubectl -n <ns> port-forward svc/<svc> 9090:9090
+curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_'
+```
+
+Correlation flow:
+
+1. confirm alert time and impacted objects
+2. inspect deployment events in same window
+3. compare Prometheus series, Loki logs, and app logs
+4. rule out scrape lag or stale dashboards
+
+## GPU / AI Infrastructure
+
+### GPU Discovery and CUDA Validation
+
+```bash
+nvidia-smi
+nvidia-smi -L
+nvidia-smi topo -m
+nvidia-smi dmon -s pucm
+nvcc --version
+python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())'
+```
+
+### MIG Basics
+
+```bash
+nvidia-smi -i 0 -q | grep -i mig -A4
+nvidia-smi mig -lgip
+nvidia-smi mig -lgi
+```
+
+### GPU Operator and DCGM
+
+```bash
+kubectl get pods -A | grep -E 'nvidia|gpu'
+kubectl -n gpu-operator describe pod <pod>
+kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100
+kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100
+```
+
+### Container GPU Validation
+
+```bash
+docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
+kubectl run gpu-check --rm -it --restart=Never \
+  --image=nvidia/cuda:12.3.2-base-ubuntu22.04 \
+  --limits='nvidia.com/gpu=1' -- nvidia-smi
+```
+
+### Kubernetes GPU Troubleshooting
+
+Check for:
+
+- device plugin not running
+- driver/container toolkit mismatch
+- node missing `nvidia.com/gpu` allocatable resources
+- MIG profile mismatch
+- taints or tolerations blocking placement
+
+Useful checks:
+
+```bash
+kubectl describe node <gpu-node> | grep -A5 -B2 -i nvidia
+kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}'
+kubectl -n <ns> describe pod <gpu-pod>
+```
+
+## Platform Troubleshooting Flows
+
+### Pod Not Starting
+
+```bash
+kubectl -n <ns> get pod <pod> -o wide
+kubectl -n <ns> describe pod <pod>
+kubectl -n <ns> logs <pod> --previous
+kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
+```
+
+### Image Pull Errors
+
+```bash
+kubectl -n <ns> describe pod <pod> | grep -A5 -i 'image'
+crictl images | grep <image>
+ctr -n k8s.io images ls | grep <image>
+```
+
+Check:
+
+- image tag exists
+- registry reachable
+- pull secret valid
+- node clock sane for token-based auth
+
+### Failing Deployment
+
+```bash
+kubectl -n <ns> rollout status deploy/<name>
+kubectl -n <ns> describe deploy/<name>
+kubectl -n <ns> get rs,pods -l app=<name> -o wide
+```
+
+### Node Not Ready
+
+```bash
+kubectl describe node <node>
+journalctl -u k3s -n 100 --no-pager
+systemctl status kubelet --no-pager
+df -h
+free -m
+```
+
+Check:
+
+- kubelet or k3s service state
+- disk pressure
+- cert expiry
+- CNI failure
+- API reachability
+
+### Storage Provisioning Issues
+
+```bash
+kubectl get pvc,pv -A
+kubectl -n <ns> describe pvc <pvc>
+kubectl get sc
+kubectl -n kube-system logs deploy/<csi-controller> --tail=100
+```
+
+Check:
+
+- storage class defaulting
+- access mode mismatch
+- CSI controller errors
+- backend quota or LUN exhaustion
+- node attachment failures