# Platform Engineering Cheatsheet Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts. ## Kubernetes / K3s ### Contexts, Namespaces, and Basic Workflows ```bash kubectl config get-contexts kubectl config use-context kubectl get ns kubectl -n get pods -o wide kubectl -n get deploy,sts,ds,svc,ingress kubectl get nodes -o wide ``` ### Describe, Logs, Exec, Events ```bash kubectl -n describe pod kubectl -n logs --tail=100 kubectl -n logs -c --previous kubectl -n exec -it -- sh kubectl -n get events --sort-by=.lastTimestamp | tail -30 ``` ### Rollout Troubleshooting ```bash kubectl -n rollout status deploy/ kubectl -n rollout history deploy/ kubectl -n rollout undo deploy/ kubectl -n get rs -l app= ``` Safe pattern: 1. `kubectl diff -f ` 2. apply to non-prod or canary namespace 3. watch rollout and events 4. validate service and logs 5. expand scope only after post-check ### Node Validation ```bash kubectl get nodes kubectl describe node kubectl top nodes kubectl top pods -A --sort-by=cpu kubectl get pods -A -o wide --field-selector spec.nodeName= ``` ### Pending / CrashLoopBackOff Flow Pending: ```bash kubectl -n describe pod kubectl get events -A --sort-by=.lastTimestamp | tail -50 ``` Check for: - unsatisfied CPU/memory requests - missing PVC - taints/tolerations mismatch - image pull secret issues - node selectors or affinity mismatch CrashLoopBackOff: ```bash kubectl -n logs --previous kubectl -n describe pod kubectl -n get pod -o jsonpath='{.status.containerStatuses[*].lastState}' ``` Check for: - bad config or missing env vars - probe failures - dependency timeouts - permission or filesystem errors ## Helm ```bash helm repo list helm repo update helm list -A helm -n get values -a helm -n get manifest helm upgrade --install -n -f values.yaml helm rollback -n helm template -f values.yaml | less ``` Validation: ```bash helm lint kubectl -n get events --sort-by=.lastTimestamp | tail -20 ``` ## Docker / Podman ```bash docker images docker ps -a docker logs --tail 100 docker exec -it sh docker inspect docker volume ls docker network ls docker system df docker image prune -f # cleanup: review first docker container prune -f # cleanup: review first podman ps -a podman inspect ``` Container validation: ```bash docker exec env | sort docker exec ss -ltnp docker inspect -f '{{.State.Status}} {{.RestartCount}}' ``` ## Terraform ### Core Commands ```bash terraform fmt -check -recursive terraform init terraform validate terraform plan -out=tfplan terraform apply tfplan terraform destroy -target= # impact: targeted destruction needs review terraform state list terraform state show terraform import ``` ### Safe Workflow 1. `terraform fmt -check -recursive` 2. `terraform validate` 3. refresh provider auth and backend access 4. review `plan` output for replacements and destroys 5. save plan artifact 6. apply reviewed plan only 7. validate resource state outside Terraform Plan review focus: - unexpected replacement - drift on security groups, routes, storage, or instance identity - provider alias mistakes - wrong workspace or backend ## CI/CD Operations ### GitLab CI ```bash gitlab-runner verify grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects//pipelines ``` ### Jenkins ```bash systemctl status jenkins --no-pager journalctl -u jenkins -n 100 --no-pager java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs ``` ### Runners, Artifacts, Pipeline Failures ```bash docker logs --tail 100 gitlab-runner kubectl -n ci get pods kubectl -n ci logs deploy/runner-controller --tail=100 ``` Troubleshooting flow: 1. validate YAML or Jenkinsfile syntax 2. confirm runner/agent availability 3. inspect job logs for auth, cache, DNS, or registry failures 4. verify artifacts were uploaded and not expired 5. correlate with platform outages, image changes, or secret rotation YAML validation: ```bash yamllint . python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml ``` ## Observability ### Prometheus ```bash curl -s http://prometheus:9090/-/ready curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}' curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}' ``` ### Loki ```bash curl -s http://loki:3100/ready curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"' ``` ### Grafana ```bash curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50 ``` ### Metrics Validation and Log Correlation ```bash kubectl -n port-forward svc/ 9090:9090 curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_' ``` Correlation flow: 1. confirm alert time and impacted objects 2. inspect deployment events in same window 3. compare Prometheus series, Loki logs, and app logs 4. rule out scrape lag or stale dashboards ## GPU / AI Infrastructure ### GPU Discovery and CUDA Validation ```bash nvidia-smi nvidia-smi -L nvidia-smi topo -m nvidia-smi dmon -s pucm nvcc --version python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())' ``` ### MIG Basics ```bash nvidia-smi -i 0 -q | grep -i mig -A4 nvidia-smi mig -lgip nvidia-smi mig -lgi ``` ### GPU Operator and DCGM ```bash kubectl get pods -A | grep -E 'nvidia|gpu' kubectl -n gpu-operator describe pod kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100 kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100 ``` ### Container GPU Validation ```bash docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi kubectl run gpu-check --rm -it --restart=Never \ --image=nvidia/cuda:12.3.2-base-ubuntu22.04 \ --limits='nvidia.com/gpu=1' -- nvidia-smi ``` ### Kubernetes GPU Troubleshooting Check for: - device plugin not running - driver/container toolkit mismatch - node missing `nvidia.com/gpu` allocatable resources - MIG profile mismatch - taints or tolerations blocking placement Useful checks: ```bash kubectl describe node | grep -A5 -B2 -i nvidia kubectl get node -o jsonpath='{.status.allocatable}' kubectl -n describe pod ``` ## Platform Troubleshooting Flows ### Pod Not Starting ```bash kubectl -n get pod -o wide kubectl -n describe pod kubectl -n logs --previous kubectl -n get events --sort-by=.lastTimestamp | tail -30 ``` ### Image Pull Errors ```bash kubectl -n describe pod | grep -A5 -i 'image' crictl images | grep ctr -n k8s.io images ls | grep ``` Check: - image tag exists - registry reachable - pull secret valid - node clock sane for token-based auth ### Failing Deployment ```bash kubectl -n rollout status deploy/ kubectl -n describe deploy/ kubectl -n get rs,pods -l app= -o wide ``` ### Node Not Ready ```bash kubectl describe node journalctl -u k3s -n 100 --no-pager systemctl status kubelet --no-pager df -h free -m ``` Check: - kubelet or k3s service state - disk pressure - cert expiry - CNI failure - API reachability ### Storage Provisioning Issues ```bash kubectl get pvc,pv -A kubectl -n describe pvc kubectl get sc kubectl -n kube-system logs deploy/ --tail=100 ``` Check: - storage class defaulting - access mode mismatch - CSI controller errors - backend quota or LUN exhaustion - node attachment failures