8.0 KiB
8.0 KiB
Platform Engineering Cheatsheet
Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts.
Kubernetes / K3s
Contexts, Namespaces, and Basic Workflows
kubectl config get-contexts
kubectl config use-context <context>
kubectl get ns
kubectl -n <ns> get pods -o wide
kubectl -n <ns> get deploy,sts,ds,svc,ingress
kubectl get nodes -o wide
Describe, Logs, Exec, Events
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> logs <pod> --tail=100
kubectl -n <ns> logs <pod> -c <container> --previous
kubectl -n <ns> exec -it <pod> -- sh
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
Rollout Troubleshooting
kubectl -n <ns> rollout status deploy/<name>
kubectl -n <ns> rollout history deploy/<name>
kubectl -n <ns> rollout undo deploy/<name>
kubectl -n <ns> get rs -l app=<name>
Safe pattern:
kubectl diff -f <manifest>- apply to non-prod or canary namespace
- watch rollout and events
- validate service and logs
- expand scope only after post-check
Node Validation
kubectl get nodes
kubectl describe node <node>
kubectl top nodes
kubectl top pods -A --sort-by=cpu
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
Pending / CrashLoopBackOff Flow
Pending:
kubectl -n <ns> describe pod <pod>
kubectl get events -A --sort-by=.lastTimestamp | tail -50
Check for:
- unsatisfied CPU/memory requests
- missing PVC
- taints/tolerations mismatch
- image pull secret issues
- node selectors or affinity mismatch
CrashLoopBackOff:
kubectl -n <ns> logs <pod> --previous
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'
Check for:
- bad config or missing env vars
- probe failures
- dependency timeouts
- permission or filesystem errors
Helm
helm repo list
helm repo update
helm list -A
helm -n <ns> get values <release> -a
helm -n <ns> get manifest <release>
helm upgrade --install <release> <chart> -n <ns> -f values.yaml
helm rollback -n <ns> <release> <revision>
helm template <release> <chart> -f values.yaml | less
Validation:
helm lint <chart>
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -20
Docker / Podman
docker images
docker ps -a
docker logs --tail 100 <container>
docker exec -it <container> sh
docker inspect <container>
docker volume ls
docker network ls
docker system df
docker image prune -f # cleanup: review first
docker container prune -f # cleanup: review first
podman ps -a
podman inspect <container>
Container validation:
docker exec <container> env | sort
docker exec <container> ss -ltnp
docker inspect -f '{{.State.Status}} {{.RestartCount}}' <container>
Terraform
Core Commands
terraform fmt -check -recursive
terraform init
terraform validate
terraform plan -out=tfplan
terraform apply tfplan
terraform destroy -target=<resource> # impact: targeted destruction needs review
terraform state list
terraform state show <resource>
terraform import <resource> <id>
Safe Workflow
terraform fmt -check -recursiveterraform validate- refresh provider auth and backend access
- review
planoutput for replacements and destroys - save plan artifact
- apply reviewed plan only
- validate resource state outside Terraform
Plan review focus:
- unexpected replacement
- drift on security groups, routes, storage, or instance identity
- provider alias mistakes
- wrong workspace or backend
CI/CD Operations
GitLab CI
gitlab-runner verify
grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml
curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects/<id>/pipelines
Jenkins
systemctl status jenkins --no-pager
journalctl -u jenkins -n 100 --no-pager
java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs
Runners, Artifacts, Pipeline Failures
docker logs --tail 100 gitlab-runner
kubectl -n ci get pods
kubectl -n ci logs deploy/runner-controller --tail=100
Troubleshooting flow:
- validate YAML or Jenkinsfile syntax
- confirm runner/agent availability
- inspect job logs for auth, cache, DNS, or registry failures
- verify artifacts were uploaded and not expired
- correlate with platform outages, image changes, or secret rotation
YAML validation:
yamllint .
python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml
Observability
Prometheus
curl -s http://prometheus:9090/-/ready
curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'
Loki
curl -s http://loki:3100/ready
curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"'
Grafana
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50
Metrics Validation and Log Correlation
kubectl -n <ns> port-forward svc/<svc> 9090:9090
curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_'
Correlation flow:
- confirm alert time and impacted objects
- inspect deployment events in same window
- compare Prometheus series, Loki logs, and app logs
- rule out scrape lag or stale dashboards
GPU / AI Infrastructure
GPU Discovery and CUDA Validation
nvidia-smi
nvidia-smi -L
nvidia-smi topo -m
nvidia-smi dmon -s pucm
nvcc --version
python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())'
MIG Basics
nvidia-smi -i 0 -q | grep -i mig -A4
nvidia-smi mig -lgip
nvidia-smi mig -lgi
GPU Operator and DCGM
kubectl get pods -A | grep -E 'nvidia|gpu'
kubectl -n gpu-operator describe pod <pod>
kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100
kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100
Container GPU Validation
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
kubectl run gpu-check --rm -it --restart=Never \
--image=nvidia/cuda:12.3.2-base-ubuntu22.04 \
--limits='nvidia.com/gpu=1' -- nvidia-smi
Kubernetes GPU Troubleshooting
Check for:
- device plugin not running
- driver/container toolkit mismatch
- node missing
nvidia.com/gpuallocatable resources - MIG profile mismatch
- taints or tolerations blocking placement
Useful checks:
kubectl describe node <gpu-node> | grep -A5 -B2 -i nvidia
kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}'
kubectl -n <ns> describe pod <gpu-pod>
Platform Troubleshooting Flows
Pod Not Starting
kubectl -n <ns> get pod <pod> -o wide
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> logs <pod> --previous
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
Image Pull Errors
kubectl -n <ns> describe pod <pod> | grep -A5 -i 'image'
crictl images | grep <image>
ctr -n k8s.io images ls | grep <image>
Check:
- image tag exists
- registry reachable
- pull secret valid
- node clock sane for token-based auth
Failing Deployment
kubectl -n <ns> rollout status deploy/<name>
kubectl -n <ns> describe deploy/<name>
kubectl -n <ns> get rs,pods -l app=<name> -o wide
Node Not Ready
kubectl describe node <node>
journalctl -u k3s -n 100 --no-pager
systemctl status kubelet --no-pager
df -h
free -m
Check:
- kubelet or k3s service state
- disk pressure
- cert expiry
- CNI failure
- API reachability
Storage Provisioning Issues
kubectl get pvc,pv -A
kubectl -n <ns> describe pvc <pvc>
kubectl get sc
kubectl -n kube-system logs deploy/<csi-controller> --tail=100
Check:
- storage class defaulting
- access mode mismatch
- CSI controller errors
- backend quota or LUN exhaustion
- node attachment failures