Files
portfolio/platform-projects/docs/platform-cheatsheet.md
T
Mateusz Suski 0d3905b8a1
lint / shell-yaml-ansible (push) Failing after 17s
Add operational cheatsheets across repository
2026-05-09 09:41:55 +00:00

8.0 KiB

Platform Engineering Cheatsheet

Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts.

Kubernetes / K3s

Contexts, Namespaces, and Basic Workflows

kubectl config get-contexts
kubectl config use-context <context>
kubectl get ns
kubectl -n <ns> get pods -o wide
kubectl -n <ns> get deploy,sts,ds,svc,ingress
kubectl get nodes -o wide

Describe, Logs, Exec, Events

kubectl -n <ns> describe pod <pod>
kubectl -n <ns> logs <pod> --tail=100
kubectl -n <ns> logs <pod> -c <container> --previous
kubectl -n <ns> exec -it <pod> -- sh
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30

Rollout Troubleshooting

kubectl -n <ns> rollout status deploy/<name>
kubectl -n <ns> rollout history deploy/<name>
kubectl -n <ns> rollout undo deploy/<name>
kubectl -n <ns> get rs -l app=<name>

Safe pattern:

  1. kubectl diff -f <manifest>
  2. apply to non-prod or canary namespace
  3. watch rollout and events
  4. validate service and logs
  5. expand scope only after post-check

Node Validation

kubectl get nodes
kubectl describe node <node>
kubectl top nodes
kubectl top pods -A --sort-by=cpu
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>

Pending / CrashLoopBackOff Flow

Pending:

kubectl -n <ns> describe pod <pod>
kubectl get events -A --sort-by=.lastTimestamp | tail -50

Check for:

  • unsatisfied CPU/memory requests
  • missing PVC
  • taints/tolerations mismatch
  • image pull secret issues
  • node selectors or affinity mismatch

CrashLoopBackOff:

kubectl -n <ns> logs <pod> --previous
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'

Check for:

  • bad config or missing env vars
  • probe failures
  • dependency timeouts
  • permission or filesystem errors

Helm

helm repo list
helm repo update
helm list -A
helm -n <ns> get values <release> -a
helm -n <ns> get manifest <release>
helm upgrade --install <release> <chart> -n <ns> -f values.yaml
helm rollback -n <ns> <release> <revision>
helm template <release> <chart> -f values.yaml | less

Validation:

helm lint <chart>
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -20

Docker / Podman

docker images
docker ps -a
docker logs --tail 100 <container>
docker exec -it <container> sh
docker inspect <container>
docker volume ls
docker network ls
docker system df
docker image prune -f         # cleanup: review first
docker container prune -f     # cleanup: review first
podman ps -a
podman inspect <container>

Container validation:

docker exec <container> env | sort
docker exec <container> ss -ltnp
docker inspect -f '{{.State.Status}} {{.RestartCount}}' <container>

Terraform

Core Commands

terraform fmt -check -recursive
terraform init
terraform validate
terraform plan -out=tfplan
terraform apply tfplan
terraform destroy -target=<resource>   # impact: targeted destruction needs review
terraform state list
terraform state show <resource>
terraform import <resource> <id>

Safe Workflow

  1. terraform fmt -check -recursive
  2. terraform validate
  3. refresh provider auth and backend access
  4. review plan output for replacements and destroys
  5. save plan artifact
  6. apply reviewed plan only
  7. validate resource state outside Terraform

Plan review focus:

  • unexpected replacement
  • drift on security groups, routes, storage, or instance identity
  • provider alias mistakes
  • wrong workspace or backend

CI/CD Operations

GitLab CI

gitlab-runner verify
grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml
curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects/<id>/pipelines

Jenkins

systemctl status jenkins --no-pager
journalctl -u jenkins -n 100 --no-pager
java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs

Runners, Artifacts, Pipeline Failures

docker logs --tail 100 gitlab-runner
kubectl -n ci get pods
kubectl -n ci logs deploy/runner-controller --tail=100

Troubleshooting flow:

  1. validate YAML or Jenkinsfile syntax
  2. confirm runner/agent availability
  3. inspect job logs for auth, cache, DNS, or registry failures
  4. verify artifacts were uploaded and not expired
  5. correlate with platform outages, image changes, or secret rotation

YAML validation:

yamllint .
python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml

Observability

Prometheus

curl -s http://prometheus:9090/-/ready
curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'

Loki

curl -s http://loki:3100/ready
curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"'

Grafana

curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50

Metrics Validation and Log Correlation

kubectl -n <ns> port-forward svc/<svc> 9090:9090
curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_'

Correlation flow:

  1. confirm alert time and impacted objects
  2. inspect deployment events in same window
  3. compare Prometheus series, Loki logs, and app logs
  4. rule out scrape lag or stale dashboards

GPU / AI Infrastructure

GPU Discovery and CUDA Validation

nvidia-smi
nvidia-smi -L
nvidia-smi topo -m
nvidia-smi dmon -s pucm
nvcc --version
python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())'

MIG Basics

nvidia-smi -i 0 -q | grep -i mig -A4
nvidia-smi mig -lgip
nvidia-smi mig -lgi

GPU Operator and DCGM

kubectl get pods -A | grep -E 'nvidia|gpu'
kubectl -n gpu-operator describe pod <pod>
kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100
kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100

Container GPU Validation

docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
kubectl run gpu-check --rm -it --restart=Never \
  --image=nvidia/cuda:12.3.2-base-ubuntu22.04 \
  --limits='nvidia.com/gpu=1' -- nvidia-smi

Kubernetes GPU Troubleshooting

Check for:

  • device plugin not running
  • driver/container toolkit mismatch
  • node missing nvidia.com/gpu allocatable resources
  • MIG profile mismatch
  • taints or tolerations blocking placement

Useful checks:

kubectl describe node <gpu-node> | grep -A5 -B2 -i nvidia
kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}'
kubectl -n <ns> describe pod <gpu-pod>

Platform Troubleshooting Flows

Pod Not Starting

kubectl -n <ns> get pod <pod> -o wide
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> logs <pod> --previous
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30

Image Pull Errors

kubectl -n <ns> describe pod <pod> | grep -A5 -i 'image'
crictl images | grep <image>
ctr -n k8s.io images ls | grep <image>

Check:

  • image tag exists
  • registry reachable
  • pull secret valid
  • node clock sane for token-based auth

Failing Deployment

kubectl -n <ns> rollout status deploy/<name>
kubectl -n <ns> describe deploy/<name>
kubectl -n <ns> get rs,pods -l app=<name> -o wide

Node Not Ready

kubectl describe node <node>
journalctl -u k3s -n 100 --no-pager
systemctl status kubelet --no-pager
df -h
free -m

Check:

  • kubelet or k3s service state
  • disk pressure
  • cert expiry
  • CNI failure
  • API reachability

Storage Provisioning Issues

kubectl get pvc,pv -A
kubectl -n <ns> describe pvc <pvc>
kubectl get sc
kubectl -n kube-system logs deploy/<csi-controller> --tail=100

Check:

  • storage class defaulting
  • access mode mismatch
  • CSI controller errors
  • backend quota or LUN exhaustion
  • node attachment failures