369 lines
8.0 KiB
Markdown
369 lines
8.0 KiB
Markdown
|
|
# Platform Engineering Cheatsheet
|
||
|
|
|
||
|
|
Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts.
|
||
|
|
|
||
|
|
## Kubernetes / K3s
|
||
|
|
|
||
|
|
### Contexts, Namespaces, and Basic Workflows
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl config get-contexts
|
||
|
|
kubectl config use-context <context>
|
||
|
|
kubectl get ns
|
||
|
|
kubectl -n <ns> get pods -o wide
|
||
|
|
kubectl -n <ns> get deploy,sts,ds,svc,ingress
|
||
|
|
kubectl get nodes -o wide
|
||
|
|
```
|
||
|
|
|
||
|
|
### Describe, Logs, Exec, Events
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl -n <ns> describe pod <pod>
|
||
|
|
kubectl -n <ns> logs <pod> --tail=100
|
||
|
|
kubectl -n <ns> logs <pod> -c <container> --previous
|
||
|
|
kubectl -n <ns> exec -it <pod> -- sh
|
||
|
|
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
|
||
|
|
```
|
||
|
|
|
||
|
|
### Rollout Troubleshooting
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl -n <ns> rollout status deploy/<name>
|
||
|
|
kubectl -n <ns> rollout history deploy/<name>
|
||
|
|
kubectl -n <ns> rollout undo deploy/<name>
|
||
|
|
kubectl -n <ns> get rs -l app=<name>
|
||
|
|
```
|
||
|
|
|
||
|
|
Safe pattern:
|
||
|
|
|
||
|
|
1. `kubectl diff -f <manifest>`
|
||
|
|
2. apply to non-prod or canary namespace
|
||
|
|
3. watch rollout and events
|
||
|
|
4. validate service and logs
|
||
|
|
5. expand scope only after post-check
|
||
|
|
|
||
|
|
### Node Validation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl get nodes
|
||
|
|
kubectl describe node <node>
|
||
|
|
kubectl top nodes
|
||
|
|
kubectl top pods -A --sort-by=cpu
|
||
|
|
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
|
||
|
|
```
|
||
|
|
|
||
|
|
### Pending / CrashLoopBackOff Flow
|
||
|
|
|
||
|
|
Pending:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl -n <ns> describe pod <pod>
|
||
|
|
kubectl get events -A --sort-by=.lastTimestamp | tail -50
|
||
|
|
```
|
||
|
|
|
||
|
|
Check for:
|
||
|
|
|
||
|
|
- unsatisfied CPU/memory requests
|
||
|
|
- missing PVC
|
||
|
|
- taints/tolerations mismatch
|
||
|
|
- image pull secret issues
|
||
|
|
- node selectors or affinity mismatch
|
||
|
|
|
||
|
|
CrashLoopBackOff:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl -n <ns> logs <pod> --previous
|
||
|
|
kubectl -n <ns> describe pod <pod>
|
||
|
|
kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'
|
||
|
|
```
|
||
|
|
|
||
|
|
Check for:
|
||
|
|
|
||
|
|
- bad config or missing env vars
|
||
|
|
- probe failures
|
||
|
|
- dependency timeouts
|
||
|
|
- permission or filesystem errors
|
||
|
|
|
||
|
|
## Helm
|
||
|
|
|
||
|
|
```bash
|
||
|
|
helm repo list
|
||
|
|
helm repo update
|
||
|
|
helm list -A
|
||
|
|
helm -n <ns> get values <release> -a
|
||
|
|
helm -n <ns> get manifest <release>
|
||
|
|
helm upgrade --install <release> <chart> -n <ns> -f values.yaml
|
||
|
|
helm rollback -n <ns> <release> <revision>
|
||
|
|
helm template <release> <chart> -f values.yaml | less
|
||
|
|
```
|
||
|
|
|
||
|
|
Validation:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
helm lint <chart>
|
||
|
|
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -20
|
||
|
|
```
|
||
|
|
|
||
|
|
## Docker / Podman
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker images
|
||
|
|
docker ps -a
|
||
|
|
docker logs --tail 100 <container>
|
||
|
|
docker exec -it <container> sh
|
||
|
|
docker inspect <container>
|
||
|
|
docker volume ls
|
||
|
|
docker network ls
|
||
|
|
docker system df
|
||
|
|
docker image prune -f # cleanup: review first
|
||
|
|
docker container prune -f # cleanup: review first
|
||
|
|
podman ps -a
|
||
|
|
podman inspect <container>
|
||
|
|
```
|
||
|
|
|
||
|
|
Container validation:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker exec <container> env | sort
|
||
|
|
docker exec <container> ss -ltnp
|
||
|
|
docker inspect -f '{{.State.Status}} {{.RestartCount}}' <container>
|
||
|
|
```
|
||
|
|
|
||
|
|
## Terraform
|
||
|
|
|
||
|
|
### Core Commands
|
||
|
|
|
||
|
|
```bash
|
||
|
|
terraform fmt -check -recursive
|
||
|
|
terraform init
|
||
|
|
terraform validate
|
||
|
|
terraform plan -out=tfplan
|
||
|
|
terraform apply tfplan
|
||
|
|
terraform destroy -target=<resource> # impact: targeted destruction needs review
|
||
|
|
terraform state list
|
||
|
|
terraform state show <resource>
|
||
|
|
terraform import <resource> <id>
|
||
|
|
```
|
||
|
|
|
||
|
|
### Safe Workflow
|
||
|
|
|
||
|
|
1. `terraform fmt -check -recursive`
|
||
|
|
2. `terraform validate`
|
||
|
|
3. refresh provider auth and backend access
|
||
|
|
4. review `plan` output for replacements and destroys
|
||
|
|
5. save plan artifact
|
||
|
|
6. apply reviewed plan only
|
||
|
|
7. validate resource state outside Terraform
|
||
|
|
|
||
|
|
Plan review focus:
|
||
|
|
|
||
|
|
- unexpected replacement
|
||
|
|
- drift on security groups, routes, storage, or instance identity
|
||
|
|
- provider alias mistakes
|
||
|
|
- wrong workspace or backend
|
||
|
|
|
||
|
|
## CI/CD Operations
|
||
|
|
|
||
|
|
### GitLab CI
|
||
|
|
|
||
|
|
```bash
|
||
|
|
gitlab-runner verify
|
||
|
|
grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml
|
||
|
|
curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects/<id>/pipelines
|
||
|
|
```
|
||
|
|
|
||
|
|
### Jenkins
|
||
|
|
|
||
|
|
```bash
|
||
|
|
systemctl status jenkins --no-pager
|
||
|
|
journalctl -u jenkins -n 100 --no-pager
|
||
|
|
java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs
|
||
|
|
```
|
||
|
|
|
||
|
|
### Runners, Artifacts, Pipeline Failures
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker logs --tail 100 gitlab-runner
|
||
|
|
kubectl -n ci get pods
|
||
|
|
kubectl -n ci logs deploy/runner-controller --tail=100
|
||
|
|
```
|
||
|
|
|
||
|
|
Troubleshooting flow:
|
||
|
|
|
||
|
|
1. validate YAML or Jenkinsfile syntax
|
||
|
|
2. confirm runner/agent availability
|
||
|
|
3. inspect job logs for auth, cache, DNS, or registry failures
|
||
|
|
4. verify artifacts were uploaded and not expired
|
||
|
|
5. correlate with platform outages, image changes, or secret rotation
|
||
|
|
|
||
|
|
YAML validation:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
yamllint .
|
||
|
|
python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml
|
||
|
|
```
|
||
|
|
|
||
|
|
## Observability
|
||
|
|
|
||
|
|
### Prometheus
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -s http://prometheus:9090/-/ready
|
||
|
|
curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
|
||
|
|
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Loki
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -s http://loki:3100/ready
|
||
|
|
curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"'
|
||
|
|
```
|
||
|
|
|
||
|
|
### Grafana
|
||
|
|
|
||
|
|
```bash
|
||
|
|
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
|
||
|
|
grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50
|
||
|
|
```
|
||
|
|
|
||
|
|
### Metrics Validation and Log Correlation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl -n <ns> port-forward svc/<svc> 9090:9090
|
||
|
|
curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_'
|
||
|
|
```
|
||
|
|
|
||
|
|
Correlation flow:
|
||
|
|
|
||
|
|
1. confirm alert time and impacted objects
|
||
|
|
2. inspect deployment events in same window
|
||
|
|
3. compare Prometheus series, Loki logs, and app logs
|
||
|
|
4. rule out scrape lag or stale dashboards
|
||
|
|
|
||
|
|
## GPU / AI Infrastructure
|
||
|
|
|
||
|
|
### GPU Discovery and CUDA Validation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
nvidia-smi
|
||
|
|
nvidia-smi -L
|
||
|
|
nvidia-smi topo -m
|
||
|
|
nvidia-smi dmon -s pucm
|
||
|
|
nvcc --version
|
||
|
|
python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())'
|
||
|
|
```
|
||
|
|
|
||
|
|
### MIG Basics
|
||
|
|
|
||
|
|
```bash
|
||
|
|
nvidia-smi -i 0 -q | grep -i mig -A4
|
||
|
|
nvidia-smi mig -lgip
|
||
|
|
nvidia-smi mig -lgi
|
||
|
|
```
|
||
|
|
|
||
|
|
### GPU Operator and DCGM
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl get pods -A | grep -E 'nvidia|gpu'
|
||
|
|
kubectl -n gpu-operator describe pod <pod>
|
||
|
|
kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100
|
||
|
|
kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100
|
||
|
|
```
|
||
|
|
|
||
|
|
### Container GPU Validation
|
||
|
|
|
||
|
|
```bash
|
||
|
|
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
|
||
|
|
kubectl run gpu-check --rm -it --restart=Never \
|
||
|
|
--image=nvidia/cuda:12.3.2-base-ubuntu22.04 \
|
||
|
|
--limits='nvidia.com/gpu=1' -- nvidia-smi
|
||
|
|
```
|
||
|
|
|
||
|
|
### Kubernetes GPU Troubleshooting
|
||
|
|
|
||
|
|
Check for:
|
||
|
|
|
||
|
|
- device plugin not running
|
||
|
|
- driver/container toolkit mismatch
|
||
|
|
- node missing `nvidia.com/gpu` allocatable resources
|
||
|
|
- MIG profile mismatch
|
||
|
|
- taints or tolerations blocking placement
|
||
|
|
|
||
|
|
Useful checks:
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl describe node <gpu-node> | grep -A5 -B2 -i nvidia
|
||
|
|
kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}'
|
||
|
|
kubectl -n <ns> describe pod <gpu-pod>
|
||
|
|
```
|
||
|
|
|
||
|
|
## Platform Troubleshooting Flows
|
||
|
|
|
||
|
|
### Pod Not Starting
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl -n <ns> get pod <pod> -o wide
|
||
|
|
kubectl -n <ns> describe pod <pod>
|
||
|
|
kubectl -n <ns> logs <pod> --previous
|
||
|
|
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
|
||
|
|
```
|
||
|
|
|
||
|
|
### Image Pull Errors
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl -n <ns> describe pod <pod> | grep -A5 -i 'image'
|
||
|
|
crictl images | grep <image>
|
||
|
|
ctr -n k8s.io images ls | grep <image>
|
||
|
|
```
|
||
|
|
|
||
|
|
Check:
|
||
|
|
|
||
|
|
- image tag exists
|
||
|
|
- registry reachable
|
||
|
|
- pull secret valid
|
||
|
|
- node clock sane for token-based auth
|
||
|
|
|
||
|
|
### Failing Deployment
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl -n <ns> rollout status deploy/<name>
|
||
|
|
kubectl -n <ns> describe deploy/<name>
|
||
|
|
kubectl -n <ns> get rs,pods -l app=<name> -o wide
|
||
|
|
```
|
||
|
|
|
||
|
|
### Node Not Ready
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl describe node <node>
|
||
|
|
journalctl -u k3s -n 100 --no-pager
|
||
|
|
systemctl status kubelet --no-pager
|
||
|
|
df -h
|
||
|
|
free -m
|
||
|
|
```
|
||
|
|
|
||
|
|
Check:
|
||
|
|
|
||
|
|
- kubelet or k3s service state
|
||
|
|
- disk pressure
|
||
|
|
- cert expiry
|
||
|
|
- CNI failure
|
||
|
|
- API reachability
|
||
|
|
|
||
|
|
### Storage Provisioning Issues
|
||
|
|
|
||
|
|
```bash
|
||
|
|
kubectl get pvc,pv -A
|
||
|
|
kubectl -n <ns> describe pvc <pvc>
|
||
|
|
kubectl get sc
|
||
|
|
kubectl -n kube-system logs deploy/<csi-controller> --tail=100
|
||
|
|
```
|
||
|
|
|
||
|
|
Check:
|
||
|
|
|
||
|
|
- storage class defaulting
|
||
|
|
- access mode mismatch
|
||
|
|
- CSI controller errors
|
||
|
|
- backend quota or LUN exhaustion
|
||
|
|
- node attachment failures
|