This commit is contained in:
@@ -0,0 +1,368 @@
|
||||
# Platform Engineering Cheatsheet
|
||||
|
||||
Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts.
|
||||
|
||||
## Kubernetes / K3s
|
||||
|
||||
### Contexts, Namespaces, and Basic Workflows
|
||||
|
||||
```bash
|
||||
kubectl config get-contexts
|
||||
kubectl config use-context <context>
|
||||
kubectl get ns
|
||||
kubectl -n <ns> get pods -o wide
|
||||
kubectl -n <ns> get deploy,sts,ds,svc,ingress
|
||||
kubectl get nodes -o wide
|
||||
```
|
||||
|
||||
### Describe, Logs, Exec, Events
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> describe pod <pod>
|
||||
kubectl -n <ns> logs <pod> --tail=100
|
||||
kubectl -n <ns> logs <pod> -c <container> --previous
|
||||
kubectl -n <ns> exec -it <pod> -- sh
|
||||
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
|
||||
```
|
||||
|
||||
### Rollout Troubleshooting
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> rollout status deploy/<name>
|
||||
kubectl -n <ns> rollout history deploy/<name>
|
||||
kubectl -n <ns> rollout undo deploy/<name>
|
||||
kubectl -n <ns> get rs -l app=<name>
|
||||
```
|
||||
|
||||
Safe pattern:
|
||||
|
||||
1. `kubectl diff -f <manifest>`
|
||||
2. apply to non-prod or canary namespace
|
||||
3. watch rollout and events
|
||||
4. validate service and logs
|
||||
5. expand scope only after post-check
|
||||
|
||||
### Node Validation
|
||||
|
||||
```bash
|
||||
kubectl get nodes
|
||||
kubectl describe node <node>
|
||||
kubectl top nodes
|
||||
kubectl top pods -A --sort-by=cpu
|
||||
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
|
||||
```
|
||||
|
||||
### Pending / CrashLoopBackOff Flow
|
||||
|
||||
Pending:
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> describe pod <pod>
|
||||
kubectl get events -A --sort-by=.lastTimestamp | tail -50
|
||||
```
|
||||
|
||||
Check for:
|
||||
|
||||
- unsatisfied CPU/memory requests
|
||||
- missing PVC
|
||||
- taints/tolerations mismatch
|
||||
- image pull secret issues
|
||||
- node selectors or affinity mismatch
|
||||
|
||||
CrashLoopBackOff:
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> logs <pod> --previous
|
||||
kubectl -n <ns> describe pod <pod>
|
||||
kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'
|
||||
```
|
||||
|
||||
Check for:
|
||||
|
||||
- bad config or missing env vars
|
||||
- probe failures
|
||||
- dependency timeouts
|
||||
- permission or filesystem errors
|
||||
|
||||
## Helm
|
||||
|
||||
```bash
|
||||
helm repo list
|
||||
helm repo update
|
||||
helm list -A
|
||||
helm -n <ns> get values <release> -a
|
||||
helm -n <ns> get manifest <release>
|
||||
helm upgrade --install <release> <chart> -n <ns> -f values.yaml
|
||||
helm rollback -n <ns> <release> <revision>
|
||||
helm template <release> <chart> -f values.yaml | less
|
||||
```
|
||||
|
||||
Validation:
|
||||
|
||||
```bash
|
||||
helm lint <chart>
|
||||
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -20
|
||||
```
|
||||
|
||||
## Docker / Podman
|
||||
|
||||
```bash
|
||||
docker images
|
||||
docker ps -a
|
||||
docker logs --tail 100 <container>
|
||||
docker exec -it <container> sh
|
||||
docker inspect <container>
|
||||
docker volume ls
|
||||
docker network ls
|
||||
docker system df
|
||||
docker image prune -f # cleanup: review first
|
||||
docker container prune -f # cleanup: review first
|
||||
podman ps -a
|
||||
podman inspect <container>
|
||||
```
|
||||
|
||||
Container validation:
|
||||
|
||||
```bash
|
||||
docker exec <container> env | sort
|
||||
docker exec <container> ss -ltnp
|
||||
docker inspect -f '{{.State.Status}} {{.RestartCount}}' <container>
|
||||
```
|
||||
|
||||
## Terraform
|
||||
|
||||
### Core Commands
|
||||
|
||||
```bash
|
||||
terraform fmt -check -recursive
|
||||
terraform init
|
||||
terraform validate
|
||||
terraform plan -out=tfplan
|
||||
terraform apply tfplan
|
||||
terraform destroy -target=<resource> # impact: targeted destruction needs review
|
||||
terraform state list
|
||||
terraform state show <resource>
|
||||
terraform import <resource> <id>
|
||||
```
|
||||
|
||||
### Safe Workflow
|
||||
|
||||
1. `terraform fmt -check -recursive`
|
||||
2. `terraform validate`
|
||||
3. refresh provider auth and backend access
|
||||
4. review `plan` output for replacements and destroys
|
||||
5. save plan artifact
|
||||
6. apply reviewed plan only
|
||||
7. validate resource state outside Terraform
|
||||
|
||||
Plan review focus:
|
||||
|
||||
- unexpected replacement
|
||||
- drift on security groups, routes, storage, or instance identity
|
||||
- provider alias mistakes
|
||||
- wrong workspace or backend
|
||||
|
||||
## CI/CD Operations
|
||||
|
||||
### GitLab CI
|
||||
|
||||
```bash
|
||||
gitlab-runner verify
|
||||
grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml
|
||||
curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects/<id>/pipelines
|
||||
```
|
||||
|
||||
### Jenkins
|
||||
|
||||
```bash
|
||||
systemctl status jenkins --no-pager
|
||||
journalctl -u jenkins -n 100 --no-pager
|
||||
java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs
|
||||
```
|
||||
|
||||
### Runners, Artifacts, Pipeline Failures
|
||||
|
||||
```bash
|
||||
docker logs --tail 100 gitlab-runner
|
||||
kubectl -n ci get pods
|
||||
kubectl -n ci logs deploy/runner-controller --tail=100
|
||||
```
|
||||
|
||||
Troubleshooting flow:
|
||||
|
||||
1. validate YAML or Jenkinsfile syntax
|
||||
2. confirm runner/agent availability
|
||||
3. inspect job logs for auth, cache, DNS, or registry failures
|
||||
4. verify artifacts were uploaded and not expired
|
||||
5. correlate with platform outages, image changes, or secret rotation
|
||||
|
||||
YAML validation:
|
||||
|
||||
```bash
|
||||
yamllint .
|
||||
python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml
|
||||
```
|
||||
|
||||
## Observability
|
||||
|
||||
### Prometheus
|
||||
|
||||
```bash
|
||||
curl -s http://prometheus:9090/-/ready
|
||||
curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
|
||||
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'
|
||||
```
|
||||
|
||||
### Loki
|
||||
|
||||
```bash
|
||||
curl -s http://loki:3100/ready
|
||||
curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"'
|
||||
```
|
||||
|
||||
### Grafana
|
||||
|
||||
```bash
|
||||
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
|
||||
grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50
|
||||
```
|
||||
|
||||
### Metrics Validation and Log Correlation
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> port-forward svc/<svc> 9090:9090
|
||||
curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_'
|
||||
```
|
||||
|
||||
Correlation flow:
|
||||
|
||||
1. confirm alert time and impacted objects
|
||||
2. inspect deployment events in same window
|
||||
3. compare Prometheus series, Loki logs, and app logs
|
||||
4. rule out scrape lag or stale dashboards
|
||||
|
||||
## GPU / AI Infrastructure
|
||||
|
||||
### GPU Discovery and CUDA Validation
|
||||
|
||||
```bash
|
||||
nvidia-smi
|
||||
nvidia-smi -L
|
||||
nvidia-smi topo -m
|
||||
nvidia-smi dmon -s pucm
|
||||
nvcc --version
|
||||
python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())'
|
||||
```
|
||||
|
||||
### MIG Basics
|
||||
|
||||
```bash
|
||||
nvidia-smi -i 0 -q | grep -i mig -A4
|
||||
nvidia-smi mig -lgip
|
||||
nvidia-smi mig -lgi
|
||||
```
|
||||
|
||||
### GPU Operator and DCGM
|
||||
|
||||
```bash
|
||||
kubectl get pods -A | grep -E 'nvidia|gpu'
|
||||
kubectl -n gpu-operator describe pod <pod>
|
||||
kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100
|
||||
kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100
|
||||
```
|
||||
|
||||
### Container GPU Validation
|
||||
|
||||
```bash
|
||||
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
|
||||
kubectl run gpu-check --rm -it --restart=Never \
|
||||
--image=nvidia/cuda:12.3.2-base-ubuntu22.04 \
|
||||
--limits='nvidia.com/gpu=1' -- nvidia-smi
|
||||
```
|
||||
|
||||
### Kubernetes GPU Troubleshooting
|
||||
|
||||
Check for:
|
||||
|
||||
- device plugin not running
|
||||
- driver/container toolkit mismatch
|
||||
- node missing `nvidia.com/gpu` allocatable resources
|
||||
- MIG profile mismatch
|
||||
- taints or tolerations blocking placement
|
||||
|
||||
Useful checks:
|
||||
|
||||
```bash
|
||||
kubectl describe node <gpu-node> | grep -A5 -B2 -i nvidia
|
||||
kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}'
|
||||
kubectl -n <ns> describe pod <gpu-pod>
|
||||
```
|
||||
|
||||
## Platform Troubleshooting Flows
|
||||
|
||||
### Pod Not Starting
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> get pod <pod> -o wide
|
||||
kubectl -n <ns> describe pod <pod>
|
||||
kubectl -n <ns> logs <pod> --previous
|
||||
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
|
||||
```
|
||||
|
||||
### Image Pull Errors
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> describe pod <pod> | grep -A5 -i 'image'
|
||||
crictl images | grep <image>
|
||||
ctr -n k8s.io images ls | grep <image>
|
||||
```
|
||||
|
||||
Check:
|
||||
|
||||
- image tag exists
|
||||
- registry reachable
|
||||
- pull secret valid
|
||||
- node clock sane for token-based auth
|
||||
|
||||
### Failing Deployment
|
||||
|
||||
```bash
|
||||
kubectl -n <ns> rollout status deploy/<name>
|
||||
kubectl -n <ns> describe deploy/<name>
|
||||
kubectl -n <ns> get rs,pods -l app=<name> -o wide
|
||||
```
|
||||
|
||||
### Node Not Ready
|
||||
|
||||
```bash
|
||||
kubectl describe node <node>
|
||||
journalctl -u k3s -n 100 --no-pager
|
||||
systemctl status kubelet --no-pager
|
||||
df -h
|
||||
free -m
|
||||
```
|
||||
|
||||
Check:
|
||||
|
||||
- kubelet or k3s service state
|
||||
- disk pressure
|
||||
- cert expiry
|
||||
- CNI failure
|
||||
- API reachability
|
||||
|
||||
### Storage Provisioning Issues
|
||||
|
||||
```bash
|
||||
kubectl get pvc,pv -A
|
||||
kubectl -n <ns> describe pvc <pvc>
|
||||
kubectl get sc
|
||||
kubectl -n kube-system logs deploy/<csi-controller> --tail=100
|
||||
```
|
||||
|
||||
Check:
|
||||
|
||||
- storage class defaulting
|
||||
- access mode mismatch
|
||||
- CSI controller errors
|
||||
- backend quota or LUN exhaustion
|
||||
- node attachment failures
|
||||
Reference in New Issue
Block a user