Files
portfolio/platform-projects/docs/platform-cheatsheet.md
T

369 lines
8.0 KiB
Markdown
Raw Normal View History

# Platform Engineering Cheatsheet
Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts.
## Kubernetes / K3s
### Contexts, Namespaces, and Basic Workflows
```bash
kubectl config get-contexts
kubectl config use-context <context>
kubectl get ns
kubectl -n <ns> get pods -o wide
kubectl -n <ns> get deploy,sts,ds,svc,ingress
kubectl get nodes -o wide
```
### Describe, Logs, Exec, Events
```bash
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> logs <pod> --tail=100
kubectl -n <ns> logs <pod> -c <container> --previous
kubectl -n <ns> exec -it <pod> -- sh
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
```
### Rollout Troubleshooting
```bash
kubectl -n <ns> rollout status deploy/<name>
kubectl -n <ns> rollout history deploy/<name>
kubectl -n <ns> rollout undo deploy/<name>
kubectl -n <ns> get rs -l app=<name>
```
Safe pattern:
1. `kubectl diff -f <manifest>`
2. apply to non-prod or canary namespace
3. watch rollout and events
4. validate service and logs
5. expand scope only after post-check
### Node Validation
```bash
kubectl get nodes
kubectl describe node <node>
kubectl top nodes
kubectl top pods -A --sort-by=cpu
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
```
### Pending / CrashLoopBackOff Flow
Pending:
```bash
kubectl -n <ns> describe pod <pod>
kubectl get events -A --sort-by=.lastTimestamp | tail -50
```
Check for:
- unsatisfied CPU/memory requests
- missing PVC
- taints/tolerations mismatch
- image pull secret issues
- node selectors or affinity mismatch
CrashLoopBackOff:
```bash
kubectl -n <ns> logs <pod> --previous
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'
```
Check for:
- bad config or missing env vars
- probe failures
- dependency timeouts
- permission or filesystem errors
## Helm
```bash
helm repo list
helm repo update
helm list -A
helm -n <ns> get values <release> -a
helm -n <ns> get manifest <release>
helm upgrade --install <release> <chart> -n <ns> -f values.yaml
helm rollback -n <ns> <release> <revision>
helm template <release> <chart> -f values.yaml | less
```
Validation:
```bash
helm lint <chart>
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -20
```
## Docker / Podman
```bash
docker images
docker ps -a
docker logs --tail 100 <container>
docker exec -it <container> sh
docker inspect <container>
docker volume ls
docker network ls
docker system df
docker image prune -f # cleanup: review first
docker container prune -f # cleanup: review first
podman ps -a
podman inspect <container>
```
Container validation:
```bash
docker exec <container> env | sort
docker exec <container> ss -ltnp
docker inspect -f '{{.State.Status}} {{.RestartCount}}' <container>
```
## Terraform
### Core Commands
```bash
terraform fmt -check -recursive
terraform init
terraform validate
terraform plan -out=tfplan
terraform apply tfplan
terraform destroy -target=<resource> # impact: targeted destruction needs review
terraform state list
terraform state show <resource>
terraform import <resource> <id>
```
### Safe Workflow
1. `terraform fmt -check -recursive`
2. `terraform validate`
3. refresh provider auth and backend access
4. review `plan` output for replacements and destroys
5. save plan artifact
6. apply reviewed plan only
7. validate resource state outside Terraform
Plan review focus:
- unexpected replacement
- drift on security groups, routes, storage, or instance identity
- provider alias mistakes
- wrong workspace or backend
## CI/CD Operations
### GitLab CI
```bash
gitlab-runner verify
grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml
curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects/<id>/pipelines
```
### Jenkins
```bash
systemctl status jenkins --no-pager
journalctl -u jenkins -n 100 --no-pager
java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs
```
### Runners, Artifacts, Pipeline Failures
```bash
docker logs --tail 100 gitlab-runner
kubectl -n ci get pods
kubectl -n ci logs deploy/runner-controller --tail=100
```
Troubleshooting flow:
1. validate YAML or Jenkinsfile syntax
2. confirm runner/agent availability
3. inspect job logs for auth, cache, DNS, or registry failures
4. verify artifacts were uploaded and not expired
5. correlate with platform outages, image changes, or secret rotation
YAML validation:
```bash
yamllint .
python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml
```
## Observability
### Prometheus
```bash
curl -s http://prometheus:9090/-/ready
curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'
```
### Loki
```bash
curl -s http://loki:3100/ready
curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"'
```
### Grafana
```bash
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50
```
### Metrics Validation and Log Correlation
```bash
kubectl -n <ns> port-forward svc/<svc> 9090:9090
curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_'
```
Correlation flow:
1. confirm alert time and impacted objects
2. inspect deployment events in same window
3. compare Prometheus series, Loki logs, and app logs
4. rule out scrape lag or stale dashboards
## GPU / AI Infrastructure
### GPU Discovery and CUDA Validation
```bash
nvidia-smi
nvidia-smi -L
nvidia-smi topo -m
nvidia-smi dmon -s pucm
nvcc --version
python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())'
```
### MIG Basics
```bash
nvidia-smi -i 0 -q | grep -i mig -A4
nvidia-smi mig -lgip
nvidia-smi mig -lgi
```
### GPU Operator and DCGM
```bash
kubectl get pods -A | grep -E 'nvidia|gpu'
kubectl -n gpu-operator describe pod <pod>
kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100
kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100
```
### Container GPU Validation
```bash
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
kubectl run gpu-check --rm -it --restart=Never \
--image=nvidia/cuda:12.3.2-base-ubuntu22.04 \
--limits='nvidia.com/gpu=1' -- nvidia-smi
```
### Kubernetes GPU Troubleshooting
Check for:
- device plugin not running
- driver/container toolkit mismatch
- node missing `nvidia.com/gpu` allocatable resources
- MIG profile mismatch
- taints or tolerations blocking placement
Useful checks:
```bash
kubectl describe node <gpu-node> | grep -A5 -B2 -i nvidia
kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}'
kubectl -n <ns> describe pod <gpu-pod>
```
## Platform Troubleshooting Flows
### Pod Not Starting
```bash
kubectl -n <ns> get pod <pod> -o wide
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> logs <pod> --previous
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
```
### Image Pull Errors
```bash
kubectl -n <ns> describe pod <pod> | grep -A5 -i 'image'
crictl images | grep <image>
ctr -n k8s.io images ls | grep <image>
```
Check:
- image tag exists
- registry reachable
- pull secret valid
- node clock sane for token-based auth
### Failing Deployment
```bash
kubectl -n <ns> rollout status deploy/<name>
kubectl -n <ns> describe deploy/<name>
kubectl -n <ns> get rs,pods -l app=<name> -o wide
```
### Node Not Ready
```bash
kubectl describe node <node>
journalctl -u k3s -n 100 --no-pager
systemctl status kubelet --no-pager
df -h
free -m
```
Check:
- kubelet or k3s service state
- disk pressure
- cert expiry
- CNI failure
- API reachability
### Storage Provisioning Issues
```bash
kubectl get pvc,pv -A
kubectl -n <ns> describe pvc <pvc>
kubectl get sc
kubectl -n kube-system logs deploy/<csi-controller> --tail=100
```
Check:
- storage class defaulting
- access mode mismatch
- CSI controller errors
- backend quota or LUN exhaustion
- node attachment failures