platform-projects/docs/platform-cheatsheet.md

# Platform Engineering Cheatsheet

Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts.

## Kubernetes / K3s

### Contexts, Namespaces, and Basic Workflows

```bash
kubectl config get-contexts
kubectl config use-context <context>
kubectl get ns
kubectl -n <ns> get pods -o wide
kubectl -n <ns> get deploy,sts,ds,svc,ingress
kubectl get nodes -o wide
```

### Describe, Logs, Exec, Events

```bash
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> logs <pod> --tail=100
kubectl -n <ns> logs <pod> -c <container> --previous
kubectl -n <ns> exec -it <pod> -- sh
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
```

### Rollout Troubleshooting

```bash
kubectl -n <ns> rollout status deploy/<name>
kubectl -n <ns> rollout history deploy/<name>
kubectl -n <ns> rollout undo deploy/<name>
kubectl -n <ns> get rs -l app=<name>
```

Safe pattern:

1. `kubectl diff -f <manifest>`
2. apply to non-prod or canary namespace
3. watch rollout and events
4. validate service and logs
5. expand scope only after post-check

### Node Validation

```bash
kubectl get nodes
kubectl describe node <node>
kubectl top nodes
kubectl top pods -A --sort-by=cpu
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>
```

### Pending / CrashLoopBackOff Flow

Pending:

```bash
kubectl -n <ns> describe pod <pod>
kubectl get events -A --sort-by=.lastTimestamp | tail -50
```

Check for:

- unsatisfied CPU/memory requests
- missing PVC
- taints/tolerations mismatch
- image pull secret issues
- node selectors or affinity mismatch

CrashLoopBackOff:

```bash
kubectl -n <ns> logs <pod> --previous
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'
```

Check for:

- bad config or missing env vars
- probe failures
- dependency timeouts
- permission or filesystem errors

## Helm

```bash
helm repo list
helm repo update
helm list -A
helm -n <ns> get values <release> -a
helm -n <ns> get manifest <release>
helm upgrade --install <release> <chart> -n <ns> -f values.yaml
helm rollback -n <ns> <release> <revision>
helm template <release> <chart> -f values.yaml | less
```

Validation:

```bash
helm lint <chart>
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -20
```

## Docker / Podman

```bash
docker images
docker ps -a
docker logs --tail 100 <container>
docker exec -it <container> sh
docker inspect <container>
docker volume ls
docker network ls
docker system df
docker image prune -f         # cleanup: review first
docker container prune -f     # cleanup: review first
podman ps -a
podman inspect <container>
```

Container validation:

```bash
docker exec <container> env | sort
docker exec <container> ss -ltnp
docker inspect -f '{{.State.Status}} {{.RestartCount}}' <container>
```

## Terraform

### Core Commands

```bash
terraform fmt -check -recursive
terraform init
terraform validate
terraform plan -out=tfplan
terraform apply tfplan
terraform destroy -target=<resource>   # impact: targeted destruction needs review
terraform state list
terraform state show <resource>
terraform import <resource> <id>
```

### Safe Workflow

1. `terraform fmt -check -recursive`
2. `terraform validate`
3. refresh provider auth and backend access
4. review `plan` output for replacements and destroys
5. save plan artifact
6. apply reviewed plan only
7. validate resource state outside Terraform

Plan review focus:

- unexpected replacement
- drift on security groups, routes, storage, or instance identity
- provider alias mistakes
- wrong workspace or backend

## CI/CD Operations

### GitLab CI

```bash
gitlab-runner verify
grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml
curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects/<id>/pipelines
```

### Jenkins

```bash
systemctl status jenkins --no-pager
journalctl -u jenkins -n 100 --no-pager
java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs
```

### Runners, Artifacts, Pipeline Failures

```bash
docker logs --tail 100 gitlab-runner
kubectl -n ci get pods
kubectl -n ci logs deploy/runner-controller --tail=100
```

Troubleshooting flow:

1. validate YAML or Jenkinsfile syntax
2. confirm runner/agent availability
3. inspect job logs for auth, cache, DNS, or registry failures
4. verify artifacts were uploaded and not expired
5. correlate with platform outages, image changes, or secret rotation

YAML validation:

```bash
yamllint .
python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml
```

## Observability

### Prometheus

```bash
curl -s http://prometheus:9090/-/ready
curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'
```

### Loki

```bash
curl -s http://loki:3100/ready
curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"'
```

### Grafana

```bash
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50
```

### Metrics Validation and Log Correlation

```bash
kubectl -n <ns> port-forward svc/<svc> 9090:9090
curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_'
```

Correlation flow:

1. confirm alert time and impacted objects
2. inspect deployment events in same window
3. compare Prometheus series, Loki logs, and app logs
4. rule out scrape lag or stale dashboards

## GPU / AI Infrastructure

### GPU Discovery and CUDA Validation

```bash
nvidia-smi
nvidia-smi -L
nvidia-smi topo -m
nvidia-smi dmon -s pucm
nvcc --version
python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())'
```

### MIG Basics

```bash
nvidia-smi -i 0 -q | grep -i mig -A4
nvidia-smi mig -lgip
nvidia-smi mig -lgi
```

### GPU Operator and DCGM

```bash
kubectl get pods -A | grep -E 'nvidia|gpu'
kubectl -n gpu-operator describe pod <pod>
kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100
kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100
```

### Container GPU Validation

```bash
docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
kubectl run gpu-check --rm -it --restart=Never \
  --image=nvidia/cuda:12.3.2-base-ubuntu22.04 \
  --limits='nvidia.com/gpu=1' -- nvidia-smi
```

### Kubernetes GPU Troubleshooting

Check for:

- device plugin not running
- driver/container toolkit mismatch
- node missing `nvidia.com/gpu` allocatable resources
- MIG profile mismatch
- taints or tolerations blocking placement

Useful checks:

```bash
kubectl describe node <gpu-node> | grep -A5 -B2 -i nvidia
kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}'
kubectl -n <ns> describe pod <gpu-pod>
```

## Platform Troubleshooting Flows

### Pod Not Starting

```bash
kubectl -n <ns> get pod <pod> -o wide
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> logs <pod> --previous
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30
```

### Image Pull Errors

```bash
kubectl -n <ns> describe pod <pod> | grep -A5 -i 'image'
crictl images | grep <image>
ctr -n k8s.io images ls | grep <image>
```

Check:

- image tag exists
- registry reachable
- pull secret valid
- node clock sane for token-based auth

### Failing Deployment

```bash
kubectl -n <ns> rollout status deploy/<name>
kubectl -n <ns> describe deploy/<name>
kubectl -n <ns> get rs,pods -l app=<name> -o wide
```

### Node Not Ready

```bash
kubectl describe node <node>
journalctl -u k3s -n 100 --no-pager
systemctl status kubelet --no-pager
df -h
free -m
```

Check:

- kubelet or k3s service state
- disk pressure
- cert expiry
- CNI failure
- API reachability

### Storage Provisioning Issues

```bash
kubectl get pvc,pv -A
kubectl -n <ns> describe pvc <pvc>
kubectl get sc
kubectl -n kube-system logs deploy/<csi-controller> --tail=100
```

Check:

- storage class defaulting
- access mode mismatch
- CSI controller errors
- backend quota or LUN exhaustion
- node attachment failures
Add operational cheatsheets across repository 2026-05-09 09:41:55 +00:00			`# Platform Engineering Cheatsheet`

			`Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts.`

			`## Kubernetes / K3s`

			`### Contexts, Namespaces, and Basic Workflows`

			```bash
			`kubectl config get-contexts`
			`kubectl config use-context <context>`
			`kubectl get ns`
			`kubectl -n <ns> get pods -o wide`
			`kubectl -n <ns> get deploy,sts,ds,svc,ingress`
			`kubectl get nodes -o wide`
			```

			`### Describe, Logs, Exec, Events`

			```bash
			`kubectl -n <ns> describe pod <pod>`
			`kubectl -n <ns> logs <pod> --tail=100`
			`kubectl -n <ns> logs <pod> -c <container> --previous`
			`kubectl -n <ns> exec -it <pod> -- sh`
			`kubectl -n <ns> get events --sort-by=.lastTimestamp \| tail -30`
			```

			`### Rollout Troubleshooting`

			```bash
			`kubectl -n <ns> rollout status deploy/<name>`
			`kubectl -n <ns> rollout history deploy/<name>`
			`kubectl -n <ns> rollout undo deploy/<name>`
			`kubectl -n <ns> get rs -l app=<name>`
			```

			`Safe pattern:`

			1. `kubectl diff -f <manifest>`
			`2. apply to non-prod or canary namespace`
			`3. watch rollout and events`
			`4. validate service and logs`
			`5. expand scope only after post-check`

			`### Node Validation`

			```bash
			`kubectl get nodes`
			`kubectl describe node <node>`
			`kubectl top nodes`
			`kubectl top pods -A --sort-by=cpu`
			`kubectl get pods -A -o wide --field-selector spec.nodeName=<node>`
			```

			`### Pending / CrashLoopBackOff Flow`

			`Pending:`

			```bash
			`kubectl -n <ns> describe pod <pod>`
			`kubectl get events -A --sort-by=.lastTimestamp \| tail -50`
			```

			`Check for:`

			`- unsatisfied CPU/memory requests`
			`- missing PVC`
			`- taints/tolerations mismatch`
			`- image pull secret issues`
			`- node selectors or affinity mismatch`

			`CrashLoopBackOff:`

			```bash
			`kubectl -n <ns> logs <pod> --previous`
			`kubectl -n <ns> describe pod <pod>`
			`kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'`
			```

			`Check for:`

			`- bad config or missing env vars`
			`- probe failures`
			`- dependency timeouts`
			`- permission or filesystem errors`

			`## Helm`

			```bash
			`helm repo list`
			`helm repo update`
			`helm list -A`
			`helm -n <ns> get values <release> -a`
			`helm -n <ns> get manifest <release>`
			`helm upgrade --install <release> <chart> -n <ns> -f values.yaml`
			`helm rollback -n <ns> <release> <revision>`
			`helm template <release> <chart> -f values.yaml \| less`
			```

			`Validation:`

			```bash
			`helm lint <chart>`
			`kubectl -n <ns> get events --sort-by=.lastTimestamp \| tail -20`
			```

			`## Docker / Podman`

			```bash
			`docker images`
			`docker ps -a`
			`docker logs --tail 100 <container>`
			`docker exec -it <container> sh`
			`docker inspect <container>`
			`docker volume ls`
			`docker network ls`
			`docker system df`
			`docker image prune -f # cleanup: review first`
			`docker container prune -f # cleanup: review first`
			`podman ps -a`
			`podman inspect <container>`
			```

			`Container validation:`

			```bash
			`docker exec <container> env \| sort`
			`docker exec <container> ss -ltnp`
			`docker inspect -f '{{.State.Status}} {{.RestartCount}}' <container>`
			```

			`## Terraform`

			`### Core Commands`

			```bash
			`terraform fmt -check -recursive`
			`terraform init`
			`terraform validate`
			`terraform plan -out=tfplan`
			`terraform apply tfplan`
			`terraform destroy -target=<resource> # impact: targeted destruction needs review`
			`terraform state list`
			`terraform state show <resource>`
			`terraform import <resource> <id>`
			```

			`### Safe Workflow`

			1. `terraform fmt -check -recursive`
			2. `terraform validate`
			`3. refresh provider auth and backend access`
			4. review `plan` output for replacements and destroys
			`5. save plan artifact`
			`6. apply reviewed plan only`
			`7. validate resource state outside Terraform`

			`Plan review focus:`

			`- unexpected replacement`
			`- drift on security groups, routes, storage, or instance identity`
			`- provider alias mistakes`
			`- wrong workspace or backend`

			`## CI/CD Operations`

			`### GitLab CI`

			```bash
			`gitlab-runner verify`
			`grep -n 'stage:\\|script:\\|rules:' .gitlab-ci.yml`
			`curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects/<id>/pipelines`
			```

			`### Jenkins`

			```bash
			`systemctl status jenkins --no-pager`
			`journalctl -u jenkins -n 100 --no-pager`
			`java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs`
			```

			`### Runners, Artifacts, Pipeline Failures`

			```bash
			`docker logs --tail 100 gitlab-runner`
			`kubectl -n ci get pods`
			`kubectl -n ci logs deploy/runner-controller --tail=100`
			```

			`Troubleshooting flow:`

			`1. validate YAML or Jenkinsfile syntax`
			`2. confirm runner/agent availability`
			`3. inspect job logs for auth, cache, DNS, or registry failures`
			`4. verify artifacts were uploaded and not expired`
			`5. correlate with platform outages, image changes, or secret rotation`

			`YAML validation:`

			```bash
			`yamllint .`
			`python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml`
			```

			`## Observability`

			`### Prometheus`

			```bash
			`curl -s http://prometheus:9090/-/ready`
			`curl -s 'http://prometheus:9090/api/v1/targets?state=active' \| jq '.data.activeTargets[] \| {job: .labels.job, health: .health}'`
			`curl -s 'http://prometheus:9090/api/v1/query?query=up' \| jq '.data.result[] \| {instance: .metric.instance, value: .value[1]}'`
			```

			`### Loki`

			```bash
			`curl -s http://loki:3100/ready`
			`curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} \|= "error"'`
			```

			`### Grafana`

			```bash
			`curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login`
			`grep -i 'error\\|failed' /var/log/grafana/grafana.log \| tail -50`
			```

			`### Metrics Validation and Log Correlation`

			```bash
			`kubectl -n <ns> port-forward svc/<svc> 9090:9090`
			`curl -s http://127.0.0.1:9090/metrics \| grep -E 'http_\|process_\|go_'`
			```

			`Correlation flow:`

			`1. confirm alert time and impacted objects`
			`2. inspect deployment events in same window`
			`3. compare Prometheus series, Loki logs, and app logs`
			`4. rule out scrape lag or stale dashboards`

			`## GPU / AI Infrastructure`

			`### GPU Discovery and CUDA Validation`

			```bash
			`nvidia-smi`
			`nvidia-smi -L`
			`nvidia-smi topo -m`
			`nvidia-smi dmon -s pucm`
			`nvcc --version`
			`python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())'`
			```

			`### MIG Basics`

			```bash
			`nvidia-smi -i 0 -q \| grep -i mig -A4`
			`nvidia-smi mig -lgip`
			`nvidia-smi mig -lgi`
			```

			`### GPU Operator and DCGM`

			```bash
			`kubectl get pods -A \| grep -E 'nvidia\|gpu'`
			`kubectl -n gpu-operator describe pod <pod>`
			`kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100`
			`kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100`
			```

			`### Container GPU Validation`

			```bash
			`docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi`
			`kubectl run gpu-check --rm -it --restart=Never \`
			`--image=nvidia/cuda:12.3.2-base-ubuntu22.04 \`
			`--limits='nvidia.com/gpu=1' -- nvidia-smi`
			```

			`### Kubernetes GPU Troubleshooting`

			`Check for:`

			`- device plugin not running`
			`- driver/container toolkit mismatch`
			- node missing `nvidia.com/gpu` allocatable resources
			`- MIG profile mismatch`
			`- taints or tolerations blocking placement`

			`Useful checks:`

			```bash
			`kubectl describe node <gpu-node> \| grep -A5 -B2 -i nvidia`
			`kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}'`
			`kubectl -n <ns> describe pod <gpu-pod>`
			```

			`## Platform Troubleshooting Flows`

			`### Pod Not Starting`

			```bash
			`kubectl -n <ns> get pod <pod> -o wide`
			`kubectl -n <ns> describe pod <pod>`
			`kubectl -n <ns> logs <pod> --previous`
			`kubectl -n <ns> get events --sort-by=.lastTimestamp \| tail -30`
			```

			`### Image Pull Errors`

			```bash
			`kubectl -n <ns> describe pod <pod> \| grep -A5 -i 'image'`
			`crictl images \| grep <image>`
			`ctr -n k8s.io images ls \| grep <image>`
			```

			`Check:`

			`- image tag exists`
			`- registry reachable`
			`- pull secret valid`
			`- node clock sane for token-based auth`

			`### Failing Deployment`

			```bash
			`kubectl -n <ns> rollout status deploy/<name>`
			`kubectl -n <ns> describe deploy/<name>`
			`kubectl -n <ns> get rs,pods -l app=<name> -o wide`
			```

			`### Node Not Ready`

			```bash
			`kubectl describe node <node>`
			`journalctl -u k3s -n 100 --no-pager`
			`systemctl status kubelet --no-pager`
			`df -h`
			`free -m`
			```

			`Check:`

			`- kubelet or k3s service state`
			`- disk pressure`
			`- cert expiry`
			`- CNI failure`
			`- API reachability`

			`### Storage Provisioning Issues`

			```bash
			`kubectl get pvc,pv -A`
			`kubectl -n <ns> describe pvc <pvc>`
			`kubectl get sc`
			`kubectl -n kube-system logs deploy/<csi-controller> --tail=100`
			```

			`Check:`

			`- storage class defaulting`
			`- access mode mismatch`
			`- CSI controller errors`
			`- backend quota or LUN exhaustion`
			`- node attachment failures`