mateusz/portfolio

Fork 0

Files

T

Mateusz Suski 0d3905b8a1

lint / shell-yaml-ansible (push) Failing after 17s

Details

Add operational cheatsheets across repository

2026-05-09 09:41:55 +00:00

8.0 KiB

Raw Blame History

Platform Engineering Cheatsheet

Operational quick reference for Kubernetes, containers, IaC, CI/CD, observability, and GPU-backed platform work. Prefer scoped queries, read-only checks, and staged rollouts.

Kubernetes / K3s

Contexts, Namespaces, and Basic Workflows

kubectl config get-contexts
kubectl config use-context <context>
kubectl get ns
kubectl -n <ns> get pods -o wide
kubectl -n <ns> get deploy,sts,ds,svc,ingress
kubectl get nodes -o wide

Describe, Logs, Exec, Events

kubectl -n <ns> describe pod <pod>
kubectl -n <ns> logs <pod> --tail=100
kubectl -n <ns> logs <pod> -c <container> --previous
kubectl -n <ns> exec -it <pod> -- sh
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30

Rollout Troubleshooting

kubectl -n <ns> rollout status deploy/<name>
kubectl -n <ns> rollout history deploy/<name>
kubectl -n <ns> rollout undo deploy/<name>
kubectl -n <ns> get rs -l app=<name>

Safe pattern:

kubectl diff -f <manifest>
apply to non-prod or canary namespace
watch rollout and events
validate service and logs
expand scope only after post-check

Node Validation

kubectl get nodes
kubectl describe node <node>
kubectl top nodes
kubectl top pods -A --sort-by=cpu
kubectl get pods -A -o wide --field-selector spec.nodeName=<node>

Pending / CrashLoopBackOff Flow

Pending:

kubectl -n <ns> describe pod <pod>
kubectl get events -A --sort-by=.lastTimestamp | tail -50

Check for:

unsatisfied CPU/memory requests
missing PVC
taints/tolerations mismatch
image pull secret issues
node selectors or affinity mismatch

CrashLoopBackOff:

kubectl -n <ns> logs <pod> --previous
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> get pod <pod> -o jsonpath='{.status.containerStatuses[*].lastState}'

Check for:

bad config or missing env vars
probe failures
dependency timeouts
permission or filesystem errors

Helm

helm repo list
helm repo update
helm list -A
helm -n <ns> get values <release> -a
helm -n <ns> get manifest <release>
helm upgrade --install <release> <chart> -n <ns> -f values.yaml
helm rollback -n <ns> <release> <revision>
helm template <release> <chart> -f values.yaml | less

Validation:

helm lint <chart>
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -20

Docker / Podman

docker images
docker ps -a
docker logs --tail 100 <container>
docker exec -it <container> sh
docker inspect <container>
docker volume ls
docker network ls
docker system df
docker image prune -f         # cleanup: review first
docker container prune -f     # cleanup: review first
podman ps -a
podman inspect <container>

Container validation:

docker exec <container> env | sort
docker exec <container> ss -ltnp
docker inspect -f '{{.State.Status}} {{.RestartCount}}' <container>

Terraform

Core Commands

terraform fmt -check -recursive
terraform init
terraform validate
terraform plan -out=tfplan
terraform apply tfplan
terraform destroy -target=<resource>   # impact: targeted destruction needs review
terraform state list
terraform state show <resource>
terraform import <resource> <id>

Safe Workflow

terraform fmt -check -recursive
terraform validate
refresh provider auth and backend access
review plan output for replacements and destroys
save plan artifact
apply reviewed plan only
validate resource state outside Terraform

Plan review focus:

unexpected replacement
drift on security groups, routes, storage, or instance identity
provider alias mistakes
wrong workspace or backend

CI/CD Operations

GitLab CI

gitlab-runner verify
grep -n 'stage:\|script:\|rules:' .gitlab-ci.yml
curl -s --header "PRIVATE-TOKEN: $TOKEN" https://gitlab.example/api/v4/projects/<id>/pipelines

Jenkins

systemctl status jenkins --no-pager
journalctl -u jenkins -n 100 --no-pager
java -jar jenkins-cli.jar -s https://jenkins.example/ list-jobs

Runners, Artifacts, Pipeline Failures

docker logs --tail 100 gitlab-runner
kubectl -n ci get pods
kubectl -n ci logs deploy/runner-controller --tail=100

Troubleshooting flow:

validate YAML or Jenkinsfile syntax
confirm runner/agent availability
inspect job logs for auth, cache, DNS, or registry failures
verify artifacts were uploaded and not expired
correlate with platform outages, image changes, or secret rotation

YAML validation:

yamllint .
python3 -c 'import yaml,sys; yaml.safe_load(open(sys.argv[1]))' .gitlab-ci.yml

Observability

Prometheus

curl -s http://prometheus:9090/-/ready
curl -s 'http://prometheus:9090/api/v1/targets?state=active' | jq '.data.activeTargets[] | {job: .labels.job, health: .health}'
curl -s 'http://prometheus:9090/api/v1/query?query=up' | jq '.data.result[] | {instance: .metric.instance, value: .value[1]}'

Loki

curl -s http://loki:3100/ready
curl -Gs http://loki:3100/loki/api/v1/query --data-urlencode 'query={app="nginx"} |= "error"'

Grafana

curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
grep -i 'error\|failed' /var/log/grafana/grafana.log | tail -50

Metrics Validation and Log Correlation

kubectl -n <ns> port-forward svc/<svc> 9090:9090
curl -s http://127.0.0.1:9090/metrics | grep -E 'http_|process_|go_'

Correlation flow:

confirm alert time and impacted objects
inspect deployment events in same window
compare Prometheus series, Loki logs, and app logs
rule out scrape lag or stale dashboards

GPU / AI Infrastructure

GPU Discovery and CUDA Validation

nvidia-smi
nvidia-smi -L
nvidia-smi topo -m
nvidia-smi dmon -s pucm
nvcc --version
python3 -c 'import torch; print(torch.cuda.is_available(), torch.cuda.device_count())'

MIG Basics

nvidia-smi -i 0 -q | grep -i mig -A4
nvidia-smi mig -lgip
nvidia-smi mig -lgi

GPU Operator and DCGM

kubectl get pods -A | grep -E 'nvidia|gpu'
kubectl -n gpu-operator describe pod <pod>
kubectl -n gpu-operator logs ds/nvidia-device-plugin-daemonset --tail=100
kubectl -n gpu-operator logs ds/nvidia-dcgm-exporter --tail=100

Container GPU Validation

docker run --rm --gpus all nvidia/cuda:12.3.2-base-ubuntu22.04 nvidia-smi
kubectl run gpu-check --rm -it --restart=Never \
  --image=nvidia/cuda:12.3.2-base-ubuntu22.04 \
  --limits='nvidia.com/gpu=1' -- nvidia-smi

Kubernetes GPU Troubleshooting

Check for:

device plugin not running
driver/container toolkit mismatch
node missing nvidia.com/gpu allocatable resources
MIG profile mismatch
taints or tolerations blocking placement

Useful checks:

kubectl describe node <gpu-node> | grep -A5 -B2 -i nvidia
kubectl get node <gpu-node> -o jsonpath='{.status.allocatable}'
kubectl -n <ns> describe pod <gpu-pod>

Platform Troubleshooting Flows

Pod Not Starting

kubectl -n <ns> get pod <pod> -o wide
kubectl -n <ns> describe pod <pod>
kubectl -n <ns> logs <pod> --previous
kubectl -n <ns> get events --sort-by=.lastTimestamp | tail -30

Image Pull Errors

kubectl -n <ns> describe pod <pod> | grep -A5 -i 'image'
crictl images | grep <image>
ctr -n k8s.io images ls | grep <image>

Check:

image tag exists
registry reachable
pull secret valid
node clock sane for token-based auth

Failing Deployment

kubectl -n <ns> rollout status deploy/<name>
kubectl -n <ns> describe deploy/<name>
kubectl -n <ns> get rs,pods -l app=<name> -o wide

Node Not Ready

kubectl describe node <node>
journalctl -u k3s -n 100 --no-pager
systemctl status kubelet --no-pager
df -h
free -m

Check:

kubelet or k3s service state
disk pressure
cert expiry
CNI failure
API reachability

Storage Provisioning Issues

kubectl get pvc,pv -A
kubectl -n <ns> describe pvc <pvc>
kubectl get sc
kubectl -n kube-system logs deploy/<csi-controller> --tail=100

Check:

storage class defaulting
access mode mismatch
CSI controller errors
backend quota or LUN exhaustion
node attachment failures

8.0 KiB Raw Blame History

Platform Engineering Cheatsheet

Kubernetes / K3s

Contexts, Namespaces, and Basic Workflows

Describe, Logs, Exec, Events

Rollout Troubleshooting

Node Validation

Pending / CrashLoopBackOff Flow

Helm

Docker / Podman

Terraform

Core Commands

Safe Workflow

CI/CD Operations

GitLab CI

Jenkins

Runners, Artifacts, Pipeline Failures

Observability

Prometheus

Loki

Grafana

Metrics Validation and Log Correlation

GPU / AI Infrastructure

GPU Discovery and CUDA Validation

MIG Basics

GPU Operator and DCGM

Container GPU Validation

Kubernetes GPU Troubleshooting

Platform Troubleshooting Flows

Pod Not Starting

Image Pull Errors

Failing Deployment

Node Not Ready

Storage Provisioning Issues

8.0 KiB

Raw Blame History