mateusz/portfolio

T

Mateusz Suski 4e739c5c99

lint / shell-yaml-ansible (push) Failing after 16s

Details

Add Linux fresh setup toolkit

2026-06-06 00:23:11 +00:00

.github/workflows

Add Python tooling validation foundation

2026-05-11 17:02:35 +00:00

Add Codex repository guidance and validation

2026-05-10 11:11:03 +00:00

Add L2 incident triage report wrapper

2026-05-12 20:00:42 +00:00

Add Linux fresh setup toolkit

2026-06-06 00:23:11 +00:00

platform-projects

Document Slurm AI/HPC cluster project

2026-06-05 15:39:24 +00:00

Add Python tooling validation foundation

2026-05-11 17:02:35 +00:00

.gitignore

Initial clean portfolio

2026-05-04 21:23:04 +00:00

.yamllint

Improve infra-run portfolio credibility

2026-05-08 21:18:22 +00:00

AGENTS.md

Add Python tooling validation foundation

2026-05-11 17:02:35 +00:00

CHANGELOG.md

Add Linux fresh setup toolkit

2026-06-06 00:23:11 +00:00

MATEUSZ_SUSKI_CV_LINUX_ENGINEER.pdf

PDF CV file upload

2026-05-14 21:23:49 +02:00

README.md

Document Slurm AI/HPC cluster project

2026-06-05 15:39:24 +00:00

ROADMAP.md

Add standalone Bash incident check scripts

2026-05-11 18:49:00 +00:00

README.md

Linux/Unix Infrastructure Engineering Portfolio

This repository contains sanitized infrastructure automation examples based on Linux/Unix operations and infrastructure workflows. The focus is on incident response, troubleshooting, pre-checks, dry-run behavior, controlled execution, post-checks, and readable operational evidence.

It is a technical portfolio, not a production toolkit. The examples show how operational work is structured: understand the current state, make changes only with explicit controls, verify the result, and leave enough evidence for review.

What This Repo Is

Practical Linux/Unix operations examples.
Safe Bash and Ansible patterns for lab and review.
Runbook-driven examples for incident response, storage operations, hardening, and observability.
A place for platform and lab topics to grow without pretending unfinished areas are complete.

What This Repo Is Not

It is not a compliance benchmark implementation.
It is not a drop-in change automation framework.
It is not proof that these exact scripts ran in any production environment.
It does not replace change review, peer review, backups, monitoring, or platform-specific runbooks.

Repository Layout

infra-run - core operational tooling and automation.
platform-projects - larger platform topics and case-study areas.
labs - experimental/lab environments and notes.
docs/codex - guidance for future Codex-driven changes.
scripts - lightweight repository validation helpers.

Usable Now

infra-run - the main implemented project in this repository.
Linux healthcheck scripts - host, disk, service, network, and report helpers.
Bash incident checks - standalone read-only checks for common Linux incidents, plus an L2 Markdown triage report wrapper for repeatable handoff and ticket evidence.
Disk full workflow - triage scripts for usage, inode pressure, deleted open files, large files, log cleanup review, and postchecks.
Veritas examples - dry-run-first VxVM/VCS storage expansion workflow examples.
GPFS examples - dry-run-first IBM Spectrum Scale expansion workflow examples.
Incident log summary - read-only Python helper for local incident log pattern summaries.
Log diff checker - read-only Python helper for before/after change log comparison.
Auth log audit - read-only Python helper for local authentication log review.
JVM log analyzer - read-only Python helper for local JVM and Java application log review.
Journal analyzer - read-only Python helper for exported journalctl text review.
Known error matcher - read-only Python helper for matching logs against a JSON known-error catalog with runbook references.
Python operational log analysis tools - small standard-library helpers for local log summaries, before/after comparisons, and evidence reports.
Ansible hardening examples - selected Linux and AIX baseline hardening tasks organized as lab-safe roles.
Slurm AI/HPC cluster automation lab - Ansible-managed Slurm lab covering CPU/GPU scheduling, GRES, cgroups, accounting, QOS/fairshare, lifecycle workflows, rolling upgrades, and health remediation.

Planned Areas

The labs and platform-projects trees are intentionally thin. They are kept as planning areas for future lab notes and case studies, not as completed projects. Current planned topics are tracked in ROADMAP.md.

Documentation

Production Operations

infra-run/docs/operations-cheatsheet.md - production-focused Linux/Unix operations reference for incident handling, validation, storage, networking, Ansible, observability, and safety-first change execution.

Platform Engineering

platform-projects/docs/platform-cheatsheet.md - platform operations reference for Kubernetes, Helm, containers, Terraform, CI/CD, observability, and GPU-backed infrastructure troubleshooting.

Labs & Experiments

labs/docs/lab-cheatsheet.md - quick-reference scratchpad for K3s, Proxmox, Terraform, Docker, networking, and short-lived lab troubleshooting work.

Codex and Review Guidance

AGENTS.md - repository rules for automated and assisted changes.
docs/codex/README.md - Codex workflow and expected final response format.
docs/codex/review-checklist.md - safety, Bash, Ansible, docs, and validation review checklist.
docs/codex/task-template.md - reusable scoped task templates.

Safety-First Usage

Read scripts and playbooks before running them. Operational examples are sanitized and may need adaptation for a real system.

Prefer read-only commands first.
Use dry-run/check mode before execution.
Treat --execute as a change-control boundary.
Confirm backups, monitoring, application impact, and rollback steps before live use.
Do not run platform-specific storage commands without a matching Veritas, GPFS, or AIX lab.

Validation

Basic local validation:

./scripts/validate-repo.sh
./scripts/check-bash.sh
./scripts/check-ansible.sh
./scripts/check-python.sh
./scripts/check-docs.sh

The validation helpers run required lightweight checks and use optional tools such as shellcheck, yamllint, ansible-playbook, ansible-lint, and markdownlint when available. Python checks use python3 -m py_compile and do not require external Python tooling. Set STRICT=1 to fail when optional tools are missing.

Some scripts depend on platform tools such as vxdisk, hagrp, mmcrnsd, and mmlscluster. Those commands are not expected to exist on a normal workstation, so functional testing against Veritas or GPFS requires a real lab environment.

See infra-run/TESTED.md and infra-run/KNOWN_LIMITATIONS.md for the current validation status.

Operational Areas Demonstrated

Linux operations triage and reporting.
Local operational log analysis with read-only Python helpers.
Disk pressure and deleted-file incident analysis.
Dry-run-first Bash automation.
Controlled storage change workflow design.
Veritas VxVM/VCS operational awareness.
GPFS / IBM Spectrum Scale operational awareness.
Ansible role organization for selected hardening controls.
Slurm AI/HPC cluster operations with GPU scheduling, accounting, lifecycle workflows, and remediation.
Clear documentation of what was tested and what still needs a real system.