Rework portfolio around Linux operations, Zabbix monitoring, migration validation, and ELK/Grafana log observability. Add AAP-style LVM resize workflow, Zabbix server/proxy/agent automation assets, Linux/AIX monitoring templates, and updated validation CI.
This commit is contained in:
@@ -0,0 +1,8 @@
|
||||
---
|
||||
skip_list:
|
||||
- role-name
|
||||
- name[casing]
|
||||
- line-too-long
|
||||
|
||||
exclude_paths:
|
||||
- .git
|
||||
@@ -0,0 +1,19 @@
|
||||
.PHONY: help test lint syntax validate-assets
|
||||
|
||||
help:
|
||||
@echo "Zabbix Monitoring + Incident Response"
|
||||
@echo " make test Run syntax, lint, and asset validation"
|
||||
@echo " make syntax Run Ansible syntax checks"
|
||||
@echo " make lint Run ansible-lint"
|
||||
@echo " make validate-assets Validate template and sample JSON assets"
|
||||
|
||||
test: syntax lint validate-assets
|
||||
|
||||
syntax:
|
||||
ansible-playbook --syntax-check playbooks/*.yml
|
||||
|
||||
lint:
|
||||
ansible-lint
|
||||
|
||||
validate-assets:
|
||||
python3 scripts/validate_assets.py
|
||||
@@ -0,0 +1,63 @@
|
||||
# Zabbix Monitoring + Incident Response
|
||||
|
||||
## Problem
|
||||
|
||||
Large Linux/Unix environments need simple, reliable OS checks before more advanced observability becomes useful. Filesystems, CPU, memory, network, process status, proxy backlog, and agent availability must be monitored consistently across Linux and AIX estates.
|
||||
|
||||
## CV Relevance
|
||||
|
||||
This project maps to Zabbix monitoring platform work, proxy maintenance, custom checks, alert noise reduction, and incident response in enterprise environments. It shows operational design and automation without pretending to run AIX locally.
|
||||
|
||||
## What This Project Demonstrates
|
||||
|
||||
- Ansible-first Zabbix server, proxy, and agent/agent2 configuration structure.
|
||||
- Proxy topology for active and passive checks.
|
||||
- Linux and AIX OS monitoring templates as reviewable JSON assets.
|
||||
- Sample Linux/AIX check data for filesystem, CPU, memory, network, and process monitoring.
|
||||
- Runbooks for Zabbix maintenance and incident response.
|
||||
|
||||
## Architecture
|
||||
|
||||
```text
|
||||
Linux/AIX hosts -> Zabbix agent/agent2 -> Zabbix proxy -> Zabbix server/web
|
||||
| |
|
||||
v v
|
||||
OS simple checks proxy queue/cache
|
||||
|
||||
Incident -> Alert -> Operator triage -> Maintenance or remediation evidence
|
||||
```
|
||||
|
||||
## Quickstart
|
||||
|
||||
```bash
|
||||
cd professional-infra/zabbix-monitoring-incident-response
|
||||
make test
|
||||
```
|
||||
|
||||
`make test` performs Ansible syntax/lint checks and validates the Zabbix template/sample JSON assets.
|
||||
|
||||
## Validation
|
||||
|
||||
```bash
|
||||
ansible-playbook --syntax-check playbooks/*.yml
|
||||
ansible-lint
|
||||
python3 scripts/validate_assets.py
|
||||
```
|
||||
|
||||
## Example Output
|
||||
|
||||
Sample check payloads are available in `samples/linux-os-checks.json` and `samples/aix-os-checks.json`. These show what a reviewable `zabbix_sender` or API-driven evidence artifact could look like for Linux and AIX hosts.
|
||||
|
||||
## Interview Talking Points
|
||||
|
||||
- Why Zabbix is suitable for simple OS checks while ELK/Grafana is better for log analysis.
|
||||
- How proxies reduce WAN dependency and support branch/client environments.
|
||||
- Difference between active and passive checks.
|
||||
- How to troubleshoot unsupported items, missing data, proxy backlog, and agent reachability.
|
||||
- How Linux and AIX monitoring differ without inventing local AIX runtime.
|
||||
|
||||
## Roadmap
|
||||
|
||||
- Add API import helpers for templates.
|
||||
- Add a Docker-based Zabbix server/proxy demo scaffold.
|
||||
- Add Wazuh or security monitoring integration as a separate side lab.
|
||||
@@ -0,0 +1,5 @@
|
||||
[defaults]
|
||||
roles_path = ./roles
|
||||
inventory = ./inventory/hosts.ini
|
||||
host_key_checking = False
|
||||
retry_files_enabled = False
|
||||
+30
@@ -0,0 +1,30 @@
|
||||
# Incident Response Runbook
|
||||
|
||||
## Filesystem Alert
|
||||
|
||||
1. Confirm current usage and growth trend.
|
||||
2. Check whether the host is Linux or AIX and use the correct runbook.
|
||||
3. Validate application ownership of the filesystem.
|
||||
4. Clean known temporary paths or request LVM expansion when approved.
|
||||
5. Attach before/after evidence to the incident ticket.
|
||||
|
||||
## Agent Unreachable
|
||||
|
||||
1. Confirm whether data loss affects one host, one proxy, or one network segment.
|
||||
2. Check proxy queue and last seen timestamp.
|
||||
3. Validate agent service state and firewall path.
|
||||
4. For active checks, confirm `ServerActive` and hostname match.
|
||||
|
||||
## Proxy Backlog
|
||||
|
||||
1. Check server reachability from proxy.
|
||||
2. Check proxy DB filesystem usage.
|
||||
3. Confirm whether config sync recently changed.
|
||||
4. Reduce noise by temporarily disabling non-critical discovery rules if required.
|
||||
|
||||
## Unsupported Items
|
||||
|
||||
1. Identify affected template and item key.
|
||||
2. Check whether item is Linux-specific or AIX-specific.
|
||||
3. Validate agent version and custom user parameters.
|
||||
4. Roll back template change if canary host group is affected.
|
||||
@@ -0,0 +1,29 @@
|
||||
# Zabbix Maintenance Runbook
|
||||
|
||||
## Server Checks
|
||||
|
||||
- Confirm Zabbix server process and web frontend availability.
|
||||
- Check database health, free space, and slow queries.
|
||||
- Review cache usage, poller utilization, and housekeeper activity.
|
||||
- Confirm recent values are arriving for representative Linux and AIX hosts.
|
||||
|
||||
## Proxy Checks
|
||||
|
||||
- Confirm proxy last seen timestamp.
|
||||
- Check proxy queue and delayed values.
|
||||
- Validate proxy database size and filesystem usage.
|
||||
- Confirm active/passive connectivity based on proxy mode.
|
||||
|
||||
## Template Maintenance
|
||||
|
||||
- Import templates in a controlled window.
|
||||
- Watch unsupported items after import.
|
||||
- Validate a small canary host group before wider rollout.
|
||||
- Document changed triggers and thresholds.
|
||||
|
||||
## Common Failure Modes
|
||||
|
||||
- Agent unreachable: check DNS, firewall, agent service, proxy route.
|
||||
- Unsupported item: check key spelling, OS capability, agent version, user parameter.
|
||||
- Proxy backlog: check WAN, DB size, proxy process, server availability.
|
||||
- Alert noise: review trigger thresholds and dependency design.
|
||||
@@ -0,0 +1,27 @@
|
||||
# Zabbix Proxy Design
|
||||
|
||||
## Purpose
|
||||
|
||||
Zabbix proxies reduce dependency on direct connectivity between the central server and monitored hosts. They are useful for client networks, segmented environments, remote sites, and maintenance windows.
|
||||
|
||||
## Active Proxy
|
||||
|
||||
- Proxy connects to the Zabbix server.
|
||||
- Good for restricted networks where inbound access to the proxy is not allowed.
|
||||
- Hosts can use active agent checks against the proxy.
|
||||
- Main operational checks: proxy last seen, delayed values, local DB size, config sync.
|
||||
|
||||
## Passive Proxy
|
||||
|
||||
- Zabbix server connects to the proxy.
|
||||
- Useful when central server can reach the proxy network.
|
||||
- Requires firewall rules from server to proxy.
|
||||
- Main operational checks: proxy listener, network latency, poller load.
|
||||
|
||||
## Operational Signals
|
||||
|
||||
- Proxy queue growth.
|
||||
- Unsupported items after template changes.
|
||||
- Agent unreachable or active checks delayed.
|
||||
- Proxy DB growth during WAN outage.
|
||||
- Config sync failures after maintenance.
|
||||
@@ -0,0 +1,4 @@
|
||||
2026-05-04 10:21:14 WARN zbx-proxy-bank01 proxy queue above threshold: 420 delayed values
|
||||
2026-05-04 10:22:01 HIGH linux-app01 Root filesystem above 85 percent
|
||||
2026-05-04 10:25:33 INFO linux-app01 filesystem cleanup completed, usage back to 74 percent
|
||||
2026-05-04 10:30:12 WARN aix-core01 active check delayed, proxy connectivity validated
|
||||
@@ -0,0 +1,12 @@
|
||||
[zabbix_server]
|
||||
zbx-server01 ansible_connection=local
|
||||
|
||||
[zabbix_proxy]
|
||||
zbx-proxy-bank01 ansible_connection=local zabbix_proxy_mode=active
|
||||
zbx-proxy-bank02 ansible_connection=local zabbix_proxy_mode=passive
|
||||
|
||||
[zabbix_agents_linux]
|
||||
linux-app01 ansible_connection=local zabbix_agent_mode=active
|
||||
|
||||
[zabbix_agents_aix]
|
||||
aix-core01 ansible_connection=local zabbix_agent_mode=active
|
||||
@@ -0,0 +1,8 @@
|
||||
---
|
||||
- name: Configure Zabbix agents
|
||||
hosts: zabbix_agents_linux:zabbix_agents_aix
|
||||
become: true
|
||||
gather_facts: false
|
||||
|
||||
roles:
|
||||
- role: zabbix_agent
|
||||
@@ -0,0 +1,8 @@
|
||||
---
|
||||
- name: Configure Zabbix proxy nodes
|
||||
hosts: zabbix_proxy
|
||||
become: true
|
||||
gather_facts: false
|
||||
|
||||
roles:
|
||||
- role: zabbix_proxy
|
||||
@@ -0,0 +1,8 @@
|
||||
---
|
||||
- name: Configure Zabbix server control plane
|
||||
hosts: zabbix_server
|
||||
become: true
|
||||
gather_facts: false
|
||||
|
||||
roles:
|
||||
- role: zabbix_server
|
||||
+7
@@ -0,0 +1,7 @@
|
||||
---
|
||||
zabbix_agent_server: zbx-proxy-bank01
|
||||
zabbix_agent_server_active: zbx-proxy-bank01
|
||||
zabbix_agent_hostname: "{{ inventory_hostname }}"
|
||||
zabbix_agent_mode: active
|
||||
zabbix_agent_listen_port: 10050
|
||||
zabbix_agent_include_dir: /etc/zabbix/zabbix_agentd.d
|
||||
+38
@@ -0,0 +1,38 @@
|
||||
---
|
||||
- name: Validate agent mode
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- zabbix_agent_mode in ["active", "passive"]
|
||||
fail_msg: "zabbix_agent_mode must be active or passive"
|
||||
|
||||
- name: Create Zabbix agent include directory
|
||||
ansible.builtin.file:
|
||||
path: "{{ zabbix_agent_include_dir }}"
|
||||
state: directory
|
||||
owner: root
|
||||
group: root
|
||||
mode: "0755"
|
||||
|
||||
- name: Render Zabbix agent configuration example
|
||||
ansible.builtin.template:
|
||||
src: zabbix_agentd.conf.j2
|
||||
dest: /etc/zabbix/zabbix_agentd.conf
|
||||
owner: root
|
||||
group: root
|
||||
mode: "0644"
|
||||
|
||||
- name: Render custom OS check keys
|
||||
ansible.builtin.template:
|
||||
src: os_checks.conf.j2
|
||||
dest: "{{ zabbix_agent_include_dir }}/os_checks.conf"
|
||||
owner: root
|
||||
group: root
|
||||
mode: "0644"
|
||||
|
||||
- name: Report agent check model
|
||||
ansible.builtin.debug:
|
||||
msg:
|
||||
host: "{{ zabbix_agent_hostname }}"
|
||||
mode: "{{ zabbix_agent_mode }}"
|
||||
server: "{{ zabbix_agent_server }}"
|
||||
server_active: "{{ zabbix_agent_server_active }}"
|
||||
+4
@@ -0,0 +1,4 @@
|
||||
UserParameter=os.fs.discovery,echo '{"data":[]}'
|
||||
UserParameter=os.cpu.runqueue,uptime
|
||||
UserParameter=os.net.tcp_established,ss -tan state established | wc -l
|
||||
UserParameter=os.process.count[*],pgrep -fc "$1"
|
||||
+5
@@ -0,0 +1,5 @@
|
||||
Server={{ zabbix_agent_server }}
|
||||
ServerActive={{ zabbix_agent_server_active }}
|
||||
Hostname={{ zabbix_agent_hostname }}
|
||||
ListenPort={{ zabbix_agent_listen_port }}
|
||||
Include={{ zabbix_agent_include_dir }}/*.conf
|
||||
+7
@@ -0,0 +1,7 @@
|
||||
---
|
||||
zabbix_proxy_server: zbx-server01
|
||||
zabbix_proxy_hostname: "{{ inventory_hostname }}"
|
||||
zabbix_proxy_mode: active
|
||||
zabbix_proxy_database: zabbix_proxy
|
||||
zabbix_proxy_config_frequency: 60
|
||||
zabbix_proxy_offline_buffer_hours: 24
|
||||
+31
@@ -0,0 +1,31 @@
|
||||
---
|
||||
- name: Validate proxy mode
|
||||
ansible.builtin.assert:
|
||||
that:
|
||||
- zabbix_proxy_mode in ["active", "passive"]
|
||||
fail_msg: "zabbix_proxy_mode must be active or passive"
|
||||
|
||||
- name: Create Zabbix proxy config directory
|
||||
ansible.builtin.file:
|
||||
path: /etc/zabbix
|
||||
state: directory
|
||||
owner: root
|
||||
group: root
|
||||
mode: "0755"
|
||||
|
||||
- name: Render Zabbix proxy configuration example
|
||||
ansible.builtin.template:
|
||||
src: zabbix_proxy.conf.j2
|
||||
dest: /etc/zabbix/zabbix_proxy.conf
|
||||
owner: root
|
||||
group: root
|
||||
mode: "0644"
|
||||
|
||||
- name: Report proxy operating model
|
||||
ansible.builtin.debug:
|
||||
msg:
|
||||
proxy: "{{ zabbix_proxy_hostname }}"
|
||||
server: "{{ zabbix_proxy_server }}"
|
||||
mode: "{{ zabbix_proxy_mode }}"
|
||||
active_checks: "{{ zabbix_proxy_mode == 'active' }}"
|
||||
offline_buffer_hours: "{{ zabbix_proxy_offline_buffer_hours }}"
|
||||
+6
@@ -0,0 +1,6 @@
|
||||
Server={{ zabbix_proxy_server }}
|
||||
Hostname={{ zabbix_proxy_hostname }}
|
||||
ProxyMode={{ 0 if zabbix_proxy_mode == 'active' else 1 }}
|
||||
DBName={{ zabbix_proxy_database }}
|
||||
ConfigFrequency={{ zabbix_proxy_config_frequency }}
|
||||
ProxyOfflineBuffer={{ zabbix_proxy_offline_buffer_hours }}
|
||||
+7
@@ -0,0 +1,7 @@
|
||||
---
|
||||
zabbix_server_listen_port: 10051
|
||||
zabbix_server_database: zabbix
|
||||
zabbix_server_housekeeping_frequency: 1
|
||||
zabbix_server_cache_size: 256M
|
||||
zabbix_server_trend_retention_days: 365
|
||||
zabbix_server_history_retention_days: 90
|
||||
+25
@@ -0,0 +1,25 @@
|
||||
---
|
||||
- name: Create Zabbix server config directory
|
||||
ansible.builtin.file:
|
||||
path: /etc/zabbix
|
||||
state: directory
|
||||
owner: root
|
||||
group: root
|
||||
mode: "0755"
|
||||
|
||||
- name: Render Zabbix server configuration example
|
||||
ansible.builtin.template:
|
||||
src: zabbix_server.conf.j2
|
||||
dest: /etc/zabbix/zabbix_server.conf
|
||||
owner: root
|
||||
group: root
|
||||
mode: "0644"
|
||||
|
||||
- name: Report Zabbix server maintenance settings
|
||||
ansible.builtin.debug:
|
||||
msg:
|
||||
listen_port: "{{ zabbix_server_listen_port }}"
|
||||
cache_size: "{{ zabbix_server_cache_size }}"
|
||||
housekeeping_frequency: "{{ zabbix_server_housekeeping_frequency }}"
|
||||
history_retention_days: "{{ zabbix_server_history_retention_days }}"
|
||||
trend_retention_days: "{{ zabbix_server_trend_retention_days }}"
|
||||
+5
@@ -0,0 +1,5 @@
|
||||
ListenPort={{ zabbix_server_listen_port }}
|
||||
DBName={{ zabbix_server_database }}
|
||||
CacheSize={{ zabbix_server_cache_size }}
|
||||
HousekeepingFrequency={{ zabbix_server_housekeeping_frequency }}
|
||||
HistoryStorageDateIndex=1
|
||||
@@ -0,0 +1,13 @@
|
||||
{
|
||||
"host": "aix-core01",
|
||||
"proxy": "zbx-proxy-bank01",
|
||||
"mode": "active",
|
||||
"checks": {
|
||||
"aix.fs.root.pused": 68.2,
|
||||
"aix.cpu.user": 17.5,
|
||||
"aix.memory.free_mb": 8192,
|
||||
"aix.net.errin": 0,
|
||||
"aix.process.count[cron]": 1
|
||||
},
|
||||
"note": "Sample payload for review; AIX runtime is not emulated locally."
|
||||
}
|
||||
@@ -0,0 +1,12 @@
|
||||
{
|
||||
"host": "linux-app01",
|
||||
"proxy": "zbx-proxy-bank01",
|
||||
"mode": "active",
|
||||
"checks": {
|
||||
"vfs.fs.size[/,pused]": 72.4,
|
||||
"system.cpu.util[,idle]": 83.1,
|
||||
"vm.memory.size[pavailable]": 41.7,
|
||||
"net.if.in[eth0]": 184320,
|
||||
"proc.num[sshd]": 2
|
||||
}
|
||||
}
|
||||
@@ -0,0 +1,49 @@
|
||||
#!/usr/bin/env python3
|
||||
"""Validate Zabbix portfolio template and sample assets."""
|
||||
|
||||
import json
|
||||
from pathlib import Path
|
||||
|
||||
|
||||
ROOT = Path(__file__).resolve().parents[1]
|
||||
|
||||
|
||||
def load_json(path: Path) -> dict:
|
||||
with path.open(encoding="utf-8") as handle:
|
||||
return json.load(handle)
|
||||
|
||||
|
||||
def validate_template(path: Path) -> None:
|
||||
data = load_json(path)
|
||||
for field in ["template", "items", "triggers"]:
|
||||
if field not in data:
|
||||
raise ValueError(f"{path}: missing {field}")
|
||||
if not data["items"]:
|
||||
raise ValueError(f"{path}: template must define at least one item")
|
||||
for item in data["items"]:
|
||||
for field in ["key", "name", "type", "value_type"]:
|
||||
if field not in item:
|
||||
raise ValueError(f"{path}: item missing {field}")
|
||||
|
||||
|
||||
def validate_sample(path: Path) -> None:
|
||||
data = load_json(path)
|
||||
for field in ["host", "proxy", "mode", "checks"]:
|
||||
if field not in data:
|
||||
raise ValueError(f"{path}: missing {field}")
|
||||
if data["mode"] not in ["active", "passive"]:
|
||||
raise ValueError(f"{path}: mode must be active or passive")
|
||||
if not data["checks"]:
|
||||
raise ValueError(f"{path}: checks cannot be empty")
|
||||
|
||||
|
||||
def main() -> None:
|
||||
for path in sorted((ROOT / "templates").glob("*.json")):
|
||||
validate_template(path)
|
||||
for path in sorted((ROOT / "samples").glob("*.json")):
|
||||
validate_sample(path)
|
||||
print("Zabbix template and sample assets are valid")
|
||||
|
||||
|
||||
if __name__ == "__main__":
|
||||
main()
|
||||
@@ -0,0 +1,16 @@
|
||||
{
|
||||
"template": "Template OS AIX - Portfolio Simple Checks",
|
||||
"groups": ["Templates/Operating systems"],
|
||||
"items": [
|
||||
{"key": "aix.fs.root.pused", "name": "AIX root filesystem usage percent", "type": "ZABBIX_AGENT_ACTIVE", "value_type": "FLOAT", "units": "%"},
|
||||
{"key": "aix.cpu.user", "name": "AIX CPU user percent", "type": "ZABBIX_AGENT_ACTIVE", "value_type": "FLOAT", "units": "%"},
|
||||
{"key": "aix.memory.free_mb", "name": "AIX free memory MB", "type": "ZABBIX_AGENT_ACTIVE", "value_type": "UNSIGNED", "units": "MB"},
|
||||
{"key": "aix.net.errin", "name": "AIX network input errors", "type": "ZABBIX_AGENT_ACTIVE", "value_type": "UNSIGNED"},
|
||||
{"key": "aix.process.count[cron]", "name": "AIX cron process count", "type": "ZABBIX_AGENT_ACTIVE", "value_type": "UNSIGNED"}
|
||||
],
|
||||
"triggers": [
|
||||
{"name": "AIX root filesystem above 85 percent", "expression": "last(/Template OS AIX - Portfolio Simple Checks/aix.fs.root.pused)>85"},
|
||||
{"name": "AIX cron is not running", "expression": "last(/Template OS AIX - Portfolio Simple Checks/aix.process.count[cron])=0"}
|
||||
],
|
||||
"notes": "AIX checks are represented as template keys and sample data. They are not executed locally in this repository."
|
||||
}
|
||||
+16
@@ -0,0 +1,16 @@
|
||||
{
|
||||
"template": "Template OS Linux - Portfolio Simple Checks",
|
||||
"groups": ["Templates/Operating systems"],
|
||||
"items": [
|
||||
{"key": "vfs.fs.size[/,pused]", "name": "Root filesystem usage percent", "type": "ZABBIX_AGENT", "value_type": "FLOAT", "units": "%"},
|
||||
{"key": "system.cpu.util[,idle]", "name": "CPU idle percent", "type": "ZABBIX_AGENT", "value_type": "FLOAT", "units": "%"},
|
||||
{"key": "vm.memory.size[pavailable]", "name": "Available memory percent", "type": "ZABBIX_AGENT", "value_type": "FLOAT", "units": "%"},
|
||||
{"key": "net.if.in[eth0]", "name": "Network inbound on eth0", "type": "ZABBIX_AGENT", "value_type": "UNSIGNED", "units": "bps"},
|
||||
{"key": "proc.num[sshd]", "name": "sshd process count", "type": "ZABBIX_AGENT", "value_type": "UNSIGNED"}
|
||||
],
|
||||
"triggers": [
|
||||
{"name": "Root filesystem above 85 percent", "expression": "last(/Template OS Linux - Portfolio Simple Checks/vfs.fs.size[/,pused])>85"},
|
||||
{"name": "Low available memory", "expression": "last(/Template OS Linux - Portfolio Simple Checks/vm.memory.size[pavailable])<10"},
|
||||
{"name": "sshd is not running", "expression": "last(/Template OS Linux - Portfolio Simple Checks/proc.num[sshd])=0"}
|
||||
]
|
||||
}
|
||||
Reference in New Issue
Block a user