This commit is contained in:
@@ -0,0 +1,857 @@
|
||||
# Production Operations Cheatsheet
|
||||
|
||||
Operational quick reference for Linux/Unix infrastructure work. Prefer read-only checks first. Record pre-change state, scope the blast radius, execute minimally, and validate after every change.
|
||||
|
||||
## Linux / Unix Daily Operations
|
||||
|
||||
### Uptime and Host State
|
||||
|
||||
Check host age, kernel, clock, and recent reboot history before touching anything:
|
||||
|
||||
```bash
|
||||
uptime
|
||||
uname -r
|
||||
hostnamectl
|
||||
timedatectl
|
||||
who -b
|
||||
last -x | head -20
|
||||
```
|
||||
|
||||
Pre-check pattern:
|
||||
|
||||
```bash
|
||||
date -u
|
||||
uptime
|
||||
df -h
|
||||
free -m
|
||||
systemctl --failed
|
||||
```
|
||||
|
||||
### Process Management
|
||||
|
||||
```bash
|
||||
ps -ef | head
|
||||
ps -eo pid,ppid,user,%cpu,%mem,etime,cmd --sort=-%cpu | head -20
|
||||
pgrep -a java
|
||||
pstree -ap | less
|
||||
pidof sshd
|
||||
renice +5 -p <pid>
|
||||
kill -TERM <pid>
|
||||
kill -9 <pid> # DANGEROUS: last resort only
|
||||
```
|
||||
|
||||
Validation:
|
||||
|
||||
```bash
|
||||
ps -p <pid> -o pid,stat,etime,cmd
|
||||
journalctl -u <service> -n 50 --no-pager
|
||||
```
|
||||
|
||||
### systemctl
|
||||
|
||||
```bash
|
||||
systemctl status <service> --no-pager -l
|
||||
systemctl is-active <service>
|
||||
systemctl is-enabled <service>
|
||||
systemctl list-units --type=service --state=running
|
||||
systemctl list-units --failed
|
||||
systemctl daemon-reload
|
||||
systemctl restart <service> # impact: confirms service interruption policy first
|
||||
```
|
||||
|
||||
### journalctl
|
||||
|
||||
```bash
|
||||
journalctl -u <service> -n 100 --no-pager
|
||||
journalctl -u <service> --since '30 min ago'
|
||||
journalctl -p err -S today
|
||||
journalctl -k -b
|
||||
journalctl --disk-usage
|
||||
```
|
||||
|
||||
### Service Troubleshooting Flow
|
||||
|
||||
1. Confirm service state and recent restart count.
|
||||
2. Read the last 100-200 journal lines.
|
||||
3. Validate config syntax before restart if the daemon supports it.
|
||||
4. Check dependent ports, mounts, credentials, and name resolution.
|
||||
5. Restart only after cause is understood or rollback exists.
|
||||
|
||||
Example:
|
||||
|
||||
```bash
|
||||
systemctl status nginx --no-pager -l
|
||||
journalctl -u nginx -n 100 --no-pager
|
||||
nginx -t
|
||||
ss -ltnp | grep ':80\|:443'
|
||||
curl -kI https://127.0.0.1/
|
||||
```
|
||||
|
||||
### CPU and Memory Diagnostics
|
||||
|
||||
```bash
|
||||
uptime
|
||||
top -H -b -n 1 | head -40
|
||||
pidstat 1 5
|
||||
pidstat -ru -p ALL 1 3
|
||||
vmstat 1 5
|
||||
iostat -xz 1 5
|
||||
free -m
|
||||
sar -q 1 5
|
||||
```
|
||||
|
||||
Quick interpretation:
|
||||
|
||||
- high `%wa`: storage path or NFS issue
|
||||
- high run queue with low CPU idle: CPU contention
|
||||
- swap growth plus page scans: memory pressure
|
||||
|
||||
### Disk Usage
|
||||
|
||||
```bash
|
||||
df -hT
|
||||
du -xhd1 /var | sort -h
|
||||
find /var/log -type f -size +500M -ls | sort -k7,7n
|
||||
lsof +L1
|
||||
```
|
||||
|
||||
### Inode Exhaustion
|
||||
|
||||
```bash
|
||||
df -ih
|
||||
find /var -xdev -type f | cut -d/ -f1-3 | sort | uniq -c | sort -n
|
||||
find /tmp -xdev -type f | wc -l
|
||||
```
|
||||
|
||||
### Mounts
|
||||
|
||||
```bash
|
||||
mount | column -t
|
||||
findmnt
|
||||
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
|
||||
cat /etc/fstab
|
||||
mount -a # can expose bad fstab entries; use in change window
|
||||
```
|
||||
|
||||
### Permissions
|
||||
|
||||
```bash
|
||||
namei -l /path/to/file
|
||||
stat /path/to/file
|
||||
getfacl /path/to/file
|
||||
chmod 640 /path/to/file
|
||||
chown root:app /path/to/file
|
||||
```
|
||||
|
||||
### SELinux
|
||||
|
||||
State and mode:
|
||||
|
||||
```bash
|
||||
getenforce
|
||||
sestatus
|
||||
cat /etc/selinux/config
|
||||
```
|
||||
|
||||
Check file, process, and port context:
|
||||
|
||||
```bash
|
||||
ls -Zd /var/www/html
|
||||
ls -lZ /var/www/html/index.html
|
||||
ps -eZ | grep nginx
|
||||
id -Z
|
||||
semanage port -l | grep http
|
||||
```
|
||||
|
||||
Audit and denial review:
|
||||
|
||||
```bash
|
||||
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
|
||||
ausearch -m AVC -ts today | audit2why
|
||||
journalctl -t setroubleshoot --since '1 hour ago'
|
||||
sealert -a /var/log/audit/audit.log
|
||||
```
|
||||
|
||||
Typical flow:
|
||||
|
||||
1. Confirm SELinux mode is `Enforcing` or `Permissive`.
|
||||
2. Identify the failing path, process domain, and target context.
|
||||
3. Read AVC denials before changing labels or booleans.
|
||||
4. Prefer persistent policy-aligned fixes over `chcon`.
|
||||
5. Restore default labels and retest service path.
|
||||
|
||||
Modify and restore context:
|
||||
|
||||
```bash
|
||||
chcon -t httpd_sys_content_t /srv/app/index.html # temporary until relabel/restore
|
||||
chcon -R -t httpd_sys_rw_content_t /srv/app/uploads # temporary until relabel/restore
|
||||
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
|
||||
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
|
||||
restorecon -Rv /srv/app
|
||||
matchpathcon /srv/app/uploads/file.txt
|
||||
```
|
||||
|
||||
Booleans and validation:
|
||||
|
||||
```bash
|
||||
getsebool -a | grep httpd
|
||||
getsebool httpd_can_network_connect
|
||||
setsebool -P httpd_can_network_connect on
|
||||
runcon -t httpd_t -- id -Z
|
||||
```
|
||||
|
||||
Notes:
|
||||
|
||||
- prefer `semanage fcontext` plus `restorecon` for persistent fixes
|
||||
- use `chcon` only as a short-lived diagnostic or emergency workaround
|
||||
- avoid generating local policy modules from `audit2allow` until root cause is understood
|
||||
- after context changes, validate service startup, AVC silence, and application path access
|
||||
|
||||
### Archives
|
||||
|
||||
```bash
|
||||
tar tf backup.tar | head
|
||||
tar czf logs-$(date +%F).tgz /var/log/app
|
||||
tar xzf bundle.tgz -C /restore/path
|
||||
gzip -t file.gz
|
||||
```
|
||||
|
||||
### File Operations
|
||||
|
||||
```bash
|
||||
cp -a source/ target/
|
||||
rsync -aHAXvn /src/ /dst/
|
||||
rsync -aHAX --delete --info=progress2 /src/ /dst/ # impact: verify source/destination twice
|
||||
mv file file.$(date +%F-%H%M%S).bak
|
||||
sha256sum file
|
||||
```
|
||||
|
||||
## Text Processing & Regex
|
||||
|
||||
### Core Tools
|
||||
|
||||
```bash
|
||||
grep -n 'ERROR' app.log
|
||||
grep -E 'ERROR|WARN' app.log
|
||||
grep -P '^\d{4}-\d{2}-\d{2}T' app.log
|
||||
awk '{print $1,$4,$5}' access.log
|
||||
awk -F, 'NR==1 || $3 ~ /failed/' report.csv
|
||||
sed -n '1,20p' file
|
||||
sed -E 's/[[:space:]]+/ /g' file
|
||||
cut -d: -f1,7 /etc/passwd
|
||||
sort file | uniq -c | sort -nr
|
||||
xargs -r -n1 systemctl status < service-list.txt
|
||||
jq '.items[] | {name: .metadata.name, phase: .status.phase}' pods.json
|
||||
```
|
||||
|
||||
### Regex Reference
|
||||
|
||||
```text
|
||||
IPv4 \b(?:\d{1,3}\.){3}\d{1,3}\b
|
||||
ISO timestamp \b\d{4}-\d{2}-\d{2}[T ][0-2]\d:[0-5]\d:[0-5]\d(?:Z|[+-][0-2]\d:?[0-5]\d)?\b
|
||||
UUID \b[0-9a-fA-F]{8}-[0-9a-fA-F]{4}-[1-5][0-9a-fA-F]{3}-[89abAB][0-9a-fA-F]{3}-[0-9a-fA-F]{12}\b
|
||||
Log level \b(?:ERROR|WARN|INFO)\b
|
||||
Failed SSH Failed password for (?:invalid user )?(\S+) from ((?:\d{1,3}\.){3}\d{1,3})
|
||||
Ansible changed/fail ^(changed|fatal|failed):\s+\[[^]]+\]
|
||||
```
|
||||
|
||||
### Log Parsing Examples
|
||||
|
||||
IP extraction:
|
||||
|
||||
```bash
|
||||
grep -oP '\b(?:\d{1,3}\.){3}\d{1,3}\b' access.log | sort | uniq -c | sort -nr | head
|
||||
```
|
||||
|
||||
Timestamp filter:
|
||||
|
||||
```bash
|
||||
grep -P '^\d{4}-\d{2}-\d{2}T\d{2}:' app.log
|
||||
```
|
||||
|
||||
UUID extraction:
|
||||
|
||||
```bash
|
||||
grep -oEi '[0-9a-f]{8}-[0-9a-f]{4}-[1-5][0-9a-f]{3}-[89ab][0-9a-f]{3}-[0-9a-f]{12}' app.log | sort -u
|
||||
```
|
||||
|
||||
ERROR/WARN/INFO parsing:
|
||||
|
||||
```bash
|
||||
grep -Eo '\b(ERROR|WARN|INFO)\b' app.log | sort | uniq -c
|
||||
```
|
||||
|
||||
Failed SSH login parsing:
|
||||
|
||||
```bash
|
||||
grep 'Failed password' /var/log/secure \
|
||||
| awk '{print $(NF-3),$NF}' \
|
||||
| sort | uniq -c | sort -nr | head
|
||||
```
|
||||
|
||||
Extract fields from logs:
|
||||
|
||||
```bash
|
||||
awk -F'|' '/ERROR/ {print $1,$3,$5}' app.log
|
||||
```
|
||||
|
||||
Filter Ansible output:
|
||||
|
||||
```bash
|
||||
grep -E '^(TASK|changed:|ok:|fatal:|failed:|skipping:)' ansible.log
|
||||
grep -E '^fatal:|^failed:' ansible.log
|
||||
```
|
||||
|
||||
## Incident Response
|
||||
|
||||
### Disk Full
|
||||
|
||||
Workflow:
|
||||
|
||||
```bash
|
||||
df -hT
|
||||
df -ih
|
||||
findmnt
|
||||
du -xhd1 /var | sort -h
|
||||
find /var -xdev -type f -size +1G -ls | sort -k7,7n
|
||||
lsof +L1
|
||||
journalctl --disk-usage
|
||||
```
|
||||
|
||||
Typical branches:
|
||||
|
||||
- filesystem full: identify growth path, compress/rotate/archive, validate app behavior
|
||||
- inode full: remove file storms, spool buildup, temp-file leaks
|
||||
- deleted open files: restart offender only after sizing impact
|
||||
|
||||
Post-check:
|
||||
|
||||
```bash
|
||||
df -hT
|
||||
df -ih
|
||||
systemctl --failed
|
||||
```
|
||||
|
||||
### High CPU
|
||||
|
||||
```bash
|
||||
uptime
|
||||
mpstat -P ALL 1 5
|
||||
pidstat -u -p ALL 1 5
|
||||
top -H -b -n 1 | head -40
|
||||
ps -eo pid,ppid,ni,psr,%cpu,cmd --sort=-%cpu | head -20
|
||||
```
|
||||
|
||||
Flow:
|
||||
|
||||
1. Confirm sustained load, not a short spike.
|
||||
2. Separate user CPU vs system CPU vs I/O wait.
|
||||
3. Identify hot process and hot threads.
|
||||
4. Correlate with deploys, cron, backups, or JVM GC.
|
||||
5. Throttle, stop, or fail over only with service impact understood.
|
||||
|
||||
### Memory Pressure
|
||||
|
||||
```bash
|
||||
free -m
|
||||
vmstat 1 5
|
||||
sar -r 1 5
|
||||
ps -eo pid,user,%mem,rss,vsz,cmd --sort=-rss | head -20
|
||||
dmesg -T | egrep -i 'oom|killed process'
|
||||
```
|
||||
|
||||
Flow:
|
||||
|
||||
1. Check swap growth and page scan rates.
|
||||
2. Identify top RSS owners.
|
||||
3. Check kernel logs for OOM.
|
||||
4. Validate cache vs real process growth.
|
||||
5. Restart leaking service only after capturing evidence.
|
||||
|
||||
### Failed Service
|
||||
|
||||
```bash
|
||||
systemctl status <service> --no-pager -l
|
||||
journalctl -u <service> -b --no-pager | tail -100
|
||||
systemctl show <service> -p ExecStart -p FragmentPath -p ActiveEnterTimestamp
|
||||
```
|
||||
|
||||
Flow:
|
||||
|
||||
1. Validate config.
|
||||
2. Validate credentials, ports, mounts, permissions.
|
||||
3. Confirm dependency availability.
|
||||
4. Restart and recheck logs immediately.
|
||||
|
||||
### SELinux Denials
|
||||
|
||||
Typical case: service works in `Permissive`, fails in `Enforcing`, or logs show `permission denied` while UNIX permissions look correct.
|
||||
|
||||
Triage:
|
||||
|
||||
```bash
|
||||
getenforce
|
||||
sestatus
|
||||
ausearch -m AVC,USER_AVC,SELINUX_ERR -ts recent
|
||||
ausearch -m AVC -ts recent | audit2why
|
||||
journalctl -t setroubleshoot --since '30 min ago'
|
||||
systemctl status <service> --no-pager -l
|
||||
ps -eZ | grep <service>
|
||||
ls -lZ /path/to/app /path/to/app/*
|
||||
```
|
||||
|
||||
Flow:
|
||||
|
||||
1. Confirm the failure is current and reproducible.
|
||||
2. Identify the denied process domain, target path, and requested access from AVC logs.
|
||||
3. Validate expected default context with `matchpathcon`.
|
||||
4. Check for mislabeled files, wrong port types, or missing SELinux booleans.
|
||||
5. Apply the smallest persistent fix, then retest in `Enforcing`.
|
||||
|
||||
Common fixes:
|
||||
|
||||
```bash
|
||||
matchpathcon /srv/app/config.yml
|
||||
restorecon -Rv /srv/app
|
||||
semanage fcontext -a -t httpd_sys_content_t '/srv/app(/.*)?'
|
||||
semanage fcontext -a -t httpd_sys_rw_content_t '/srv/app/uploads(/.*)?'
|
||||
semanage port -l | grep http
|
||||
getsebool -a | grep httpd
|
||||
setsebool -P httpd_can_network_connect on
|
||||
```
|
||||
|
||||
Validation:
|
||||
|
||||
```bash
|
||||
getenforce
|
||||
systemctl restart <service>
|
||||
systemctl status <service> --no-pager -l
|
||||
ausearch -m AVC -ts recent
|
||||
curl -fsS http://127.0.0.1:<port>/health
|
||||
```
|
||||
|
||||
Operational notes:
|
||||
|
||||
- do not leave systems in `Permissive` as the fix
|
||||
- prefer `restorecon` and `semanage fcontext` over repeated `chcon`
|
||||
- treat `audit2allow` output as investigation material, not automatic remediation
|
||||
- if policy changes are unavoidable, document exact AVC evidence and rollback path
|
||||
|
||||
### SSL Issues
|
||||
|
||||
```bash
|
||||
openssl s_client -connect host:443 -servername host -showcerts </dev/null
|
||||
openssl x509 -in cert.pem -noout -subject -issuer -dates -ext subjectAltName
|
||||
curl -vkI https://host/
|
||||
```
|
||||
|
||||
Check for:
|
||||
|
||||
- expired certificate
|
||||
- missing SAN
|
||||
- incomplete chain
|
||||
- hostname mismatch
|
||||
- TLS version or cipher mismatch
|
||||
|
||||
### DNS Issues
|
||||
|
||||
```bash
|
||||
dig +short app.example.com
|
||||
dig @<resolver> app.example.com
|
||||
dig +trace app.example.com
|
||||
getent hosts app.example.com
|
||||
resolvectl status
|
||||
```
|
||||
|
||||
Flow:
|
||||
|
||||
1. Compare resolver result with authoritative result.
|
||||
2. Check TTL and stale cache.
|
||||
3. Validate `/etc/resolv.conf`, local resolver, and search domains.
|
||||
4. Test from affected host and unaffected host.
|
||||
|
||||
### Network Issues
|
||||
|
||||
```bash
|
||||
ip addr
|
||||
ip route
|
||||
ss -tulpen
|
||||
tcpdump -ni any host <peer> and port <port>
|
||||
curl -sv http://host:port/health
|
||||
mtr -rwzc 20 host
|
||||
```
|
||||
|
||||
Flow:
|
||||
|
||||
1. Interface/link state.
|
||||
2. Route and source IP selection.
|
||||
3. Listening socket on target.
|
||||
4. Firewall and security controls.
|
||||
5. Packet capture if app logs are inconclusive.
|
||||
|
||||
### JVM / Tomcat Issues
|
||||
|
||||
```bash
|
||||
ps -ef | grep -i tomcat
|
||||
jcmd <pid> VM.flags
|
||||
jstat -gcutil <pid> 1000 10
|
||||
jstack <pid> | head -100
|
||||
ss -ltnp | grep java
|
||||
tail -100 /opt/tomcat/logs/catalina.out
|
||||
```
|
||||
|
||||
Focus:
|
||||
|
||||
- stuck threads
|
||||
- full GC loops
|
||||
- heap exhaustion
|
||||
- connector bind failures
|
||||
- slow backend dependency
|
||||
|
||||
### Certificate Expiration
|
||||
|
||||
```bash
|
||||
echo | openssl s_client -connect host:443 -servername host 2>/dev/null \
|
||||
| openssl x509 -noout -enddate
|
||||
|
||||
openssl x509 -checkend 2592000 -noout -in cert.pem
|
||||
```
|
||||
|
||||
### Suspicious Login Attempts
|
||||
|
||||
```bash
|
||||
last -ai | head -30
|
||||
lastb -ai | head -30
|
||||
grep 'Failed password' /var/log/secure | tail -50
|
||||
grep 'Accepted ' /var/log/secure | tail -50
|
||||
ausearch -m USER_LOGIN -ts recent
|
||||
```
|
||||
|
||||
Workflow:
|
||||
|
||||
1. Identify source IPs and usernames.
|
||||
2. Validate whether attempts are expected from bastions/scanners.
|
||||
3. Check successful logins from same sources.
|
||||
4. Review sudo usage and persistence changes.
|
||||
5. Preserve logs before cleanup or rotation.
|
||||
|
||||
## Networking Operations
|
||||
|
||||
```bash
|
||||
ip -br addr
|
||||
ip route get 8.8.8.8
|
||||
ss -ltnp
|
||||
ss -tn state established '( sport = :443 or dport = :443 )'
|
||||
tcpdump -ni eth0 port 53
|
||||
dig +short mx example.com
|
||||
curl -sS -o /dev/null -w '%{http_code} %{time_total}\n' https://host/health
|
||||
mtr -rwzc 10 host
|
||||
traceroute -T -p 443 host
|
||||
openssl s_client -connect host:443 -servername host </dev/null
|
||||
```
|
||||
|
||||
## Storage Operations
|
||||
|
||||
### Block and Filesystem Discovery
|
||||
|
||||
```bash
|
||||
lsblk -f
|
||||
blkid
|
||||
findmnt
|
||||
cat /proc/partitions
|
||||
multipath -ll
|
||||
```
|
||||
|
||||
### LVM
|
||||
|
||||
```bash
|
||||
pvs
|
||||
vgs
|
||||
lvs -a -o +devices
|
||||
pvdisplay /dev/sdX
|
||||
vgdisplay <vg>
|
||||
lvdisplay /dev/<vg>/<lv>
|
||||
```
|
||||
|
||||
Growth example:
|
||||
|
||||
```bash
|
||||
pvcreate /dev/mapper/mpatha # impact: write metadata
|
||||
vgextend vgdata /dev/mapper/mpatha # impact: changes VG layout
|
||||
lvextend -L +100G -r /dev/vgdata/lvapp
|
||||
```
|
||||
|
||||
### XFS
|
||||
|
||||
```bash
|
||||
xfs_info /mountpoint
|
||||
xfs_repair -n /dev/mapper/vg-lv
|
||||
xfs_growfs /mountpoint
|
||||
```
|
||||
|
||||
### ext4
|
||||
|
||||
```bash
|
||||
tune2fs -l /dev/mapper/vg-lv | head -40
|
||||
e2fsck -fn /dev/mapper/vg-lv
|
||||
resize2fs /dev/mapper/vg-lv
|
||||
```
|
||||
|
||||
### Multipath
|
||||
|
||||
```bash
|
||||
multipath -ll
|
||||
lsblk -S
|
||||
udevadm info --query=all --name=/dev/mapper/mpatha | head -40
|
||||
```
|
||||
|
||||
### NFS
|
||||
|
||||
```bash
|
||||
showmount -e nfs-server
|
||||
nfsstat -m
|
||||
mount | grep nfs
|
||||
rpcinfo -p nfs-server
|
||||
```
|
||||
|
||||
### iSCSI
|
||||
|
||||
```bash
|
||||
iscsiadm -m session
|
||||
iscsiadm -m node
|
||||
iscsiadm -m discovery -t sendtargets -p <target-ip>
|
||||
```
|
||||
|
||||
### Mount Troubleshooting
|
||||
|
||||
```bash
|
||||
findmnt /mountpoint
|
||||
mount -v /mountpoint
|
||||
dmesg -T | tail -50
|
||||
journalctl -k -n 100 --no-pager
|
||||
```
|
||||
|
||||
Check:
|
||||
|
||||
- device path stable
|
||||
- UUID correct
|
||||
- filesystem type correct
|
||||
- multipath settled
|
||||
- network and RPC available for NFS
|
||||
|
||||
### Filesystem Validation
|
||||
|
||||
```bash
|
||||
findmnt -no SOURCE,TARGET,FSTYPE,OPTIONS /data
|
||||
df -hT /data
|
||||
touch /data/.write-test && rm -f /data/.write-test
|
||||
```
|
||||
|
||||
### Migration Validation Example
|
||||
|
||||
```bash
|
||||
findmnt /data
|
||||
df -hT /data
|
||||
rsync -aHAXvn /olddata/ /data/
|
||||
rsync -aHAXc --delete --dry-run /olddata/ /data/
|
||||
sha256sum /olddata/keyfile /data/keyfile
|
||||
```
|
||||
|
||||
## AIX Operations
|
||||
|
||||
```bash
|
||||
oslevel -s
|
||||
errpt | head
|
||||
errpt -a | more
|
||||
topas
|
||||
lsvg -o
|
||||
lsvg rootvg
|
||||
lslpp -L | grep -i openssl
|
||||
svmon -G
|
||||
svmon -P <pid>
|
||||
netstat -rn
|
||||
```
|
||||
|
||||
## SSL/TLS Operations
|
||||
|
||||
### OpenSSL Checks
|
||||
|
||||
```bash
|
||||
openssl version -a
|
||||
openssl x509 -in cert.pem -noout -text | less
|
||||
openssl rsa -in key.pem -check
|
||||
openssl verify -CAfile chain.pem cert.pem
|
||||
```
|
||||
|
||||
### Expiration Validation
|
||||
|
||||
```bash
|
||||
openssl x509 -enddate -noout -in cert.pem
|
||||
openssl x509 -checkend 604800 -noout -in cert.pem
|
||||
```
|
||||
|
||||
### keytool Basics
|
||||
|
||||
```bash
|
||||
keytool -list -v -keystore keystore.jks
|
||||
keytool -list -cacerts | grep -i <alias>
|
||||
keytool -importcert -alias app-cert -file cert.pem -keystore keystore.jks
|
||||
```
|
||||
|
||||
### Chain Validation
|
||||
|
||||
```bash
|
||||
openssl s_client -connect host:443 -servername host -showcerts </dev/null
|
||||
openssl verify -untrusted intermediate.pem -CAfile root.pem server.pem
|
||||
```
|
||||
|
||||
## Automation Operations
|
||||
|
||||
### Bash Safety Patterns
|
||||
|
||||
```bash
|
||||
set -euo pipefail
|
||||
IFS=$'\n\t'
|
||||
trap 'echo "line ${LINENO}: command failed" >&2' ERR
|
||||
trap 'rm -f "${tmpfile:-}"' EXIT
|
||||
```
|
||||
|
||||
Safe loop examples:
|
||||
|
||||
```bash
|
||||
while IFS= read -r host; do
|
||||
ssh "$host" uptime
|
||||
done < hostlist.txt
|
||||
|
||||
find /var/log -type f -name '*.log' -print0 \
|
||||
| while IFS= read -r -d '' file; do
|
||||
gzip -t "$file"
|
||||
done
|
||||
```
|
||||
|
||||
Operational scripting patterns:
|
||||
|
||||
- default to read-only mode
|
||||
- require explicit `--execute` for changes
|
||||
- log actions with timestamps
|
||||
- validate dependencies with `command -v`
|
||||
- use temp files with `mktemp`
|
||||
- guard destructive paths and empty variables
|
||||
|
||||
## Ansible Operations
|
||||
|
||||
### Execution
|
||||
|
||||
```bash
|
||||
ansible-inventory -i inventory/hosts.yml --graph
|
||||
ansible-inventory -i inventory/hosts.yml --list | jq '.'
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --syntax-check
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --check --diff
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --limit web01
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --tags packages
|
||||
ansible-playbook -i inventory/hosts.yml playbooks/site.yml --start-at-task 'Restart nginx'
|
||||
```
|
||||
|
||||
### Safe Rollout Workflow
|
||||
|
||||
1. Validate inventory and variable targeting.
|
||||
2. Run syntax-check.
|
||||
3. Run `--check --diff` on a single host.
|
||||
4. Execute against one host or one tier.
|
||||
5. Validate service health, logs, and config.
|
||||
6. Expand rollout only after post-check passes.
|
||||
|
||||
Rollback mindset:
|
||||
|
||||
- keep before/after config copies
|
||||
- know which tasks restart services
|
||||
- define manual backout if package/config changes fail
|
||||
- avoid broad `--limit` mistakes by reviewing resolved host list first
|
||||
|
||||
## Monitoring & Observability
|
||||
|
||||
### Zabbix Checks
|
||||
|
||||
```bash
|
||||
systemctl status zabbix-agent2 --no-pager
|
||||
zabbix_agent2 -t vfs.fs.size[/,free]
|
||||
grep -i 'failed\|error' /var/log/zabbix/zabbix_agent*.log
|
||||
```
|
||||
|
||||
### ELK Log Workflows
|
||||
|
||||
```bash
|
||||
grep -Ei 'error|warn|exception' /var/log/app/app.log | tail -50
|
||||
journalctl -u filebeat -n 100 --no-pager
|
||||
curl -s http://localhost:9200/_cluster/health?pretty
|
||||
```
|
||||
|
||||
### Grafana Checks
|
||||
|
||||
```bash
|
||||
curl -s -o /dev/null -w '%{http_code}\n' http://grafana:3000/login
|
||||
grep -i 'error' /var/log/grafana/grafana.log | tail -50
|
||||
```
|
||||
|
||||
### Health Endpoints and Alert Validation
|
||||
|
||||
```bash
|
||||
curl -fsS http://app:8080/health
|
||||
curl -fsS http://app:8080/metrics | head
|
||||
```
|
||||
|
||||
False positive validation:
|
||||
|
||||
1. Compare alert timestamp with deploy/change window.
|
||||
2. Confirm on-host evidence, not only dashboard data.
|
||||
3. Check collector lag, scrape failures, and stale metrics.
|
||||
4. Validate from a second source before escalating.
|
||||
|
||||
## Operational Habits
|
||||
|
||||
### Pre-checks
|
||||
|
||||
- capture time, hostname, and operator
|
||||
- capture current config and service state
|
||||
- check recent alerts, maintenance windows, and dependencies
|
||||
- confirm backup or rollback path exists
|
||||
|
||||
### Post-checks
|
||||
|
||||
- validate service state
|
||||
- validate logs for fresh errors
|
||||
- validate client path, ports, and name resolution
|
||||
- compare metrics before/after
|
||||
|
||||
### Rollback Thinking
|
||||
|
||||
- define exact backout trigger before change
|
||||
- prefer reversible steps
|
||||
- keep config backups with timestamps
|
||||
- avoid bundling unrelated changes
|
||||
|
||||
### Change Validation
|
||||
|
||||
```bash
|
||||
systemctl is-active <service>
|
||||
curl -fsS http://127.0.0.1:<port>/health
|
||||
ss -ltnp | grep :<port>
|
||||
journalctl -u <service> -S '5 min ago' --no-pager
|
||||
```
|
||||
|
||||
### Operational Communication
|
||||
|
||||
- state scope, risk, and expected impact before action
|
||||
- record start and stop times in UTC
|
||||
- document what changed, what was checked, and remaining risk
|
||||
- escalate with evidence, not assumptions
|
||||
|
||||
### Evidence Collection During Incidents
|
||||
|
||||
```bash
|
||||
mkdir -p /tmp/incident-$(date -u +%Y%m%dT%H%M%SZ)
|
||||
journalctl -b > /tmp/incident-*/journal.txt
|
||||
ss -tulpen > /tmp/incident-*/sockets.txt
|
||||
df -hT > /tmp/incident-*/df.txt
|
||||
free -m > /tmp/incident-*/free.txt
|
||||
```
|
||||
Reference in New Issue
Block a user