kpi: auto-update metrics 2026-03-13

RoadChain-SHA2048: c645c1292ab1555e RoadChain-Identity: alexa@sovereign RoadChain-Full: c645c1292ab1555ebe6982915536d1c94701ff6bb16c20ed6ef4144eb50c9f984b4bfe5b9902109e8defd958d6be43ced8ec11cf95d6241536cd4da0b75f8fb48cbeb1b9f450c8f665b73d39e837d23e73e2ba4201af4dc40c02a34283efb04b39c612083465536f194f16adfadb1b56f714a65b918f40750f54eebf7724236861de173ec31963ff3b1b988d712be7e5acc3fe391eb804d3fdcfb9ccf77afc732660d23fff801f894318327eabf775eb4f4e67f7f22d07f23b0e17f6594cfe95b83b275fb7baaa97115e86562604fc5b47cc8024574b61396924e0ee2b7e454b0a1480c3076c7ad72408ceb4a75360d2d49c7d805c37ac5315af00e4a8ca2262
2026-03-18 04:34:12 -05:00 · 2026-03-13 23:16:12 -05:00
parent 0c714c106c
commit ec7b1445b5
25 changed files with 815 additions and 1112 deletions
--- a/roles/03-site-reliability-engineer.md
+++ b/roles/03-site-reliability-engineer.md
@@ -8,66 +8,46 @@ amundsonalexa@gmail.com | [github.com/blackboxprogramming](https://github.com/bl

 ## Summary

-SRE managing a 7-node distributed fleet with 256 systemd services, 52 automated tasks, and self-healing autonomy. Maintains 48+ production domains, 99 Cloudflare deployments, and a daily KPI system tracking 60+ reliability metrics across 9 data sources.
+Running 256 services across distributed hardware with no on-call team. Built observability from scratch, resolved 10+ production incidents solo, and automated reliability into the infrastructure itself.

 ---

 ## Experience

-### BlackRoad OS | Founder & SRE Lead | 2025–Present
+### BlackRoad OS | Founder & Site Reliability Engineer | 2025–Present

-**Reliability & Uptime**
- Operate 5 Raspberry Pi edge nodes + 2 cloud VMs with WireGuard mesh connectivity
- Implement self-healing cron automation: heartbeat every 1 minute, heal cycle every 5 minutes
- Monitor and resolve 12 failed systemd units across fleet with automated restart policies
- Manage 48 Nginx reverse proxy sites routing traffic to backend services
+**The Reality: Solo On-Call for Everything**
+- One person responsible for 256 services, 48 domains, 7 nodes, 283 databases — every incident is yours
+- Built a 10-collector KPI system tracking 60+ metrics daily: fleet health, service status, temperatures, swap, processes, connections
+- Day-over-day delta tracking catches regressions before they become outages — automated Slack notifications on anomalies

-**Incident Response**
- Identified and resolved thermal throttling (73.8°C → 57.9°C) caused by runaway Ollama loops
- Fixed undervoltage issues across Pi fleet via config.txt tuning (+95mV recovery)
- Discovered and removed obfuscated cron dropper (security incident on Cecilia)
- Resolved swap exhaustion (100% on Cecilia) by identifying memory-hungry services
- Migrated leaked credentials from plaintext crontabs to secured env files (chmod 600)
+**The Incidents: Real Problems, Real Fixes**
+- Node at 73.8°C — identified runaway Ollama generation loop via power monitoring, killed and disabled the service, temp dropped to 57.9°C
+- Swap at 100% on Cecilia — found 4 concurrent rclone instances syncing same Google Drive, consolidated to 1, freed 2 GB swap
+- Obfuscated cron dropper discovered on Cecilia — exec'ing from /tmp/op.py. Removed the malware, audited all nodes, rotated credentials fleet-wide
+- Leaked GitHub PAT found in systemd service file — removed from config, rotated token, migrated all secrets to chmod 600 env files

-**Monitoring & Observability**
- Built 9-collector KPI system: GitHub, Gitea, fleet, services, autonomy, LOC, local, Cloudflare, deep GitHub
- Track 60+ metrics daily: commits, fleet health, temperatures, swap, processes, connections
- Distributed tracing database with nanosecond-precision spans
- Per-node SSH health probes with Python-based remote execution
- Power monitoring deployed to all nodes (cron every 5 minutes, persistent logs)
-
-**Infrastructure Management**
- 14 Docker containers via Docker Swarm with leader election
- 11 PostgreSQL databases with automated backup
- 9 Tailscale mesh peers for secure cross-network access
- 4 Cloudflare tunnels routing 48+ domains to fleet services
-
-**Capacity Planning**
- Fleet: 20 GB RAM, 707 GB storage, 52 TOPS AI compute
- Identified and disabled 16 skeleton microservices freeing 800 MB RAM
- Cleaned 19 GB of stale GitHub Actions runner directories
- Power optimization: conservative CPU governors, WiFi power management, GPU memory reduction
+**The System: Reliability as Code**
+- Self-healing autonomy: heartbeat every 60s detects down services, heal cycle every 5m auto-restarts them
+- Power monitoring on every node (cron */5, persistent logs) — voltage, throttle state, temperature, governor all tracked
+- Distributed tracing database with nanosecond-precision spans — can trace any request across any node

 ---

 ## Technical Skills

-**SRE:** systemd, cron, Nginx, Docker Swarm, WireGuard, Tailscale, Cloudflare Tunnels
-**Monitoring:** Custom KPI collection, distributed tracing, thermal/voltage monitoring, SSH probes
-**Incident Response:** Root cause analysis, credential rotation, service isolation, capacity recovery
-**Languages:** Bash (212 CLI tools), Python, JavaScript
-**Cloud:** Cloudflare (99 Pages, 22 D1, 46 KV, 11 R2), DigitalOcean
+systemd, cron, Nginx, Docker Swarm, WireGuard, Tailscale, distributed tracing, Bash, Python

 ---

 ## Metrics

-| Metric | Value |
-|--------|-------|
-| Services managed | 256 |
-| Automated tasks | 52 |
-| Domains served | 48+ |
-| KPI metrics tracked | 60+ |
-| Fleet nodes | 7 |
-| Incident resolutions | 10+ |
-| Docker containers | 14 |
+| Metric | Value | Source |
+|--------|-------|--------|
+| Systemd Services | *live* | services.sh — systemctl list-units via SSH |
+| Failed Units | *live* | services.sh — systemctl --failed via SSH |
+| Fleet Nodes | *live* | fleet.sh — SSH probe to all nodes |
+| Nodes Online | *live* | fleet.sh — SSH probe to all nodes |
+| Avg Temp | *live* | fleet.sh — /sys/class/thermal via SSH |
+| Docker Containers | *live* | services.sh — docker ps via SSH |
+| Nginx Sites | *live* | services.sh — /etc/nginx/sites-enabled via SSH |