20 role-specific resumes with verified KPIs — BlackRoad only, no prior experience

RoadChain-SHA2048: 428ab11c02ce78d6 RoadChain-Identity: alexa@sovereign RoadChain-Full: 428ab11c02ce78d628aa30489d9f0f3251e709352f2deacf05882435ed9f5d114fe2a1c9e75b3c831688f47cd9032c22b388f821b1b29dcac9fc9a3ad4a1b39f1210d1275f9472df606b763bb551961d1eaebfe8f2a4b9c23d3f3da3f001d916e03ff920def04c8304d8544ac916e4c50c16da942dcc830388e298b7c016b991320b30f7d3fe153aaab71ab109aea3f9dca996ac6e14ca1c0969248c8ca2767ab631c17dc86c0c2a8edd1c8965ab3ba6c92ba7cc9aa4d74406058a39d8fdec53a200371b7d1e1214a860a7ff2c53b83b09f516cec69cbe00e3556caee7f813e4a09d3f430a3a3eab5d4763f8975999c31bd77f82972ab8d7c2d7c5aedcce9442
2026-03-18 05:34:08 -05:00 · 2026-03-13 00:01:11 -05:00
parent 58d4e18634
commit d5c8667284
22 changed files with 1584 additions and 215 deletions
--- a/roles/03-site-reliability-engineer.md
+++ b/roles/03-site-reliability-engineer.md
@@ -0,0 +1,73 @@
+# Alexa Amundson
+
+**Site Reliability Engineer**
+
+amundsonalexa@gmail.com | [github.com/blackboxprogramming](https://github.com/blackboxprogramming)
+
+---
+
+## Summary
+
+SRE managing a 7-node distributed fleet with 256 systemd services, 52 automated tasks, and self-healing autonomy. Maintains 48+ production domains, 99 Cloudflare deployments, and a daily KPI system tracking 60+ reliability metrics across 9 data sources.
+
+---
+
+## Experience
+
+### BlackRoad OS | Founder & SRE Lead | 2024–Present
+
+**Reliability & Uptime**
+- Operate 5 Raspberry Pi edge nodes + 2 cloud VMs with WireGuard mesh connectivity
+- Implement self-healing cron automation: heartbeat every 1 minute, heal cycle every 5 minutes
+- Monitor and resolve 12 failed systemd units across fleet with automated restart policies
+- Manage 48 Nginx reverse proxy sites routing traffic to backend services
+
+**Incident Response**
+- Identified and resolved thermal throttling (73.8°C → 57.9°C) caused by runaway Ollama loops
+- Fixed undervoltage issues across Pi fleet via config.txt tuning (+95mV recovery)
+- Discovered and removed obfuscated cron dropper (security incident on Cecilia)
+- Resolved swap exhaustion (100% on Cecilia) by identifying memory-hungry services
+- Migrated leaked credentials from plaintext crontabs to secured env files (chmod 600)
+
+**Monitoring & Observability**
+- Built 9-collector KPI system: GitHub, Gitea, fleet, services, autonomy, LOC, local, Cloudflare, deep GitHub
+- Track 60+ metrics daily: commits, fleet health, temperatures, swap, processes, connections
+- Distributed tracing database with nanosecond-precision spans
+- Per-node SSH health probes with Python-based remote execution
+- Power monitoring deployed to all nodes (cron every 5 minutes, persistent logs)
+
+**Infrastructure Management**
+- 14 Docker containers via Docker Swarm with leader election
+- 11 PostgreSQL databases with automated backup
+- 9 Tailscale mesh peers for secure cross-network access
+- 4 Cloudflare tunnels routing 48+ domains to fleet services
+
+**Capacity Planning**
+- Fleet: 20 GB RAM, 707 GB storage, 52 TOPS AI compute
+- Identified and disabled 16 skeleton microservices freeing 800 MB RAM
+- Cleaned 19 GB of stale GitHub Actions runner directories
+- Power optimization: conservative CPU governors, WiFi power management, GPU memory reduction
+
+---
+
+## Technical Skills
+
+**SRE:** systemd, cron, Nginx, Docker Swarm, WireGuard, Tailscale, Cloudflare Tunnels
+**Monitoring:** Custom KPI collection, distributed tracing, thermal/voltage monitoring, SSH probes
+**Incident Response:** Root cause analysis, credential rotation, service isolation, capacity recovery
+**Languages:** Bash (212 CLI tools), Python, JavaScript
+**Cloud:** Cloudflare (99 Pages, 22 D1, 46 KV, 11 R2), DigitalOcean
+
+---
+
+## Metrics
+
+| Metric | Value |
+|--------|-------|
+| Services managed | 256 |
+| Automated tasks | 52 |
+| Domains served | 48+ |
+| KPI metrics tracked | 60+ |
+| Fleet nodes | 7 |
+| Incident resolutions | 10+ |
+| Docker containers | 14 |