20 role-specific resumes with verified KPIs — BlackRoad only, no prior experience

RoadChain-SHA2048: 428ab11c02ce78d6
RoadChain-Identity: alexa@sovereign
RoadChain-Full: 428ab11c02ce78d628aa30489d9f0f3251e709352f2deacf05882435ed9f5d114fe2a1c9e75b3c831688f47cd9032c22b388f821b1b29dcac9fc9a3ad4a1b39f1210d1275f9472df606b763bb551961d1eaebfe8f2a4b9c23d3f3da3f001d916e03ff920def04c8304d8544ac916e4c50c16da942dcc830388e298b7c016b991320b30f7d3fe153aaab71ab109aea3f9dca996ac6e14ca1c0969248c8ca2767ab631c17dc86c0c2a8edd1c8965ab3ba6c92ba7cc9aa4d74406058a39d8fdec53a200371b7d1e1214a860a7ff2c53b83b09f516cec69cbe00e3556caee7f813e4a09d3f430a3a3eab5d4763f8975999c31bd77f82972ab8d7c2d7c5aedcce9442
This commit is contained in:
2026-03-13 00:01:11 -05:00
parent 58d4e18634
commit d5c8667284
22 changed files with 1584 additions and 215 deletions

View File

@@ -0,0 +1,73 @@
# Alexa Amundson
**Site Reliability Engineer**
amundsonalexa@gmail.com | [github.com/blackboxprogramming](https://github.com/blackboxprogramming)
---
## Summary
SRE managing a 7-node distributed fleet with 256 systemd services, 52 automated tasks, and self-healing autonomy. Maintains 48+ production domains, 99 Cloudflare deployments, and a daily KPI system tracking 60+ reliability metrics across 9 data sources.
---
## Experience
### BlackRoad OS | Founder & SRE Lead | 2024Present
**Reliability & Uptime**
- Operate 5 Raspberry Pi edge nodes + 2 cloud VMs with WireGuard mesh connectivity
- Implement self-healing cron automation: heartbeat every 1 minute, heal cycle every 5 minutes
- Monitor and resolve 12 failed systemd units across fleet with automated restart policies
- Manage 48 Nginx reverse proxy sites routing traffic to backend services
**Incident Response**
- Identified and resolved thermal throttling (73.8°C → 57.9°C) caused by runaway Ollama loops
- Fixed undervoltage issues across Pi fleet via config.txt tuning (+95mV recovery)
- Discovered and removed obfuscated cron dropper (security incident on Cecilia)
- Resolved swap exhaustion (100% on Cecilia) by identifying memory-hungry services
- Migrated leaked credentials from plaintext crontabs to secured env files (chmod 600)
**Monitoring & Observability**
- Built 9-collector KPI system: GitHub, Gitea, fleet, services, autonomy, LOC, local, Cloudflare, deep GitHub
- Track 60+ metrics daily: commits, fleet health, temperatures, swap, processes, connections
- Distributed tracing database with nanosecond-precision spans
- Per-node SSH health probes with Python-based remote execution
- Power monitoring deployed to all nodes (cron every 5 minutes, persistent logs)
**Infrastructure Management**
- 14 Docker containers via Docker Swarm with leader election
- 11 PostgreSQL databases with automated backup
- 9 Tailscale mesh peers for secure cross-network access
- 4 Cloudflare tunnels routing 48+ domains to fleet services
**Capacity Planning**
- Fleet: 20 GB RAM, 707 GB storage, 52 TOPS AI compute
- Identified and disabled 16 skeleton microservices freeing 800 MB RAM
- Cleaned 19 GB of stale GitHub Actions runner directories
- Power optimization: conservative CPU governors, WiFi power management, GPU memory reduction
---
## Technical Skills
**SRE:** systemd, cron, Nginx, Docker Swarm, WireGuard, Tailscale, Cloudflare Tunnels
**Monitoring:** Custom KPI collection, distributed tracing, thermal/voltage monitoring, SSH probes
**Incident Response:** Root cause analysis, credential rotation, service isolation, capacity recovery
**Languages:** Bash (212 CLI tools), Python, JavaScript
**Cloud:** Cloudflare (99 Pages, 22 D1, 46 KV, 11 R2), DigitalOcean
---
## Metrics
| Metric | Value |
|--------|-------|
| Services managed | 256 |
| Automated tasks | 52 |
| Domains served | 48+ |
| KPI metrics tracked | 60+ |
| Fleet nodes | 7 |
| Incident resolutions | 10+ |
| Docker containers | 14 |