Files
alexa-amundson-resume/roles/03-site-reliability-engineer.md
Alexa Amundson 292fa97a8e kpi: auto-update metrics 2026-03-13
RoadChain-SHA2048: 9f948f149bd9f508
RoadChain-Identity: alexa@sovereign
RoadChain-Full: 9f948f149bd9f508d25792c617d1c4049cf814c3acbb3181886684f1d89e2ab84fdb0364ce227ef1c03c0b59335e5d1aad9434f983ad375d50eca597e7daea8f9bb2a3e40116fa13de0453865ff2665fb759fc63204fe222360becc3b8c447fb1fbe7e10a440e8107745b57c643682cb2e4f7cffbb9c8c0e1bc5b03623fcbd41d0ab39740c02f148d5309591013f3d65810692706da448cf7e04b4368ef3738898fcc0f2414377cf1ff1f5897a27cfd96289c1f1875a3a93ec732453686f07621952135ae7df10cce155ebc206d3d3a3a9931fc7683d635c74b67d080fc170a8b8238a9eda91ba9193aaeb17737276b9140330cf622d656efdb3e968f46d1a24
2026-03-13 01:07:28 -05:00

74 lines
2.9 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Alexa Amundson
**Site Reliability Engineer**
amundsonalexa@gmail.com | [github.com/blackboxprogramming](https://github.com/blackboxprogramming)
---
## Summary
SRE managing a 7-node distributed fleet with 256 systemd services, 52 automated tasks, and self-healing autonomy. Maintains 48+ production domains, 99 Cloudflare deployments, and a daily KPI system tracking 60+ reliability metrics across 9 data sources.
---
## Experience
### BlackRoad OS | Founder & SRE Lead | 2025Present
**Reliability & Uptime**
- Operate 5 Raspberry Pi edge nodes + 2 cloud VMs with WireGuard mesh connectivity
- Implement self-healing cron automation: heartbeat every 1 minute, heal cycle every 5 minutes
- Monitor and resolve 12 failed systemd units across fleet with automated restart policies
- Manage 48 Nginx reverse proxy sites routing traffic to backend services
**Incident Response**
- Identified and resolved thermal throttling (73.8°C → 57.9°C) caused by runaway Ollama loops
- Fixed undervoltage issues across Pi fleet via config.txt tuning (+95mV recovery)
- Discovered and removed obfuscated cron dropper (security incident on Cecilia)
- Resolved swap exhaustion (100% on Cecilia) by identifying memory-hungry services
- Migrated leaked credentials from plaintext crontabs to secured env files (chmod 600)
**Monitoring & Observability**
- Built 9-collector KPI system: GitHub, Gitea, fleet, services, autonomy, LOC, local, Cloudflare, deep GitHub
- Track 60+ metrics daily: commits, fleet health, temperatures, swap, processes, connections
- Distributed tracing database with nanosecond-precision spans
- Per-node SSH health probes with Python-based remote execution
- Power monitoring deployed to all nodes (cron every 5 minutes, persistent logs)
**Infrastructure Management**
- 14 Docker containers via Docker Swarm with leader election
- 11 PostgreSQL databases with automated backup
- 9 Tailscale mesh peers for secure cross-network access
- 4 Cloudflare tunnels routing 48+ domains to fleet services
**Capacity Planning**
- Fleet: 20 GB RAM, 707 GB storage, 52 TOPS AI compute
- Identified and disabled 16 skeleton microservices freeing 800 MB RAM
- Cleaned 19 GB of stale GitHub Actions runner directories
- Power optimization: conservative CPU governors, WiFi power management, GPU memory reduction
---
## Technical Skills
**SRE:** systemd, cron, Nginx, Docker Swarm, WireGuard, Tailscale, Cloudflare Tunnels
**Monitoring:** Custom KPI collection, distributed tracing, thermal/voltage monitoring, SSH probes
**Incident Response:** Root cause analysis, credential rotation, service isolation, capacity recovery
**Languages:** Bash (212 CLI tools), Python, JavaScript
**Cloud:** Cloudflare (99 Pages, 22 D1, 46 KV, 11 R2), DigitalOcean
---
## Metrics
| Metric | Value |
|--------|-------|
| Services managed | 256 |
| Automated tasks | 52 |
| Domains served | 48+ |
| KPI metrics tracked | 60+ |
| Fleet nodes | 7 |
| Incident resolutions | 10+ |
| Docker containers | 14 |