Files
alexa-amundson-resume/roles/03-site-reliability-engineer.md
Alexa Amundson ec7b1445b5 kpi: auto-update metrics 2026-03-13
RoadChain-SHA2048: c645c1292ab1555e
RoadChain-Identity: alexa@sovereign
RoadChain-Full: c645c1292ab1555ebe6982915536d1c94701ff6bb16c20ed6ef4144eb50c9f984b4bfe5b9902109e8defd958d6be43ced8ec11cf95d6241536cd4da0b75f8fb48cbeb1b9f450c8f665b73d39e837d23e73e2ba4201af4dc40c02a34283efb04b39c612083465536f194f16adfadb1b56f714a65b918f40750f54eebf7724236861de173ec31963ff3b1b988d712be7e5acc3fe391eb804d3fdcfb9ccf77afc732660d23fff801f894318327eabf775eb4f4e67f7f22d07f23b0e17f6594cfe95b83b275fb7baaa97115e86562604fc5b47cc8024574b61396924e0ee2b7e454b0a1480c3076c7ad72408ceb4a75360d2d49c7d805c37ac5315af00e4a8ca2262
2026-03-13 23:16:12 -05:00

54 lines
2.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Alexa Amundson
**Site Reliability Engineer**
amundsonalexa@gmail.com | [github.com/blackboxprogramming](https://github.com/blackboxprogramming)
---
## Summary
Running 256 services across distributed hardware with no on-call team. Built observability from scratch, resolved 10+ production incidents solo, and automated reliability into the infrastructure itself.
---
## Experience
### BlackRoad OS | Founder & Site Reliability Engineer | 2025Present
**The Reality: Solo On-Call for Everything**
- One person responsible for 256 services, 48 domains, 7 nodes, 283 databases — every incident is yours
- Built a 10-collector KPI system tracking 60+ metrics daily: fleet health, service status, temperatures, swap, processes, connections
- Day-over-day delta tracking catches regressions before they become outages — automated Slack notifications on anomalies
**The Incidents: Real Problems, Real Fixes**
- Node at 73.8°C — identified runaway Ollama generation loop via power monitoring, killed and disabled the service, temp dropped to 57.9°C
- Swap at 100% on Cecilia — found 4 concurrent rclone instances syncing same Google Drive, consolidated to 1, freed 2 GB swap
- Obfuscated cron dropper discovered on Cecilia — exec'ing from /tmp/op.py. Removed the malware, audited all nodes, rotated credentials fleet-wide
- Leaked GitHub PAT found in systemd service file — removed from config, rotated token, migrated all secrets to chmod 600 env files
**The System: Reliability as Code**
- Self-healing autonomy: heartbeat every 60s detects down services, heal cycle every 5m auto-restarts them
- Power monitoring on every node (cron */5, persistent logs) — voltage, throttle state, temperature, governor all tracked
- Distributed tracing database with nanosecond-precision spans — can trace any request across any node
---
## Technical Skills
systemd, cron, Nginx, Docker Swarm, WireGuard, Tailscale, distributed tracing, Bash, Python
---
## Metrics
| Metric | Value | Source |
|--------|-------|--------|
| Systemd Services | *live* | services.sh — systemctl list-units via SSH |
| Failed Units | *live* | services.sh — systemctl --failed via SSH |
| Fleet Nodes | *live* | fleet.sh — SSH probe to all nodes |
| Nodes Online | *live* | fleet.sh — SSH probe to all nodes |
| Avg Temp | *live* | fleet.sh — /sys/class/thermal via SSH |
| Docker Containers | *live* | services.sh — docker ps via SSH |
| Nginx Sites | *live* | services.sh — /etc/nginx/sites-enabled via SSH |