Files
alexa-amundson-resume/roles/03-site-reliability-engineer.md
Alexa Amundson 292fa97a8e kpi: auto-update metrics 2026-03-13
RoadChain-SHA2048: 9f948f149bd9f508
RoadChain-Identity: alexa@sovereign
RoadChain-Full: 9f948f149bd9f508d25792c617d1c4049cf814c3acbb3181886684f1d89e2ab84fdb0364ce227ef1c03c0b59335e5d1aad9434f983ad375d50eca597e7daea8f9bb2a3e40116fa13de0453865ff2665fb759fc63204fe222360becc3b8c447fb1fbe7e10a440e8107745b57c643682cb2e4f7cffbb9c8c0e1bc5b03623fcbd41d0ab39740c02f148d5309591013f3d65810692706da448cf7e04b4368ef3738898fcc0f2414377cf1ff1f5897a27cfd96289c1f1875a3a93ec732453686f07621952135ae7df10cce155ebc206d3d3a3a9931fc7683d635c74b67d080fc170a8b8238a9eda91ba9193aaeb17737276b9140330cf622d656efdb3e968f46d1a24
2026-03-13 01:07:28 -05:00

2.9 KiB
Raw Blame History

Alexa Amundson

Site Reliability Engineer

amundsonalexa@gmail.com | github.com/blackboxprogramming


Summary

SRE managing a 7-node distributed fleet with 256 systemd services, 52 automated tasks, and self-healing autonomy. Maintains 48+ production domains, 99 Cloudflare deployments, and a daily KPI system tracking 60+ reliability metrics across 9 data sources.


Experience

BlackRoad OS | Founder & SRE Lead | 2025Present

Reliability & Uptime

  • Operate 5 Raspberry Pi edge nodes + 2 cloud VMs with WireGuard mesh connectivity
  • Implement self-healing cron automation: heartbeat every 1 minute, heal cycle every 5 minutes
  • Monitor and resolve 12 failed systemd units across fleet with automated restart policies
  • Manage 48 Nginx reverse proxy sites routing traffic to backend services

Incident Response

  • Identified and resolved thermal throttling (73.8°C → 57.9°C) caused by runaway Ollama loops
  • Fixed undervoltage issues across Pi fleet via config.txt tuning (+95mV recovery)
  • Discovered and removed obfuscated cron dropper (security incident on Cecilia)
  • Resolved swap exhaustion (100% on Cecilia) by identifying memory-hungry services
  • Migrated leaked credentials from plaintext crontabs to secured env files (chmod 600)

Monitoring & Observability

  • Built 9-collector KPI system: GitHub, Gitea, fleet, services, autonomy, LOC, local, Cloudflare, deep GitHub
  • Track 60+ metrics daily: commits, fleet health, temperatures, swap, processes, connections
  • Distributed tracing database with nanosecond-precision spans
  • Per-node SSH health probes with Python-based remote execution
  • Power monitoring deployed to all nodes (cron every 5 minutes, persistent logs)

Infrastructure Management

  • 14 Docker containers via Docker Swarm with leader election
  • 11 PostgreSQL databases with automated backup
  • 9 Tailscale mesh peers for secure cross-network access
  • 4 Cloudflare tunnels routing 48+ domains to fleet services

Capacity Planning

  • Fleet: 20 GB RAM, 707 GB storage, 52 TOPS AI compute
  • Identified and disabled 16 skeleton microservices freeing 800 MB RAM
  • Cleaned 19 GB of stale GitHub Actions runner directories
  • Power optimization: conservative CPU governors, WiFi power management, GPU memory reduction

Technical Skills

SRE: systemd, cron, Nginx, Docker Swarm, WireGuard, Tailscale, Cloudflare Tunnels Monitoring: Custom KPI collection, distributed tracing, thermal/voltage monitoring, SSH probes Incident Response: Root cause analysis, credential rotation, service isolation, capacity recovery Languages: Bash (212 CLI tools), Python, JavaScript Cloud: Cloudflare (99 Pages, 22 D1, 46 KV, 11 R2), DigitalOcean


Metrics

Metric Value
Services managed 256
Automated tasks 52
Domains served 48+
KPI metrics tracked 60+
Fleet nodes 7
Incident resolutions 10+
Docker containers 14