Files
alexa-amundson-resume/roles/03-site-reliability-engineer.md
Alexa Amundson d5c8667284 20 role-specific resumes with verified KPIs — BlackRoad only, no prior experience
RoadChain-SHA2048: 428ab11c02ce78d6
RoadChain-Identity: alexa@sovereign
RoadChain-Full: 428ab11c02ce78d628aa30489d9f0f3251e709352f2deacf05882435ed9f5d114fe2a1c9e75b3c831688f47cd9032c22b388f821b1b29dcac9fc9a3ad4a1b39f1210d1275f9472df606b763bb551961d1eaebfe8f2a4b9c23d3f3da3f001d916e03ff920def04c8304d8544ac916e4c50c16da942dcc830388e298b7c016b991320b30f7d3fe153aaab71ab109aea3f9dca996ac6e14ca1c0969248c8ca2767ab631c17dc86c0c2a8edd1c8965ab3ba6c92ba7cc9aa4d74406058a39d8fdec53a200371b7d1e1214a860a7ff2c53b83b09f516cec69cbe00e3556caee7f813e4a09d3f430a3a3eab5d4763f8975999c31bd77f82972ab8d7c2d7c5aedcce9442
2026-03-13 00:01:11 -05:00

2.9 KiB
Raw Blame History

Alexa Amundson

Site Reliability Engineer

amundsonalexa@gmail.com | github.com/blackboxprogramming


Summary

SRE managing a 7-node distributed fleet with 256 systemd services, 52 automated tasks, and self-healing autonomy. Maintains 48+ production domains, 99 Cloudflare deployments, and a daily KPI system tracking 60+ reliability metrics across 9 data sources.


Experience

BlackRoad OS | Founder & SRE Lead | 2024Present

Reliability & Uptime

  • Operate 5 Raspberry Pi edge nodes + 2 cloud VMs with WireGuard mesh connectivity
  • Implement self-healing cron automation: heartbeat every 1 minute, heal cycle every 5 minutes
  • Monitor and resolve 12 failed systemd units across fleet with automated restart policies
  • Manage 48 Nginx reverse proxy sites routing traffic to backend services

Incident Response

  • Identified and resolved thermal throttling (73.8°C → 57.9°C) caused by runaway Ollama loops
  • Fixed undervoltage issues across Pi fleet via config.txt tuning (+95mV recovery)
  • Discovered and removed obfuscated cron dropper (security incident on Cecilia)
  • Resolved swap exhaustion (100% on Cecilia) by identifying memory-hungry services
  • Migrated leaked credentials from plaintext crontabs to secured env files (chmod 600)

Monitoring & Observability

  • Built 9-collector KPI system: GitHub, Gitea, fleet, services, autonomy, LOC, local, Cloudflare, deep GitHub
  • Track 60+ metrics daily: commits, fleet health, temperatures, swap, processes, connections
  • Distributed tracing database with nanosecond-precision spans
  • Per-node SSH health probes with Python-based remote execution
  • Power monitoring deployed to all nodes (cron every 5 minutes, persistent logs)

Infrastructure Management

  • 14 Docker containers via Docker Swarm with leader election
  • 11 PostgreSQL databases with automated backup
  • 9 Tailscale mesh peers for secure cross-network access
  • 4 Cloudflare tunnels routing 48+ domains to fleet services

Capacity Planning

  • Fleet: 20 GB RAM, 707 GB storage, 52 TOPS AI compute
  • Identified and disabled 16 skeleton microservices freeing 800 MB RAM
  • Cleaned 19 GB of stale GitHub Actions runner directories
  • Power optimization: conservative CPU governors, WiFi power management, GPU memory reduction

Technical Skills

SRE: systemd, cron, Nginx, Docker Swarm, WireGuard, Tailscale, Cloudflare Tunnels Monitoring: Custom KPI collection, distributed tracing, thermal/voltage monitoring, SSH probes Incident Response: Root cause analysis, credential rotation, service isolation, capacity recovery Languages: Bash (212 CLI tools), Python, JavaScript Cloud: Cloudflare (99 Pages, 22 D1, 46 KV, 11 R2), DigitalOcean


Metrics

Metric Value
Services managed 256
Automated tasks 52
Domains served 48+
KPI metrics tracked 60+
Fleet nodes 7
Incident resolutions 10+
Docker containers 14