# Alexa Amundson **Site Reliability Engineer** amundsonalexa@gmail.com | [github.com/blackboxprogramming](https://github.com/blackboxprogramming) --- ## Summary SRE managing a 7-node distributed fleet with 256 systemd services, 52 automated tasks, and self-healing autonomy. Maintains 48+ production domains, 99 Cloudflare deployments, and a daily KPI system tracking 60+ reliability metrics across 9 data sources. --- ## Experience ### BlackRoad OS | Founder & SRE Lead | 2025–Present **Reliability & Uptime** - Operate 5 Raspberry Pi edge nodes + 2 cloud VMs with WireGuard mesh connectivity - Implement self-healing cron automation: heartbeat every 1 minute, heal cycle every 5 minutes - Monitor and resolve 12 failed systemd units across fleet with automated restart policies - Manage 48 Nginx reverse proxy sites routing traffic to backend services **Incident Response** - Identified and resolved thermal throttling (73.8°C → 57.9°C) caused by runaway Ollama loops - Fixed undervoltage issues across Pi fleet via config.txt tuning (+95mV recovery) - Discovered and removed obfuscated cron dropper (security incident on Cecilia) - Resolved swap exhaustion (100% on Cecilia) by identifying memory-hungry services - Migrated leaked credentials from plaintext crontabs to secured env files (chmod 600) **Monitoring & Observability** - Built 9-collector KPI system: GitHub, Gitea, fleet, services, autonomy, LOC, local, Cloudflare, deep GitHub - Track 60+ metrics daily: commits, fleet health, temperatures, swap, processes, connections - Distributed tracing database with nanosecond-precision spans - Per-node SSH health probes with Python-based remote execution - Power monitoring deployed to all nodes (cron every 5 minutes, persistent logs) **Infrastructure Management** - 14 Docker containers via Docker Swarm with leader election - 11 PostgreSQL databases with automated backup - 9 Tailscale mesh peers for secure cross-network access - 4 Cloudflare tunnels routing 48+ domains to fleet services **Capacity Planning** - Fleet: 20 GB RAM, 707 GB storage, 52 TOPS AI compute - Identified and disabled 16 skeleton microservices freeing 800 MB RAM - Cleaned 19 GB of stale GitHub Actions runner directories - Power optimization: conservative CPU governors, WiFi power management, GPU memory reduction --- ## Technical Skills **SRE:** systemd, cron, Nginx, Docker Swarm, WireGuard, Tailscale, Cloudflare Tunnels **Monitoring:** Custom KPI collection, distributed tracing, thermal/voltage monitoring, SSH probes **Incident Response:** Root cause analysis, credential rotation, service isolation, capacity recovery **Languages:** Bash (212 CLI tools), Python, JavaScript **Cloud:** Cloudflare (99 Pages, 22 D1, 46 KV, 11 R2), DigitalOcean --- ## Metrics | Metric | Value | |--------|-------| | Services managed | 256 | | Automated tasks | 52 | | Domains served | 48+ | | KPI metrics tracked | 60+ | | Fleet nodes | 7 | | Incident resolutions | 10+ | | Docker containers | 14 |