mirror of https://github.com/blackboxprogramming/alexa-amundson-resume.git synced 2026-03-18 02:03:58 -05:00

Files

Alexa Amundson ec7b1445b5 kpi: auto-update metrics 2026-03-13

RoadChain-SHA2048: c645c1292ab1555e
RoadChain-Identity: alexa@sovereign
RoadChain-Full: c645c1292ab1555ebe6982915536d1c94701ff6bb16c20ed6ef4144eb50c9f984b4bfe5b9902109e8defd958d6be43ced8ec11cf95d6241536cd4da0b75f8fb48cbeb1b9f450c8f665b73d39e837d23e73e2ba4201af4dc40c02a34283efb04b39c612083465536f194f16adfadb1b56f714a65b918f40750f54eebf7724236861de173ec31963ff3b1b988d712be7e5acc3fe391eb804d3fdcfb9ccf77afc732660d23fff801f894318327eabf775eb4f4e67f7f22d07f23b0e17f6594cfe95b83b275fb7baaa97115e86562604fc5b47cc8024574b61396924e0ee2b7e454b0a1480c3076c7ad72408ceb4a75360d2d49c7d805c37ac5315af00e4a8ca2262

2026-03-13 23:16:12 -05:00

2.5 KiB

Raw Blame History

Alexa Amundson

Site Reliability Engineer

amundsonalexa@gmail.com | github.com/blackboxprogramming

Summary

Running 256 services across distributed hardware with no on-call team. Built observability from scratch, resolved 10+ production incidents solo, and automated reliability into the infrastructure itself.

Experience

BlackRoad OS | Founder & Site Reliability Engineer | 2025–Present

The Reality: Solo On-Call for Everything

One person responsible for 256 services, 48 domains, 7 nodes, 283 databases — every incident is yours
Built a 10-collector KPI system tracking 60+ metrics daily: fleet health, service status, temperatures, swap, processes, connections
Day-over-day delta tracking catches regressions before they become outages — automated Slack notifications on anomalies

The Incidents: Real Problems, Real Fixes

Node at 73.8°C — identified runaway Ollama generation loop via power monitoring, killed and disabled the service, temp dropped to 57.9°C
Swap at 100% on Cecilia — found 4 concurrent rclone instances syncing same Google Drive, consolidated to 1, freed 2 GB swap
Obfuscated cron dropper discovered on Cecilia — exec'ing from /tmp/op.py. Removed the malware, audited all nodes, rotated credentials fleet-wide
Leaked GitHub PAT found in systemd service file — removed from config, rotated token, migrated all secrets to chmod 600 env files

The System: Reliability as Code

Self-healing autonomy: heartbeat every 60s detects down services, heal cycle every 5m auto-restarts them
Power monitoring on every node (cron */5, persistent logs) — voltage, throttle state, temperature, governor all tracked
Distributed tracing database with nanosecond-precision spans — can trace any request across any node

Technical Skills

systemd, cron, Nginx, Docker Swarm, WireGuard, Tailscale, distributed tracing, Bash, Python

Metrics

Metric	Value	Source
Systemd Services	live	services.sh — systemctl list-units via SSH
Failed Units	live	services.sh — systemctl --failed via SSH
Fleet Nodes	live	fleet.sh — SSH probe to all nodes
Nodes Online	live	fleet.sh — SSH probe to all nodes
Avg Temp	live	fleet.sh — /sys/class/thermal via SSH
Docker Containers	live	services.sh — docker ps via SSH
Nginx Sites	live	services.sh — /etc/nginx/sites-enabled via SSH

2.5 KiB Raw Blame History Unescape Escape