mirror of
https://github.com/blackboxprogramming/alexa-amundson-resume.git
synced 2026-03-18 05:34:08 -05:00
20 role-specific resumes with verified KPIs — BlackRoad only, no prior experience
RoadChain-SHA2048: 428ab11c02ce78d6 RoadChain-Identity: alexa@sovereign RoadChain-Full: 428ab11c02ce78d628aa30489d9f0f3251e709352f2deacf05882435ed9f5d114fe2a1c9e75b3c831688f47cd9032c22b388f821b1b29dcac9fc9a3ad4a1b39f1210d1275f9472df606b763bb551961d1eaebfe8f2a4b9c23d3f3da3f001d916e03ff920def04c8304d8544ac916e4c50c16da942dcc830388e298b7c016b991320b30f7d3fe153aaab71ab109aea3f9dca996ac6e14ca1c0969248c8ca2767ab631c17dc86c0c2a8edd1c8965ab3ba6c92ba7cc9aa4d74406058a39d8fdec53a200371b7d1e1214a860a7ff2c53b83b09f516cec69cbe00e3556caee7f813e4a09d3f430a3a3eab5d4763f8975999c31bd77f82972ab8d7c2d7c5aedcce9442
This commit is contained in:
73
roles/03-site-reliability-engineer.md
Normal file
73
roles/03-site-reliability-engineer.md
Normal file
@@ -0,0 +1,73 @@
|
||||
# Alexa Amundson
|
||||
|
||||
**Site Reliability Engineer**
|
||||
|
||||
amundsonalexa@gmail.com | [github.com/blackboxprogramming](https://github.com/blackboxprogramming)
|
||||
|
||||
---
|
||||
|
||||
## Summary
|
||||
|
||||
SRE managing a 7-node distributed fleet with 256 systemd services, 52 automated tasks, and self-healing autonomy. Maintains 48+ production domains, 99 Cloudflare deployments, and a daily KPI system tracking 60+ reliability metrics across 9 data sources.
|
||||
|
||||
---
|
||||
|
||||
## Experience
|
||||
|
||||
### BlackRoad OS | Founder & SRE Lead | 2024–Present
|
||||
|
||||
**Reliability & Uptime**
|
||||
- Operate 5 Raspberry Pi edge nodes + 2 cloud VMs with WireGuard mesh connectivity
|
||||
- Implement self-healing cron automation: heartbeat every 1 minute, heal cycle every 5 minutes
|
||||
- Monitor and resolve 12 failed systemd units across fleet with automated restart policies
|
||||
- Manage 48 Nginx reverse proxy sites routing traffic to backend services
|
||||
|
||||
**Incident Response**
|
||||
- Identified and resolved thermal throttling (73.8°C → 57.9°C) caused by runaway Ollama loops
|
||||
- Fixed undervoltage issues across Pi fleet via config.txt tuning (+95mV recovery)
|
||||
- Discovered and removed obfuscated cron dropper (security incident on Cecilia)
|
||||
- Resolved swap exhaustion (100% on Cecilia) by identifying memory-hungry services
|
||||
- Migrated leaked credentials from plaintext crontabs to secured env files (chmod 600)
|
||||
|
||||
**Monitoring & Observability**
|
||||
- Built 9-collector KPI system: GitHub, Gitea, fleet, services, autonomy, LOC, local, Cloudflare, deep GitHub
|
||||
- Track 60+ metrics daily: commits, fleet health, temperatures, swap, processes, connections
|
||||
- Distributed tracing database with nanosecond-precision spans
|
||||
- Per-node SSH health probes with Python-based remote execution
|
||||
- Power monitoring deployed to all nodes (cron every 5 minutes, persistent logs)
|
||||
|
||||
**Infrastructure Management**
|
||||
- 14 Docker containers via Docker Swarm with leader election
|
||||
- 11 PostgreSQL databases with automated backup
|
||||
- 9 Tailscale mesh peers for secure cross-network access
|
||||
- 4 Cloudflare tunnels routing 48+ domains to fleet services
|
||||
|
||||
**Capacity Planning**
|
||||
- Fleet: 20 GB RAM, 707 GB storage, 52 TOPS AI compute
|
||||
- Identified and disabled 16 skeleton microservices freeing 800 MB RAM
|
||||
- Cleaned 19 GB of stale GitHub Actions runner directories
|
||||
- Power optimization: conservative CPU governors, WiFi power management, GPU memory reduction
|
||||
|
||||
---
|
||||
|
||||
## Technical Skills
|
||||
|
||||
**SRE:** systemd, cron, Nginx, Docker Swarm, WireGuard, Tailscale, Cloudflare Tunnels
|
||||
**Monitoring:** Custom KPI collection, distributed tracing, thermal/voltage monitoring, SSH probes
|
||||
**Incident Response:** Root cause analysis, credential rotation, service isolation, capacity recovery
|
||||
**Languages:** Bash (212 CLI tools), Python, JavaScript
|
||||
**Cloud:** Cloudflare (99 Pages, 22 D1, 46 KV, 11 R2), DigitalOcean
|
||||
|
||||
---
|
||||
|
||||
## Metrics
|
||||
|
||||
| Metric | Value |
|
||||
|--------|-------|
|
||||
| Services managed | 256 |
|
||||
| Automated tasks | 52 |
|
||||
| Domains served | 48+ |
|
||||
| KPI metrics tracked | 60+ |
|
||||
| Fleet nodes | 7 |
|
||||
| Incident resolutions | 10+ |
|
||||
| Docker containers | 14 |
|
||||
Reference in New Issue
Block a user