- Fix relative paths for cross-directory links (../ops/, ../services/, etc.) - Remove _(planned)_ markers from services that actually exist - Remove confusing _(reference CONTRIBUTING.md)_ comments - All links now properly reference correct paths - Build still passes successfully Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>
6.6 KiB
id, title, slug, description, tags, status
| id | title | slug | description | tags | status | |||
|---|---|---|---|---|---|---|---|---|
| runbooks-incident-playbook | Incident Response Playbook | /runbooks/incident-playbook | Step-by-step incident response procedures |
|
stable |
Incident Response Playbook
This playbook provides step-by-step procedures for responding to incidents in BlackRoad OS.
Severity Levels
| Level | Description | Response Time | Examples |
|---|---|---|---|
| SEV1 | Critical - System down | < 15 min | Complete outage, data loss |
| SEV2 | High - Major degradation | < 1 hour | API errors >50%, slow response |
| SEV3 | Medium - Partial impact | < 4 hours | Single service degraded |
| SEV4 | Low - Minor issue | < 24 hours | UI glitch, non-critical bug |
Incident Response Process
1. Detection and Alert
When an incident is detected:
- ✅ Acknowledge the alert
- ✅ Create incident tracking issue/ticket
- ✅ Determine severity level
- ✅ Notify relevant stakeholders
Communication Channels:
- GitHub Issues: For tracking
- Slack/Discord: For real-time coordination (if available)
- Status page: For user communication
2. Initial Assessment
Gather information (5-10 minutes):
- What is broken?
- Since when?
- What changed recently?
- What is the user impact?
- What services are affected?
Check:
- Prism Console - System health
- Railway logs - Service logs
- GitHub Actions - Recent deployments
3. Containment
For SEV1/SEV2 incidents:
Option A: Rollback (if recent deployment)
# Via Railway dashboard or CLI
railway rollback --service=api
Option B: Disable Failing Component
# Scale down problematic service temporarily
railway scale --service=operator --replicas=0
Option C: Enable Maintenance Mode
- Return 503 status from API
- Display maintenance page on Web
4. Investigation
Common investigation steps:
Check Logs:
# Via Railway CLI
railway logs --service=api --tail=100
# Check for errors
railway logs --service=api | grep ERROR
Check Database:
-- Check database connection
SELECT 1;
-- Check recent errors
SELECT * FROM error_logs
ORDER BY created_at DESC
LIMIT 100;
Check Job Queue:
# Connect to Redis
redis-cli
# Check queue depth
LLEN bullmq:jobs:waiting
LLEN bullmq:jobs:active
LLEN bullmq:jobs:failed
Check Service Health:
# Test health endpoints
curl https://api.blackroad.dev/health
curl https://api.blackroad.dev/ready
5. Resolution
Apply fix based on investigation:
Code Fix:
- Create hotfix branch
- Make minimal fix
- Test locally
- Deploy to staging
- Deploy to production
- Verify fix
Configuration Fix:
# Update environment variable via Railway
railway variables set KEY=value --service=api
# Restart service
railway restart --service=api
Database Fix:
-- Apply migration or data fix
-- Always backup first!
Infrastructure Fix:
- Adjust scaling
- Modify resource limits
- Update networking config
6. Verification
Confirm resolution:
- ✅ Service health checks passing
- ✅ Error rates back to normal
- ✅ User reports confirm fix
- ✅ Metrics show recovery
- ✅ No new errors in logs
Monitor for 30+ minutes to ensure stability.
7. Communication
Update stakeholders:
- ✅ Post resolution update
- ✅ Close incident ticket
- ✅ Update status page
- ✅ Thank responders
8. Post-Incident Review
Within 48 hours, document:
- Timeline: When things happened
- Root Cause: Why it happened
- Impact: Who was affected
- Resolution: How it was fixed
- Action Items: How to prevent recurrence
Template:
# Incident Post-Mortem: [YYYY-MM-DD] [Brief Title]
## Summary
Brief overview of what happened.
## Timeline (UTC)
- HH:MM - Incident began
- HH:MM - Alert triggered
- HH:MM - Investigation started
- HH:MM - Fix deployed
- HH:MM - Verified resolved
## Root Cause
Technical explanation of why it happened.
## Impact
- Users affected: X
- Duration: X minutes
- Services impacted: API, Operator
## Resolution
What we did to fix it.
## Action Items
- [ ] Add monitoring for X
- [ ] Improve Y process
- [ ] Document Z
Common Incident Scenarios
API Service Down
Symptoms:
- Health checks failing
- 500 errors
- Connection timeouts
Quick Checks:
- Database connectivity
- Environment variables
- Recent deployments
- Resource limits
Common Fixes:
- Restart service
- Rollback deployment
- Scale up resources
- Fix database connection
See Service: API for details.
Operator Jobs Stuck
Symptoms:
- Jobs not processing
- Queue growing
- Workers idle
Quick Checks:
- Redis connectivity
- Worker processes running
- Job errors in logs
- Queue depths
Common Fixes:
- Restart Operator service
- Clear failed jobs
- Scale up workers
- Fix job timeouts
See Service: Operator for details.
Database Issues
Symptoms:
- Query timeouts
- Connection pool exhausted
- Slow responses
Quick Checks:
- Active connections
- Slow queries
- Database size
- Resource usage
Common Fixes:
- Restart service connections
- Kill long-running queries
- Increase connection pool
- Optimize slow queries
High Error Rates
Symptoms:
- Errors >5% of requests
- Multiple error types
- Degraded performance
Quick Checks:
- Error logs
- Recent changes
- External dependencies
- Resource usage
Common Fixes:
- Identify error source
- Fix or rollback code
- Add error handling
- Scale resources
Escalation
When to escalate:
- Incident not resolved in expected time
- Need additional expertise
- SEV1 lasting >1 hour
- Unclear root cause
Escalation Path:
- Team lead / Senior engineer
- Infrastructure team
- External support (Railway, etc.)
Tools and Resources
Monitoring:
- Prism Console
- Railway Dashboard
- Cloudflare Analytics
Logs:
- Railway Logs
- Application logs
- Database logs
Runbooks:
- Deploy API (planned)
- Debug Operator (planned)
- Rollback Procedures (planned)
Documentation: