- Fix relative paths for cross-directory links (../ops/, ../services/, etc.) - Remove _(planned)_ markers from services that actually exist - Remove confusing _(reference CONTRIBUTING.md)_ comments - All links now properly reference correct paths - Build still passes successfully Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>
338 lines
6.6 KiB
Markdown
338 lines
6.6 KiB
Markdown
---
|
|
id: runbooks-incident-playbook
|
|
title: "Incident Response Playbook"
|
|
slug: /runbooks/incident-playbook
|
|
description: "Step-by-step incident response procedures"
|
|
tags: ["runbooks", "incidents", "operations"]
|
|
status: stable
|
|
---
|
|
|
|
# Incident Response Playbook
|
|
|
|
This playbook provides step-by-step procedures for responding to incidents in BlackRoad OS.
|
|
|
|
## Severity Levels
|
|
|
|
| **Level** | **Description** | **Response Time** | **Examples** |
|
|
|-----------|-----------------|-------------------|--------------|
|
|
| **SEV1** | Critical - System down | < 15 min | Complete outage, data loss |
|
|
| **SEV2** | High - Major degradation | < 1 hour | API errors >50%, slow response |
|
|
| **SEV3** | Medium - Partial impact | < 4 hours | Single service degraded |
|
|
| **SEV4** | Low - Minor issue | < 24 hours | UI glitch, non-critical bug |
|
|
|
|
## Incident Response Process
|
|
|
|
### 1. Detection and Alert
|
|
|
|
**When an incident is detected:**
|
|
|
|
1. ✅ Acknowledge the alert
|
|
2. ✅ Create incident tracking issue/ticket
|
|
3. ✅ Determine severity level
|
|
4. ✅ Notify relevant stakeholders
|
|
|
|
**Communication Channels:**
|
|
- GitHub Issues: For tracking
|
|
- Slack/Discord: For real-time coordination (if available)
|
|
- Status page: For user communication
|
|
|
|
### 2. Initial Assessment
|
|
|
|
**Gather information (5-10 minutes):**
|
|
|
|
1. What is broken?
|
|
2. Since when?
|
|
3. What changed recently?
|
|
4. What is the user impact?
|
|
5. What services are affected?
|
|
|
|
**Check:**
|
|
- [Prism Console](../ops/PRISM_CONSOLE.md) - System health
|
|
- Railway logs - Service logs
|
|
- GitHub Actions - Recent deployments
|
|
|
|
### 3. Containment
|
|
|
|
**For SEV1/SEV2 incidents:**
|
|
|
|
**Option A: Rollback (if recent deployment)**
|
|
```bash
|
|
# Via Railway dashboard or CLI
|
|
railway rollback --service=api
|
|
```
|
|
|
|
**Option B: Disable Failing Component**
|
|
```bash
|
|
# Scale down problematic service temporarily
|
|
railway scale --service=operator --replicas=0
|
|
```
|
|
|
|
**Option C: Enable Maintenance Mode**
|
|
- Return 503 status from API
|
|
- Display maintenance page on Web
|
|
|
|
### 4. Investigation
|
|
|
|
**Common investigation steps:**
|
|
|
|
**Check Logs:**
|
|
```bash
|
|
# Via Railway CLI
|
|
railway logs --service=api --tail=100
|
|
|
|
# Check for errors
|
|
railway logs --service=api | grep ERROR
|
|
```
|
|
|
|
**Check Database:**
|
|
```sql
|
|
-- Check database connection
|
|
SELECT 1;
|
|
|
|
-- Check recent errors
|
|
SELECT * FROM error_logs
|
|
ORDER BY created_at DESC
|
|
LIMIT 100;
|
|
```
|
|
|
|
**Check Job Queue:**
|
|
```bash
|
|
# Connect to Redis
|
|
redis-cli
|
|
|
|
# Check queue depth
|
|
LLEN bullmq:jobs:waiting
|
|
LLEN bullmq:jobs:active
|
|
LLEN bullmq:jobs:failed
|
|
```
|
|
|
|
**Check Service Health:**
|
|
```bash
|
|
# Test health endpoints
|
|
curl https://api.blackroad.dev/health
|
|
curl https://api.blackroad.dev/ready
|
|
```
|
|
|
|
### 5. Resolution
|
|
|
|
**Apply fix based on investigation:**
|
|
|
|
**Code Fix:**
|
|
1. Create hotfix branch
|
|
2. Make minimal fix
|
|
3. Test locally
|
|
4. Deploy to staging
|
|
5. Deploy to production
|
|
6. Verify fix
|
|
|
|
**Configuration Fix:**
|
|
```bash
|
|
# Update environment variable via Railway
|
|
railway variables set KEY=value --service=api
|
|
|
|
# Restart service
|
|
railway restart --service=api
|
|
```
|
|
|
|
**Database Fix:**
|
|
```sql
|
|
-- Apply migration or data fix
|
|
-- Always backup first!
|
|
```
|
|
|
|
**Infrastructure Fix:**
|
|
- Adjust scaling
|
|
- Modify resource limits
|
|
- Update networking config
|
|
|
|
### 6. Verification
|
|
|
|
**Confirm resolution:**
|
|
|
|
1. ✅ Service health checks passing
|
|
2. ✅ Error rates back to normal
|
|
3. ✅ User reports confirm fix
|
|
4. ✅ Metrics show recovery
|
|
5. ✅ No new errors in logs
|
|
|
|
**Monitor for 30+ minutes** to ensure stability.
|
|
|
|
### 7. Communication
|
|
|
|
**Update stakeholders:**
|
|
|
|
1. ✅ Post resolution update
|
|
2. ✅ Close incident ticket
|
|
3. ✅ Update status page
|
|
4. ✅ Thank responders
|
|
|
|
### 8. Post-Incident Review
|
|
|
|
**Within 48 hours, document:**
|
|
|
|
1. **Timeline:** When things happened
|
|
2. **Root Cause:** Why it happened
|
|
3. **Impact:** Who was affected
|
|
4. **Resolution:** How it was fixed
|
|
5. **Action Items:** How to prevent recurrence
|
|
|
|
**Template:**
|
|
```md
|
|
# Incident Post-Mortem: [YYYY-MM-DD] [Brief Title]
|
|
|
|
## Summary
|
|
Brief overview of what happened.
|
|
|
|
## Timeline (UTC)
|
|
- HH:MM - Incident began
|
|
- HH:MM - Alert triggered
|
|
- HH:MM - Investigation started
|
|
- HH:MM - Fix deployed
|
|
- HH:MM - Verified resolved
|
|
|
|
## Root Cause
|
|
Technical explanation of why it happened.
|
|
|
|
## Impact
|
|
- Users affected: X
|
|
- Duration: X minutes
|
|
- Services impacted: API, Operator
|
|
|
|
## Resolution
|
|
What we did to fix it.
|
|
|
|
## Action Items
|
|
- [ ] Add monitoring for X
|
|
- [ ] Improve Y process
|
|
- [ ] Document Z
|
|
```
|
|
|
|
## Common Incident Scenarios
|
|
|
|
### API Service Down
|
|
|
|
**Symptoms:**
|
|
- Health checks failing
|
|
- 500 errors
|
|
- Connection timeouts
|
|
|
|
**Quick Checks:**
|
|
1. Database connectivity
|
|
2. Environment variables
|
|
3. Recent deployments
|
|
4. Resource limits
|
|
|
|
**Common Fixes:**
|
|
- Restart service
|
|
- Rollback deployment
|
|
- Scale up resources
|
|
- Fix database connection
|
|
|
|
See [Service: API](../services/service-api.md) for details.
|
|
|
|
---
|
|
|
|
### Operator Jobs Stuck
|
|
|
|
**Symptoms:**
|
|
- Jobs not processing
|
|
- Queue growing
|
|
- Workers idle
|
|
|
|
**Quick Checks:**
|
|
1. Redis connectivity
|
|
2. Worker processes running
|
|
3. Job errors in logs
|
|
4. Queue depths
|
|
|
|
**Common Fixes:**
|
|
- Restart Operator service
|
|
- Clear failed jobs
|
|
- Scale up workers
|
|
- Fix job timeouts
|
|
|
|
See [Service: Operator](../services/service-operator.md) for details.
|
|
|
|
---
|
|
|
|
### Database Issues
|
|
|
|
**Symptoms:**
|
|
- Query timeouts
|
|
- Connection pool exhausted
|
|
- Slow responses
|
|
|
|
**Quick Checks:**
|
|
1. Active connections
|
|
2. Slow queries
|
|
3. Database size
|
|
4. Resource usage
|
|
|
|
**Common Fixes:**
|
|
- Restart service connections
|
|
- Kill long-running queries
|
|
- Increase connection pool
|
|
- Optimize slow queries
|
|
|
|
---
|
|
|
|
### High Error Rates
|
|
|
|
**Symptoms:**
|
|
- Errors >5% of requests
|
|
- Multiple error types
|
|
- Degraded performance
|
|
|
|
**Quick Checks:**
|
|
1. Error logs
|
|
2. Recent changes
|
|
3. External dependencies
|
|
4. Resource usage
|
|
|
|
**Common Fixes:**
|
|
- Identify error source
|
|
- Fix or rollback code
|
|
- Add error handling
|
|
- Scale resources
|
|
|
|
## Escalation
|
|
|
|
**When to escalate:**
|
|
- Incident not resolved in expected time
|
|
- Need additional expertise
|
|
- SEV1 lasting >1 hour
|
|
- Unclear root cause
|
|
|
|
**Escalation Path:**
|
|
1. Team lead / Senior engineer
|
|
2. Infrastructure team
|
|
3. External support (Railway, etc.)
|
|
|
|
## Tools and Resources
|
|
|
|
**Monitoring:**
|
|
- [Prism Console](../ops/PRISM_CONSOLE.md)
|
|
- Railway Dashboard
|
|
- Cloudflare Analytics
|
|
|
|
**Logs:**
|
|
- Railway Logs
|
|
- Application logs
|
|
- Database logs
|
|
|
|
**Runbooks:**
|
|
- [Deploy API](runbooks/deploy-api.md) _(planned)_
|
|
- [Debug Operator](runbooks/debug-operator.md) _(planned)_
|
|
- [Rollback Procedures](runbooks/rollback-procedures.md) _(planned)_
|
|
|
|
**Documentation:**
|
|
- [Service: API](../services/service-api.md)
|
|
- [Service: Operator](../services/service-operator.md)
|
|
- [Infra Guide](../ops/INFRA_GUIDE.md)
|
|
|
|
## See Also
|
|
|
|
- [Incidents and Incident Response](../ops/incidents-and-incident-response.md)
|
|
- [Operator Runtime](../ops/OPERATOR_RUNTIME.md)
|
|
- [Prism Console](../ops/PRISM_CONSOLE.md)
|