blackroad-os-docs/docs/runbooks/incident-playbook.md

---
id: runbooks-incident-playbook
title: "Incident Response Playbook"
slug: /runbooks/incident-playbook
description: "Step-by-step incident response procedures"
tags: ["runbooks", "incidents", "operations"]
status: stable
---

# Incident Response Playbook

This playbook provides step-by-step procedures for responding to incidents in BlackRoad OS.

## Severity Levels

| **Level** | **Description** | **Response Time** | **Examples** |
|-----------|-----------------|-------------------|--------------|
| **SEV1** | Critical - System down | < 15 min | Complete outage, data loss |
| **SEV2** | High - Major degradation | < 1 hour | API errors >50%, slow response |
| **SEV3** | Medium - Partial impact | < 4 hours | Single service degraded |
| **SEV4** | Low - Minor issue | < 24 hours | UI glitch, non-critical bug |

## Incident Response Process

### 1. Detection and Alert

**When an incident is detected:**

1. ✅ Acknowledge the alert
2. ✅ Create incident tracking issue/ticket
3. ✅ Determine severity level
4. ✅ Notify relevant stakeholders

**Communication Channels:**
- GitHub Issues: For tracking
- Slack/Discord: For real-time coordination (if available)
- Status page: For user communication

### 2. Initial Assessment

**Gather information (5-10 minutes):**

1. What is broken?
2. Since when?
3. What changed recently?
4. What is the user impact?
5. What services are affected?

**Check:**
- [Prism Console](../ops/PRISM_CONSOLE.md) - System health
- Railway logs - Service logs
- GitHub Actions - Recent deployments

### 3. Containment

**For SEV1/SEV2 incidents:**

**Option A: Rollback (if recent deployment)**
```bash
# Via Railway dashboard or CLI
railway rollback --service=api
```

**Option B: Disable Failing Component**
```bash
# Scale down problematic service temporarily
railway scale --service=operator --replicas=0
```

**Option C: Enable Maintenance Mode**
- Return 503 status from API
- Display maintenance page on Web

### 4. Investigation

**Common investigation steps:**

**Check Logs:**
```bash
# Via Railway CLI
railway logs --service=api --tail=100

# Check for errors
railway logs --service=api | grep ERROR
```

**Check Database:**
```sql
-- Check database connection
SELECT 1;

-- Check recent errors
SELECT * FROM error_logs
ORDER BY created_at DESC
LIMIT 100;
```

**Check Job Queue:**
```bash
# Connect to Redis
redis-cli

# Check queue depth
LLEN bullmq:jobs:waiting
LLEN bullmq:jobs:active
LLEN bullmq:jobs:failed
```

**Check Service Health:**
```bash
# Test health endpoints
curl https://api.blackroad.dev/health
curl https://api.blackroad.dev/ready
```

### 5. Resolution

**Apply fix based on investigation:**

**Code Fix:**
1. Create hotfix branch
2. Make minimal fix
3. Test locally
4. Deploy to staging
5. Deploy to production
6. Verify fix

**Configuration Fix:**
```bash
# Update environment variable via Railway
railway variables set KEY=value --service=api

# Restart service
railway restart --service=api
```

**Database Fix:**
```sql
-- Apply migration or data fix
-- Always backup first!
```

**Infrastructure Fix:**
- Adjust scaling
- Modify resource limits
- Update networking config

### 6. Verification

**Confirm resolution:**

1. ✅ Service health checks passing
2. ✅ Error rates back to normal
3. ✅ User reports confirm fix
4. ✅ Metrics show recovery
5. ✅ No new errors in logs

**Monitor for 30+ minutes** to ensure stability.

### 7. Communication

**Update stakeholders:**

1. ✅ Post resolution update
2. ✅ Close incident ticket
3. ✅ Update status page
4. ✅ Thank responders

### 8. Post-Incident Review

**Within 48 hours, document:**

1. **Timeline:** When things happened
2. **Root Cause:** Why it happened
3. **Impact:** Who was affected
4. **Resolution:** How it was fixed
5. **Action Items:** How to prevent recurrence

**Template:**
```md
# Incident Post-Mortem: [YYYY-MM-DD] [Brief Title]

## Summary
Brief overview of what happened.

## Timeline (UTC)
- HH:MM - Incident began
- HH:MM - Alert triggered
- HH:MM - Investigation started
- HH:MM - Fix deployed
- HH:MM - Verified resolved

## Root Cause
Technical explanation of why it happened.

## Impact
- Users affected: X
- Duration: X minutes
- Services impacted: API, Operator

## Resolution
What we did to fix it.

## Action Items
- [ ] Add monitoring for X
- [ ] Improve Y process
- [ ] Document Z
```

## Common Incident Scenarios

### API Service Down

**Symptoms:**
- Health checks failing
- 500 errors
- Connection timeouts

**Quick Checks:**
1. Database connectivity
2. Environment variables
3. Recent deployments
4. Resource limits

**Common Fixes:**
- Restart service
- Rollback deployment
- Scale up resources
- Fix database connection

See [Service: API](../services/service-api.md) for details.

---

### Operator Jobs Stuck

**Symptoms:**
- Jobs not processing
- Queue growing
- Workers idle

**Quick Checks:**
1. Redis connectivity
2. Worker processes running
3. Job errors in logs
4. Queue depths

**Common Fixes:**
- Restart Operator service
- Clear failed jobs
- Scale up workers
- Fix job timeouts

See [Service: Operator](../services/service-operator.md) for details.

---

### Database Issues

**Symptoms:**
- Query timeouts
- Connection pool exhausted
- Slow responses

**Quick Checks:**
1. Active connections
2. Slow queries
3. Database size
4. Resource usage

**Common Fixes:**
- Restart service connections
- Kill long-running queries
- Increase connection pool
- Optimize slow queries

---

### High Error Rates

**Symptoms:**
- Errors >5% of requests
- Multiple error types
- Degraded performance

**Quick Checks:**
1. Error logs
2. Recent changes
3. External dependencies
4. Resource usage

**Common Fixes:**
- Identify error source
- Fix or rollback code
- Add error handling
- Scale resources

## Escalation

**When to escalate:**
- Incident not resolved in expected time
- Need additional expertise
- SEV1 lasting >1 hour
- Unclear root cause

**Escalation Path:**
1. Team lead / Senior engineer
2. Infrastructure team
3. External support (Railway, etc.)

## Tools and Resources

**Monitoring:**
- [Prism Console](../ops/PRISM_CONSOLE.md)
- Railway Dashboard
- Cloudflare Analytics

**Logs:**
- Railway Logs
- Application logs
- Database logs

**Runbooks:**
- [Deploy API](runbooks/deploy-api.md) _(planned)_
- [Debug Operator](runbooks/debug-operator.md) _(planned)_
- [Rollback Procedures](runbooks/rollback-procedures.md) _(planned)_

**Documentation:**
- [Service: API](../services/service-api.md)
- [Service: Operator](../services/service-operator.md)
- [Infra Guide](../ops/INFRA_GUIDE.md)

## See Also

- [Incidents and Incident Response](../ops/incidents-and-incident-response.md)
- [Operator Runtime](../ops/OPERATOR_RUNTIME.md)
- [Prism Console](../ops/PRISM_CONSOLE.md)