Files
blackroad-os-docs/docs/runbooks/incident-playbook.md
copilot-swe-agent[bot] 702ae7eaea Fix cross-directory link paths and remove incorrect status markers
- Fix relative paths for cross-directory links (../ops/, ../services/, etc.)
- Remove _(planned)_ markers from services that actually exist
- Remove confusing _(reference CONTRIBUTING.md)_ comments
- All links now properly reference correct paths
- Build still passes successfully

Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>
2025-11-24 16:44:52 +00:00

6.6 KiB

id, title, slug, description, tags, status
id title slug description tags status
runbooks-incident-playbook Incident Response Playbook /runbooks/incident-playbook Step-by-step incident response procedures
runbooks
incidents
operations
stable

Incident Response Playbook

This playbook provides step-by-step procedures for responding to incidents in BlackRoad OS.

Severity Levels

Level Description Response Time Examples
SEV1 Critical - System down < 15 min Complete outage, data loss
SEV2 High - Major degradation < 1 hour API errors >50%, slow response
SEV3 Medium - Partial impact < 4 hours Single service degraded
SEV4 Low - Minor issue < 24 hours UI glitch, non-critical bug

Incident Response Process

1. Detection and Alert

When an incident is detected:

  1. Acknowledge the alert
  2. Create incident tracking issue/ticket
  3. Determine severity level
  4. Notify relevant stakeholders

Communication Channels:

  • GitHub Issues: For tracking
  • Slack/Discord: For real-time coordination (if available)
  • Status page: For user communication

2. Initial Assessment

Gather information (5-10 minutes):

  1. What is broken?
  2. Since when?
  3. What changed recently?
  4. What is the user impact?
  5. What services are affected?

Check:

  • Prism Console - System health
  • Railway logs - Service logs
  • GitHub Actions - Recent deployments

3. Containment

For SEV1/SEV2 incidents:

Option A: Rollback (if recent deployment)

# Via Railway dashboard or CLI
railway rollback --service=api

Option B: Disable Failing Component

# Scale down problematic service temporarily
railway scale --service=operator --replicas=0

Option C: Enable Maintenance Mode

  • Return 503 status from API
  • Display maintenance page on Web

4. Investigation

Common investigation steps:

Check Logs:

# Via Railway CLI
railway logs --service=api --tail=100

# Check for errors
railway logs --service=api | grep ERROR

Check Database:

-- Check database connection
SELECT 1;

-- Check recent errors
SELECT * FROM error_logs 
ORDER BY created_at DESC 
LIMIT 100;

Check Job Queue:

# Connect to Redis
redis-cli

# Check queue depth
LLEN bullmq:jobs:waiting
LLEN bullmq:jobs:active
LLEN bullmq:jobs:failed

Check Service Health:

# Test health endpoints
curl https://api.blackroad.dev/health
curl https://api.blackroad.dev/ready

5. Resolution

Apply fix based on investigation:

Code Fix:

  1. Create hotfix branch
  2. Make minimal fix
  3. Test locally
  4. Deploy to staging
  5. Deploy to production
  6. Verify fix

Configuration Fix:

# Update environment variable via Railway
railway variables set KEY=value --service=api

# Restart service
railway restart --service=api

Database Fix:

-- Apply migration or data fix
-- Always backup first!

Infrastructure Fix:

  • Adjust scaling
  • Modify resource limits
  • Update networking config

6. Verification

Confirm resolution:

  1. Service health checks passing
  2. Error rates back to normal
  3. User reports confirm fix
  4. Metrics show recovery
  5. No new errors in logs

Monitor for 30+ minutes to ensure stability.

7. Communication

Update stakeholders:

  1. Post resolution update
  2. Close incident ticket
  3. Update status page
  4. Thank responders

8. Post-Incident Review

Within 48 hours, document:

  1. Timeline: When things happened
  2. Root Cause: Why it happened
  3. Impact: Who was affected
  4. Resolution: How it was fixed
  5. Action Items: How to prevent recurrence

Template:

# Incident Post-Mortem: [YYYY-MM-DD] [Brief Title]

## Summary
Brief overview of what happened.

## Timeline (UTC)
- HH:MM - Incident began
- HH:MM - Alert triggered
- HH:MM - Investigation started
- HH:MM - Fix deployed
- HH:MM - Verified resolved

## Root Cause
Technical explanation of why it happened.

## Impact
- Users affected: X
- Duration: X minutes
- Services impacted: API, Operator

## Resolution
What we did to fix it.

## Action Items
- [ ] Add monitoring for X
- [ ] Improve Y process
- [ ] Document Z

Common Incident Scenarios

API Service Down

Symptoms:

  • Health checks failing
  • 500 errors
  • Connection timeouts

Quick Checks:

  1. Database connectivity
  2. Environment variables
  3. Recent deployments
  4. Resource limits

Common Fixes:

  • Restart service
  • Rollback deployment
  • Scale up resources
  • Fix database connection

See Service: API for details.


Operator Jobs Stuck

Symptoms:

  • Jobs not processing
  • Queue growing
  • Workers idle

Quick Checks:

  1. Redis connectivity
  2. Worker processes running
  3. Job errors in logs
  4. Queue depths

Common Fixes:

  • Restart Operator service
  • Clear failed jobs
  • Scale up workers
  • Fix job timeouts

See Service: Operator for details.


Database Issues

Symptoms:

  • Query timeouts
  • Connection pool exhausted
  • Slow responses

Quick Checks:

  1. Active connections
  2. Slow queries
  3. Database size
  4. Resource usage

Common Fixes:

  • Restart service connections
  • Kill long-running queries
  • Increase connection pool
  • Optimize slow queries

High Error Rates

Symptoms:

  • Errors >5% of requests
  • Multiple error types
  • Degraded performance

Quick Checks:

  1. Error logs
  2. Recent changes
  3. External dependencies
  4. Resource usage

Common Fixes:

  • Identify error source
  • Fix or rollback code
  • Add error handling
  • Scale resources

Escalation

When to escalate:

  • Incident not resolved in expected time
  • Need additional expertise
  • SEV1 lasting >1 hour
  • Unclear root cause

Escalation Path:

  1. Team lead / Senior engineer
  2. Infrastructure team
  3. External support (Railway, etc.)

Tools and Resources

Monitoring:

Logs:

  • Railway Logs
  • Application logs
  • Database logs

Runbooks:

Documentation:

See Also