Files
blackroad-operating-system/services/aiops/README.md
Alexa Louise 9644737ba7 feat: Add domain architecture and extract core services from Prism Console
## Domain Architecture
- Complete domain-to-service mapping for 16 verified domains
- Subdomain architecture for blackroad.systems and blackroad.io
- GitHub organization mapping (BlackRoad-OS repos)
- Railway service-to-domain configuration
- DNS configuration templates for Cloudflare

## Extracted Services

### AIops Service (services/aiops/)
- Canary analysis for deployment validation
- Config drift detection
- Event correlation engine
- Auto-remediation with runbook mapping
- SLO budget management

### Analytics Service (services/analytics/)
- Rule-based anomaly detection with safe expression evaluation
- Cohort analysis with multi-metric aggregation
- Decision engine with credit budget constraints
- Narrative report generation

### Codex Governance (services/codex/)
- 82+ governance principles (entries)
- Codex Pantheon with 48+ agent archetypes
- Manifesto defining ethical framework

## Integration Points
- AIops → infra.blackroad.systems (blackroad-os-infra)
- Analytics → core.blackroad.systems (blackroad-os-core)
- Codex → operator.blackroad.systems (blackroad-os-operator)

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
2025-11-29 13:39:08 -06:00

95 lines
2.4 KiB
Markdown

# BlackRoad OS - AIops Service
> AI-powered operations and automated incident response
## Overview
The AIops service provides intelligent operational monitoring, automated remediation, and SLO budget management for BlackRoad OS infrastructure.
## Features
### Canary Analysis (`canary.py`)
- Compare metric snapshots between baseline and canary deployments
- Configurable thresholds for latency (p50, p95) and error rates
- Automatic pass/fail determination
- Artifact generation for audit trails
### Config Drift Detection (`config_drift.py`)
- Detect configuration changes across environments
- Severity-based alerting (critical/warning)
- Baseline comparison for compliance
### Event Correlation (`correlation.py`)
- Rule-based event correlation engine
- Multi-source event integration (incidents, healthchecks, changes, anomalies)
- Time-window based pattern matching
- Root cause identification
### Auto-Remediation (`remediation.py`)
- Runbook-based automated responses
- Maintenance window enforcement
- Dry-run execution support
- Execution blocking for safety
### SLO Budget Management (`slo_budget.py`)
- Error budget calculation and tracking
- Budget state management (ok/warn/burning)
- Alert generation for budget exhaustion
## Configuration
Config files are expected at:
```
configs/aiops/
├── canary.yaml # Canary analysis thresholds
├── correlation.yaml # Correlation rules
└── maintenance.yaml # Maintenance windows
```
## Usage
```python
from services.aiops import canary, remediation, slo_budget
# Run canary analysis
result = canary.analyze(base_path, canary_path)
# Check SLO budget
status = slo_budget.budget_status("api-gateway", "7d")
# Plan remediation
plan = remediation.plan(correlations)
```
## Integration
### Railway Deployment
- Service: `blackroad-os-infra`
- Domain: `infra.blackroad.systems`
- Health: `GET /health`
### Endpoints
- `POST /v1/aiops/canary` - Run canary analysis
- `POST /v1/aiops/correlate` - Correlate events
- `POST /v1/aiops/remediate` - Execute remediation
- `GET /v1/aiops/slo/:service` - Get SLO budget status
## Artifacts
All operations generate artifacts in:
```
artifacts/aiops/
├── canary_YYYYMMDDHHMMSS/
│ ├── diff.json
│ └── report.md
├── correlation_YYYYMMDDHHMMSS.json
├── plan.json
└── exec_YYYYMMDDHHMMSS/
├── log.jsonl
└── summary.md
```
## Source
Extracted from: `blackroad-prism-console/aiops/`