- Fix relative paths for cross-directory links (../ops/, ../services/, etc.) - Remove _(planned)_ markers from services that actually exist - Remove confusing _(reference CONTRIBUTING.md)_ comments - All links now properly reference correct paths - Build still passes successfully Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>
6.0 KiB
id, title, slug, description, tags, status
| id | title | slug | description | tags | status | |||
|---|---|---|---|---|---|---|---|---|
| services-service-operator | Service: Operator | /services/service-operator | Documentation for the BlackRoad OS Operator service |
|
stable |
Service: Operator
What it does
The BlackRoad OS Operator is the job orchestration and execution engine. It manages:
- Asynchronous job processing
- Agent task execution
- Workflow orchestration
- Queue management
- Retry logic and error handling
The Operator is the backbone of BlackRoad OS automation, turning high-level requests from the API into executed work.
Repository
- GitHub: BlackRoad-OS/blackroad-os-operator
- Primary Language: TypeScript (Node.js)
- Queue System: BullMQ / Redis
Key Features
- 📋 Job queue management with BullMQ
- 🔄 Automatic retry with exponential backoff
- 🎯 Priority-based job scheduling
- 🔐 Secure job execution contexts
- 📊 Real-time job status updates
- 🧠 Agent memory and state management
Architecture
flowchart TD
API[API Service] -->|Submit Job| Operator[Operator Service]
Operator --> Queue[(Redis Queue)]
Queue --> Worker1[Worker 1]
Queue --> Worker2[Worker 2]
Queue --> WorkerN[Worker N]
Worker1 --> Packs[Pack Executors]
Worker2 --> Packs
WorkerN --> Packs
Packs --> Results[Job Results]
Results --> DB[(Database)]
Deployment
The Operator service is deployed using:
- Platform: Railway
- Scaling: Horizontal scaling via worker processes
- Environment Variables: See
.env.examplein repository - Health Checks:
/health,/ready,/queue-status
For deployment procedures, see:
Health Checks
Standard endpoints:
| Endpoint | Purpose | Expected Response |
|---|---|---|
GET /health |
Basic health check | 200 OK |
GET /ready |
Readiness check (Redis connected) | 200 OK when ready |
GET /queue-status |
Queue metrics | 200 OK with queue stats |
GET /version |
Service version info | 200 OK with version |
Job Types
The Operator handles various job types:
Agent Execution Jobs
Execute agent logic with specific contexts and memory.
Workflow Jobs
Multi-step workflows with conditional logic and branching.
Scheduled Jobs
Cron-style recurring tasks.
Event-Triggered Jobs
Jobs triggered by system events or webhooks.
Job Lifecycle
Pending → Active → Completed
│ │ ↓
│ └──→ Failed → Retrying
│ ↓
└───────────→ Cancelled
- Pending: Job is queued, waiting for worker
- Active: Worker is processing the job
- Completed: Job finished successfully
- Failed: Job encountered an error
- Retrying: Failed job is being retried
- Cancelled: Job was manually cancelled
Related Services
- Service: API - Submits jobs to Operator
- Service: Core (planned) - Core business logic
- Service: Prism Console (planned) - Job monitoring UI
- Packs: Various pack services that execute specific job types
Environment Configuration
Key environment variables:
REDIS_URL- Redis connection for queueDATABASE_URL- PostgreSQL for job metadataWORKER_CONCURRENCY- Number of concurrent jobs per workerJOB_TIMEOUT_MS- Default job timeoutMAX_RETRIES- Maximum retry attemptsRETRY_BACKOFF_MS- Initial retry delay
⚠️ Security: Never commit actual values. Use Railway secrets or equivalent.
Development
Local development setup:
# Clone the repository
git clone https://github.com/BlackRoad-OS/blackroad-os-operator.git
cd blackroad-os-operator
# Install dependencies
npm install
# Set up environment
cp .env.example .env
# Edit .env with local values
# Start Redis (via Docker)
docker run -d -p 6379:6379 redis:latest
# Run in development mode
npm run dev
See Local Development Guide for more details.
Monitoring
- Queue Dashboard: BullBoard UI (if enabled)
- Metrics: Job completion rates, failure rates, latency
- Logs: Structured logging with context
- Alerts: Configure alerts for queue depth, failed jobs
- Prism Console: Real-time monitoring
Performance Tuning
Worker Concurrency
Adjust WORKER_CONCURRENCY based on:
- Available CPU/memory
- Job complexity
- External API rate limits
Queue Priority
Set job priorities to ensure critical jobs execute first:
- High: User-facing operations
- Normal: Background tasks
- Low: Maintenance, cleanup jobs
Memory Management
Monitor worker memory usage:
- Restart workers periodically if memory leaks detected
- Use separate queues for memory-intensive jobs
Troubleshooting
Common issues:
Jobs stuck in pending
- Check Redis connectivity
- Verify workers are running
- Review worker logs for errors
High failure rates
- Check job timeout settings
- Review error logs for patterns
- Verify external service availability
Queue growing indefinitely
- Increase worker count
- Reduce job creation rate
- Identify and fix failing jobs
For debugging procedures, see Debug Operator Runbook (planned).
Contributing
To contribute to the Operator service:
- Review Contributing Guide
- Follow Coding Standards (planned)
- Understand job lifecycle and queue patterns
- Submit PRs with tests
See Also
- Operator Runtime - Operational guide
- Core Primitives - Job data structures
- Events and RoadChain - Event-driven architecture
- Agents Atlas - Agent ecosystem