Files
blackroad-os-docs/docs/services/service-operator.md
copilot-swe-agent[bot] 702ae7eaea Fix cross-directory link paths and remove incorrect status markers
- Fix relative paths for cross-directory links (../ops/, ../services/, etc.)
- Remove _(planned)_ markers from services that actually exist
- Remove confusing _(reference CONTRIBUTING.md)_ comments
- All links now properly reference correct paths
- Build still passes successfully

Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>
2025-11-24 16:44:52 +00:00

6.0 KiB

id, title, slug, description, tags, status
id title slug description tags status
services-service-operator Service: Operator /services/service-operator Documentation for the BlackRoad OS Operator service
services
operator
jobs
stable

Service: Operator

What it does

The BlackRoad OS Operator is the job orchestration and execution engine. It manages:

  • Asynchronous job processing
  • Agent task execution
  • Workflow orchestration
  • Queue management
  • Retry logic and error handling

The Operator is the backbone of BlackRoad OS automation, turning high-level requests from the API into executed work.

Repository

Key Features

  • 📋 Job queue management with BullMQ
  • 🔄 Automatic retry with exponential backoff
  • 🎯 Priority-based job scheduling
  • 🔐 Secure job execution contexts
  • 📊 Real-time job status updates
  • 🧠 Agent memory and state management

Architecture

flowchart TD
    API[API Service] -->|Submit Job| Operator[Operator Service]
    Operator --> Queue[(Redis Queue)]
    Queue --> Worker1[Worker 1]
    Queue --> Worker2[Worker 2]
    Queue --> WorkerN[Worker N]
    Worker1 --> Packs[Pack Executors]
    Worker2 --> Packs
    WorkerN --> Packs
    Packs --> Results[Job Results]
    Results --> DB[(Database)]

Deployment

The Operator service is deployed using:

  • Platform: Railway
  • Scaling: Horizontal scaling via worker processes
  • Environment Variables: See .env.example in repository
  • Health Checks: /health, /ready, /queue-status

For deployment procedures, see:

Health Checks

Standard endpoints:

Endpoint Purpose Expected Response
GET /health Basic health check 200 OK
GET /ready Readiness check (Redis connected) 200 OK when ready
GET /queue-status Queue metrics 200 OK with queue stats
GET /version Service version info 200 OK with version

Job Types

The Operator handles various job types:

Agent Execution Jobs

Execute agent logic with specific contexts and memory.

Workflow Jobs

Multi-step workflows with conditional logic and branching.

Scheduled Jobs

Cron-style recurring tasks.

Event-Triggered Jobs

Jobs triggered by system events or webhooks.

Job Lifecycle

Pending → Active → Completed
   │        │          ↓
   │        └──→ Failed → Retrying
   │                  ↓
   └───────────→ Cancelled
  1. Pending: Job is queued, waiting for worker
  2. Active: Worker is processing the job
  3. Completed: Job finished successfully
  4. Failed: Job encountered an error
  5. Retrying: Failed job is being retried
  6. Cancelled: Job was manually cancelled

Environment Configuration

Key environment variables:

  • REDIS_URL - Redis connection for queue
  • DATABASE_URL - PostgreSQL for job metadata
  • WORKER_CONCURRENCY - Number of concurrent jobs per worker
  • JOB_TIMEOUT_MS - Default job timeout
  • MAX_RETRIES - Maximum retry attempts
  • RETRY_BACKOFF_MS - Initial retry delay

⚠️ Security: Never commit actual values. Use Railway secrets or equivalent.

Development

Local development setup:

# Clone the repository
git clone https://github.com/BlackRoad-OS/blackroad-os-operator.git
cd blackroad-os-operator

# Install dependencies
npm install

# Set up environment
cp .env.example .env
# Edit .env with local values

# Start Redis (via Docker)
docker run -d -p 6379:6379 redis:latest

# Run in development mode
npm run dev

See Local Development Guide for more details.

Monitoring

  • Queue Dashboard: BullBoard UI (if enabled)
  • Metrics: Job completion rates, failure rates, latency
  • Logs: Structured logging with context
  • Alerts: Configure alerts for queue depth, failed jobs
  • Prism Console: Real-time monitoring

Performance Tuning

Worker Concurrency

Adjust WORKER_CONCURRENCY based on:

  • Available CPU/memory
  • Job complexity
  • External API rate limits

Queue Priority

Set job priorities to ensure critical jobs execute first:

  • High: User-facing operations
  • Normal: Background tasks
  • Low: Maintenance, cleanup jobs

Memory Management

Monitor worker memory usage:

  • Restart workers periodically if memory leaks detected
  • Use separate queues for memory-intensive jobs

Troubleshooting

Common issues:

Jobs stuck in pending

  • Check Redis connectivity
  • Verify workers are running
  • Review worker logs for errors

High failure rates

  • Check job timeout settings
  • Review error logs for patterns
  • Verify external service availability

Queue growing indefinitely

  • Increase worker count
  • Reduce job creation rate
  • Identify and fix failing jobs

For debugging procedures, see Debug Operator Runbook (planned).

Contributing

To contribute to the Operator service:

  1. Review Contributing Guide
  2. Follow Coding Standards (planned)
  3. Understand job lifecycle and queue patterns
  4. Submit PRs with tests

See Also