Files

copilot-swe-agent[bot] 702ae7eaea Fix cross-directory link paths and remove incorrect status markers

- Fix relative paths for cross-directory links (../ops/, ../services/, etc.)
- Remove _(planned)_ markers from services that actually exist
- Remove confusing _(reference CONTRIBUTING.md)_ comments
- All links now properly reference correct paths
- Build still passes successfully

Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>

2025-11-24 16:44:52 +00:00

6.0 KiB

Raw Blame History

id, title, slug, description, tags, status

title

slug

description

Service: Operator

What it does

The BlackRoad OS Operator is the job orchestration and execution engine. It manages:

Asynchronous job processing
Agent task execution
Workflow orchestration
Queue management
Retry logic and error handling

The Operator is the backbone of BlackRoad OS automation, turning high-level requests from the API into executed work.

Repository

GitHub: BlackRoad-OS/blackroad-os-operator
Primary Language: TypeScript (Node.js)
Queue System: BullMQ / Redis

Key Features

📋 Job queue management with BullMQ
🔄 Automatic retry with exponential backoff
🎯 Priority-based job scheduling
🔐 Secure job execution contexts
📊 Real-time job status updates
🧠 Agent memory and state management

Architecture

flowchart TD
    API[API Service] -->|Submit Job| Operator[Operator Service]
    Operator --> Queue[(Redis Queue)]
    Queue --> Worker1[Worker 1]
    Queue --> Worker2[Worker 2]
    Queue --> WorkerN[Worker N]
    Worker1 --> Packs[Pack Executors]
    Worker2 --> Packs
    WorkerN --> Packs
    Packs --> Results[Job Results]
    Results --> DB[(Database)]

Deployment

The Operator service is deployed using:

Platform: Railway
Scaling: Horizontal scaling via worker processes
Environment Variables: See .env.example in repository
Health Checks: /health, /ready, /queue-status

For deployment procedures, see:

Operator Runtime Guide
Deploy Operator Runbook (planned)

Health Checks

Standard endpoints:

Endpoint	Purpose	Expected Response
`GET /health`	Basic health check	`200 OK`
`GET /ready`	Readiness check (Redis connected)	`200 OK` when ready
`GET /queue-status`	Queue metrics	`200 OK` with queue stats
`GET /version`	Service version info	`200 OK` with version

Job Types

The Operator handles various job types:

Agent Execution Jobs

Execute agent logic with specific contexts and memory.

Workflow Jobs

Multi-step workflows with conditional logic and branching.

Scheduled Jobs

Cron-style recurring tasks.

Event-Triggered Jobs

Jobs triggered by system events or webhooks.

Job Lifecycle

Pending → Active → Completed
   │        │          ↓
   │        └──→ Failed → Retrying
   │                  ↓
   └───────────→ Cancelled

Pending: Job is queued, waiting for worker
Active: Worker is processing the job
Completed: Job finished successfully
Failed: Job encountered an error
Retrying: Failed job is being retried
Cancelled: Job was manually cancelled

Service: API - Submits jobs to Operator
Service: Core (planned) - Core business logic
Service: Prism Console (planned) - Job monitoring UI
Packs: Various pack services that execute specific job types

Environment Configuration

Key environment variables:

REDIS_URL - Redis connection for queue
DATABASE_URL - PostgreSQL for job metadata
WORKER_CONCURRENCY - Number of concurrent jobs per worker
JOB_TIMEOUT_MS - Default job timeout
MAX_RETRIES - Maximum retry attempts
RETRY_BACKOFF_MS - Initial retry delay

⚠️ Security: Never commit actual values. Use Railway secrets or equivalent.

Development

Local development setup:

# Clone the repository
git clone https://github.com/BlackRoad-OS/blackroad-os-operator.git
cd blackroad-os-operator

# Install dependencies
npm install

# Set up environment
cp .env.example .env
# Edit .env with local values

# Start Redis (via Docker)
docker run -d -p 6379:6379 redis:latest

# Run in development mode
npm run dev

See Local Development Guide for more details.

Monitoring

Queue Dashboard: BullBoard UI (if enabled)
Metrics: Job completion rates, failure rates, latency
Logs: Structured logging with context
Alerts: Configure alerts for queue depth, failed jobs
Prism Console: Real-time monitoring

Performance Tuning

Worker Concurrency

Adjust WORKER_CONCURRENCY based on:

Available CPU/memory
Job complexity
External API rate limits

Queue Priority

Set job priorities to ensure critical jobs execute first:

High: User-facing operations
Normal: Background tasks
Low: Maintenance, cleanup jobs

Memory Management

Monitor worker memory usage:

Restart workers periodically if memory leaks detected
Use separate queues for memory-intensive jobs

Troubleshooting

Common issues:

Jobs stuck in pending

Check Redis connectivity
Verify workers are running
Review worker logs for errors

High failure rates

Check job timeout settings
Review error logs for patterns
Verify external service availability

Queue growing indefinitely

Increase worker count
Reduce job creation rate
Identify and fix failing jobs

For debugging procedures, see Debug Operator Runbook (planned).

Contributing

To contribute to the Operator service:

Review Contributing Guide
Follow Coding Standards (planned)
Understand job lifecycle and queue patterns
Submit PRs with tests

6.0 KiB Raw Blame History