blackroad-os-docs/docs/services/service-operator.md

---
id: services-service-operator
title: "Service: Operator"
slug: /services/service-operator
description: "Documentation for the BlackRoad OS Operator service"
tags: ["services", "operator", "jobs"]
status: stable
---

# Service: Operator

## What it does

The **BlackRoad OS Operator** is the job orchestration and execution engine. It manages:

- Asynchronous job processing
- Agent task execution
- Workflow orchestration
- Queue management
- Retry logic and error handling

The Operator is the backbone of BlackRoad OS automation, turning high-level requests from the API into executed work.

## Repository

- **GitHub:** [BlackRoad-OS/blackroad-os-operator](https://github.com/BlackRoad-OS/blackroad-os-operator)
- **Primary Language:** TypeScript (Node.js)
- **Queue System:** BullMQ / Redis

## Key Features

- 📋 Job queue management with BullMQ
- 🔄 Automatic retry with exponential backoff
- 🎯 Priority-based job scheduling
- 🔐 Secure job execution contexts
- 📊 Real-time job status updates
- 🧠 Agent memory and state management

## Architecture

```mermaid
flowchart TD
    API[API Service] -->|Submit Job| Operator[Operator Service]
    Operator --> Queue[(Redis Queue)]
    Queue --> Worker1[Worker 1]
    Queue --> Worker2[Worker 2]
    Queue --> WorkerN[Worker N]
    Worker1 --> Packs[Pack Executors]
    Worker2 --> Packs
    WorkerN --> Packs
    Packs --> Results[Job Results]
    Results --> DB[(Database)]
```

## Deployment

The Operator service is deployed using:

- **Platform:** Railway
- **Scaling:** Horizontal scaling via worker processes
- **Environment Variables:** See `.env.example` in repository
- **Health Checks:** `/health`, `/ready`, `/queue-status`

For deployment procedures, see:
- [Operator Runtime Guide](../ops/OPERATOR_RUNTIME.md)
- [Deploy Operator Runbook](runbooks/deploy-operator.md) _(planned)_

## Health Checks

Standard endpoints:

| Endpoint | Purpose | Expected Response |
|----------|---------|-------------------|
| `GET /health` | Basic health check | `200 OK` |
| `GET /ready` | Readiness check (Redis connected) | `200 OK` when ready |
| `GET /queue-status` | Queue metrics | `200 OK` with queue stats |
| `GET /version` | Service version info | `200 OK` with version |

## Job Types

The Operator handles various job types:

### Agent Execution Jobs
Execute agent logic with specific contexts and memory.

### Workflow Jobs
Multi-step workflows with conditional logic and branching.

### Scheduled Jobs
Cron-style recurring tasks.

### Event-Triggered Jobs
Jobs triggered by system events or webhooks.

## Job Lifecycle

```
Pending → Active → Completed
   │        │          ↓
   │        └──→ Failed → Retrying
   │                  ↓
   └───────────→ Cancelled
```

1. **Pending:** Job is queued, waiting for worker
2. **Active:** Worker is processing the job
3. **Completed:** Job finished successfully
4. **Failed:** Job encountered an error
5. **Retrying:** Failed job is being retried
6. **Cancelled:** Job was manually cancelled

## Related Services

- [Service: API](./service-api.md) - Submits jobs to Operator
- [Service: Core](./service-core.md) _(planned)_ - Core business logic
- [Service: Prism Console](./service-prism-console.md) _(planned)_ - Job monitoring UI
- **Packs:** Various pack services that execute specific job types

## Environment Configuration

Key environment variables:

- `REDIS_URL` - Redis connection for queue
- `DATABASE_URL` - PostgreSQL for job metadata
- `WORKER_CONCURRENCY` - Number of concurrent jobs per worker
- `JOB_TIMEOUT_MS` - Default job timeout
- `MAX_RETRIES` - Maximum retry attempts
- `RETRY_BACKOFF_MS` - Initial retry delay

> ⚠️ **Security:** Never commit actual values. Use Railway secrets or equivalent.

## Development

Local development setup:

```bash
# Clone the repository
git clone https://github.com/BlackRoad-OS/blackroad-os-operator.git
cd blackroad-os-operator

# Install dependencies
npm install

# Set up environment
cp .env.example .env
# Edit .env with local values

# Start Redis (via Docker)
docker run -d -p 6379:6379 redis:latest

# Run in development mode
npm run dev
```

See [Local Development Guide](dev/local-development.md) for more details.

## Monitoring

- **Queue Dashboard:** BullBoard UI (if enabled)
- **Metrics:** Job completion rates, failure rates, latency
- **Logs:** Structured logging with context
- **Alerts:** Configure alerts for queue depth, failed jobs
- **Prism Console:** [Real-time monitoring](../ops/PRISM_CONSOLE.md)

## Performance Tuning

### Worker Concurrency
Adjust `WORKER_CONCURRENCY` based on:
- Available CPU/memory
- Job complexity
- External API rate limits

### Queue Priority
Set job priorities to ensure critical jobs execute first:
- **High:** User-facing operations
- **Normal:** Background tasks
- **Low:** Maintenance, cleanup jobs

### Memory Management
Monitor worker memory usage:
- Restart workers periodically if memory leaks detected
- Use separate queues for memory-intensive jobs

## Troubleshooting

Common issues:

### Jobs stuck in pending
- Check Redis connectivity
- Verify workers are running
- Review worker logs for errors

### High failure rates
- Check job timeout settings
- Review error logs for patterns
- Verify external service availability

### Queue growing indefinitely
- Increase worker count
- Reduce job creation rate
- Identify and fix failing jobs

For debugging procedures, see [Debug Operator Runbook](runbooks/debug-operator.md) _(planned)_.

## Contributing

To contribute to the Operator service:

1. Review [Contributing Guide](../guides/contributing.md)
2. Follow [Coding Standards](../guides/coding-standards.md) _(planned)_
3. Understand job lifecycle and queue patterns
4. Submit PRs with tests

## See Also

- [Operator Runtime](../ops/OPERATOR_RUNTIME.md) - Operational guide
- [Core Primitives](dev/CORE_PRIMITIVES.md) - Job data structures
- [Events and RoadChain](dev/EVENTS_AND_ROADCHAIN.md) - Event-driven architecture
- [Agents Atlas](dev/AGENTS_ATLAS_AND_FRIENDS.md) - Agent ecosystem