- Fix relative paths for cross-directory links (../ops/, ../services/, etc.) - Remove _(planned)_ markers from services that actually exist - Remove confusing _(reference CONTRIBUTING.md)_ comments - All links now properly reference correct paths - Build still passes successfully Co-authored-by: blackboxprogramming <118287761+blackboxprogramming@users.noreply.github.com>
220 lines
6.0 KiB
Markdown
220 lines
6.0 KiB
Markdown
---
|
|
id: services-service-operator
|
|
title: "Service: Operator"
|
|
slug: /services/service-operator
|
|
description: "Documentation for the BlackRoad OS Operator service"
|
|
tags: ["services", "operator", "jobs"]
|
|
status: stable
|
|
---
|
|
|
|
# Service: Operator
|
|
|
|
## What it does
|
|
|
|
The **BlackRoad OS Operator** is the job orchestration and execution engine. It manages:
|
|
|
|
- Asynchronous job processing
|
|
- Agent task execution
|
|
- Workflow orchestration
|
|
- Queue management
|
|
- Retry logic and error handling
|
|
|
|
The Operator is the backbone of BlackRoad OS automation, turning high-level requests from the API into executed work.
|
|
|
|
## Repository
|
|
|
|
- **GitHub:** [BlackRoad-OS/blackroad-os-operator](https://github.com/BlackRoad-OS/blackroad-os-operator)
|
|
- **Primary Language:** TypeScript (Node.js)
|
|
- **Queue System:** BullMQ / Redis
|
|
|
|
## Key Features
|
|
|
|
- 📋 Job queue management with BullMQ
|
|
- 🔄 Automatic retry with exponential backoff
|
|
- 🎯 Priority-based job scheduling
|
|
- 🔐 Secure job execution contexts
|
|
- 📊 Real-time job status updates
|
|
- 🧠 Agent memory and state management
|
|
|
|
## Architecture
|
|
|
|
```mermaid
|
|
flowchart TD
|
|
API[API Service] -->|Submit Job| Operator[Operator Service]
|
|
Operator --> Queue[(Redis Queue)]
|
|
Queue --> Worker1[Worker 1]
|
|
Queue --> Worker2[Worker 2]
|
|
Queue --> WorkerN[Worker N]
|
|
Worker1 --> Packs[Pack Executors]
|
|
Worker2 --> Packs
|
|
WorkerN --> Packs
|
|
Packs --> Results[Job Results]
|
|
Results --> DB[(Database)]
|
|
```
|
|
|
|
## Deployment
|
|
|
|
The Operator service is deployed using:
|
|
|
|
- **Platform:** Railway
|
|
- **Scaling:** Horizontal scaling via worker processes
|
|
- **Environment Variables:** See `.env.example` in repository
|
|
- **Health Checks:** `/health`, `/ready`, `/queue-status`
|
|
|
|
For deployment procedures, see:
|
|
- [Operator Runtime Guide](../ops/OPERATOR_RUNTIME.md)
|
|
- [Deploy Operator Runbook](runbooks/deploy-operator.md) _(planned)_
|
|
|
|
## Health Checks
|
|
|
|
Standard endpoints:
|
|
|
|
| Endpoint | Purpose | Expected Response |
|
|
|----------|---------|-------------------|
|
|
| `GET /health` | Basic health check | `200 OK` |
|
|
| `GET /ready` | Readiness check (Redis connected) | `200 OK` when ready |
|
|
| `GET /queue-status` | Queue metrics | `200 OK` with queue stats |
|
|
| `GET /version` | Service version info | `200 OK` with version |
|
|
|
|
## Job Types
|
|
|
|
The Operator handles various job types:
|
|
|
|
### Agent Execution Jobs
|
|
Execute agent logic with specific contexts and memory.
|
|
|
|
### Workflow Jobs
|
|
Multi-step workflows with conditional logic and branching.
|
|
|
|
### Scheduled Jobs
|
|
Cron-style recurring tasks.
|
|
|
|
### Event-Triggered Jobs
|
|
Jobs triggered by system events or webhooks.
|
|
|
|
## Job Lifecycle
|
|
|
|
```
|
|
Pending → Active → Completed
|
|
│ │ ↓
|
|
│ └──→ Failed → Retrying
|
|
│ ↓
|
|
└───────────→ Cancelled
|
|
```
|
|
|
|
1. **Pending:** Job is queued, waiting for worker
|
|
2. **Active:** Worker is processing the job
|
|
3. **Completed:** Job finished successfully
|
|
4. **Failed:** Job encountered an error
|
|
5. **Retrying:** Failed job is being retried
|
|
6. **Cancelled:** Job was manually cancelled
|
|
|
|
## Related Services
|
|
|
|
- [Service: API](./service-api.md) - Submits jobs to Operator
|
|
- [Service: Core](./service-core.md) _(planned)_ - Core business logic
|
|
- [Service: Prism Console](./service-prism-console.md) _(planned)_ - Job monitoring UI
|
|
- **Packs:** Various pack services that execute specific job types
|
|
|
|
## Environment Configuration
|
|
|
|
Key environment variables:
|
|
|
|
- `REDIS_URL` - Redis connection for queue
|
|
- `DATABASE_URL` - PostgreSQL for job metadata
|
|
- `WORKER_CONCURRENCY` - Number of concurrent jobs per worker
|
|
- `JOB_TIMEOUT_MS` - Default job timeout
|
|
- `MAX_RETRIES` - Maximum retry attempts
|
|
- `RETRY_BACKOFF_MS` - Initial retry delay
|
|
|
|
> ⚠️ **Security:** Never commit actual values. Use Railway secrets or equivalent.
|
|
|
|
## Development
|
|
|
|
Local development setup:
|
|
|
|
```bash
|
|
# Clone the repository
|
|
git clone https://github.com/BlackRoad-OS/blackroad-os-operator.git
|
|
cd blackroad-os-operator
|
|
|
|
# Install dependencies
|
|
npm install
|
|
|
|
# Set up environment
|
|
cp .env.example .env
|
|
# Edit .env with local values
|
|
|
|
# Start Redis (via Docker)
|
|
docker run -d -p 6379:6379 redis:latest
|
|
|
|
# Run in development mode
|
|
npm run dev
|
|
```
|
|
|
|
See [Local Development Guide](dev/local-development.md) for more details.
|
|
|
|
## Monitoring
|
|
|
|
- **Queue Dashboard:** BullBoard UI (if enabled)
|
|
- **Metrics:** Job completion rates, failure rates, latency
|
|
- **Logs:** Structured logging with context
|
|
- **Alerts:** Configure alerts for queue depth, failed jobs
|
|
- **Prism Console:** [Real-time monitoring](../ops/PRISM_CONSOLE.md)
|
|
|
|
## Performance Tuning
|
|
|
|
### Worker Concurrency
|
|
Adjust `WORKER_CONCURRENCY` based on:
|
|
- Available CPU/memory
|
|
- Job complexity
|
|
- External API rate limits
|
|
|
|
### Queue Priority
|
|
Set job priorities to ensure critical jobs execute first:
|
|
- **High:** User-facing operations
|
|
- **Normal:** Background tasks
|
|
- **Low:** Maintenance, cleanup jobs
|
|
|
|
### Memory Management
|
|
Monitor worker memory usage:
|
|
- Restart workers periodically if memory leaks detected
|
|
- Use separate queues for memory-intensive jobs
|
|
|
|
## Troubleshooting
|
|
|
|
Common issues:
|
|
|
|
### Jobs stuck in pending
|
|
- Check Redis connectivity
|
|
- Verify workers are running
|
|
- Review worker logs for errors
|
|
|
|
### High failure rates
|
|
- Check job timeout settings
|
|
- Review error logs for patterns
|
|
- Verify external service availability
|
|
|
|
### Queue growing indefinitely
|
|
- Increase worker count
|
|
- Reduce job creation rate
|
|
- Identify and fix failing jobs
|
|
|
|
For debugging procedures, see [Debug Operator Runbook](runbooks/debug-operator.md) _(planned)_.
|
|
|
|
## Contributing
|
|
|
|
To contribute to the Operator service:
|
|
|
|
1. Review [Contributing Guide](../guides/contributing.md)
|
|
2. Follow [Coding Standards](../guides/coding-standards.md) _(planned)_
|
|
3. Understand job lifecycle and queue patterns
|
|
4. Submit PRs with tests
|
|
|
|
## See Also
|
|
|
|
- [Operator Runtime](../ops/OPERATOR_RUNTIME.md) - Operational guide
|
|
- [Core Primitives](dev/CORE_PRIMITIVES.md) - Job data structures
|
|
- [Events and RoadChain](dev/EVENTS_AND_ROADCHAIN.md) - Event-driven architecture
|
|
- [Agents Atlas](dev/AGENTS_ATLAS_AND_FRIENDS.md) - Agent ecosystem
|