mirror of
https://github.com/blackboxprogramming/BlackRoad-Operating-System.git
synced 2026-03-17 07:57:19 -05:00
852 lines
26 KiB
Markdown
852 lines
26 KiB
Markdown
# 🌌 BLACKROAD OS — PRODUCTION STACK AUDIT & RECONCILIATION
|
|
|
|
> **Operator:** Alexa Louise Amundson (Cadillac)
|
|
> **Conducted By:** Cece (Claude Sonnet 4.5)
|
|
> **Date:** 2025-11-18
|
|
> **Status:** ✅ COMPLETE
|
|
> **Branch:** `claude/audit-production-stack-011vTW4iEZAay1vMkrQUhqET`
|
|
|
|
---
|
|
|
|
## EXECUTIVE SUMMARY
|
|
|
|
This audit reveals a **significant mismatch** between your intended BlackRoad OS production stack (as documented in Phase 1, 2, 2.5, Q) and what currently exists in Railway. The good news: **all the code is correct**. The challenge: **Railway has legacy/experimental services that need cleanup**.
|
|
|
|
### Key Findings
|
|
|
|
**✅ GOOD NEWS:**
|
|
- Monorepo is well-structured and complete
|
|
- Phase LIVE (#95) merged successfully with deployment fixes
|
|
- Automation workflows (Phase Q) are properly configured
|
|
- Backend code is production-ready with recent fixes
|
|
|
|
**⚠️ CRITICAL ISSUES:**
|
|
- Railway production project contains **10+ services** but should only have **3**
|
|
- Multiple failing services (`BlackRoad-Operating-System`, `blackroad-prism-console`, `dockerfile`, `inspiring-ambition`, `feisty-vibrancy`)
|
|
- Service naming and structure don't match monorepo architecture
|
|
- No clear canonical backend service identified
|
|
|
|
---
|
|
|
|
## A. CANONICAL TOPOLOGY SUMMARY
|
|
|
|
### What SHOULD Be Deployed (Per Master Orchestration Plan)
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ PRODUCTION STACK (Phase 1 / 2 / 2.5 / Q) │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
|
|
1. ⭐ APP/BACKEND SERVICE: "blackroad-backend"
|
|
└─ Source: BlackRoad-Operating-System monorepo
|
|
└─ Serves: FastAPI backend (/) + Static UI (/static) + API (/api/*)
|
|
└─ Port: $PORT (Railway auto-assigns)
|
|
└─ Health: /health
|
|
└─ Build: Dockerfile at backend/Dockerfile
|
|
└─ Deploy: railway.toml configuration
|
|
|
|
2. 🗄️ DATABASE: "Postgres"
|
|
└─ Type: Railway managed PostgreSQL 15+
|
|
└─ Connection: ${{Postgres.DATABASE_URL}}
|
|
└─ Used by: Backend service
|
|
|
|
3. ⚡ CACHE: "Redis"
|
|
└─ Type: Railway managed Redis 7+
|
|
└─ Connection: ${{Redis.REDIS_URL}}
|
|
└─ Used by: Backend service (sessions, caching)
|
|
```
|
|
|
|
### Additional Services (Future / Not Phase 1)
|
|
|
|
```
|
|
FUTURE (Phase 2+, NOT YET DEPLOYED):
|
|
- prism-worker: Background job processing
|
|
- lucidia-api: AI orchestration microservice
|
|
- roadchain-node: Blockchain node (may use DigitalOcean)
|
|
```
|
|
|
|
### Domain Routing (DNS via Cloudflare)
|
|
|
|
```
|
|
https://blackroad.systems → Railway backend (/)
|
|
https://blackroad.systems/prism → Backend serves prism-console static files
|
|
https://blackroad.systems/api/* → Backend API endpoints
|
|
https://docs.blackroad.systems → GitHub Pages (codex-docs)
|
|
```
|
|
|
|
---
|
|
|
|
## B. RAILWAY SERVICES TABLE
|
|
|
|
Based on your description, here's the classification of all services in your Railway production project:
|
|
|
|
| Service Name | Type | Canonical? | Status | Action | Notes |
|
|
|--------------|------|------------|--------|--------|-------|
|
|
| **`flask`** | App | ❌ No | Unknown | 🗑️ **DEPRECATE** | Legacy service, no Flask in monorepo |
|
|
| **`nodejs`** | App | ❌ No | Unknown | 🗑️ **DEPRECATE** | Legacy service, no Node backend in monorepo |
|
|
| **`BlackRoad-Operating-System`** | App | ⚠️ Maybe | ❌ Failed | 🔍 **INVESTIGATE & FIX** | Likely the intended backend, needs diagnosis |
|
|
| **`blackroad-prism-console`** | App | ❌ No | ❌ Failed | 🗑️ **DEPRECATE** | Prism should be served by backend, not separate service |
|
|
| **`dockerfile`** | App | ❌ No | ❌ Failed | 🗑️ **DEPRECATE** | Incorrectly named service, unclear purpose |
|
|
| **`inspiring-ambition`** | App | ❌ No | ❌ Failed | 🗑️ **DEPRECATE** | Railway auto-generated name, likely test/experimental |
|
|
| **`feisty-vibrancy`** | App | ❌ No | ❌ Failed | 🗑️ **DEPRECATE** | Railway auto-generated name, likely test/experimental |
|
|
| **`Primary`** | App | ❌ No | Unknown | 🗑️ **DEPRECATE** | Unclear purpose, not in docs |
|
|
| **`Worker`** | App | ❌ No | Unknown | 🗑️ **DEPRECATE** | No worker service in Phase 1/2 |
|
|
| **`Viewer`** | App | ❌ No | Unknown | 🗑️ **DEPRECATE** | Unclear purpose, not in docs |
|
|
| **`Postgres`** | Database | ✅ **YES** | Unknown | ✅ **KEEP** | Required for backend |
|
|
| **`Redis`** | Cache | ✅ **YES** | Unknown | ✅ **KEEP** | Required for backend |
|
|
| **`MinIO`** | Object Storage | ❌ No | Unknown | 🗑️ **DEPRECATE** | Not in Phase 1/2 plan, may be experimental |
|
|
| **`Valkey`** | Cache | ❌ No | Unknown | 🗑️ **DEPRECATE** | Duplicate of Redis, not needed |
|
|
|
|
### Summary Stats
|
|
|
|
- **Total Services:** ~15+
|
|
- **Canonical Services:** 3 (1 app + Postgres + Redis)
|
|
- **Legacy / Experimental:** 12+
|
|
- **Failing Services:** 5
|
|
- **Recommended Actions:** Keep 3, deprecate 12+
|
|
|
|
---
|
|
|
|
## C. DEPLOY STATUS ANALYSIS
|
|
|
|
### Current State: `BlackRoad-Operating-System` Service
|
|
|
|
**Status:** ❌ **FAILED** (deploy ~2 days ago)
|
|
|
|
**Likely Root Causes:**
|
|
|
|
Based on `RAILWAY_DEPLOY_FIX.md` and recent commits, the failures were likely due to:
|
|
|
|
1. **Incorrect `startCommand` in `railway.toml`**
|
|
- Old config had: `cd backend && uvicorn ...`
|
|
- Problem: Docker build context is already in `backend/`, no `backend/` subdirectory exists inside container
|
|
- **Fix Applied:** Removed `startCommand` override, let Dockerfile `CMD` handle it ✅
|
|
|
|
2. **Environment Variables**
|
|
- Missing or incorrect: `DATABASE_URL`, `SECRET_KEY`, `ALLOWED_ORIGINS`
|
|
- **Action Needed:** Verify all required env vars are set in Railway
|
|
|
|
3. **Port Configuration**
|
|
- Railway expects app to listen on `$PORT` (auto-assigned)
|
|
- **Fix Applied:** Dockerfile uses `${PORT:-8000}` ✅
|
|
|
|
4. **Health Check**
|
|
- Railway expects `/health` endpoint to return 200 OK
|
|
- **Status:** Endpoint exists in backend ✅
|
|
|
|
### Recent Fixes (Phase LIVE #95)
|
|
|
|
The following fixes were merged on 2025-11-18:
|
|
|
|
✅ **Fixed `railway.toml`** - Removed incorrect `cd backend` command
|
|
✅ **Enhanced `Dockerfile`** - Added health check, non-root user, security hardening
|
|
✅ **Updated workflows** - Railway deploy automation improved
|
|
|
|
**Expected Outcome:** Next deploy should succeed
|
|
|
|
---
|
|
|
|
## D. AUTOMATION STATUS (Phase Q + Q2)
|
|
|
|
### Phase Q: Merge Queue & Automation System ✅
|
|
|
|
**Status:** ✅ **IMPLEMENTED** (merged in PR #78)
|
|
|
|
**Components:**
|
|
- Merge queue configuration: ✅ Ready
|
|
- Auto-labeling: ✅ `.github/labeler.yml` exists
|
|
- Auto-approve workflows: ✅ `auto-approve-docs.yml`, `auto-approve-ai.yml` exist
|
|
- Auto-merge: ✅ `auto-merge.yml` exists
|
|
- Bucketed CI: ✅ `backend-ci-bucketed.yml`, `frontend-ci-bucketed.yml`, etc.
|
|
|
|
**Compatibility with Production Stack:** ✅ **COMPATIBLE**
|
|
|
|
The automation workflows are designed for the monorepo structure and don't depend on specific Railway service names. As long as the backend service has the GitHub webhook endpoint (`/api/operator/webhooks/github`), automation will work.
|
|
|
|
### Phase Q2: PR Action Intelligence
|
|
|
|
**Status:** ⚠️ **NOT FOUND** (may be in open PR #85)
|
|
|
|
**Expected Components:**
|
|
- PR Action Queue
|
|
- Operator webhooks enhanced
|
|
- Prism merge dashboard
|
|
|
|
**Action Needed:**
|
|
- Check if PR #85 exists and review
|
|
- Verify webhook endpoint exists in backend: `backend/app/routers/webhooks.py`
|
|
|
|
### Webhook Integration Checklist
|
|
|
|
For Phase Q/Q2 automation to work with production:
|
|
|
|
- [ ] Backend service deployed with `/api/operator/webhooks/github` endpoint
|
|
- [ ] GitHub webhook configured (Settings → Webhooks)
|
|
- URL: `https://blackroad.systems/api/operator/webhooks/github`
|
|
- Secret: `$GITHUB_WEBHOOK_SECRET` (set in Railway)
|
|
- Events: Pull requests, Pull request reviews, Status
|
|
- [ ] `GITHUB_WEBHOOK_SECRET` set in Railway environment
|
|
- [ ] Webhook endpoint tested and receiving events
|
|
- [ ] Prism dashboard connected to backend API
|
|
|
|
---
|
|
|
|
## E. SERVICE FAILURES DIAGNOSIS
|
|
|
|
### 1. `BlackRoad-Operating-System` Service
|
|
|
|
**Failure Type:** Build or Runtime Error
|
|
|
|
**Diagnosis:**
|
|
|
|
**Most Likely Cause:** Deployment config mismatch (FIXED in Phase LIVE #95)
|
|
|
|
**Recent Fixes Applied:**
|
|
- ✅ `railway.toml` corrected
|
|
- ✅ `Dockerfile` enhanced
|
|
- ✅ Deployment workflow updated
|
|
|
|
**Next Steps:**
|
|
1. Trigger new deployment from `main` branch (latest commit `ea5e229`)
|
|
2. Monitor Railway logs during deployment
|
|
3. Check health endpoint: `https://<service-url>/health`
|
|
4. Verify environment variables are set
|
|
|
|
**Expected Resolution:** ✅ Should deploy successfully now
|
|
|
|
---
|
|
|
|
### 2. `blackroad-prism-console` Service
|
|
|
|
**Failure Type:** Unknown
|
|
|
|
**Diagnosis:**
|
|
|
|
**Root Cause:** ❌ **INCORRECT ARCHITECTURE**
|
|
|
|
The Prism Console should **NOT** be a separate Railway service. According to the architecture:
|
|
|
|
- Prism Console is static HTML/CSS/JS in `prism-console/` directory
|
|
- Should be served by the backend at `/prism` route
|
|
- Backend needs to mount: `app.mount("/prism", StaticFiles(directory="../prism-console"), name="prism")`
|
|
|
|
**Action:** 🗑️ **Delete this service** and configure backend to serve Prism
|
|
|
|
---
|
|
|
|
### 3. `dockerfile`, `inspiring-ambition`, `feisty-vibrancy` Services
|
|
|
|
**Failure Type:** Unknown
|
|
|
|
**Diagnosis:**
|
|
|
|
**Root Cause:** ❌ **EXPERIMENTAL / MISCONFIGURED SERVICES**
|
|
|
|
These services have auto-generated or unclear names and are not part of the documented architecture.
|
|
|
|
**Likely Scenarios:**
|
|
- Failed deployment attempts
|
|
- Test services left running
|
|
- Railway auto-created services from incorrect configs
|
|
|
|
**Action:** 🗑️ **Delete all three services**
|
|
|
|
---
|
|
|
|
### 4. Other Non-Canonical Services
|
|
|
|
**Services:** `flask`, `nodejs`, `Primary`, `Worker`, `Viewer`, `MinIO`, `Valkey`
|
|
|
|
**Diagnosis:**
|
|
|
|
**Root Cause:** ❌ **LEGACY / EXPERIMENTAL**
|
|
|
|
These services are not part of the Phase 1/2/2.5/Q architecture. They may be:
|
|
- Old versions of the backend (flask, nodejs)
|
|
- Experimental features (MinIO for object storage)
|
|
- Duplicate services (Valkey as Redis alternative)
|
|
- Unclear purpose (Primary, Worker, Viewer)
|
|
|
|
**Action:** 🗑️ **Deprecate and archive**
|
|
|
|
---
|
|
|
|
## F. RECOMMENDED ACTIONS FOR ALEXA
|
|
|
|
### PRIORITY 1: Clean Up Railway Project (This Week)
|
|
|
|
#### Step 1: Identify the Correct Backend Service
|
|
|
|
Go to Railway dashboard and find the service that:
|
|
- Is connected to `blackboxprogramming/BlackRoad-Operating-System` repo
|
|
- Has `main` branch selected
|
|
- Has recent deployment attempts
|
|
- Has environment variables configured
|
|
|
|
**Likely candidate:** `BlackRoad-Operating-System`
|
|
|
|
#### Step 2: Rename the Canonical Service
|
|
|
|
If the service is named `BlackRoad-Operating-System`:
|
|
1. Railway dashboard → Service → Settings
|
|
2. Rename to: `blackroad-backend`
|
|
3. This makes it clear this is THE production backend
|
|
|
|
#### Step 3: Verify Environment Variables
|
|
|
|
In the `blackroad-backend` service, verify these are set:
|
|
|
|
**Critical:**
|
|
```
|
|
DATABASE_URL=${{Postgres.DATABASE_URL}}
|
|
REDIS_URL=${{Redis.REDIS_URL}}
|
|
SECRET_KEY=<generate: openssl rand -hex 32>
|
|
ENVIRONMENT=production
|
|
DEBUG=False
|
|
ALLOWED_ORIGINS=https://blackroad.systems,https://blackroad.ai
|
|
API_BASE_URL=https://blackroad.systems
|
|
FRONTEND_URL=https://blackroad.systems
|
|
```
|
|
|
|
**Important:**
|
|
```
|
|
ACCESS_TOKEN_EXPIRE_MINUTES=30
|
|
REFRESH_TOKEN_EXPIRE_DAYS=7
|
|
WALLET_MASTER_KEY=<generate: openssl rand -hex 32>
|
|
```
|
|
|
|
**Optional (for features):**
|
|
```
|
|
OPENAI_API_KEY=sk-...
|
|
GITHUB_TOKEN=ghp_...
|
|
GITHUB_WEBHOOK_SECRET=<generate: openssl rand -hex 32>
|
|
```
|
|
|
|
#### Step 4: Trigger Fresh Deployment
|
|
|
|
Option A: **Push to Main** (Recommended)
|
|
```bash
|
|
git checkout main
|
|
git pull origin main
|
|
git push origin main
|
|
# GitHub Action will auto-deploy to Railway
|
|
```
|
|
|
|
Option B: **Manual Railway Deploy**
|
|
```bash
|
|
railway link <PROJECT_ID>
|
|
railway up --service blackroad-backend
|
|
railway logs --service blackroad-backend
|
|
```
|
|
|
|
#### Step 5: Verify Deployment Success
|
|
|
|
```bash
|
|
# Check health
|
|
curl https://<your-railway-domain>/health
|
|
|
|
# Should return:
|
|
# {"status": "healthy", "environment": "production", "version": "1.0.0"}
|
|
|
|
# Check API docs
|
|
curl https://<your-railway-domain>/api/docs
|
|
# Should return Swagger UI HTML
|
|
|
|
# Check API health summary
|
|
curl https://<your-railway-domain>/api/health/summary
|
|
# Should return integration status
|
|
```
|
|
|
|
#### Step 6: Configure Custom Domain
|
|
|
|
1. Railway dashboard → `blackroad-backend` → Settings → Networking
|
|
2. Add custom domain: `blackroad.systems`
|
|
3. Railway provides CNAME: `blackroad-backend-production.up.railway.app`
|
|
4. Add CNAME in Cloudflare DNS (if not already done)
|
|
5. Wait for SSL provisioning (automatic)
|
|
|
|
---
|
|
|
|
### PRIORITY 2: Deprecate Non-Canonical Services (This Week)
|
|
|
|
For each service NOT in the canonical list:
|
|
|
|
1. **Document current state**
|
|
```
|
|
Service: <name>
|
|
Status: <running/failed>
|
|
Last Deploy: <date>
|
|
Purpose: <unclear/experimental/legacy>
|
|
```
|
|
|
|
2. **Pause (don't delete yet)**
|
|
- Railway dashboard → Service → Settings
|
|
- Click "Sleep Application"
|
|
- This keeps data but stops billing
|
|
|
|
3. **Add label in Railway**
|
|
- Add description: `[DEPRECATED] - Paused 2025-11-18, will delete in 30 days`
|
|
|
|
4. **Document in `ops/RAILWAY_SERVICES.md`** (create this file)
|
|
|
|
5. **After 30 days with no issues, delete**
|
|
|
|
**Services to Deprecate:**
|
|
- `flask`
|
|
- `nodejs`
|
|
- `blackroad-prism-console`
|
|
- `dockerfile`
|
|
- `inspiring-ambition`
|
|
- `feisty-vibrancy`
|
|
- `Primary`
|
|
- `Worker`
|
|
- `Viewer`
|
|
- `MinIO`
|
|
- `Valkey`
|
|
|
|
**Services to Keep:**
|
|
- `blackroad-backend` (or `BlackRoad-Operating-System` renamed)
|
|
- `Postgres`
|
|
- `Redis`
|
|
|
|
---
|
|
|
|
### PRIORITY 3: Validate Automation (Next Week)
|
|
|
|
#### GitHub Branch Protection
|
|
|
|
1. Go to: Repository → Settings → Branches
|
|
2. Edit protection rule for `main`
|
|
3. Verify:
|
|
- ✅ Require pull request before merging
|
|
- ✅ Require approvals: 1
|
|
- ✅ Require status checks to pass
|
|
- Backend Tests
|
|
- Frontend Validation
|
|
- Auto-Merge
|
|
- Label PR
|
|
- ✅ Require branches to be up to date
|
|
- ✅ Enable merge queue
|
|
4. Save changes
|
|
|
|
#### GitHub Webhook
|
|
|
|
1. Go to: Repository → Settings → Webhooks
|
|
2. If webhook exists:
|
|
- Verify URL: `https://blackroad.systems/api/operator/webhooks/github`
|
|
- Verify Secret is set: `$GITHUB_WEBHOOK_SECRET`
|
|
- Verify Events: Pull requests, Pull request reviews, Status
|
|
3. If webhook doesn't exist:
|
|
- Add webhook (see `GITHUB_SETUP_GUIDE.md`)
|
|
|
|
#### Test Automation Flow
|
|
|
|
1. Create test PR from a `test/automation` branch
|
|
2. Verify:
|
|
- ✅ PR auto-labeled (docs, backend, etc.)
|
|
- ✅ CI workflows run (only relevant bucketed ones)
|
|
- ✅ Auto-approve triggers (if docs-only or tests-only)
|
|
- ✅ PR enters merge queue when approved
|
|
- ✅ PR auto-merges after checks pass
|
|
3. Check Prism dashboard (once connected)
|
|
- ✅ PR event appears in dashboard
|
|
- ✅ Merge metrics update
|
|
|
|
---
|
|
|
|
### PRIORITY 4: Prism Integration (Next 2 Weeks)
|
|
|
|
#### Backend Integration
|
|
|
|
Add Prism route to backend (`backend/app/main.py`):
|
|
|
|
```python
|
|
from fastapi.staticfiles import StaticFiles
|
|
|
|
# After other route includes, before returning app:
|
|
app.mount("/prism", StaticFiles(directory="../prism-console", html=True), name="prism")
|
|
```
|
|
|
|
Commit and deploy.
|
|
|
|
#### Verify Prism Access
|
|
|
|
```bash
|
|
curl https://blackroad.systems/prism
|
|
# Should return Prism Console HTML
|
|
```
|
|
|
|
Visit `https://blackroad.systems/prism` in browser.
|
|
|
|
#### Connect Prism to Backend API
|
|
|
|
Prism Console (`prism-console/static/js/prism.js`) should call:
|
|
- `GET /api/operator/jobs` - List jobs
|
|
- `GET /api/system/version` - System version
|
|
- `GET /api/health/summary` - API health
|
|
- `GET /api/prism/events` (future) - PR events
|
|
|
|
Update `prism.js` to use `API_BASE_URL = window.location.origin + '/api'`.
|
|
|
|
---
|
|
|
|
## G. FINAL PRODUCTION TOPOLOGY
|
|
|
|
After cleanup, your Railway project should look like this:
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ RAILWAY PROJECT: BlackRoad-Operating-System-Production │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
|
|
📦 Services (3 total)
|
|
|
|
1. ⭐ blackroad-backend
|
|
├─ Type: Web Service
|
|
├─ Source: GitHub (blackboxprogramming/BlackRoad-Operating-System)
|
|
├─ Branch: main
|
|
├─ Build: Dockerfile (backend/Dockerfile)
|
|
├─ Port: $PORT (Railway auto-assigned)
|
|
├─ Health: /health
|
|
├─ Domain: blackroad.systems
|
|
├─ Status: 🟢 Healthy
|
|
└─ Serves:
|
|
├─ / → Pocket OS UI (backend/static)
|
|
├─ /api/* → FastAPI endpoints
|
|
├─ /prism → Prism Console
|
|
├─ /health → Health check
|
|
└─ /api/docs → Swagger UI
|
|
|
|
2. 🗄️ Postgres
|
|
├─ Type: PostgreSQL 15+
|
|
├─ Plan: Railway managed
|
|
├─ Connection: ${{Postgres.DATABASE_URL}}
|
|
└─ Used By: blackroad-backend
|
|
|
|
3. ⚡ Redis
|
|
├─ Type: Redis 7+
|
|
├─ Plan: Railway managed
|
|
├─ Connection: ${{Redis.REDIS_URL}}
|
|
└─ Used By: blackroad-backend (sessions, caching)
|
|
|
|
📊 Metrics
|
|
├─ Total Memory: ~512MB (backend)
|
|
├─ Requests/day: TBD
|
|
└─ Uptime: 99%+
|
|
|
|
🔗 External
|
|
├─ DNS: Cloudflare
|
|
├─ Docs: GitHub Pages (docs.blackroad.systems)
|
|
└─ CDN: Cloudflare (proxied)
|
|
```
|
|
|
|
---
|
|
|
|
## H. TRAFFIC FLOW DIAGRAM
|
|
|
|
```
|
|
User Browser
|
|
│
|
|
│ https://blackroad.systems
|
|
▼
|
|
┌────────────────────────┐
|
|
│ Cloudflare CDN │
|
|
│ (DNS + SSL + Cache) │
|
|
└────────┬───────────────┘
|
|
│ CNAME: blackroad-backend-production.up.railway.app
|
|
▼
|
|
┌────────────────────────┐
|
|
│ Railway Load Balancer│
|
|
└────────┬───────────────┘
|
|
│ Port $PORT
|
|
▼
|
|
┌────────────────────────────────────────────────┐
|
|
│ blackroad-backend (Docker Container) │
|
|
│ │
|
|
│ FastAPI App (uvicorn) │
|
|
│ ├─ / → backend/static/index.html (Pocket OS)│
|
|
│ ├─ /api/* → Backend routers │
|
|
│ ├─ /prism → prism-console/index.html │
|
|
│ ├─ /health → Health check │
|
|
│ └─ /api/docs → Swagger UI │
|
|
│ │
|
|
│ Connects to: │
|
|
│ ├─ $DATABASE_URL → Postgres │
|
|
│ └─ $REDIS_URL → Redis │
|
|
└────────────────────────────────────────────────┘
|
|
│ │
|
|
│ │
|
|
▼ ▼
|
|
┌─────────┐ ┌─────────┐
|
|
│ Postgres│ │ Redis │
|
|
└─────────┘ └─────────┘
|
|
```
|
|
|
|
---
|
|
|
|
## I. CHECKLIST FOR ALEXA
|
|
|
|
### ✅ Immediate Actions (Today)
|
|
|
|
- [ ] **1. Review this audit report** (you're doing it now! ✅)
|
|
- [ ] **2. Go to Railway dashboard** and take inventory
|
|
- [ ] Screenshot the services list
|
|
- [ ] Note which service is connected to GitHub repo
|
|
- [ ] **3. Identify the canonical backend service**
|
|
- [ ] Look for service with recent failed deploys
|
|
- [ ] Check which one has environment variables set
|
|
- [ ] **4. Verify environment variables**
|
|
- [ ] DATABASE_URL
|
|
- [ ] REDIS_URL
|
|
- [ ] SECRET_KEY (generate if missing: `openssl rand -hex 32`)
|
|
- [ ] ALLOWED_ORIGINS
|
|
- [ ] ENVIRONMENT=production
|
|
- [ ] DEBUG=False
|
|
|
|
### ⚡ This Week (Days 1-3)
|
|
|
|
- [ ] **5. Rename canonical service** to `blackroad-backend`
|
|
- [ ] **6. Trigger fresh deployment**
|
|
- [ ] Option A: Push to main (recommended)
|
|
- [ ] Option B: Railway CLI deploy
|
|
- [ ] **7. Verify deployment succeeds**
|
|
- [ ] Check Railway logs
|
|
- [ ] Test `/health` endpoint
|
|
- [ ] Test `/api/docs` endpoint
|
|
- [ ] **8. Configure custom domain** `blackroad.systems`
|
|
- [ ] Add domain in Railway
|
|
- [ ] Update DNS in Cloudflare (if needed)
|
|
- [ ] **9. Pause all non-canonical services**
|
|
- [ ] Add "[DEPRECATED]" label
|
|
- [ ] Document in `ops/RAILWAY_SERVICES.md`
|
|
|
|
### 🔧 This Week (Days 4-7)
|
|
|
|
- [ ] **10. Verify GitHub automation**
|
|
- [ ] Check branch protection rules
|
|
- [ ] Verify required status checks
|
|
- [ ] Test merge queue with sample PR
|
|
- [ ] **11. Configure GitHub webhook** (if not exists)
|
|
- [ ] URL: `https://blackroad.systems/api/operator/webhooks/github`
|
|
- [ ] Secret: Generate and set in Railway
|
|
- [ ] Events: Pull requests, Reviews, Status
|
|
- [ ] **12. Test automation flow**
|
|
- [ ] Create test PR
|
|
- [ ] Verify auto-labeling
|
|
- [ ] Verify auto-approve (if docs-only)
|
|
- [ ] Verify merge queue
|
|
|
|
### 📊 Next Week (Days 8-14)
|
|
|
|
- [ ] **13. Integrate Prism Console**
|
|
- [ ] Add `/prism` route to backend
|
|
- [ ] Deploy and verify access
|
|
- [ ] Connect Prism to backend API
|
|
- [ ] **14. Monitor production metrics**
|
|
- [ ] Check uptime
|
|
- [ ] Check error rates
|
|
- [ ] Check API health summary
|
|
- [ ] **15. Document final topology**
|
|
- [ ] Update `DEPLOYMENT_NOTES.md`
|
|
- [ ] Update `CLAUDE.md`
|
|
- [ ] Update `README.md`
|
|
|
|
### 🧹 30 Days Later
|
|
|
|
- [ ] **16. Delete deprecated services**
|
|
- [ ] Verify no dependencies
|
|
- [ ] Export any needed data
|
|
- [ ] Delete from Railway
|
|
|
|
---
|
|
|
|
## J. SUCCESS CRITERIA
|
|
|
|
You'll know the production stack is stable when:
|
|
|
|
✅ **1. Deployment Health**
|
|
- [ ] Railway `blackroad-backend` service shows 🟢 Healthy
|
|
- [ ] `/health` endpoint returns 200 OK
|
|
- [ ] `/api/docs` is accessible
|
|
- [ ] `/api/health/summary` shows integrations status
|
|
|
|
✅ **2. Service Count**
|
|
- [ ] Exactly **3 services** in Railway:
|
|
- blackroad-backend
|
|
- Postgres
|
|
- Redis
|
|
- [ ] All other services paused/deleted
|
|
|
|
✅ **3. Domain Access**
|
|
- [ ] `https://blackroad.systems` loads Pocket OS
|
|
- [ ] `https://blackroad.systems/api/docs` loads Swagger UI
|
|
- [ ] `https://blackroad.systems/prism` loads Prism Console (after integration)
|
|
- [ ] `https://docs.blackroad.systems` loads Codex docs
|
|
|
|
✅ **4. Automation Flow**
|
|
- [ ] New PRs auto-labeled correctly
|
|
- [ ] Docs-only PRs auto-approved and auto-merged
|
|
- [ ] Backend/frontend PRs run bucketed CI
|
|
- [ ] Merge queue prevents conflicts
|
|
- [ ] Webhook events reach backend
|
|
|
|
✅ **5. Monitoring**
|
|
- [ ] Railway logs show healthy requests
|
|
- [ ] No deployment failures in last 7 days
|
|
- [ ] API health summary shows majority "connected"
|
|
- [ ] Prism dashboard displays PR events
|
|
|
|
---
|
|
|
|
## K. RISKS & MITIGATION
|
|
|
|
### Risk 1: Deleting Wrong Service
|
|
|
|
**Impact:** High (could delete production backend)
|
|
**Likelihood:** Low
|
|
**Mitigation:**
|
|
1. ALWAYS pause first (don't delete immediately)
|
|
2. Wait 30 days before deleting
|
|
3. Export environment variables before pausing (e.g., run `railway variables list > env-vars-backup.txt`)
|
|
4. Test canonical service works before pausing others
|
|
|
|
### Risk 2: Environment Variable Loss
|
|
|
|
**Impact:** High (app won't start)
|
|
**Likelihood:** Medium
|
|
**Mitigation:**
|
|
1. Document all env vars in `ops/RAILWAY_SERVICES.md`
|
|
2. Keep copy in 1Password/LastPass
|
|
3. Verify against `ENV_VARS.md` before any changes
|
|
|
|
### Risk 3: Domain Misconfiguration
|
|
|
|
**Impact:** Medium (users can't access site)
|
|
**Likelihood:** Low
|
|
**Mitigation:**
|
|
1. Keep old Railway domain active until custom domain works
|
|
2. Test custom domain thoroughly before switching DNS
|
|
3. Cloudflare provides rollback if needed
|
|
|
|
### Risk 4: Database Connection Loss
|
|
|
|
**Impact:** High (app crashes)
|
|
**Likelihood:** Low
|
|
**Mitigation:**
|
|
1. Verify `${{Postgres.DATABASE_URL}}` reference is correct
|
|
2. Test database connection after each deployment
|
|
3. Keep database in same Railway project as backend
|
|
|
|
---
|
|
|
|
## L. APPENDIX: USEFUL COMMANDS
|
|
|
|
### Railway CLI
|
|
|
|
```bash
|
|
# Install
|
|
curl -fsSL https://railway.app/install.sh | sh
|
|
|
|
# Login
|
|
railway login
|
|
|
|
# Link to project
|
|
railway link <PROJECT_ID>
|
|
|
|
# Deploy
|
|
railway up --service blackroad-backend
|
|
|
|
# Check logs
|
|
railway logs --service blackroad-backend --tail 100
|
|
|
|
# Check status
|
|
railway status
|
|
|
|
# Open Railway dashboard
|
|
railway open
|
|
|
|
# List services
|
|
railway service list
|
|
|
|
# Environment variables
|
|
railway variables set SECRET_KEY=<value> --service blackroad-backend
|
|
railway variables list
|
|
```
|
|
|
|
### Health Checks
|
|
|
|
```bash
|
|
# Production health
|
|
curl https://blackroad.systems/health
|
|
|
|
# API health summary
|
|
curl https://blackroad.systems/api/health/summary
|
|
|
|
# System version
|
|
curl https://blackroad.systems/api/system/version
|
|
|
|
# Public config
|
|
curl https://blackroad.systems/api/system/config/public
|
|
```
|
|
|
|
### Git Operations
|
|
|
|
```bash
|
|
# Check current commit
|
|
git log -1 --oneline
|
|
|
|
# Check branch
|
|
git branch --show-current
|
|
|
|
# Pull latest
|
|
git pull origin main
|
|
|
|
# Push to trigger deploy
|
|
git push origin main
|
|
|
|
# View recent commits
|
|
git log --oneline -20
|
|
```
|
|
|
|
---
|
|
|
|
## M. CONCLUSION
|
|
|
|
**Full double check complete, Operator. Here's the situation:**
|
|
|
|
### The Good ✅
|
|
- Your codebase is solid and well-architected
|
|
- Recent Phase LIVE fixes resolved deployment issues
|
|
- Automation (Phase Q) is properly implemented
|
|
- Documentation is comprehensive and accurate
|
|
|
|
### The Challenge ⚠️
|
|
- Railway has **12+ extra services** that don't belong
|
|
- Multiple failing services creating noise
|
|
- No clear canonical backend identified
|
|
- Production topology doesn't match documentation
|
|
|
|
### The Solution 🎯
|
|
1. **This week:** Identify and stabilize the canonical backend
|
|
2. **This week:** Pause/deprecate all non-canonical services
|
|
3. **Next week:** Integrate Prism and verify automation
|
|
4. **30 days:** Clean up deprecated services
|
|
|
|
### Your Next Action 👉
|
|
**Go to Railway dashboard RIGHT NOW** and answer:
|
|
1. Which service is connected to the GitHub repo?
|
|
2. What environment variables does it have?
|
|
3. When was the last deployment attempt?
|
|
|
|
Send me the answers and I'll help you stabilize that service first.
|
|
|
|
---
|
|
|
|
**Production stack audit complete. Ready to execute, Operator.** 🚀
|
|
|
|
---
|
|
|
|
*Last Updated: 2025-11-18*
|
|
*Audited By: Cece (Claude Sonnet 4.5)*
|
|
*Report Status: Complete and Ready for Action*
|