blackroad-os/blackroad-cluster-tests

Fork 0

Files

blackboxprogramming 75c7f56432 Initial commit — RoadCode import

2026-03-08 20:03:26 -05:00

7.6 KiB

Raw Blame History

BlackRoad Raspberry Pi Cluster Test Suite

Zero physical intervention. All tests over SSH. Non-destructive. Idempotent.

Quick Start

# Run full test suite on all nodes
./run-tests.sh

# Dry run (shows what would be tested)
./run-tests.sh --dry-run

# Fast mode (skips slow tests like SMART, latency)
./run-tests.sh --fast

# Test single node
./run-tests.sh --node lucidia

# Combine flags
./run-tests.sh --fast --node octavia

Prerequisites

Passwordless SSH to all nodes (use ssh-copy-id or Tailscale)
Nodes must be reachable by hostname (use .local mDNS or /etc/hosts)
Bash 4.0+ on orchestrator machine
jq for JSON parsing (optional but recommended)

What Gets Tested

Test Module	Validates	Critical Checks
`test_os.sh`	Boot & OS health	Uptime, kernel, filesystem R/W, boot device
`test_storage.sh`	Storage integrity	Disk/inode usage, SMART health
`test_network.sh`	Network fabric	LAN, Tailscale, DNS, mDNS, inter-node ping
`test_time.sh`	Time authority	Chrony tracking, octavia as NTP source, drift
`test_mqtt.sh`	MQTT fabric	Broker reachable, pub/sub roundtrip, retained msgs
`test_systemd.sh`	Service health	BlackRoad services active, no failed units
`test_hardware.sh`	Hardware status	Temperature, throttling, USB, I2C, HDMI
`test_role.sh`	Role correctness	node.yaml exists, role matches hardware
`test_ui.sh`	Dashboard/UI	Grafana, kiosk service, HDMI output
`test_agent.sh`	Operator layer	Heartbeat MQTT, phase cycling, emotional envelope

Test Results

Results are saved in results/ directory:

Per-node JSON files: {node}_{timestamp}.json
Summary report: summary_{timestamp}.txt

Exit Codes

0 = All tests passed
1 = Warnings present
2 = Failures detected

Example Output

╔════════════════════════════════════════════════════════╗
║  BlackRoad Cluster Test Suite                         ║
║  Fri Dec 26 15:30:00 PST 2025                          ║
╚════════════════════════════════════════════════════════╝

Mode: LIVE | Speed: FULL
Nodes: lucidia alice aria octavia shellfish

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Testing: lucidia
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  test_os... ✓ PASS
  test_storage... ✓ PASS
  test_network... ✓ PASS
  test_time... ✓ PASS
  test_mqtt... ✓ PASS
  test_systemd... ✓ PASS
  test_hardware... ✓ PASS
  test_role... ✓ PASS
  test_ui... ✓ PASS
  test_agent... ✓ PASS

  Summary: 10 pass | 0 warn | 0 fail

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Testing: octavia
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  test_os... ✓ PASS
  test_storage... ⚠ WARN
  test_network... ✓ PASS
  test_time... ✓ PASS
  test_mqtt... ✓ PASS
  test_systemd... ✓ PASS
  test_hardware... ✓ PASS
  test_role... ✓ PASS
  test_ui... ✓ PASS
  test_agent... ✓ PASS

  Summary: 9 pass | 1 warn | 0 fail

╔════════════════════════════════════════════════════════╗
║  Cluster Summary                                       ║
╚════════════════════════════════════════════════════════╝

Total Tests: 50
Pass:  48
Warn:  2
Fail:  0

Results saved to: /Users/alexa/blackroad-cluster-tests/results

Status: WARNINGS PRESENT

Safety Guarantees

✅ No destructive operations

No mkfs, dd, fdisk, or partition changes
No EEPROM writes
No forced reboots
No service restarts

✅ Read-only by default

All tests use read operations
MQTT tests use temporary topics and clean up
File system checks are non-invasive

✅ Idempotent execution

Safe to run every minute
Results don't affect system state
No cumulative side effects

Integration Examples

Cron Job (every 15 minutes)

*/15 * * * * cd /home/alexa/blackroad-cluster-tests && ./run-tests.sh --fast

MQTT Alert on Failure

./run-tests.sh || mosquitto_pub -h octavia -t blackroad/alerts/tests -m "Test suite failed at $(date)"

Dashboard Integration

# Parse JSON results and send to monitoring system
./run-tests.sh --fast
for result in results/*_$(date +%Y%m%d)*.json; do
  jq -c . "$result" | curl -X POST https://monitor.blackroad.com/api/test-results -d @-
done

Pre-Deployment Check

#!/bin/bash
# deploy.sh
set -e

echo "Running cluster health check..."
./run-tests.sh --fast

if [ $? -ne 0 ]; then
  echo "❌ Cluster health check failed. Aborting deployment."
  exit 1
fi

echo "✅ Cluster healthy. Proceeding with deployment..."
# ... deployment logic

Customization

Add Custom Test Module

Create modules/test_custom.sh:

#!/usr/bin/env bash
set -euo pipefail

output_result() {
  local status=$1
  local message=$2
  local details=${3:-"{}"}
  echo "{\"test\": \"custom\", \"status\": \"$status\", \"message\": \"$message\", \"details\": $details}"
}

if [[ "${DRY_RUN:-false}" == "true" ]]; then
  output_result "DRY_RUN" "Would check custom metric"
  exit 0
fi

# Your test logic here
# ...

output_result "PASS" "Custom check passed" "{}"

Then add to TEST_MODULES array in run-tests.sh.

Change Target Nodes

Edit NODES array in run-tests.sh:

NODES=(lucidia alice aria octavia shellfish)

Adjust Thresholds

Edit individual test modules. Example in test_storage.sh:

# Change disk warning from 80% to 70%
elif [[ $ROOT_USAGE -gt 70 ]]; then
  output_result "WARN" "Root disk usage high (${ROOT_USAGE}%)" "$DETAILS"

Troubleshooting

SSH Connection Issues

# Test manual SSH
ssh lucidia "echo test"

# Set up passwordless auth
ssh-copy-id lucidia

Missing Tools on Nodes

# Install on each node
ssh lucidia "sudo apt-get update && sudo apt-get install -y jq mosquitto-clients i2c-tools smartmontools"

Slow Test Execution

# Use fast mode to skip SMART checks and latency tests
./run-tests.sh --fast

JSON Parsing Errors

# Install jq on orchestrator machine
brew install jq  # macOS
apt-get install jq  # Debian/Ubuntu

Architecture

run-tests.sh (orchestrator)
    │
    ├─ SSH to each node
    │   │
    │   ├─ modules/test_os.sh → JSON output
    │   ├─ modules/test_storage.sh → JSON output
    │   ├─ modules/test_network.sh → JSON output
    │   └─ ... (all test modules)
    │
    └─ Aggregate results → results/{node}_{timestamp}.json
                        → results/summary_{timestamp}.txt

Each test module:

Runs read-only commands
Outputs single-line JSON to stdout
Returns 0 (script always succeeds; status in JSON)

Orchestrator:

Copies test module via stdin to SSH
Executes remotely
Captures JSON output (last line)
Aggregates into cluster summary

License

MIT - BlackRoad Systems 2025

7.6 KiB Raw Blame History