Files
blackroad-cluster-tests/README.md

7.6 KiB

BlackRoad Raspberry Pi Cluster Test Suite

Zero physical intervention. All tests over SSH. Non-destructive. Idempotent.

Quick Start

# Run full test suite on all nodes
./run-tests.sh

# Dry run (shows what would be tested)
./run-tests.sh --dry-run

# Fast mode (skips slow tests like SMART, latency)
./run-tests.sh --fast

# Test single node
./run-tests.sh --node lucidia

# Combine flags
./run-tests.sh --fast --node octavia

Prerequisites

  • Passwordless SSH to all nodes (use ssh-copy-id or Tailscale)
  • Nodes must be reachable by hostname (use .local mDNS or /etc/hosts)
  • Bash 4.0+ on orchestrator machine
  • jq for JSON parsing (optional but recommended)

What Gets Tested

Test Module Validates Critical Checks
test_os.sh Boot & OS health Uptime, kernel, filesystem R/W, boot device
test_storage.sh Storage integrity Disk/inode usage, SMART health
test_network.sh Network fabric LAN, Tailscale, DNS, mDNS, inter-node ping
test_time.sh Time authority Chrony tracking, octavia as NTP source, drift
test_mqtt.sh MQTT fabric Broker reachable, pub/sub roundtrip, retained msgs
test_systemd.sh Service health BlackRoad services active, no failed units
test_hardware.sh Hardware status Temperature, throttling, USB, I2C, HDMI
test_role.sh Role correctness node.yaml exists, role matches hardware
test_ui.sh Dashboard/UI Grafana, kiosk service, HDMI output
test_agent.sh Operator layer Heartbeat MQTT, phase cycling, emotional envelope

Test Results

Results are saved in results/ directory:

  • Per-node JSON files: {node}_{timestamp}.json
  • Summary report: summary_{timestamp}.txt

Exit Codes

  • 0 = All tests passed
  • 1 = Warnings present
  • 2 = Failures detected

Example Output

╔════════════════════════════════════════════════════════╗
║  BlackRoad Cluster Test Suite                         ║
║  Fri Dec 26 15:30:00 PST 2025                          ║
╚════════════════════════════════════════════════════════╝

Mode: LIVE | Speed: FULL
Nodes: lucidia alice aria octavia shellfish

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Testing: lucidia
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  test_os... ✓ PASS
  test_storage... ✓ PASS
  test_network... ✓ PASS
  test_time... ✓ PASS
  test_mqtt... ✓ PASS
  test_systemd... ✓ PASS
  test_hardware... ✓ PASS
  test_role... ✓ PASS
  test_ui... ✓ PASS
  test_agent... ✓ PASS

  Summary: 10 pass | 0 warn | 0 fail

━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Testing: octavia
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  test_os... ✓ PASS
  test_storage... ⚠ WARN
  test_network... ✓ PASS
  test_time... ✓ PASS
  test_mqtt... ✓ PASS
  test_systemd... ✓ PASS
  test_hardware... ✓ PASS
  test_role... ✓ PASS
  test_ui... ✓ PASS
  test_agent... ✓ PASS

  Summary: 9 pass | 1 warn | 0 fail

╔════════════════════════════════════════════════════════╗
║  Cluster Summary                                       ║
╚════════════════════════════════════════════════════════╝

Total Tests: 50
Pass:  48
Warn:  2
Fail:  0

Results saved to: /Users/alexa/blackroad-cluster-tests/results

Status: WARNINGS PRESENT

Safety Guarantees

No destructive operations

  • No mkfs, dd, fdisk, or partition changes
  • No EEPROM writes
  • No forced reboots
  • No service restarts

Read-only by default

  • All tests use read operations
  • MQTT tests use temporary topics and clean up
  • File system checks are non-invasive

Idempotent execution

  • Safe to run every minute
  • Results don't affect system state
  • No cumulative side effects

Integration Examples

Cron Job (every 15 minutes)

*/15 * * * * cd /home/alexa/blackroad-cluster-tests && ./run-tests.sh --fast

MQTT Alert on Failure

./run-tests.sh || mosquitto_pub -h octavia -t blackroad/alerts/tests -m "Test suite failed at $(date)"

Dashboard Integration

# Parse JSON results and send to monitoring system
./run-tests.sh --fast
for result in results/*_$(date +%Y%m%d)*.json; do
  jq -c . "$result" | curl -X POST https://monitor.blackroad.com/api/test-results -d @-
done

Pre-Deployment Check

#!/bin/bash
# deploy.sh
set -e

echo "Running cluster health check..."
./run-tests.sh --fast

if [ $? -ne 0 ]; then
  echo "❌ Cluster health check failed. Aborting deployment."
  exit 1
fi

echo "✅ Cluster healthy. Proceeding with deployment..."
# ... deployment logic

Customization

Add Custom Test Module

Create modules/test_custom.sh:

#!/usr/bin/env bash
set -euo pipefail

output_result() {
  local status=$1
  local message=$2
  local details=${3:-"{}"}
  echo "{\"test\": \"custom\", \"status\": \"$status\", \"message\": \"$message\", \"details\": $details}"
}

if [[ "${DRY_RUN:-false}" == "true" ]]; then
  output_result "DRY_RUN" "Would check custom metric"
  exit 0
fi

# Your test logic here
# ...

output_result "PASS" "Custom check passed" "{}"

Then add to TEST_MODULES array in run-tests.sh.

Change Target Nodes

Edit NODES array in run-tests.sh:

NODES=(lucidia alice aria octavia shellfish)

Adjust Thresholds

Edit individual test modules. Example in test_storage.sh:

# Change disk warning from 80% to 70%
elif [[ $ROOT_USAGE -gt 70 ]]; then
  output_result "WARN" "Root disk usage high (${ROOT_USAGE}%)" "$DETAILS"

Troubleshooting

SSH Connection Issues

# Test manual SSH
ssh lucidia "echo test"

# Set up passwordless auth
ssh-copy-id lucidia

Missing Tools on Nodes

# Install on each node
ssh lucidia "sudo apt-get update && sudo apt-get install -y jq mosquitto-clients i2c-tools smartmontools"

Slow Test Execution

# Use fast mode to skip SMART checks and latency tests
./run-tests.sh --fast

JSON Parsing Errors

# Install jq on orchestrator machine
brew install jq  # macOS
apt-get install jq  # Debian/Ubuntu

Architecture

run-tests.sh (orchestrator)
    │
    ├─ SSH to each node
    │   │
    │   ├─ modules/test_os.sh → JSON output
    │   ├─ modules/test_storage.sh → JSON output
    │   ├─ modules/test_network.sh → JSON output
    │   └─ ... (all test modules)
    │
    └─ Aggregate results → results/{node}_{timestamp}.json
                        → results/summary_{timestamp}.txt

Each test module:

  1. Runs read-only commands
  2. Outputs single-line JSON to stdout
  3. Returns 0 (script always succeeds; status in JSON)

Orchestrator:

  1. Copies test module via stdin to SSH
  2. Executes remotely
  3. Captures JSON output (last line)
  4. Aggregates into cluster summary

License

MIT - BlackRoad Systems 2025