Files
blackroad-cluster-tests/README.md

276 lines
7.6 KiB
Markdown

# BlackRoad Raspberry Pi Cluster Test Suite
**Zero physical intervention. All tests over SSH. Non-destructive. Idempotent.**
## Quick Start
```bash
# Run full test suite on all nodes
./run-tests.sh
# Dry run (shows what would be tested)
./run-tests.sh --dry-run
# Fast mode (skips slow tests like SMART, latency)
./run-tests.sh --fast
# Test single node
./run-tests.sh --node lucidia
# Combine flags
./run-tests.sh --fast --node octavia
```
## Prerequisites
- Passwordless SSH to all nodes (use `ssh-copy-id` or Tailscale)
- Nodes must be reachable by hostname (use `.local` mDNS or `/etc/hosts`)
- Bash 4.0+ on orchestrator machine
- `jq` for JSON parsing (optional but recommended)
## What Gets Tested
| Test Module | Validates | Critical Checks |
|------------|-----------|----------------|
| `test_os.sh` | Boot & OS health | Uptime, kernel, filesystem R/W, boot device |
| `test_storage.sh` | Storage integrity | Disk/inode usage, SMART health |
| `test_network.sh` | Network fabric | LAN, Tailscale, DNS, mDNS, inter-node ping |
| `test_time.sh` | Time authority | Chrony tracking, octavia as NTP source, drift |
| `test_mqtt.sh` | MQTT fabric | Broker reachable, pub/sub roundtrip, retained msgs |
| `test_systemd.sh` | Service health | BlackRoad services active, no failed units |
| `test_hardware.sh` | Hardware status | Temperature, throttling, USB, I2C, HDMI |
| `test_role.sh` | Role correctness | node.yaml exists, role matches hardware |
| `test_ui.sh` | Dashboard/UI | Grafana, kiosk service, HDMI output |
| `test_agent.sh` | Operator layer | Heartbeat MQTT, phase cycling, emotional envelope |
## Test Results
Results are saved in `results/` directory:
- Per-node JSON files: `{node}_{timestamp}.json`
- Summary report: `summary_{timestamp}.txt`
### Exit Codes
- `0` = All tests passed
- `1` = Warnings present
- `2` = Failures detected
## Example Output
```
╔════════════════════════════════════════════════════════╗
║ BlackRoad Cluster Test Suite ║
║ Fri Dec 26 15:30:00 PST 2025 ║
╚════════════════════════════════════════════════════════╝
Mode: LIVE | Speed: FULL
Nodes: lucidia alice aria octavia shellfish
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Testing: lucidia
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
test_os... ✓ PASS
test_storage... ✓ PASS
test_network... ✓ PASS
test_time... ✓ PASS
test_mqtt... ✓ PASS
test_systemd... ✓ PASS
test_hardware... ✓ PASS
test_role... ✓ PASS
test_ui... ✓ PASS
test_agent... ✓ PASS
Summary: 10 pass | 0 warn | 0 fail
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
Testing: octavia
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
test_os... ✓ PASS
test_storage... ⚠ WARN
test_network... ✓ PASS
test_time... ✓ PASS
test_mqtt... ✓ PASS
test_systemd... ✓ PASS
test_hardware... ✓ PASS
test_role... ✓ PASS
test_ui... ✓ PASS
test_agent... ✓ PASS
Summary: 9 pass | 1 warn | 0 fail
╔════════════════════════════════════════════════════════╗
║ Cluster Summary ║
╚════════════════════════════════════════════════════════╝
Total Tests: 50
Pass: 48
Warn: 2
Fail: 0
Results saved to: /Users/alexa/blackroad-cluster-tests/results
Status: WARNINGS PRESENT
```
## Safety Guarantees
**No destructive operations**
- No `mkfs`, `dd`, `fdisk`, or partition changes
- No EEPROM writes
- No forced reboots
- No service restarts
**Read-only by default**
- All tests use read operations
- MQTT tests use temporary topics and clean up
- File system checks are non-invasive
**Idempotent execution**
- Safe to run every minute
- Results don't affect system state
- No cumulative side effects
## Integration Examples
### Cron Job (every 15 minutes)
```bash
*/15 * * * * cd /home/alexa/blackroad-cluster-tests && ./run-tests.sh --fast
```
### MQTT Alert on Failure
```bash
./run-tests.sh || mosquitto_pub -h octavia -t blackroad/alerts/tests -m "Test suite failed at $(date)"
```
### Dashboard Integration
```bash
# Parse JSON results and send to monitoring system
./run-tests.sh --fast
for result in results/*_$(date +%Y%m%d)*.json; do
jq -c . "$result" | curl -X POST https://monitor.blackroad.com/api/test-results -d @-
done
```
### Pre-Deployment Check
```bash
#!/bin/bash
# deploy.sh
set -e
echo "Running cluster health check..."
./run-tests.sh --fast
if [ $? -ne 0 ]; then
echo "❌ Cluster health check failed. Aborting deployment."
exit 1
fi
echo "✅ Cluster healthy. Proceeding with deployment..."
# ... deployment logic
```
## Customization
### Add Custom Test Module
Create `modules/test_custom.sh`:
```bash
#!/usr/bin/env bash
set -euo pipefail
output_result() {
local status=$1
local message=$2
local details=${3:-"{}"}
echo "{\"test\": \"custom\", \"status\": \"$status\", \"message\": \"$message\", \"details\": $details}"
}
if [[ "${DRY_RUN:-false}" == "true" ]]; then
output_result "DRY_RUN" "Would check custom metric"
exit 0
fi
# Your test logic here
# ...
output_result "PASS" "Custom check passed" "{}"
```
Then add to `TEST_MODULES` array in `run-tests.sh`.
### Change Target Nodes
Edit `NODES` array in `run-tests.sh`:
```bash
NODES=(lucidia alice aria octavia shellfish)
```
### Adjust Thresholds
Edit individual test modules. Example in `test_storage.sh`:
```bash
# Change disk warning from 80% to 70%
elif [[ $ROOT_USAGE -gt 70 ]]; then
output_result "WARN" "Root disk usage high (${ROOT_USAGE}%)" "$DETAILS"
```
## Troubleshooting
### SSH Connection Issues
```bash
# Test manual SSH
ssh lucidia "echo test"
# Set up passwordless auth
ssh-copy-id lucidia
```
### Missing Tools on Nodes
```bash
# Install on each node
ssh lucidia "sudo apt-get update && sudo apt-get install -y jq mosquitto-clients i2c-tools smartmontools"
```
### Slow Test Execution
```bash
# Use fast mode to skip SMART checks and latency tests
./run-tests.sh --fast
```
### JSON Parsing Errors
```bash
# Install jq on orchestrator machine
brew install jq # macOS
apt-get install jq # Debian/Ubuntu
```
## Architecture
```
run-tests.sh (orchestrator)
├─ SSH to each node
│ │
│ ├─ modules/test_os.sh → JSON output
│ ├─ modules/test_storage.sh → JSON output
│ ├─ modules/test_network.sh → JSON output
│ └─ ... (all test modules)
└─ Aggregate results → results/{node}_{timestamp}.json
→ results/summary_{timestamp}.txt
```
Each test module:
1. Runs read-only commands
2. Outputs single-line JSON to stdout
3. Returns 0 (script always succeeds; status in JSON)
Orchestrator:
1. Copies test module via stdin to SSH
2. Executes remotely
3. Captures JSON output (last line)
4. Aggregates into cluster summary
## License
MIT - BlackRoad Systems 2025