Files
blackroad-private-enhancements/docs/TROUBLESHOOTING.md
blackboxprogramming 4acdf1f8ac
Some checks failed
☁️ Cloudflare Deployment / Deploy Workers (push) Has been cancelled
🚂 Railway Deployment / Deploy to Railway (push) Has been cancelled
🌐 Unified Multi-Platform Deployment / 🔍 Prepare (push) Has been cancelled
▲ Vercel Deployment / Deploy to Vercel (push) Has been cancelled
🌐 Unified Multi-Platform Deployment / 🚀 Deploy all platforms (push) Has been cancelled
🔒 Security Scanning / 📦 Dependencies (push) Failing after 40s
🔒 Security Scanning / 🔐 Secrets (push) Failing after 1m34s
💾 Automated Backup / 📦 Backup infrastructure (push) Failing after 45s
🏥 Infrastructure Health Monitoring / 🔍 Health Check (push) Successful in 2s
Initial commit — RoadCode import
2026-03-08 20:04:29 -05:00

485 lines
9.0 KiB
Markdown

# 🔧 Troubleshooting Guide
Common issues and solutions for BlackRoad-Private infrastructure.
## Deployment Issues
### Railway Deployment Fails
**Symptoms:**
- Workflow fails at "Deploy to Railway" step
- Error: "Authentication failed"
**Solutions:**
1. Verify `RAILWAY_TOKEN` secret is set
2. Check token hasn't expired
3. Regenerate token: https://railway.app/account/tokens
4. Ensure `RAILWAY_PROJECT_ID` matches your project
```bash
# Test locally
railway login
railway link
railway status
```
### Cloudflare Workers Deployment Fails
**Symptoms:**
- Error: "Account ID not found"
- Error: "Zone ID invalid"
**Solutions:**
1. Verify secrets in GitHub:
- `CLOUDFLARE_API_TOKEN`
- `CLOUDFLARE_ACCOUNT_ID`
- `CLOUDFLARE_ZONE_ID`
2. Check API token permissions:
- Workers: Edit
- Account Settings: Read
- Zone: Edit
```bash
# Test locally
wrangler whoami
wrangler deploy
```
### Vercel Deployment Fails
**Symptoms:**
- Error: "Project not found"
- Error: "Team/Organization mismatch"
**Solutions:**
1. Verify all Vercel secrets:
- `VERCEL_TOKEN`
- `VERCEL_ORG_ID`
- `VERCEL_PROJECT_ID`
2. Check token scope includes deployment access
```bash
# Test locally
vercel whoami
vercel --prod
```
## Health Check Issues
### All Health Checks Failing
**Symptoms:**
- Health check workflow shows all platforms unhealthy
- No actual service issues
**Solutions:**
1. Verify health URL secrets are set correctly
2. Check health endpoints return 200 status
3. Ensure endpoints don't require authentication
```bash
# Test manually
curl -v https://your-service-url/api/health
```
### Intermittent Health Check Failures
**Symptoms:**
- Health checks sometimes fail
- Services are actually healthy
**Solutions:**
1. Increase health check timeout (currently 100s)
2. Check platform status pages for outages
3. Review application logs for slow responses
## Build Issues
### npm install Fails
**Symptoms:**
- Build fails at dependency installation
- Error: "EACCES" or "Permission denied"
**Solutions:**
1. Check `package-lock.json` is committed
2. Verify Node version (requires 20+)
3. Clear cache and retry:
```bash
npm cache clean --force
npm ci
```
### Build Timeout
**Symptoms:**
- Build takes too long and times out
- Error: "Process exceeded timeout"
**Solutions:**
1. Optimize build process
2. Increase timeout in `railway.json`
3. Use build cache effectively
4. Consider multi-stage builds
## Secret Issues
### Secret Not Found
**Symptoms:**
- Workflow fails with "secret not found"
- Variable shows as empty
**Solutions:**
1. Go to Settings → Secrets and variables → Actions
2. Verify secret name matches exactly (case-sensitive)
3. Check secret is in correct environment
4. Secrets don't show values - verify by re-creating
### Secret Exposed in Logs
**Symptoms:**
- Secret value visible in workflow logs
**Solutions:**
1. Immediately rotate the secret
2. Check for `echo` commands that output secrets
3. Use `***` masking: secrets are auto-masked if in GitHub Secrets
4. Never log full API responses
## Workflow Issues
### Workflow Doesn't Trigger
**Symptoms:**
- Push to main but no workflow runs
- Manual dispatch button missing
**Solutions:**
1. Check workflow file syntax (YAML is valid)
2. Verify workflow file is in `.github/workflows/`
3. Check branch name matches trigger
4. Ensure workflow is enabled (Actions → Workflows)
```bash
# Validate YAML
yamllint .github/workflows/*.yml
```
### Workflow Stuck on "Queued"
**Symptoms:**
- Workflow shows "Queued" for extended time
- No progress
**Solutions:**
1. Check GitHub Actions minutes quota
2. Verify no concurrent job limits hit
3. Check for required status checks blocking
4. Cancel and re-run workflow
## Platform-Specific Issues
### Railway
#### Port Binding Error
```
Error: address already in use
```
**Solution:**
```env
# Ensure PORT is set in railway.json
PORT=3000
```
#### Database Connection Fails
```
Error: connection refused
```
**Solution:**
1. Check database service is running
2. Verify connection string in environment
3. Check Railway service dependencies
### Cloudflare
#### Worker Exceeds Size Limit
```
Error: script too large
```
**Solutions:**
1. Reduce bundle size
2. Use webpack/rollup to minimize
3. Split into multiple workers
4. Use ES modules
#### KV Namespace Not Found
```
Error: KV namespace binding not found
```
**Solution:**
```bash
# Create namespace
wrangler kv:namespace create BLACKROAD_PRIVATE_KV
# Update wrangler.toml with namespace ID
```
### Vercel
#### Function Timeout
```
Error: FUNCTION_INVOCATION_TIMEOUT
```
**Solutions:**
1. Optimize function performance
2. Increase timeout in `vercel.json`:
```json
{
"functions": {
"api/*.js": {
"maxDuration": 10
}
}
}
```
#### Build Memory Exceeded
```
Error: Command failed with exit code 137
```
**Solutions:**
1. Reduce build complexity
2. Use `NODE_OPTIONS="--max_old_space_size=4096"`
3. Upgrade Vercel plan for more memory
## Security Scan Issues
### False Positive Vulnerabilities
**Symptoms:**
- Security scan reports vulnerabilities in dev dependencies
- Vulnerabilities don't affect production
**Solutions:**
1. Run: `npm audit --production`
2. Add exception in workflow if safe
3. Update to patched versions when available
### TruffleHog Blocks Commit
**Symptoms:**
- Secret detected in commit
- Deployment blocked
**Solutions:**
1. **DO NOT** commit actual secrets
2. Remove secret from history:
```bash
git filter-branch --force --index-filter \
"git rm --cached --ignore-unmatch path/to/file" \
--prune-empty --tag-name-filter cat -- --all
```
3. Use `.gitignore` for sensitive files
4. Rotate exposed secrets immediately
## Monitoring Issues
### Health Check Alert Spam
**Symptoms:**
- Many health check failure alerts
- Services are actually healthy
**Solutions:**
1. Adjust health check frequency in workflow
2. Implement retry logic
3. Add grace period before alerting
4. Check for network issues
### Missing Deployment Logs
**Symptoms:**
- Workflow succeeds but no logs
- Can't debug issues
**Solutions:**
1. Add more logging to workflows:
```yaml
- name: Debug
run: |
echo "Variable: ${{ env.VAR }}"
ls -la
```
2. Enable debug logging:
- Re-run workflow with "Enable debug logging"
3. Check platform-specific logs
## Emergency Procedures
### Complete Deployment Failure
**All platforms failing:**
1. **Check Status Pages:**
- https://www.githubstatus.com
- https://www.railwaystatus.com
- https://www.cloudflarestatus.com
- https://www.vercel-status.com
2. **Rollback:**
```bash
# Railway
railway rollback
# Cloudflare
wrangler rollback
# Vercel
vercel rollback <deployment-url>
```
3. **Manual Deploy:**
```bash
# Deploy to Railway directly
cd BlackRoad-Private
railway up
# Deploy to Cloudflare
wrangler deploy
# Deploy to Vercel
vercel --prod
```
### Data Loss Prevention
**If deployment corrupts data:**
1. **Immediate:**
- Stop all deployments
- Isolate affected services
2. **Restore:**
- Find latest backup artifact
- Download and extract
- Redeploy from backup
3. **Verify:**
- Test all endpoints
- Check data integrity
- Monitor for issues
## Getting Help
### Before Opening Issue
1. Check this troubleshooting guide
2. Review workflow logs
3. Check platform status pages
4. Search existing issues
### Creating Issue
Include:
- Platform affected (Railway/Cloudflare/Vercel)
- Workflow run URL
- Error message (full text)
- Steps to reproduce
- Expected vs actual behavior
### Escalation
For critical production issues:
1. Create high-priority issue
2. Tag repository maintainers
3. Contact platform support if needed
## Useful Commands
### Debugging
```bash
# Check all platform statuses
curl -s https://your-railway-url/api/health | jq
curl -s https://your-cloudflare-url/api/health | jq
curl -s https://your-vercel-url/api/health | jq
# View recent deployments
railway deployments
wrangler deployments list
vercel ls
# Check logs
railway logs --tail
wrangler tail
vercel logs
# Test build locally
npm ci
npm run build
npm start
```
### Quick Fixes
```bash
# Clear all caches
npm cache clean --force
rm -rf node_modules package-lock.json
npm install
# Reset to working state
git fetch origin
git reset --hard origin/main
# Force redeploy
git commit --allow-empty -m "trigger deploy"
git push
```
## Prevention
### Best Practices
1. **Test Locally First:**
```bash
npm ci && npm run build && npm start
```
2. **Use Feature Branches:**
- Never push directly to main
- Use PRs for review
- Test in staging first
3. **Monitor Actively:**
- Check health dashboards daily
- Review security scans weekly
- Verify backups monthly
4. **Document Changes:**
- Update README for config changes
- Note breaking changes
- Keep runbooks current
5. **Have Rollback Plan:**
- Know how to rollback each platform
- Keep previous deployment info
- Test rollback procedures
## Additional Resources
- **Railway Docs:** https://docs.railway.app
- **Cloudflare Docs:** https://developers.cloudflare.com
- **Vercel Docs:** https://vercel.com/docs
- **GitHub Actions:** https://docs.github.com/actions