9.0 KiB
🔧 Troubleshooting Guide
Common issues and solutions for BlackRoad-Private infrastructure.
Deployment Issues
Railway Deployment Fails
Symptoms:
- Workflow fails at "Deploy to Railway" step
- Error: "Authentication failed"
Solutions:
- Verify
RAILWAY_TOKENsecret is set - Check token hasn't expired
- Regenerate token: https://railway.app/account/tokens
- Ensure
RAILWAY_PROJECT_IDmatches your project
# Test locally
railway login
railway link
railway status
Cloudflare Workers Deployment Fails
Symptoms:
- Error: "Account ID not found"
- Error: "Zone ID invalid"
Solutions:
-
Verify secrets in GitHub:
CLOUDFLARE_API_TOKENCLOUDFLARE_ACCOUNT_IDCLOUDFLARE_ZONE_ID
-
Check API token permissions:
- Workers: Edit
- Account Settings: Read
- Zone: Edit
# Test locally
wrangler whoami
wrangler deploy
Vercel Deployment Fails
Symptoms:
- Error: "Project not found"
- Error: "Team/Organization mismatch"
Solutions:
-
Verify all Vercel secrets:
VERCEL_TOKENVERCEL_ORG_IDVERCEL_PROJECT_ID
-
Check token scope includes deployment access
# Test locally
vercel whoami
vercel --prod
Health Check Issues
All Health Checks Failing
Symptoms:
- Health check workflow shows all platforms unhealthy
- No actual service issues
Solutions:
- Verify health URL secrets are set correctly
- Check health endpoints return 200 status
- Ensure endpoints don't require authentication
# Test manually
curl -v https://your-service-url/api/health
Intermittent Health Check Failures
Symptoms:
- Health checks sometimes fail
- Services are actually healthy
Solutions:
- Increase health check timeout (currently 100s)
- Check platform status pages for outages
- Review application logs for slow responses
Build Issues
npm install Fails
Symptoms:
- Build fails at dependency installation
- Error: "EACCES" or "Permission denied"
Solutions:
- Check
package-lock.jsonis committed - Verify Node version (requires 20+)
- Clear cache and retry:
npm cache clean --force
npm ci
Build Timeout
Symptoms:
- Build takes too long and times out
- Error: "Process exceeded timeout"
Solutions:
- Optimize build process
- Increase timeout in
railway.json - Use build cache effectively
- Consider multi-stage builds
Secret Issues
Secret Not Found
Symptoms:
- Workflow fails with "secret not found"
- Variable shows as empty
Solutions:
- Go to Settings → Secrets and variables → Actions
- Verify secret name matches exactly (case-sensitive)
- Check secret is in correct environment
- Secrets don't show values - verify by re-creating
Secret Exposed in Logs
Symptoms:
- Secret value visible in workflow logs
Solutions:
- Immediately rotate the secret
- Check for
echocommands that output secrets - Use
***masking: secrets are auto-masked if in GitHub Secrets - Never log full API responses
Workflow Issues
Workflow Doesn't Trigger
Symptoms:
- Push to main but no workflow runs
- Manual dispatch button missing
Solutions:
- Check workflow file syntax (YAML is valid)
- Verify workflow file is in
.github/workflows/ - Check branch name matches trigger
- Ensure workflow is enabled (Actions → Workflows)
# Validate YAML
yamllint .github/workflows/*.yml
Workflow Stuck on "Queued"
Symptoms:
- Workflow shows "Queued" for extended time
- No progress
Solutions:
- Check GitHub Actions minutes quota
- Verify no concurrent job limits hit
- Check for required status checks blocking
- Cancel and re-run workflow
Platform-Specific Issues
Railway
Port Binding Error
Error: address already in use
Solution:
# Ensure PORT is set in railway.json
PORT=3000
Database Connection Fails
Error: connection refused
Solution:
- Check database service is running
- Verify connection string in environment
- Check Railway service dependencies
Cloudflare
Worker Exceeds Size Limit
Error: script too large
Solutions:
- Reduce bundle size
- Use webpack/rollup to minimize
- Split into multiple workers
- Use ES modules
KV Namespace Not Found
Error: KV namespace binding not found
Solution:
# Create namespace
wrangler kv:namespace create BLACKROAD_PRIVATE_KV
# Update wrangler.toml with namespace ID
Vercel
Function Timeout
Error: FUNCTION_INVOCATION_TIMEOUT
Solutions:
- Optimize function performance
- Increase timeout in
vercel.json:
{
"functions": {
"api/*.js": {
"maxDuration": 10
}
}
}
Build Memory Exceeded
Error: Command failed with exit code 137
Solutions:
- Reduce build complexity
- Use
NODE_OPTIONS="--max_old_space_size=4096" - Upgrade Vercel plan for more memory
Security Scan Issues
False Positive Vulnerabilities
Symptoms:
- Security scan reports vulnerabilities in dev dependencies
- Vulnerabilities don't affect production
Solutions:
- Run:
npm audit --production - Add exception in workflow if safe
- Update to patched versions when available
TruffleHog Blocks Commit
Symptoms:
- Secret detected in commit
- Deployment blocked
Solutions:
- DO NOT commit actual secrets
- Remove secret from history:
git filter-branch --force --index-filter \
"git rm --cached --ignore-unmatch path/to/file" \
--prune-empty --tag-name-filter cat -- --all
- Use
.gitignorefor sensitive files - Rotate exposed secrets immediately
Monitoring Issues
Health Check Alert Spam
Symptoms:
- Many health check failure alerts
- Services are actually healthy
Solutions:
- Adjust health check frequency in workflow
- Implement retry logic
- Add grace period before alerting
- Check for network issues
Missing Deployment Logs
Symptoms:
- Workflow succeeds but no logs
- Can't debug issues
Solutions:
- Add more logging to workflows:
- name: Debug
run: |
echo "Variable: ${{ env.VAR }}"
ls -la
- Enable debug logging:
- Re-run workflow with "Enable debug logging"
- Check platform-specific logs
Emergency Procedures
Complete Deployment Failure
All platforms failing:
-
Check Status Pages:
-
Rollback:
# Railway
railway rollback
# Cloudflare
wrangler rollback
# Vercel
vercel rollback <deployment-url>
- Manual Deploy:
# Deploy to Railway directly
cd BlackRoad-Private
railway up
# Deploy to Cloudflare
wrangler deploy
# Deploy to Vercel
vercel --prod
Data Loss Prevention
If deployment corrupts data:
-
Immediate:
- Stop all deployments
- Isolate affected services
-
Restore:
- Find latest backup artifact
- Download and extract
- Redeploy from backup
-
Verify:
- Test all endpoints
- Check data integrity
- Monitor for issues
Getting Help
Before Opening Issue
- Check this troubleshooting guide
- Review workflow logs
- Check platform status pages
- Search existing issues
Creating Issue
Include:
- Platform affected (Railway/Cloudflare/Vercel)
- Workflow run URL
- Error message (full text)
- Steps to reproduce
- Expected vs actual behavior
Escalation
For critical production issues:
- Create high-priority issue
- Tag repository maintainers
- Contact platform support if needed
Useful Commands
Debugging
# Check all platform statuses
curl -s https://your-railway-url/api/health | jq
curl -s https://your-cloudflare-url/api/health | jq
curl -s https://your-vercel-url/api/health | jq
# View recent deployments
railway deployments
wrangler deployments list
vercel ls
# Check logs
railway logs --tail
wrangler tail
vercel logs
# Test build locally
npm ci
npm run build
npm start
Quick Fixes
# Clear all caches
npm cache clean --force
rm -rf node_modules package-lock.json
npm install
# Reset to working state
git fetch origin
git reset --hard origin/main
# Force redeploy
git commit --allow-empty -m "trigger deploy"
git push
Prevention
Best Practices
-
Test Locally First:
npm ci && npm run build && npm start -
Use Feature Branches:
- Never push directly to main
- Use PRs for review
- Test in staging first
-
Monitor Actively:
- Check health dashboards daily
- Review security scans weekly
- Verify backups monthly
-
Document Changes:
- Update README for config changes
- Note breaking changes
- Keep runbooks current
-
Have Rollback Plan:
- Know how to rollback each platform
- Keep previous deployment info
- Test rollback procedures
Additional Resources
- Railway Docs: https://docs.railway.app
- Cloudflare Docs: https://developers.cloudflare.com
- Vercel Docs: https://vercel.com/docs
- GitHub Actions: https://docs.github.com/actions