# 🔧 Troubleshooting Guide Common issues and solutions for BlackRoad-Private infrastructure. ## Deployment Issues ### Railway Deployment Fails **Symptoms:** - Workflow fails at "Deploy to Railway" step - Error: "Authentication failed" **Solutions:** 1. Verify `RAILWAY_TOKEN` secret is set 2. Check token hasn't expired 3. Regenerate token: https://railway.app/account/tokens 4. Ensure `RAILWAY_PROJECT_ID` matches your project ```bash # Test locally railway login railway link railway status ``` ### Cloudflare Workers Deployment Fails **Symptoms:** - Error: "Account ID not found" - Error: "Zone ID invalid" **Solutions:** 1. Verify secrets in GitHub: - `CLOUDFLARE_API_TOKEN` - `CLOUDFLARE_ACCOUNT_ID` - `CLOUDFLARE_ZONE_ID` 2. Check API token permissions: - Workers: Edit - Account Settings: Read - Zone: Edit ```bash # Test locally wrangler whoami wrangler deploy ``` ### Vercel Deployment Fails **Symptoms:** - Error: "Project not found" - Error: "Team/Organization mismatch" **Solutions:** 1. Verify all Vercel secrets: - `VERCEL_TOKEN` - `VERCEL_ORG_ID` - `VERCEL_PROJECT_ID` 2. Check token scope includes deployment access ```bash # Test locally vercel whoami vercel --prod ``` ## Health Check Issues ### All Health Checks Failing **Symptoms:** - Health check workflow shows all platforms unhealthy - No actual service issues **Solutions:** 1. Verify health URL secrets are set correctly 2. Check health endpoints return 200 status 3. Ensure endpoints don't require authentication ```bash # Test manually curl -v https://your-service-url/api/health ``` ### Intermittent Health Check Failures **Symptoms:** - Health checks sometimes fail - Services are actually healthy **Solutions:** 1. Increase health check timeout (currently 100s) 2. Check platform status pages for outages 3. Review application logs for slow responses ## Build Issues ### npm install Fails **Symptoms:** - Build fails at dependency installation - Error: "EACCES" or "Permission denied" **Solutions:** 1. Check `package-lock.json` is committed 2. Verify Node version (requires 20+) 3. Clear cache and retry: ```bash npm cache clean --force npm ci ``` ### Build Timeout **Symptoms:** - Build takes too long and times out - Error: "Process exceeded timeout" **Solutions:** 1. Optimize build process 2. Increase timeout in `railway.json` 3. Use build cache effectively 4. Consider multi-stage builds ## Secret Issues ### Secret Not Found **Symptoms:** - Workflow fails with "secret not found" - Variable shows as empty **Solutions:** 1. Go to Settings → Secrets and variables → Actions 2. Verify secret name matches exactly (case-sensitive) 3. Check secret is in correct environment 4. Secrets don't show values - verify by re-creating ### Secret Exposed in Logs **Symptoms:** - Secret value visible in workflow logs **Solutions:** 1. Immediately rotate the secret 2. Check for `echo` commands that output secrets 3. Use `***` masking: secrets are auto-masked if in GitHub Secrets 4. Never log full API responses ## Workflow Issues ### Workflow Doesn't Trigger **Symptoms:** - Push to main but no workflow runs - Manual dispatch button missing **Solutions:** 1. Check workflow file syntax (YAML is valid) 2. Verify workflow file is in `.github/workflows/` 3. Check branch name matches trigger 4. Ensure workflow is enabled (Actions → Workflows) ```bash # Validate YAML yamllint .github/workflows/*.yml ``` ### Workflow Stuck on "Queued" **Symptoms:** - Workflow shows "Queued" for extended time - No progress **Solutions:** 1. Check GitHub Actions minutes quota 2. Verify no concurrent job limits hit 3. Check for required status checks blocking 4. Cancel and re-run workflow ## Platform-Specific Issues ### Railway #### Port Binding Error ``` Error: address already in use ``` **Solution:** ```env # Ensure PORT is set in railway.json PORT=3000 ``` #### Database Connection Fails ``` Error: connection refused ``` **Solution:** 1. Check database service is running 2. Verify connection string in environment 3. Check Railway service dependencies ### Cloudflare #### Worker Exceeds Size Limit ``` Error: script too large ``` **Solutions:** 1. Reduce bundle size 2. Use webpack/rollup to minimize 3. Split into multiple workers 4. Use ES modules #### KV Namespace Not Found ``` Error: KV namespace binding not found ``` **Solution:** ```bash # Create namespace wrangler kv:namespace create BLACKROAD_PRIVATE_KV # Update wrangler.toml with namespace ID ``` ### Vercel #### Function Timeout ``` Error: FUNCTION_INVOCATION_TIMEOUT ``` **Solutions:** 1. Optimize function performance 2. Increase timeout in `vercel.json`: ```json { "functions": { "api/*.js": { "maxDuration": 10 } } } ``` #### Build Memory Exceeded ``` Error: Command failed with exit code 137 ``` **Solutions:** 1. Reduce build complexity 2. Use `NODE_OPTIONS="--max_old_space_size=4096"` 3. Upgrade Vercel plan for more memory ## Security Scan Issues ### False Positive Vulnerabilities **Symptoms:** - Security scan reports vulnerabilities in dev dependencies - Vulnerabilities don't affect production **Solutions:** 1. Run: `npm audit --production` 2. Add exception in workflow if safe 3. Update to patched versions when available ### TruffleHog Blocks Commit **Symptoms:** - Secret detected in commit - Deployment blocked **Solutions:** 1. **DO NOT** commit actual secrets 2. Remove secret from history: ```bash git filter-branch --force --index-filter \ "git rm --cached --ignore-unmatch path/to/file" \ --prune-empty --tag-name-filter cat -- --all ``` 3. Use `.gitignore` for sensitive files 4. Rotate exposed secrets immediately ## Monitoring Issues ### Health Check Alert Spam **Symptoms:** - Many health check failure alerts - Services are actually healthy **Solutions:** 1. Adjust health check frequency in workflow 2. Implement retry logic 3. Add grace period before alerting 4. Check for network issues ### Missing Deployment Logs **Symptoms:** - Workflow succeeds but no logs - Can't debug issues **Solutions:** 1. Add more logging to workflows: ```yaml - name: Debug run: | echo "Variable: ${{ env.VAR }}" ls -la ``` 2. Enable debug logging: - Re-run workflow with "Enable debug logging" 3. Check platform-specific logs ## Emergency Procedures ### Complete Deployment Failure **All platforms failing:** 1. **Check Status Pages:** - https://www.githubstatus.com - https://www.railwaystatus.com - https://www.cloudflarestatus.com - https://www.vercel-status.com 2. **Rollback:** ```bash # Railway railway rollback # Cloudflare wrangler rollback # Vercel vercel rollback ``` 3. **Manual Deploy:** ```bash # Deploy to Railway directly cd BlackRoad-Private railway up # Deploy to Cloudflare wrangler deploy # Deploy to Vercel vercel --prod ``` ### Data Loss Prevention **If deployment corrupts data:** 1. **Immediate:** - Stop all deployments - Isolate affected services 2. **Restore:** - Find latest backup artifact - Download and extract - Redeploy from backup 3. **Verify:** - Test all endpoints - Check data integrity - Monitor for issues ## Getting Help ### Before Opening Issue 1. Check this troubleshooting guide 2. Review workflow logs 3. Check platform status pages 4. Search existing issues ### Creating Issue Include: - Platform affected (Railway/Cloudflare/Vercel) - Workflow run URL - Error message (full text) - Steps to reproduce - Expected vs actual behavior ### Escalation For critical production issues: 1. Create high-priority issue 2. Tag repository maintainers 3. Contact platform support if needed ## Useful Commands ### Debugging ```bash # Check all platform statuses curl -s https://your-railway-url/api/health | jq curl -s https://your-cloudflare-url/api/health | jq curl -s https://your-vercel-url/api/health | jq # View recent deployments railway deployments wrangler deployments list vercel ls # Check logs railway logs --tail wrangler tail vercel logs # Test build locally npm ci npm run build npm start ``` ### Quick Fixes ```bash # Clear all caches npm cache clean --force rm -rf node_modules package-lock.json npm install # Reset to working state git fetch origin git reset --hard origin/main # Force redeploy git commit --allow-empty -m "trigger deploy" git push ``` ## Prevention ### Best Practices 1. **Test Locally First:** ```bash npm ci && npm run build && npm start ``` 2. **Use Feature Branches:** - Never push directly to main - Use PRs for review - Test in staging first 3. **Monitor Actively:** - Check health dashboards daily - Review security scans weekly - Verify backups monthly 4. **Document Changes:** - Update README for config changes - Note breaking changes - Keep runbooks current 5. **Have Rollback Plan:** - Know how to rollback each platform - Keep previous deployment info - Test rollback procedures ## Additional Resources - **Railway Docs:** https://docs.railway.app - **Cloudflare Docs:** https://developers.cloudflare.com - **Vercel Docs:** https://vercel.com/docs - **GitHub Actions:** https://docs.github.com/actions