Some checks failed
☁️ Cloudflare Deployment / Deploy Workers (push) Has been cancelled
🚂 Railway Deployment / Deploy to Railway (push) Has been cancelled
🌐 Unified Multi-Platform Deployment / 🔍 Prepare (push) Has been cancelled
▲ Vercel Deployment / Deploy to Vercel (push) Has been cancelled
🌐 Unified Multi-Platform Deployment / 🚀 Deploy all platforms (push) Has been cancelled
🔒 Security Scanning / 📦 Dependencies (push) Failing after 40s
🔒 Security Scanning / 🔐 Secrets (push) Failing after 1m34s
💾 Automated Backup / 📦 Backup infrastructure (push) Failing after 45s
🏥 Infrastructure Health Monitoring / 🔍 Health Check (push) Successful in 2s
485 lines
9.0 KiB
Markdown
485 lines
9.0 KiB
Markdown
# 🔧 Troubleshooting Guide
|
|
|
|
Common issues and solutions for BlackRoad-Private infrastructure.
|
|
|
|
## Deployment Issues
|
|
|
|
### Railway Deployment Fails
|
|
|
|
**Symptoms:**
|
|
- Workflow fails at "Deploy to Railway" step
|
|
- Error: "Authentication failed"
|
|
|
|
**Solutions:**
|
|
1. Verify `RAILWAY_TOKEN` secret is set
|
|
2. Check token hasn't expired
|
|
3. Regenerate token: https://railway.app/account/tokens
|
|
4. Ensure `RAILWAY_PROJECT_ID` matches your project
|
|
|
|
```bash
|
|
# Test locally
|
|
railway login
|
|
railway link
|
|
railway status
|
|
```
|
|
|
|
### Cloudflare Workers Deployment Fails
|
|
|
|
**Symptoms:**
|
|
- Error: "Account ID not found"
|
|
- Error: "Zone ID invalid"
|
|
|
|
**Solutions:**
|
|
1. Verify secrets in GitHub:
|
|
- `CLOUDFLARE_API_TOKEN`
|
|
- `CLOUDFLARE_ACCOUNT_ID`
|
|
- `CLOUDFLARE_ZONE_ID`
|
|
|
|
2. Check API token permissions:
|
|
- Workers: Edit
|
|
- Account Settings: Read
|
|
- Zone: Edit
|
|
|
|
```bash
|
|
# Test locally
|
|
wrangler whoami
|
|
wrangler deploy
|
|
```
|
|
|
|
### Vercel Deployment Fails
|
|
|
|
**Symptoms:**
|
|
- Error: "Project not found"
|
|
- Error: "Team/Organization mismatch"
|
|
|
|
**Solutions:**
|
|
1. Verify all Vercel secrets:
|
|
- `VERCEL_TOKEN`
|
|
- `VERCEL_ORG_ID`
|
|
- `VERCEL_PROJECT_ID`
|
|
|
|
2. Check token scope includes deployment access
|
|
|
|
```bash
|
|
# Test locally
|
|
vercel whoami
|
|
vercel --prod
|
|
```
|
|
|
|
## Health Check Issues
|
|
|
|
### All Health Checks Failing
|
|
|
|
**Symptoms:**
|
|
- Health check workflow shows all platforms unhealthy
|
|
- No actual service issues
|
|
|
|
**Solutions:**
|
|
1. Verify health URL secrets are set correctly
|
|
2. Check health endpoints return 200 status
|
|
3. Ensure endpoints don't require authentication
|
|
|
|
```bash
|
|
# Test manually
|
|
curl -v https://your-service-url/api/health
|
|
```
|
|
|
|
### Intermittent Health Check Failures
|
|
|
|
**Symptoms:**
|
|
- Health checks sometimes fail
|
|
- Services are actually healthy
|
|
|
|
**Solutions:**
|
|
1. Increase health check timeout (currently 100s)
|
|
2. Check platform status pages for outages
|
|
3. Review application logs for slow responses
|
|
|
|
## Build Issues
|
|
|
|
### npm install Fails
|
|
|
|
**Symptoms:**
|
|
- Build fails at dependency installation
|
|
- Error: "EACCES" or "Permission denied"
|
|
|
|
**Solutions:**
|
|
1. Check `package-lock.json` is committed
|
|
2. Verify Node version (requires 20+)
|
|
3. Clear cache and retry:
|
|
|
|
```bash
|
|
npm cache clean --force
|
|
npm ci
|
|
```
|
|
|
|
### Build Timeout
|
|
|
|
**Symptoms:**
|
|
- Build takes too long and times out
|
|
- Error: "Process exceeded timeout"
|
|
|
|
**Solutions:**
|
|
1. Optimize build process
|
|
2. Increase timeout in `railway.json`
|
|
3. Use build cache effectively
|
|
4. Consider multi-stage builds
|
|
|
|
## Secret Issues
|
|
|
|
### Secret Not Found
|
|
|
|
**Symptoms:**
|
|
- Workflow fails with "secret not found"
|
|
- Variable shows as empty
|
|
|
|
**Solutions:**
|
|
1. Go to Settings → Secrets and variables → Actions
|
|
2. Verify secret name matches exactly (case-sensitive)
|
|
3. Check secret is in correct environment
|
|
4. Secrets don't show values - verify by re-creating
|
|
|
|
### Secret Exposed in Logs
|
|
|
|
**Symptoms:**
|
|
- Secret value visible in workflow logs
|
|
|
|
**Solutions:**
|
|
1. Immediately rotate the secret
|
|
2. Check for `echo` commands that output secrets
|
|
3. Use `***` masking: secrets are auto-masked if in GitHub Secrets
|
|
4. Never log full API responses
|
|
|
|
## Workflow Issues
|
|
|
|
### Workflow Doesn't Trigger
|
|
|
|
**Symptoms:**
|
|
- Push to main but no workflow runs
|
|
- Manual dispatch button missing
|
|
|
|
**Solutions:**
|
|
1. Check workflow file syntax (YAML is valid)
|
|
2. Verify workflow file is in `.github/workflows/`
|
|
3. Check branch name matches trigger
|
|
4. Ensure workflow is enabled (Actions → Workflows)
|
|
|
|
```bash
|
|
# Validate YAML
|
|
yamllint .github/workflows/*.yml
|
|
```
|
|
|
|
### Workflow Stuck on "Queued"
|
|
|
|
**Symptoms:**
|
|
- Workflow shows "Queued" for extended time
|
|
- No progress
|
|
|
|
**Solutions:**
|
|
1. Check GitHub Actions minutes quota
|
|
2. Verify no concurrent job limits hit
|
|
3. Check for required status checks blocking
|
|
4. Cancel and re-run workflow
|
|
|
|
## Platform-Specific Issues
|
|
|
|
### Railway
|
|
|
|
#### Port Binding Error
|
|
```
|
|
Error: address already in use
|
|
```
|
|
|
|
**Solution:**
|
|
```env
|
|
# Ensure PORT is set in railway.json
|
|
PORT=3000
|
|
```
|
|
|
|
#### Database Connection Fails
|
|
```
|
|
Error: connection refused
|
|
```
|
|
|
|
**Solution:**
|
|
1. Check database service is running
|
|
2. Verify connection string in environment
|
|
3. Check Railway service dependencies
|
|
|
|
### Cloudflare
|
|
|
|
#### Worker Exceeds Size Limit
|
|
```
|
|
Error: script too large
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Reduce bundle size
|
|
2. Use webpack/rollup to minimize
|
|
3. Split into multiple workers
|
|
4. Use ES modules
|
|
|
|
#### KV Namespace Not Found
|
|
```
|
|
Error: KV namespace binding not found
|
|
```
|
|
|
|
**Solution:**
|
|
```bash
|
|
# Create namespace
|
|
wrangler kv:namespace create BLACKROAD_PRIVATE_KV
|
|
|
|
# Update wrangler.toml with namespace ID
|
|
```
|
|
|
|
### Vercel
|
|
|
|
#### Function Timeout
|
|
```
|
|
Error: FUNCTION_INVOCATION_TIMEOUT
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Optimize function performance
|
|
2. Increase timeout in `vercel.json`:
|
|
```json
|
|
{
|
|
"functions": {
|
|
"api/*.js": {
|
|
"maxDuration": 10
|
|
}
|
|
}
|
|
}
|
|
```
|
|
|
|
#### Build Memory Exceeded
|
|
```
|
|
Error: Command failed with exit code 137
|
|
```
|
|
|
|
**Solutions:**
|
|
1. Reduce build complexity
|
|
2. Use `NODE_OPTIONS="--max_old_space_size=4096"`
|
|
3. Upgrade Vercel plan for more memory
|
|
|
|
## Security Scan Issues
|
|
|
|
### False Positive Vulnerabilities
|
|
|
|
**Symptoms:**
|
|
- Security scan reports vulnerabilities in dev dependencies
|
|
- Vulnerabilities don't affect production
|
|
|
|
**Solutions:**
|
|
1. Run: `npm audit --production`
|
|
2. Add exception in workflow if safe
|
|
3. Update to patched versions when available
|
|
|
|
### TruffleHog Blocks Commit
|
|
|
|
**Symptoms:**
|
|
- Secret detected in commit
|
|
- Deployment blocked
|
|
|
|
**Solutions:**
|
|
1. **DO NOT** commit actual secrets
|
|
2. Remove secret from history:
|
|
```bash
|
|
git filter-branch --force --index-filter \
|
|
"git rm --cached --ignore-unmatch path/to/file" \
|
|
--prune-empty --tag-name-filter cat -- --all
|
|
```
|
|
3. Use `.gitignore` for sensitive files
|
|
4. Rotate exposed secrets immediately
|
|
|
|
## Monitoring Issues
|
|
|
|
### Health Check Alert Spam
|
|
|
|
**Symptoms:**
|
|
- Many health check failure alerts
|
|
- Services are actually healthy
|
|
|
|
**Solutions:**
|
|
1. Adjust health check frequency in workflow
|
|
2. Implement retry logic
|
|
3. Add grace period before alerting
|
|
4. Check for network issues
|
|
|
|
### Missing Deployment Logs
|
|
|
|
**Symptoms:**
|
|
- Workflow succeeds but no logs
|
|
- Can't debug issues
|
|
|
|
**Solutions:**
|
|
1. Add more logging to workflows:
|
|
```yaml
|
|
- name: Debug
|
|
run: |
|
|
echo "Variable: ${{ env.VAR }}"
|
|
ls -la
|
|
```
|
|
2. Enable debug logging:
|
|
- Re-run workflow with "Enable debug logging"
|
|
3. Check platform-specific logs
|
|
|
|
## Emergency Procedures
|
|
|
|
### Complete Deployment Failure
|
|
|
|
**All platforms failing:**
|
|
|
|
1. **Check Status Pages:**
|
|
- https://www.githubstatus.com
|
|
- https://www.railwaystatus.com
|
|
- https://www.cloudflarestatus.com
|
|
- https://www.vercel-status.com
|
|
|
|
2. **Rollback:**
|
|
```bash
|
|
# Railway
|
|
railway rollback
|
|
|
|
# Cloudflare
|
|
wrangler rollback
|
|
|
|
# Vercel
|
|
vercel rollback <deployment-url>
|
|
```
|
|
|
|
3. **Manual Deploy:**
|
|
```bash
|
|
# Deploy to Railway directly
|
|
cd BlackRoad-Private
|
|
railway up
|
|
|
|
# Deploy to Cloudflare
|
|
wrangler deploy
|
|
|
|
# Deploy to Vercel
|
|
vercel --prod
|
|
```
|
|
|
|
### Data Loss Prevention
|
|
|
|
**If deployment corrupts data:**
|
|
|
|
1. **Immediate:**
|
|
- Stop all deployments
|
|
- Isolate affected services
|
|
|
|
2. **Restore:**
|
|
- Find latest backup artifact
|
|
- Download and extract
|
|
- Redeploy from backup
|
|
|
|
3. **Verify:**
|
|
- Test all endpoints
|
|
- Check data integrity
|
|
- Monitor for issues
|
|
|
|
## Getting Help
|
|
|
|
### Before Opening Issue
|
|
|
|
1. Check this troubleshooting guide
|
|
2. Review workflow logs
|
|
3. Check platform status pages
|
|
4. Search existing issues
|
|
|
|
### Creating Issue
|
|
|
|
Include:
|
|
- Platform affected (Railway/Cloudflare/Vercel)
|
|
- Workflow run URL
|
|
- Error message (full text)
|
|
- Steps to reproduce
|
|
- Expected vs actual behavior
|
|
|
|
### Escalation
|
|
|
|
For critical production issues:
|
|
1. Create high-priority issue
|
|
2. Tag repository maintainers
|
|
3. Contact platform support if needed
|
|
|
|
## Useful Commands
|
|
|
|
### Debugging
|
|
|
|
```bash
|
|
# Check all platform statuses
|
|
curl -s https://your-railway-url/api/health | jq
|
|
curl -s https://your-cloudflare-url/api/health | jq
|
|
curl -s https://your-vercel-url/api/health | jq
|
|
|
|
# View recent deployments
|
|
railway deployments
|
|
wrangler deployments list
|
|
vercel ls
|
|
|
|
# Check logs
|
|
railway logs --tail
|
|
wrangler tail
|
|
vercel logs
|
|
|
|
# Test build locally
|
|
npm ci
|
|
npm run build
|
|
npm start
|
|
```
|
|
|
|
### Quick Fixes
|
|
|
|
```bash
|
|
# Clear all caches
|
|
npm cache clean --force
|
|
rm -rf node_modules package-lock.json
|
|
npm install
|
|
|
|
# Reset to working state
|
|
git fetch origin
|
|
git reset --hard origin/main
|
|
|
|
# Force redeploy
|
|
git commit --allow-empty -m "trigger deploy"
|
|
git push
|
|
```
|
|
|
|
## Prevention
|
|
|
|
### Best Practices
|
|
|
|
1. **Test Locally First:**
|
|
```bash
|
|
npm ci && npm run build && npm start
|
|
```
|
|
|
|
2. **Use Feature Branches:**
|
|
- Never push directly to main
|
|
- Use PRs for review
|
|
- Test in staging first
|
|
|
|
3. **Monitor Actively:**
|
|
- Check health dashboards daily
|
|
- Review security scans weekly
|
|
- Verify backups monthly
|
|
|
|
4. **Document Changes:**
|
|
- Update README for config changes
|
|
- Note breaking changes
|
|
- Keep runbooks current
|
|
|
|
5. **Have Rollback Plan:**
|
|
- Know how to rollback each platform
|
|
- Keep previous deployment info
|
|
- Test rollback procedures
|
|
|
|
## Additional Resources
|
|
|
|
- **Railway Docs:** https://docs.railway.app
|
|
- **Cloudflare Docs:** https://developers.cloudflare.com
|
|
- **Vercel Docs:** https://vercel.com/docs
|
|
- **GitHub Actions:** https://docs.github.com/actions
|