Production Deployment Checklist - Beta 1.0¶

Project: cidadao.ai Backend Multi-Agent System Target Environment: Railway Production Date: 2025-10-27 Version: Beta 1.0

🎯 Pre-Deployment Checklist¶

1. Code Quality & Testing ✅¶

2. Environment Configuration 🔧¶

Required Environment Variables¶

Core Services¶

# CRITICAL - Must be set
SECRET_KEY=<generate-with-scripts/generate_secrets.py>
JWT_SECRET_KEY=<generate-with-scripts/generate_secrets.py>

# LLM Provider Configuration (Primary: Maritaca AI)
LLM_PROVIDER=maritaca
MARITACA_API_KEY=<maritaca-api-key>
MARITACA_MODEL=sabiazinho-3

# LLM Provider (Backup: Anthropic Claude)
ANTHROPIC_API_KEY=<anthropic-key>
ANTHROPIC_MODEL=claude-sonnet-4-20250514

Database & Cache¶

# PostgreSQL (Railway managed)
DATABASE_URL=postgresql://user:pass@host:port/cidadaoai_prod

# Redis (Railway addon or external)
REDIS_URL=redis://default:password@host:port

# Supabase (optional - for enhanced features)
SUPABASE_URL=https://xxx.supabase.co
SUPABASE_SERVICE_ROLE_KEY=<key>

Government APIs¶

# Portal da Transparência (CRITICAL for real data)
TRANSPARENCY_API_KEY=<portal-api-key>
# Get at: https://api.portaldatransparencia.gov.br/

# Dados.gov.br (optional)
DADOS_GOV_API_KEY=<dados-gov-key>

Monitoring & Observability¶

# Prometheus metrics
ENABLE_METRICS=true

# Sentry error tracking (recommended)
SENTRY_DSN=<sentry-dsn>
SENTRY_ENVIRONMENT=production

# Log level
LOG_LEVEL=INFO  # Options: DEBUG, INFO, WARNING, ERROR

Security & Rate Limiting¶

# CORS origins (comma-separated)
CORS_ORIGINS=https://cidadao.ai,https://app.cidadao.ai

# Rate limiting
RATE_LIMIT_PER_MINUTE=60
RATE_LIMIT_BURST=10

# IP Whitelist (optional - for admin endpoints)
IP_WHITELIST=203.0.113.0/24,198.51.100.0/24

Configuration Validation¶

# Validate all required env vars are set
railway run python scripts/deployment/validate_config.py

3. Database & Migrations 🗄️¶

Backup Current Database

railway run pg_dump $DATABASE_URL > backup_$(date +%Y%m%d_%H%M%S).sql

Run Migrations
```
railway run alembic upgrade head
```

Verify Migration Success

railway run alembic current
railway run alembic history

Seed Initial Data (if needed)

railway run python scripts/seed_production_data.py

Database Indexes: Verify critical indexes exist
Investigations: investigation_id, user_id, status
Anomalies: contract_id, detection_date, severity
Cache: TTL indexes for automatic cleanup

4. Third-Party Services Verification ✓¶

LLM Providers¶

Maritaca AI: Verify API key works

LLM_PROVIDER=maritaca railway run python test_maritaca_integration.py

Anthropic Claude: Verify backup provider

ANTHROPIC_API_KEY=xxx railway run python -c "from anthropic import Anthropic; print(Anthropic().messages.create(model='claude-sonnet-4-20250514', max_tokens=10, messages=[{'role':'user','content':'test'}]))"

Government APIs¶

Portal da Transparência: Test API key

curl -H "chave-api-dados: $TRANSPARENCY_API_KEY" \
  "https://api.portaldatransparencia.gov.br/api-de-dados/orgaos-siafi"

Federal APIs: Verify IBGE, DataSUS, INEP, PNCP connectivity
```
railway run python scripts/test_federal_apis.py
```

Infrastructure¶

Redis Connection: Verify cache connectivity

railway run python -c "import redis; r=redis.from_url('$REDIS_URL'); r.ping()"

PostgreSQL Connection: Verify database connectivity

railway run python -c "from src.infrastructure.database import engine; engine.connect()"

5. Security Hardening 🔒¶

Application Security¶

Remove Debug Mode: Ensure DEBUG=False in production
HTTPS Only: Verify all endpoints use HTTPS
Secure Headers: CSP, HSTS, X-Frame-Options configured
CORS: Whitelist only production domains
Rate Limiting: Enabled on all public endpoints
Input Validation: All user inputs sanitized
SQL Injection: Use parameterized queries only
XSS Protection: Output encoding enabled

Secrets Management¶

Environment Variables: Never commit secrets to git
Railway Secrets: All sensitive vars in Railway dashboard
Rotate Keys: Generate new SECRET_KEY and JWT_SECRET_KEY
API Keys: Use separate keys for prod vs dev

Access Control¶

Admin Endpoints: Protected with authentication
IP Whitelist: Restrict admin access if needed
JWT Expiration: Set reasonable token expiry (24h recommended)
Password Hashing: Use bcrypt/argon2 for user passwords

6. Performance Optimization ⚡¶

Application Performance¶

Connection Pooling: PostgreSQL pool size configured

# In config.py
SQLALCHEMY_POOL_SIZE = 20
SQLALCHEMY_MAX_OVERFLOW = 40

Async Workers: Celery workers configured for background tasks

railway run celery -A src.infrastructure.queue.celery_app worker --loglevel=info

Caching Strategy: Multi-layer caching enabled
Memory cache for hot data
Redis for distributed cache
Database for persistent data
Query Optimization: Indexes on frequently queried fields

Resource Limits¶

Railway Scaling: Configure appropriate resources
RAM: 2GB minimum (4GB recommended for 16 agents)
CPU: 2 vCPUs minimum
Disk: 10GB for logs and temporary files
Timeouts: Configure reasonable timeouts
API requests: 30s
Agent processing: 60s
Database queries: 10s

7. Monitoring & Alerting 📊¶

Application Monitoring¶

Health Endpoint: /health/ returns 200 OK

curl https://cidadao-api-production.up.railway.app/health/

Metrics Endpoint: /health/metrics returns Prometheus metrics

curl https://cidadao-api-production.up.railway.app/health/metrics

Grafana Dashboards: Import production dashboards
Overview dashboard
Per-agent performance
API response times
Error rates

Error Tracking¶

Sentry Integration: Configure Sentry for error tracking

sentry_sdk.init(
    dsn=os.getenv("SENTRY_DSN"),
    environment="production",
    traces_sample_rate=0.1,
)

Log Aggregation: Centralize logs (Railway logs, Datadog, etc.)

Alerting Rules¶

Error Rate: Alert if >5% of requests fail
Response Time: Alert if p95 >2s
Database: Alert if connection pool >80% full
Redis: Alert if memory >90% full
LLM API: Alert if rate limit exceeded

8. Backup & Recovery 💾¶

Automated Backups¶

Database Backups: Daily automated backups enabled

# Railway: Enable automatic backups in dashboard
# Or configure manual backup script
0 2 * * * railway run pg_dump $DATABASE_URL | gzip > /backups/db_$(date +\%Y\%m\%d).sql.gz

Backup Retention: Keep 30 days of daily backups

Disaster Recovery Plan¶

RTO (Recovery Time Objective): <4 hours
RPO (Recovery Point Objective): <24 hours
Backup Restoration Tested: Successfully restored from backup

Rollback Procedures¶

Code Rollback: Previous deployment tagged in git

git tag production-v1.0.0-beta-$(date +%Y%m%d)
git push origin production-v1.0.0-beta-$(date +%Y%m%d)

Database Rollback: Alembic downgrade tested

alembic downgrade -1  # Rollback last migration

Railway Rollback: Previous deployment available in Railway dashboard

9. Documentation 📚¶

API Documentation¶

OpenAPI Spec: Available at /docs endpoint
Postman Collection: Published and updated
Authentication Guide: JWT token generation documented
Rate Limits: Documented for each endpoint

Runbooks¶

Incident Response: Steps for handling production incidents
Scaling Guide: How to scale Railway resources
Common Issues: FAQ with solutions

Changelog¶

Release Notes: Beta 1.0 features documented
Breaking Changes: API changes documented
Migration Guide: For users upgrading from alpha

10. Pre-Launch Testing 🧪¶

Smoke Tests¶

Health Check: /health/ returns 200
Authentication: JWT login works

Agent Invocation: Each of 16 agents responds

# Test critical agents
curl -X POST https://cidadao-api-production.up.railway.app/api/v1/chat/message \
  -H "Authorization: Bearer $JWT_TOKEN" \
  -d '{"message": "Detectar anomalias em contratos", "agent": "zumbi"}'

Real Data: Actual government data retrieved

curl https://cidadao-api-production.up.railway.app/api/v1/federal/ibge/states

Load Testing¶

Baseline Performance: Measure under normal load
Target: <200ms p95 response time
Target: >100 requests/second capacity

Stress Testing: Verify behavior under high load

# Using k6 or locust
k6 run --vus 50 --duration 5m scripts/load_tests/api_test.js

Endurance Testing: 24-hour stability test

Security Testing¶

OWASP Top 10: Scan for common vulnerabilities

# Using OWASP ZAP or similar
zap-cli quick-scan https://cidadao-api-production.up.railway.app

Dependency Scan: Check for vulnerable dependencies
```
pip-audit
safety check
```

11. Deployment Execution 🚀¶

Pre-Deployment¶

Announce Maintenance Window: Notify users
Freeze Code: No new commits to main branch
Team Availability: Ensure team available for rollback

Deployment Steps¶

Tag Release

git tag v1.0.0-beta
git push origin v1.0.0-beta

Deploy to Railway

railway up
# Or use automatic deployment from GitHub

Run Migrations
```
railway run alembic upgrade head
```
Verify Deployment
```
railway status
railway logs --tail 100
```
Smoke Test Production
Health check passes
Critical endpoints respond
All 16 agents operational

Post-Deployment¶

Monitor Errors: Watch Sentry for 1 hour
Monitor Performance: Watch Grafana dashboards
Monitor Logs: Check Railway logs for errors
Update Status Page: Mark deployment as complete

12. Post-Launch Monitoring 👀¶

First 24 Hours¶

Hour 1: Active monitoring, team on standby
Hour 6: Review error rates and performance
Hour 24: Full metrics review

First Week¶

Daily Metrics Review: Error rate, response time, throughput
User Feedback: Monitor support channels
Performance Tuning: Adjust based on real usage

Success Criteria¶

Uptime: >99.9% (max 8 minutes downtime/day)
Error Rate: <1% of requests
Response Time: p95 <200ms, p99 <500ms
Agent Success Rate: >95% successful investigations

🚨 Rollback Triggers¶

Immediately rollback if: - Error rate >10% for >5 minutes - API completely unavailable for >2 minutes - Database corruption detected - Critical security vulnerability discovered - LLM provider completely unavailable

Rollback Procedure: 1. Deploy previous Railway version 2. Rollback database migration if needed 3. Notify users of rollback 4. Investigate root cause

📞 Emergency Contacts¶

Development Team: - Anderson Henrique da Silva (Lead Engineer) - [Add other team members]

Infrastructure: - Railway Support: support@railway.app - Database DBA: [contact]

Third-Party Services: - Maritaca AI Support: [contact] - Anthropic Support: [contact] - Portal da Transparência: [contact]

✅ Final Sign-Off¶

Engineering Lead: All technical requirements met
DevOps: Infrastructure ready and monitored
Security: Security checklist completed
QA: All tests passing, smoke tests successful
Product: Release notes approved

Deployment Approved By: - Name: ___________________________ - Date: ___________________________ - Signature: ______________________

📊 Deployment Metrics Baseline¶

Record these metrics before deployment for comparison:

Metric	Current	Target	Actual (Post-Deploy)
Test Coverage	44.59%	80% (v1.0 target)	___
Tests Passing	177+	All	___
API Response Time (p95)	TBD	<200ms	___
Error Rate	TBD	<1%	___
Uptime (30 days)	99.9%	>99.9%	___
Agent Success Rate	TBD	>95%	___

Document Version: 1.0 Last Updated: 2025-10-27 Next Review: Before v1.0 production release

Status: ✅ Ready for Beta 1.0 Deployment