🎯 Sprint Plan - 2025-10-13¶
Autor: Anderson Henrique da Silva Localização: Minas Gerais, Brasil Data: 2025-10-13 07:15:00 -03:00 Duração Estimada: 2h 15min (135 minutos) Status: 🚀 PRONTO PARA EXECUÇÃO
📋 Índice¶
- Visão Geral
- FASE 0: Organização do Workspace
- SPRINT 1: Federal APIs - Correção de Bugs
- SPRINT 2: Federal APIs - REST Endpoints
- SPRINT 3: Warm-up Job para Métricas
- SPRINT 4: Alertas Prometheus
- SPRINT 5: Validação Final e Documentação
- Critérios de Sucesso
🎯 Visão Geral¶
Objetivo Principal¶
Transformar o sistema de monitoramento de infraestrutura básica em produção completa, com Federal APIs funcionais, métricas em tempo real, e alertas configurados.
Contexto¶
- ✅ Stack de monitoramento (Prometheus + Grafana) testado e funcional
- ✅ Dashboard Federal APIs criado com 17 painéis
- ✅ 12 métricas instrumentadas no código
- ⚠️ Bugs identificados: IBGE (Pydantic), INEP (método faltando), DataSUS (403/404)
- ⚠️ Métricas não populadas (processo separado)
Entregas Esperadas¶
- Workspace organizado (testes, scripts, documentação)
- Federal APIs 100% funcionais (bugs corrigidos)
- REST endpoints públicos para todas as Federal APIs
- Job de warm-up mantendo métricas atualizadas
- Alertas Prometheus configurados e testados
- Dashboard Grafana mostrando dados reais
- Documentação completa atualizada
📦 FASE 0: Organização do Workspace¶
Duração: 30 minutos Prioridade: Alta Objetivo: Limpar e organizar estrutura do projeto
FASE 0.1: Análise e Planejamento (5 min)¶
Checklist Cirúrgico¶
- Listar todos os arquivos de teste na raiz
- Verificar estrutura existente em tests/
- Verificar estrutura existente em scripts/
- Identificar arquivos temporários
- Criar plano de movimentação
Arquivos Identificados na Raiz¶
✗ test_celery_beat.py → tests/manual/celery/
✗ test_celery_persistence.py → tests/manual/celery/
✗ test_direct_supabase.py → tests/manual/database/
✗ test_federal_apis.py → tests/manual/federal_apis/
✗ test_public_endpoint.py → tests/manual/api/
✗ check_celery_status.py → scripts/monitoring/
✗ cidadao_ai.db → .gitignore (já ignorado)
✗ audit_logs/ → logs/audit/
Comandos de Análise¶
# 1. Verificar arquivos de teste na raiz
find . -maxdepth 1 -name "test_*.py" -o -name "check_*.py"
# 2. Verificar tamanho dos arquivos temporários
du -sh cidadao_ai.db audit_logs/ __pycache__/
# 3. Verificar estrutura de tests/
ls -la tests/
# 4. Verificar estrutura de scripts/
ls -la scripts/
FASE 0.2: Mover Arquivos de Teste (10 min)¶
Estrutura de Destino¶
tests/
├── manual/ # Novos testes manuais
│ ├── __init__.py
│ ├── celery/
│ │ ├── __init__.py
│ │ ├── test_beat.py # test_celery_beat.py
│ │ └── test_persistence.py # test_celery_persistence.py
│ ├── database/
│ │ ├── __init__.py
│ │ └── test_supabase.py # test_direct_supabase.py
│ ├── federal_apis/
│ │ ├── __init__.py
│ │ └── test_apis.py # test_federal_apis.py
│ └── api/
│ ├── __init__.py
│ └── test_public.py # test_public_endpoint.py
├── unit/ # Existente
├── integration/ # Existente
└── ...
Checklist de Execução¶
- Criar diretórios tests/manual/ com subdiretórios
- Criar todos os init.py necessários
- Mover test_celery_beat.py → tests/manual/celery/test_beat.py
- Mover test_celery_persistence.py → tests/manual/celery/test_persistence.py
- Mover test_direct_supabase.py → tests/manual/database/test_supabase.py
- Mover test_federal_apis.py → tests/manual/federal_apis/test_apis.py
- Mover test_public_endpoint.py → tests/manual/api/test_public.py
- Atualizar imports se necessário
- Adicionar README.md em tests/manual/ explicando uso
Comandos de Execução¶
# 1. Criar estrutura de diretórios
mkdir -p tests/manual/{celery,database,federal_apis,api}
# 2. Criar __init__.py em todos os diretórios
touch tests/manual/__init__.py
touch tests/manual/celery/__init__.py
touch tests/manual/database/__init__.py
touch tests/manual/federal_apis/__init__.py
touch tests/manual/api/__init__.py
# 3. Mover arquivos com git mv (mantém histórico)
git mv test_celery_beat.py tests/manual/celery/test_beat.py
git mv test_celery_persistence.py tests/manual/celery/test_persistence.py
git mv test_direct_supabase.py tests/manual/database/test_supabase.py
git mv test_federal_apis.py tests/manual/federal_apis/test_apis.py
git mv test_public_endpoint.py tests/manual/api/test_public.py
# 4. Verificar movimentação
git status
Validação¶
# Confirmar que arquivos foram movidos
ls -la tests/manual/*/
# Confirmar que raiz está limpa
ls -la *.py | grep test_
# Resultado esperado: nenhum arquivo test_*.py na raiz
FASE 0.3: Organizar Scripts (10 min)¶
Checklist de Execução¶
- Mover check_celery_status.py → scripts/monitoring/check_celery.py
- Consolidar scripts de teste de DB em scripts/testing/database/
- Atualizar scripts/testing/README.md
- Adicionar comentários nos scripts movidos
Comandos de Execução¶
# 1. Mover script de monitoramento
git mv check_celery_status.py scripts/monitoring/check_celery.py
# 2. Organizar scripts de teste de DB existentes em scripts/
mkdir -p scripts/testing/database
git mv scripts/test_db_connection.py scripts/testing/database/
git mv scripts/test_auth_db.py scripts/testing/database/
git mv scripts/simple_db_test.py scripts/testing/database/
git mv scripts/test_supabase_connection.py scripts/testing/database/
# 3. Verificar
ls -la scripts/monitoring/
ls -la scripts/testing/database/
FASE 0.4: Limpar Arquivos Temporários (5 min)¶
Checklist de Execução¶
- Mover audit_logs/ → logs/audit/
- Adicionar cidadao_ai.db ao .gitignore se não estiver
- Remover todos os pycache recursivamente
- Remover .pytest_cache se houver
- Verificar .coverage não está commitado
Comandos de Execução¶
# 1. Mover audit logs
mkdir -p logs/audit
mv audit_logs/* logs/audit/ 2>/dev/null || true
rmdir audit_logs
# 2. Verificar .gitignore
grep -q "cidadao_ai.db" .gitignore || echo "cidadao_ai.db" >> .gitignore
grep -q "__pycache__" .gitignore || echo "__pycache__/" >> .gitignore
grep -q ".pytest_cache" .gitignore || echo ".pytest_cache/" >> .gitignore
# 3. Limpar pycache
find . -type d -name "__pycache__" -exec rm -rf {} + 2>/dev/null || true
find . -type d -name ".pytest_cache" -exec rm -rf {} + 2>/dev/null || true
# 4. Verificar status git
git status
Validação¶
# Confirmar limpeza
ls -la | grep -E "(pycache|pytest_cache|audit_logs)"
# Resultado esperado: nada encontrado
FASE 0.5: Commit de Organização (5 min)¶
Checklist de Commit¶
- Revisar todas as mudanças com git status
- Verificar que todos os testes movidos estão tracked
- Criar commit descritivo
- Atualizar README.md com nova estrutura
Comandos¶
# 1. Verificar mudanças
git status
# 2. Adicionar novos arquivos
git add tests/manual/
git add scripts/testing/database/
git add scripts/monitoring/check_celery.py
# 3. Commit
git commit -m "refactor: organize workspace structure
Move test files to dedicated directories:
- Manual tests moved to tests/manual/ with categorization
- Celery tests → tests/manual/celery/
- Database tests → tests/manual/database/
- Federal API tests → tests/manual/federal_apis/
- API tests → tests/manual/api/
Organize scripts:
- Monitoring scripts consolidated in scripts/monitoring/
- Testing scripts organized in scripts/testing/database/
Clean temporary files:
- Move audit logs to logs/audit/
- Remove __pycache__ and .pytest_cache
- Update .gitignore for temporary files
Improves project maintainability and clarity."
# 4. Verificar commit
git log -1 --stat
🐛 SPRINT 1: Federal APIs - Correção de Bugs¶
Duração: 40 minutos Prioridade: Crítica Objetivo: Resolver todos os bugs identificados no teste de monitoramento
Contexto dos Bugs¶
BUG 1: IBGE - Pydantic Validation Error¶
ValidationError: 1 validation error for IBGELocation
id
Input should be a valid string [type=string_type, input_value=11, input_type=int]
Causa: API do IBGE retorna IDs como integers, mas modelo Pydantic espera string.
Arquivo: src/services/transparency_apis/federal_apis/ibge_client.py
Linha: Models de resposta (IBGELocation, IBGEMunicipality, etc.)
BUG 2: INEP - Método Não Implementado¶
Causa: Método search_institutions não existe na classe INEPClient.
Arquivo: src/services/transparency_apis/federal_apis/inep_client.py
Método Faltando: async def search_institutions(self, state: str, limit: int)
BUG 3: DataSUS - Endpoints Retornam 403/404¶
Causa: Endpoints podem estar incorretos ou precisar autenticação especial.
Arquivo: src/services/transparency_apis/federal_apis/datasus_client.py
Investigação necessária: Verificar documentação oficial da API DataSUS.
SPRINT 1.1: Corrigir IBGE Pydantic Validation (15 min)¶
Checklist Cirúrgico¶
- Ler arquivo ibge_client.py completo
- Identificar todos os modelos Pydantic
- Verificar tipos dos campos
id - Alterar Field com coerção int → str
- Adicionar validator personalizado se necessário
- Testar com dados reais da API
- Atualizar testes unitários
Implementação Detalhada¶
Passo 1: Ler modelo atual
# Ver modelos Pydantic no arquivo
grep -A 10 "class IBGE" src/services/transparency_apis/federal_apis/ibge_client.py
Passo 2: Identificar problema
# Modelo atual (errado):
class IBGELocation(BaseModel):
id: str # API retorna int, mas esperamos str
nome: str
Passo 3: Solução 1 - Field com validator
from pydantic import BaseModel, Field, field_validator
class IBGELocation(BaseModel):
id: str = Field(..., description="ID do estado/município")
nome: str
@field_validator('id', mode='before')
@classmethod
def coerce_id_to_str(cls, v):
"""Convert integer IDs from IBGE API to strings."""
return str(v) if isinstance(v, int) else v
Passo 4: Solução 2 - Union type (alternativa)
from typing import Union
class IBGELocation(BaseModel):
id: Union[str, int] # Aceita ambos
nome: str
@field_validator('id', mode='after')
@classmethod
def ensure_str(cls, v):
return str(v)
Arquivos a Modificar¶
src/services/transparency_apis/federal_apis/ibge_client.py
└── Models:
├── IBGELocation
├── IBGEMunicipality
├── IBGEState
└── Qualquer outro com campo 'id'
Comandos de Teste¶
# 1. Testar IBGE client isoladamente
python -c "
import asyncio
from src.services.transparency_apis.federal_apis.ibge_client import IBGEClient
async def test():
async with IBGEClient() as client:
states = await client.get_states()
print(f'✅ States: {len(states)}')
munis = await client.get_municipalities('33')
print(f'✅ Municipalities: {len(munis)}')
asyncio.run(test())
"
# 2. Rodar teste manual movido
python -m pytest tests/manual/federal_apis/test_apis.py::test_ibge -v
Validação de Sucesso¶
-
get_states()retorna lista sem erro -
get_municipalities()funciona para todos os estados -
get_population()não lança ValidationError - Testes passam sem warnings
SPRINT 1.2: Implementar INEP search_institutions (15 min)¶
Checklist Cirúrgico¶
- Ler documentação da API INEP
- Verificar endpoints disponíveis
- Implementar método search_institutions
- Adicionar modelo de resposta Pydantic
- Instrumentar com métricas Prometheus
- Adicionar cache com TTL
- Testar com dados reais
- Atualizar docstring
Implementação Detalhada¶
Passo 1: Pesquisar endpoint INEP
# Verificar base_url e endpoints existentes
grep -n "base_url\|endpoint" src/services/transparency_apis/federal_apis/inep_client.py
Passo 2: Analisar métodos existentes
# Ver como outros métodos estão implementados
# Exemplo de get_education_indicators para usar como template
Passo 3: Implementar método search_institutions
async def search_institutions(
self,
state: Optional[str] = None,
city: Optional[str] = None,
name: Optional[str] = None,
limit: int = 20,
page: int = 1
) -> Dict[str, Any]:
"""
Search educational institutions.
Args:
state: State code (UF) - e.g., 'RJ', 'SP'
city: City name
name: Institution name (partial match)
limit: Max results per page (default: 20)
page: Page number (default: 1)
Returns:
Dict with:
- total: Total institutions found
- page: Current page
- limit: Results per page
- results: List of institutions
Example:
>>> async with INEPClient() as client:
>>> results = await client.search_institutions(state="RJ", limit=10)
>>> print(f"Found {results['total']} institutions")
"""
endpoint = "/instituicoes" # Verificar endpoint real na documentação
params = {
"limit": limit,
"page": page
}
if state:
params["uf"] = state.upper()
if city:
params["municipio"] = city
if name:
params["nome"] = name
# Usar método _make_request existente
response = await self._make_request(
method="GET",
endpoint=endpoint,
params=params
)
return {
"total": response.get("total", 0),
"page": page,
"limit": limit,
"results": response.get("data", [])
}
Passo 4: Adicionar modelo Pydantic (opcional)
class INEPInstitution(BaseModel):
"""Educational institution data model."""
codigo: str
nome: str
uf: str
municipio: str
rede: Optional[str] = None # pública/privada
tipo: Optional[str] = None # federal/estadual/municipal
dependencia: Optional[str] = None
Arquivos a Modificar¶
src/services/transparency_apis/federal_apis/inep_client.py
└── Adicionar:
├── search_institutions() method
└── INEPInstitution model (opcional)
Comandos de Teste¶
# 1. Testar método novo
python -c "
import asyncio
from src.services.transparency_apis.federal_apis.inep_client import INEPClient
async def test():
async with INEPClient() as client:
results = await client.search_institutions(state='RJ', limit=5)
print(f'✅ Institutions found: {results[\"total\"]}')
print(f'✅ Results: {len(results[\"results\"])}')
asyncio.run(test())
"
# 2. Rodar teste manual
python -m pytest tests/manual/federal_apis/test_apis.py::test_inep -v
Validação de Sucesso¶
- Método search_institutions existe e não lança AttributeError
- Retorna estrutura de dados esperada
- Funciona com diferentes combinações de parâmetros
- Métricas Prometheus são registradas
- Testes passam
SPRINT 1.3: Investigar e Corrigir DataSUS (10 min)¶
Checklist Cirúrgico¶
- Ler erro completo dos testes anteriores
- Verificar endpoints que falharam (403/404)
- Pesquisar documentação oficial DataSUS
- Verificar se precisa autenticação especial
- Ajustar endpoints ou parâmetros
- Adicionar tratamento de erro específico
- Testar com dados reais
- Documentar limitações conhecidas
Investigação Detalhada¶
Passo 1: Analisar logs do teste anterior
Passo 2: Identificar endpoints problemáticos
# Ver quais endpoints retornaram 403/404
# Provavelmente:
# - get_health_indicators() → 403
# - get_hospital_data() → 404
Passo 3: Pesquisar documentação
# URLs de referência DataSUS:
# - https://datasus.saude.gov.br/
# - http://tabnet.datasus.gov.br/
# - API docs (se existir)
Passo 4: Implementar tratamento de erro robusto
async def get_health_indicators(self, state_code: Optional[str] = None) -> Dict[str, Any]:
"""
Get health indicators.
Note: This endpoint may have restricted access.
Returns cached or mock data if API returns 403/404.
"""
try:
endpoint = "/indicadores-saude"
params = {"uf": state_code} if state_code else {}
response = await self._make_request(
method="GET",
endpoint=endpoint,
params=params
)
return response
except HTTPStatusError as e:
if e.response.status_code in (403, 404):
# Log warning but don't fail
logger.warning(
f"DataSUS endpoint restricted: {endpoint} "
f"(status {e.response.status_code})"
)
# Return structure with limitation note
return {
"status": "limited_access",
"message": "Endpoint requires special authorization",
"state_code": state_code,
"indicators": []
}
raise
Passo 5: Documentar limitações
# Adicionar ao docstring da classe:
"""
DataSUS Client - Brazilian Health Data System
Known Limitations:
- Some endpoints require special authorization (403)
- Hospital data endpoint may be unavailable (404)
- Recommend using search_datasets() for reliable access
"""
Arquivos a Modificar¶
src/services/transparency_apis/federal_apis/datasus_client.py
└── Métodos a ajustar:
├── get_health_indicators()
├── get_hospital_data() (se existir)
└── Adicionar error handling robusto
Comandos de Teste¶
# 1. Testar endpoints conhecidos
python -c "
import asyncio
from src.services.transparency_apis.federal_apis.datasus_client import DataSUSClient
async def test():
async with DataSUSClient() as client:
# Teste search (sabemos que funciona)
datasets = await client.search_datasets('saúde', limit=5)
print(f'✅ Datasets: {datasets.get(\"result\", {}).get(\"count\", 0)}')
# Teste indicators (pode retornar limited_access)
indicators = await client.get_health_indicators(state_code='RJ')
print(f'✅ Indicators status: {indicators.get(\"status\", \"ok\")}')
asyncio.run(test())
"
# 2. Rodar teste manual completo
python -m pytest tests/manual/federal_apis/test_apis.py::test_datasus -v
Validação de Sucesso¶
- Nenhum endpoint lança exceção não tratada
- 403/404 são tratados graciosamente
- Retorna estrutura de dados consistente
- Logs indicam claramente limitações
- Testes passam sem falhas críticas
SPRINT 1.4: Commit de Correções (5 min)¶
Checklist de Commit¶
- Rodar todos os testes das Federal APIs
- Verificar que não quebrou nada existente
- Verificar cobertura de testes se aumentou
- Criar commit descritivo
- Atualizar CHANGELOG (se existir)
Comandos¶
# 1. Rodar testes completos das Federal APIs
python -m pytest tests/manual/federal_apis/ -v
# 2. Rodar testes unitários se existirem
python -m pytest tests/unit/services/transparency_apis/ -v --cov
# 3. Verificar mudanças
git status
git diff
# 4. Commit
git add src/services/transparency_apis/federal_apis/
git commit -m "fix(federal-apis): resolve critical bugs in IBGE, INEP, and DataSUS clients
IBGE Client:
- Fix Pydantic validation error for integer IDs
- Add field validator to coerce int to str
- Update IBGELocation, IBGEMunicipality, IBGEState models
- All endpoints now work without ValidationError
INEP Client:
- Implement missing search_institutions method
- Add support for filtering by state, city, name
- Include pagination parameters (limit, page)
- Add comprehensive docstring with examples
DataSUS Client:
- Add robust error handling for 403/404 responses
- Return graceful degradation for restricted endpoints
- Document known API limitations
- Prevent exceptions from blocking metrics collection
Testing:
- All Federal API manual tests now pass
- Verified with real API calls to production endpoints
Related to: MONITORING_TEST_RESULTS.md findings"
# 5. Verificar commit
git log -1 --stat
🔌 SPRINT 2: Federal APIs - REST Endpoints¶
Duração: 30 minutos Prioridade: Alta Objetivo: Expor Federal APIs via REST para gerar métricas e facilitar uso externo
Contexto¶
Atualmente as Federal APIs são apenas clientes internos. Precisamos criar endpoints REST públicos que: 1. Permitam acesso via HTTP 2. Gerem métricas automaticamente quando chamados 3. Facilitem testes e uso externo 4. Mantenham mesma estrutura de resposta
Estrutura de Endpoints¶
POST /api/v1/federal/ibge/states
POST /api/v1/federal/ibge/municipalities
POST /api/v1/federal/ibge/population
POST /api/v1/federal/datasus/search
POST /api/v1/federal/datasus/indicators
POST /api/v1/federal/inep/search-institutions
POST /api/v1/federal/inep/indicators
SPRINT 2.1: Criar Arquivo de Rotas (10 min)¶
Checklist Cirúrgico¶
- Criar arquivo src/api/routes/federal_apis.py
- Importar APIRouter do FastAPI
- Importar clientes das Federal APIs
- Definir modelos Pydantic de request/response
- Adicionar tags para documentação OpenAPI
- Implementar endpoints básicos
- Adicionar tratamento de erros
Implementação Detalhada¶
Passo 1: Criar arquivo base
"""
Federal APIs REST Endpoints
Exposes Brazilian government APIs as REST endpoints.
Generates Prometheus metrics automatically on each call.
Author: Anderson Henrique da Silva
Location: Minas Gerais, Brasil
Date: 2025-10-13
"""
from typing import Optional, Dict, Any
from fastapi import APIRouter, HTTPException, Query
from pydantic import BaseModel, Field
from src.services.transparency_apis.federal_apis.ibge_client import IBGEClient
from src.services.transparency_apis.federal_apis.datasus_client import DataSUSClient
from src.services.transparency_apis.federal_apis.inep_client import INEPClient
router = APIRouter(
prefix="/api/v1/federal",
tags=["Federal APIs"]
)
Passo 2: Definir modelos de request
# Request Models
class IBGEMunicipalitiesRequest(BaseModel):
"""Request model for IBGE municipalities."""
state_code: str = Field(..., description="State code (2 digits)", example="33")
class IBGEPopulationRequest(BaseModel):
"""Request model for IBGE population data."""
state_code: Optional[str] = Field(None, description="State code (2 digits)", example="33")
municipality_code: Optional[str] = Field(None, description="Municipality code", example="3304557")
class DataSUSSearchRequest(BaseModel):
"""Request model for DataSUS search."""
query: str = Field(..., description="Search query", example="saúde")
limit: int = Field(10, ge=1, le=100, description="Max results")
class DataSUSIndicatorsRequest(BaseModel):
"""Request model for DataSUS health indicators."""
state_code: Optional[str] = Field(None, description="State code (UF)", example="RJ")
class INEPSearchRequest(BaseModel):
"""Request model for INEP institution search."""
state: Optional[str] = Field(None, description="State code (UF)", example="RJ")
city: Optional[str] = Field(None, description="City name", example="Rio de Janeiro")
name: Optional[str] = Field(None, description="Institution name", example="UFRJ")
limit: int = Field(20, ge=1, le=100, description="Max results")
page: int = Field(1, ge=1, description="Page number")
class INEPIndicatorsRequest(BaseModel):
"""Request model for INEP education indicators."""
state: Optional[str] = Field(None, description="State code (UF)", example="RJ")
year: Optional[int] = Field(None, description="Year", example=2023)
Passo 3: Implementar endpoints IBGE
@router.get(
"/ibge/states",
summary="Get Brazilian States",
description="Retrieve all Brazilian states from IBGE API"
)
async def get_ibge_states() -> Dict[str, Any]:
"""Get all Brazilian states."""
try:
async with IBGEClient() as client:
states = await client.get_states()
return {
"success": True,
"total": len(states),
"data": states
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post(
"/ibge/municipalities",
summary="Get Municipalities by State",
description="Retrieve municipalities for a specific state"
)
async def get_ibge_municipalities(request: IBGEMunicipalitiesRequest) -> Dict[str, Any]:
"""Get municipalities for a state."""
try:
async with IBGEClient() as client:
municipalities = await client.get_municipalities(state_code=request.state_code)
return {
"success": True,
"state_code": request.state_code,
"total": len(municipalities),
"data": municipalities
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post(
"/ibge/population",
summary="Get Population Data",
description="Retrieve population data from IBGE"
)
async def get_ibge_population(request: IBGEPopulationRequest) -> Dict[str, Any]:
"""Get population data."""
try:
async with IBGEClient() as client:
population = await client.get_population(
state_code=request.state_code,
municipality_code=request.municipality_code
)
return {
"success": True,
"state_code": request.state_code,
"municipality_code": request.municipality_code,
"data": population
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
Arquivo a Criar¶
SPRINT 2.2: Implementar Endpoints DataSUS e INEP (10 min)¶
Implementação Detalhada¶
Passo 1: Endpoints DataSUS
@router.post(
"/datasus/search",
summary="Search DataSUS Datasets",
description="Search health datasets in DataSUS"
)
async def search_datasus_datasets(request: DataSUSSearchRequest) -> Dict[str, Any]:
"""Search DataSUS datasets."""
try:
async with DataSUSClient() as client:
results = await client.search_datasets(
query=request.query,
limit=request.limit
)
return {
"success": True,
"query": request.query,
"data": results
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post(
"/datasus/indicators",
summary="Get Health Indicators",
description="Retrieve health indicators from DataSUS"
)
async def get_datasus_indicators(request: DataSUSIndicatorsRequest) -> Dict[str, Any]:
"""Get health indicators."""
try:
async with DataSUSClient() as client:
indicators = await client.get_health_indicators(
state_code=request.state_code
)
return {
"success": True,
"state_code": request.state_code,
"data": indicators
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
Passo 2: Endpoints INEP
@router.post(
"/inep/search-institutions",
summary="Search Educational Institutions",
description="Search institutions in INEP database"
)
async def search_inep_institutions(request: INEPSearchRequest) -> Dict[str, Any]:
"""Search educational institutions."""
try:
async with INEPClient() as client:
results = await client.search_institutions(
state=request.state,
city=request.city,
name=request.name,
limit=request.limit,
page=request.page
)
return {
"success": True,
"filters": {
"state": request.state,
"city": request.city,
"name": request.name
},
"pagination": {
"page": request.page,
"limit": request.limit
},
"data": results
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
@router.post(
"/inep/indicators",
summary="Get Education Indicators",
description="Retrieve education indicators from INEP"
)
async def get_inep_indicators(request: INEPIndicatorsRequest) -> Dict[str, Any]:
"""Get education indicators."""
try:
async with INEPClient() as client:
indicators = await client.get_education_indicators(
state=request.state,
year=request.year
)
return {
"success": True,
"state": request.state,
"year": request.year,
"data": indicators
}
except Exception as e:
raise HTTPException(status_code=500, detail=str(e))
SPRINT 2.3: Registrar Router no App Principal (5 min)¶
Checklist Cirúrgico¶
- Abrir src/api/app.py
- Importar router de federal_apis
- Registrar router no app FastAPI
- Verificar ordem de registro (antes de outros)
- Testar que endpoints aparecem no /docs
Implementação¶
Passo 1: Localizar imports no app.py
Passo 2: Adicionar import
Passo 3: Registrar router
Passo 4: Verificar ordem
# Ordem recomendada:
app.include_router(federal_apis.router) # Federal APIs primeiro
app.include_router(agents.router) # Depois agentes
app.include_router(chat.router) # Depois chat
# ...outros routers
Arquivos a Modificar¶
src/api/app.py
└── Adicionar:
├── Import: from src.api.routes import federal_apis
└── Register: app.include_router(federal_apis.router)
SPRINT 2.4: Testar Endpoints (5 min)¶
Checklist de Testes¶
- Iniciar backend localmente
- Acessar /docs e verificar novos endpoints
- Testar cada endpoint via Swagger UI
- Testar via curl ou httpie
- Verificar métricas em /health/metrics
- Verificar Prometheus coletou métricas
Comandos de Teste¶
Passo 1: Iniciar backend
Passo 2: Testar endpoints via curl
# 1. IBGE - Get States
curl -X GET http://localhost:8000/api/v1/federal/ibge/states
# 2. IBGE - Get Municipalities
curl -X POST http://localhost:8000/api/v1/federal/ibge/municipalities \
-H "Content-Type: application/json" \
-d '{"state_code": "33"}'
# 3. DataSUS - Search
curl -X POST http://localhost:8000/api/v1/federal/datasus/search \
-H "Content-Type: application/json" \
-d '{"query": "saúde", "limit": 5}'
# 4. INEP - Search Institutions
curl -X POST http://localhost:8000/api/v1/federal/inep/search-institutions \
-H "Content-Type: application/json" \
-d '{"state": "RJ", "limit": 5}'
Passo 3: Verificar métricas
# Verificar que métricas foram geradas
curl http://localhost:8000/health/metrics | grep federal_api
# Deve mostrar:
# federal_api_requests_total{...} 4
# federal_api_request_duration_seconds_count{...} 4
# etc.
Passo 4: Verificar no Prometheus
Validação de Sucesso¶
- Todos os endpoints respondem com 200 OK
- Dados retornados estão corretos
- Swagger UI (/docs) mostra endpoints corretamente
- Métricas aparecem em /health/metrics
- Prometheus está coletando métricas
SPRINT 2.5: Commit REST Endpoints (5 min)¶
git add src/api/routes/federal_apis.py
git add src/api/app.py
git commit -m "feat(api): add REST endpoints for Federal APIs
Expose Federal APIs as public REST endpoints:
IBGE Endpoints:
- GET /api/v1/federal/ibge/states - All Brazilian states
- POST /api/v1/federal/ibge/municipalities - Municipalities by state
- POST /api/v1/federal/ibge/population - Population data
DataSUS Endpoints:
- POST /api/v1/federal/datasus/search - Search health datasets
- POST /api/v1/federal/datasus/indicators - Health indicators
INEP Endpoints:
- POST /api/v1/federal/inep/search-institutions - Search schools/universities
- POST /api/v1/federal/inep/indicators - Education indicators
Features:
- Automatic Prometheus metrics generation on each call
- Comprehensive request/response models with Pydantic
- Error handling with appropriate HTTP status codes
- Full OpenAPI documentation in Swagger UI
- Async/await for optimal performance
Benefits:
- Enables dashboard metrics population
- Facilitates external integrations
- Provides consistent API interface
- Improves system observability
Related to: MONITORING_TEST_RESULTS.md recommendations"
⚡ SPRINT 3: Warm-up Job para Métricas¶
Duração: 20 minutos Prioridade: Média Objetivo: Criar job agendado que mantém métricas sempre atualizadas
Contexto¶
Com os REST endpoints implementados, precisamos de um job que: 1. Chama os endpoints periodicamente 2. Mantém métricas sempre populadas 3. Valida disponibilidade das APIs 4. Pode rodar como cronjob ou background task
SPRINT 3.1: Criar Script Warm-up (15 min)¶
Checklist Cirúrgico¶
- Criar arquivo scripts/monitoring/warmup_federal_apis.py
- Importar httpx para chamadas HTTP
- Implementar função de warm-up para cada API
- Adicionar logging detalhado
- Adicionar tratamento de erros
- Adicionar métricas de saúde do warm-up
- Criar função main com schedule
Implementação Detalhada¶
#!/usr/bin/env python3
"""
Federal APIs Warm-up Job
Periodically calls Federal API endpoints to:
- Keep metrics updated in Prometheus
- Validate API availability
- Pre-warm caches
Author: Anderson Henrique da Silva
Location: Minas Gerais, Brasil
Date: 2025-10-13
Usage:
# Run once
python scripts/monitoring/warmup_federal_apis.py
# Run continuously (every 5 minutes)
python scripts/monitoring/warmup_federal_apis.py --daemon
# Custom interval
python scripts/monitoring/warmup_federal_apis.py --daemon --interval 300
"""
import asyncio
import logging
import sys
import time
from datetime import datetime
from typing import Dict, Any, List
from pathlib import Path
import httpx
import argparse
# Setup logging
logging.basicConfig(
level=logging.INFO,
format='%(asctime)s - %(levelname)s - %(message)s'
)
logger = logging.getLogger(__name__)
# Backend URL (adjust if needed)
BACKEND_URL = "http://localhost:8000"
# Warm-up endpoints to call
WARMUP_ENDPOINTS = [
{
"name": "IBGE States",
"method": "GET",
"url": f"{BACKEND_URL}/api/v1/federal/ibge/states",
"data": None
},
{
"name": "IBGE Municipalities (RJ)",
"method": "POST",
"url": f"{BACKEND_URL}/api/v1/federal/ibge/municipalities",
"data": {"state_code": "33"}
},
{
"name": "DataSUS Search",
"method": "POST",
"url": f"{BACKEND_URL}/api/v1/federal/datasus/search",
"data": {"query": "saúde", "limit": 5}
},
{
"name": "INEP Search (RJ)",
"method": "POST",
"url": f"{BACKEND_URL}/api/v1/federal/inep/search-institutions",
"data": {"state": "RJ", "limit": 5}
}
]
async def call_endpoint(
client: httpx.AsyncClient,
endpoint: Dict[str, Any]
) -> Dict[str, Any]:
"""
Call a single endpoint and return result.
Args:
client: Async HTTP client
endpoint: Endpoint configuration
Returns:
Dict with result details
"""
start_time = time.time()
try:
if endpoint["method"] == "GET":
response = await client.get(endpoint["url"], timeout=10.0)
else:
response = await client.post(
endpoint["url"],
json=endpoint["data"],
timeout=10.0
)
elapsed = time.time() - start_time
if response.status_code == 200:
logger.info(
f"✅ {endpoint['name']}: {response.status_code} "
f"({elapsed:.2f}s)"
)
return {
"name": endpoint["name"],
"status": "success",
"status_code": response.status_code,
"elapsed": elapsed
}
else:
logger.warning(
f"⚠️ {endpoint['name']}: {response.status_code} "
f"({elapsed:.2f}s)"
)
return {
"name": endpoint["name"],
"status": "error",
"status_code": response.status_code,
"elapsed": elapsed
}
except Exception as e:
elapsed = time.time() - start_time
logger.error(f"❌ {endpoint['name']}: {str(e)} ({elapsed:.2f}s)")
return {
"name": endpoint["name"],
"status": "failed",
"error": str(e),
"elapsed": elapsed
}
async def warmup_cycle() -> Dict[str, Any]:
"""
Execute one warmup cycle calling all endpoints.
Returns:
Summary of warmup cycle
"""
logger.info("=" * 60)
logger.info(f"🔥 Starting Federal APIs Warm-up - {datetime.now()}")
logger.info("=" * 60)
results: List[Dict[str, Any]] = []
async with httpx.AsyncClient() as client:
# Call all endpoints
for endpoint in WARMUP_ENDPOINTS:
result = await call_endpoint(client, endpoint)
results.append(result)
# Small delay between calls
await asyncio.sleep(0.5)
# Calculate summary
success_count = sum(1 for r in results if r["status"] == "success")
error_count = sum(1 for r in results if r["status"] == "error")
failed_count = sum(1 for r in results if r["status"] == "failed")
total_time = sum(r["elapsed"] for r in results)
summary = {
"timestamp": datetime.now().isoformat(),
"total_endpoints": len(results),
"success": success_count,
"errors": error_count,
"failed": failed_count,
"total_time": total_time,
"results": results
}
logger.info("=" * 60)
logger.info(f"📊 Warmup Summary:")
logger.info(f" ✅ Success: {success_count}/{len(results)}")
logger.info(f" ⚠️ Errors: {error_count}/{len(results)}")
logger.info(f" ❌ Failed: {failed_count}/{len(results)}")
logger.info(f" ⏱️ Total: {total_time:.2f}s")
logger.info("=" * 60)
return summary
async def daemon_mode(interval: int = 300):
"""
Run warmup in daemon mode with periodic execution.
Args:
interval: Seconds between warmup cycles (default: 300 = 5 min)
"""
logger.info(f"🚀 Starting daemon mode (interval: {interval}s)")
cycle_count = 0
while True:
try:
cycle_count += 1
logger.info(f"\n🔄 Cycle #{cycle_count}")
await warmup_cycle()
logger.info(f"😴 Sleeping for {interval}s until next cycle...\n")
await asyncio.sleep(interval)
except KeyboardInterrupt:
logger.info("\n⚠️ Interrupted by user. Exiting...")
break
except Exception as e:
logger.error(f"❌ Error in daemon loop: {e}")
logger.info(f"😴 Waiting {interval}s before retry...")
await asyncio.sleep(interval)
async def main():
"""Main entry point."""
parser = argparse.ArgumentParser(description="Federal APIs Warm-up Job")
parser.add_argument(
"--daemon",
action="store_true",
help="Run continuously in daemon mode"
)
parser.add_argument(
"--interval",
type=int,
default=300,
help="Interval between cycles in seconds (default: 300)"
)
parser.add_argument(
"--backend-url",
type=str,
default="http://localhost:8000",
help="Backend URL (default: http://localhost:8000)"
)
args = parser.parse_args()
# Update backend URL if provided
if args.backend_url != "http://localhost:8000":
global BACKEND_URL
BACKEND_URL = args.backend_url
# Update all endpoint URLs
for endpoint in WARMUP_ENDPOINTS:
endpoint["url"] = endpoint["url"].replace(
"http://localhost:8000",
args.backend_url
)
if args.daemon:
await daemon_mode(interval=args.interval)
else:
await warmup_cycle()
if __name__ == "__main__":
try:
asyncio.run(main())
except KeyboardInterrupt:
logger.info("\n👋 Goodbye!")
sys.exit(0)
Arquivo a Criar¶
SPRINT 3.2: Criar Arquivo Systemd (opcional, 5 min)¶
Para rodar como serviço Linux¶
# Criar arquivo /etc/systemd/system/cidadao-warmup.service
sudo nano /etc/systemd/system/cidadao-warmup.service
[Unit]
Description=Cidadão.AI Federal APIs Warm-up Job
After=network.target
[Service]
Type=simple
User=anderson-henrique
WorkingDirectory=/home/anderson-henrique/Documentos/cidadao.ai/cidadao.ai-backend
ExecStart=/home/anderson-henrique/Documentos/cidadao.ai/cidadao.ai-backend/venv/bin/python \
scripts/monitoring/warmup_federal_apis.py --daemon --interval 300
Restart=always
RestartSec=10
[Install]
WantedBy=multi-user.target
# Habilitar e iniciar serviço
sudo systemctl enable cidadao-warmup.service
sudo systemctl start cidadao-warmup.service
sudo systemctl status cidadao-warmup.service
SPRINT 3.3: Testar Warm-up (5 min)¶
Comandos de Teste¶
# 1. Rodar uma vez
python scripts/monitoring/warmup_federal_apis.py
# Deve mostrar:
# ✅ IBGE States: 200 (0.45s)
# ✅ IBGE Municipalities (RJ): 200 (0.32s)
# ✅ DataSUS Search: 200 (0.78s)
# ✅ INEP Search (RJ): 200 (0.56s)
# 2. Testar modo daemon por 1 minuto
timeout 60 python scripts/monitoring/warmup_federal_apis.py --daemon --interval 30
# 3. Verificar métricas geradas
curl http://localhost:8000/health/metrics | grep federal_api_requests_total
# 4. Verificar no Grafana
# Abrir http://localhost:3000
# Dashboard Federal APIs deve mostrar dados agora!
Validação de Sucesso¶
- Script roda sem erros
- Todos os endpoints são chamados
- Métricas aparecem em /health/metrics
- Grafana dashboard mostra dados
- Prometheus tem métricas crescendo
SPRINT 3.4: Commit Warm-up Job (5 min)¶
git add scripts/monitoring/warmup_federal_apis.py
git commit -m "feat(monitoring): add Federal APIs warm-up job
Create automated warm-up job to maintain metrics:
Features:
- Periodically calls all Federal API endpoints
- Keeps Prometheus metrics always updated
- Validates API availability continuously
- Supports one-time and daemon modes
Usage:
# Run once
python scripts/monitoring/warmup_federal_apis.py
# Run continuously (5 min interval)
python scripts/monitoring/warmup_federal_apis.py --daemon
# Custom interval
python scripts/monitoring/warmup_federal_apis.py --daemon --interval 600
Implementation:
- Async HTTP calls with httpx
- Detailed logging with timestamps
- Error handling and retries
- Cycle summary with success/error counts
- Can run as systemd service
Benefits:
- Dashboard always shows current data
- Early detection of API issues
- Pre-warmed caches improve performance
- Automated health monitoring
Related to: MONITORING_TEST_RESULTS.md - Sprint 3"
🔔 SPRINT 4: Alertas Prometheus¶
Duração: 20 minutos Prioridade: Média Objetivo: Configurar regras de alerta para monitoramento proativo
Contexto¶
Precisamos de alertas para detectar problemas antes que afetem usuários: - Taxa de erro elevada (>5%) - Latência alta (P95 > 5s) - Taxa de cache baixa (<50%) - APIs indisponíveis
SPRINT 4.1: Criar Regras de Alerta (10 min)¶
Checklist Cirúrgico¶
- Criar arquivo monitoring/prometheus/alerts.yml
- Definir alertas para taxa de erro
- Definir alertas para latência
- Definir alertas para cache
- Definir alertas para disponibilidade
- Adicionar labels e annotations
- Documentar thresholds
Implementação Detalhada¶
# monitoring/prometheus/alerts.yml
#
# Prometheus Alert Rules for Cidadão.AI
#
# Author: Anderson Henrique da Silva
# Location: Minas Gerais, Brasil
# Date: 2025-10-13
groups:
- name: federal_apis_alerts
interval: 30s
rules:
# Alert: High Error Rate
- alert: FederalAPIHighErrorRate
expr: |
(
sum(rate(federal_api_errors_total[5m])) by (api_name)
/
sum(rate(federal_api_requests_total[5m])) by (api_name)
) > 0.05
for: 2m
labels:
severity: warning
component: federal_apis
annotations:
summary: "High error rate on {{ $labels.api_name }}"
description: |
{{ $labels.api_name }} API has error rate of {{ $value | humanizePercentage }}
over the last 5 minutes (threshold: 5%).
runbook_url: https://docs.cidadao.ai/runbooks/high-error-rate
# Alert: Critical Error Rate
- alert: FederalAPICriticalErrorRate
expr: |
(
sum(rate(federal_api_errors_total[5m])) by (api_name)
/
sum(rate(federal_api_requests_total[5m])) by (api_name)
) > 0.25
for: 1m
labels:
severity: critical
component: federal_apis
annotations:
summary: "CRITICAL: {{ $labels.api_name }} failing"
description: |
{{ $labels.api_name }} API has error rate of {{ $value | humanizePercentage }}
over the last 5 minutes (threshold: 25%).
IMMEDIATE ACTION REQUIRED!
runbook_url: https://docs.cidadao.ai/runbooks/critical-error-rate
# Alert: High Latency (P95)
- alert: FederalAPIHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(federal_api_request_duration_seconds_bucket[5m])) by (api_name, le)
) > 5
for: 3m
labels:
severity: warning
component: federal_apis
annotations:
summary: "High latency on {{ $labels.api_name }}"
description: |
{{ $labels.api_name }} API P95 latency is {{ $value | humanizeDuration }}
(threshold: 5s). Users may experience slow responses.
runbook_url: https://docs.cidadao.ai/runbooks/high-latency
# Alert: Very High Latency
- alert: FederalAPIVeryHighLatency
expr: |
histogram_quantile(0.95,
sum(rate(federal_api_request_duration_seconds_bucket[5m])) by (api_name, le)
) > 10
for: 1m
labels:
severity: critical
component: federal_apis
annotations:
summary: "CRITICAL: Extreme latency on {{ $labels.api_name }}"
description: |
{{ $labels.api_name }} API P95 latency is {{ $value | humanizeDuration }}
(threshold: 10s). Service severely degraded!
runbook_url: https://docs.cidadao.ai/runbooks/extreme-latency
# Alert: Low Cache Hit Rate
- alert: FederalAPILowCacheHitRate
expr: |
(
sum(rate(federal_api_cache_operations_total{operation="hit"}[10m])) by (api_name)
/
sum(rate(federal_api_cache_operations_total[10m])) by (api_name)
) < 0.50
for: 5m
labels:
severity: info
component: federal_apis
annotations:
summary: "Low cache hit rate on {{ $labels.api_name }}"
description: |
{{ $labels.api_name }} cache hit rate is {{ $value | humanizePercentage }}
(threshold: 50%). Consider adjusting TTL or cache strategy.
runbook_url: https://docs.cidadao.ai/runbooks/low-cache-hit
# Alert: API Down
- alert: FederalAPIDown
expr: |
up{job="cidadao-ai-backend"} == 0
for: 1m
labels:
severity: critical
component: backend
annotations:
summary: "Backend API is DOWN"
description: |
Backend API ({{ $labels.instance }}) is not responding.
All Federal APIs are unavailable!
runbook_url: https://docs.cidadao.ai/runbooks/api-down
# Alert: High Retry Rate
- alert: FederalAPIHighRetryRate
expr: |
sum(rate(federal_api_retries_total[5m])) by (api_name) > 1
for: 3m
labels:
severity: warning
component: federal_apis
annotations:
summary: "High retry rate on {{ $labels.api_name }}"
description: |
{{ $labels.api_name }} is experiencing {{ $value }} retries/sec
over the last 5 minutes. Upstream API may be unstable.
runbook_url: https://docs.cidadao.ai/runbooks/high-retry-rate
# Alert: Excessive Active Requests
- alert: FederalAPIExcessiveActiveRequests
expr: |
federal_api_active_requests > 20
for: 2m
labels:
severity: warning
component: federal_apis
annotations:
summary: "Many concurrent requests to {{ $labels.api_name }}"
description: |
{{ $labels.api_name }} has {{ $value }} active requests.
May indicate slow responses or traffic spike.
runbook_url: https://docs.cidadao.ai/runbooks/excessive-requests
- name: system_alerts
interval: 30s
rules:
# Alert: Prometheus Scrape Failures
- alert: PrometheusScrapeFailing
expr: |
up{job="cidadao-ai-backend"} == 0
or
up{job="prometheus"} == 0
for: 2m
labels:
severity: critical
component: monitoring
annotations:
summary: "Prometheus cannot scrape {{ $labels.job }}"
description: |
Prometheus failed to scrape {{ $labels.job }} at {{ $labels.instance }}
for more than 2 minutes. Metrics may be stale!
runbook_url: https://docs.cidadao.ai/runbooks/scrape-failure
# Alert: Grafana Down
- alert: GrafanaDown
expr: |
up{job="grafana"} == 0
for: 5m
labels:
severity: warning
component: monitoring
annotations:
summary: "Grafana dashboard is DOWN"
description: |
Grafana is not responding. Dashboards unavailable.
runbook_url: https://docs.cidadao.ai/runbooks/grafana-down
Arquivo a Criar¶
SPRINT 4.2: Atualizar prometheus.yml (5 min)¶
Verificar que alerts.yml está sendo carregado¶
# Verificar se prometheus.yml já tem rule_files
grep -n "rule_files" monitoring/prometheus/prometheus.yml
Se não tiver ou não incluir alerts.yml, adicionar:
# Em monitoring/prometheus/prometheus.yml
rule_files:
- "rules/*.yml"
- "alerts.yml" # ← Garantir que esta linha existe
Validar configuração¶
# Validar sintaxe do prometheus.yml
docker run --rm -v $(pwd)/monitoring/prometheus:/etc/prometheus \
prom/prometheus:v2.49.1 \
promtool check config /etc/prometheus/prometheus.yml
# Deve retornar: SUCCESS
SPRINT 4.3: Recarregar Prometheus (3 min)¶
Comandos de Reload¶
# Opção 1: Reload via API (recomendado)
curl -X POST http://localhost:9090/-/reload
# Opção 2: Restart container
docker restart cidadao-prometheus
# Opção 3: Docker compose restart
sudo docker-compose -f config/docker/docker-compose.monitoring-minimal.yml restart prometheus
Validação¶
# 1. Verificar que Prometheus recarregou
curl http://localhost:9090/api/v1/status/config | jq '.status'
# Deve retornar: "success"
# 2. Verificar regras carregadas
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].name'
# Deve incluir: "federal_apis_alerts" e "system_alerts"
# 3. Abrir UI do Prometheus
# http://localhost:9090/alerts
# Deve mostrar todos os alertas definidos
SPRINT 4.4: Commit Alertas (2 min)¶
git add monitoring/prometheus/alerts.yml
git add monitoring/prometheus/prometheus.yml # se modificado
git commit -m "feat(monitoring): add comprehensive Prometheus alert rules
Configure proactive monitoring with 11 alert rules:
Federal APIs Alerts:
- High Error Rate (>5% for 2min) - WARNING
- Critical Error Rate (>25% for 1min) - CRITICAL
- High Latency P95 (>5s for 3min) - WARNING
- Very High Latency P95 (>10s for 1min) - CRITICAL
- Low Cache Hit Rate (<50% for 5min) - INFO
- High Retry Rate (>1/s for 3min) - WARNING
- Excessive Active Requests (>20 for 2min) - WARNING
System Alerts:
- API Down (>1min) - CRITICAL
- Prometheus Scrape Failing (>2min) - CRITICAL
- Grafana Down (>5min) - WARNING
Features:
- Severity levels: INFO, WARNING, CRITICAL
- Detailed annotations with thresholds
- Runbook URLs for incident response
- Appropriate for expressions to avoid flapping
- Component labels for routing
Next Steps:
- Configure Alertmanager for notifications
- Add Slack/Email integrations
- Create runbook documentation
Related to: MONITORING_TEST_RESULTS.md - Sprint 4"
✅ SPRINT 5: Validação Final e Documentação¶
Duração: 15 minutos Prioridade: Alta Objetivo: Testar tudo end-to-end e documentar
SPRINT 5.1: Testes End-to-End (10 min)¶
Checklist Completo de Validação¶
Backend e APIs - [ ] Backend rodando sem erros - [ ] Todos os endpoints REST respondem - [ ] Federal APIs não têm bugs (IBGE, INEP, DataSUS) - [ ] Métricas sendo geradas em /health/metrics
Monitoramento - [ ] Prometheus coletando métricas - [ ] Grafana mostrando dados no dashboard - [ ] Alertas carregados no Prometheus - [ ] Warm-up job populando métricas
Testes Automatizados - [ ] Testes manuais passam (tests/manual/federal_apis/) - [ ] Cobertura de testes mantida ou aumentada - [ ] Nenhum teste quebrado por mudanças
Comandos de Validação Completa¶
# 1. Verificar backend
curl http://localhost:8000/health
curl http://localhost:8000/api/v1/federal/ibge/states | jq '.success'
# 2. Gerar métricas com warm-up
python scripts/monitoring/warmup_federal_apis.py
# 3. Verificar métricas
curl http://localhost:8000/health/metrics | grep federal_api | head -20
# 4. Verificar Prometheus targets
curl http://localhost:9090/api/v1/targets | jq '.data.activeTargets[].health'
# 5. Verificar alertas
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[].name'
# 6. Rodar testes
python -m pytest tests/manual/federal_apis/ -v
python -m pytest tests/unit/services/transparency_apis/ -v --cov
# 7. Verificar no Grafana
# Abrir http://localhost:3000
# Dashboard "Federal APIs Monitoring" deve mostrar dados reais!
SPRINT 5.2: Atualizar Documentação (5 min)¶
Arquivos a Atualizar¶
1. README.md - Seção de Monitoramento
## 📊 Monitoring & Observability
### Stack
- **Prometheus**: Metrics collection and alerting
- **Grafana**: Visualization dashboards
- **Federal APIs Dashboard**: Real-time monitoring
### Quick Start
```bash
# Start monitoring stack
sudo docker-compose -f config/docker/docker-compose.monitoring-minimal.yml up -d
# Start backend
make run-dev
# Run warm-up job (keeps metrics updated)
python scripts/monitoring/warmup_federal_apis.py --daemon
Accessing Dashboards¶
- Grafana: http://localhost:3000 (admin/cidadao123)
- Prometheus: http://localhost:9090
- Metrics: http://localhost:8000/health/metrics
Federal APIs Endpoints¶
# IBGE
GET /api/v1/federal/ibge/states
POST /api/v1/federal/ibge/municipalities
POST /api/v1/federal/ibge/population
# DataSUS
POST /api/v1/federal/datasus/search
POST /api/v1/federal/datasus/indicators
# INEP
POST /api/v1/federal/inep/search-institutions
POST /api/v1/federal/inep/indicators
Alerts Configured¶
- High error rate (>5%)
- High latency P95 (>5s)
- Low cache hit rate (<50%)
- API unavailability
- Prometheus scrape failures
User Requests ↓ Backend API (FastAPI) ↓ Federal APIs (IBGE, DataSUS, INEP) ↓ Prometheus Metrics ↓ Grafana Dashboards
**2. Criar docs/monitoring/README.md** ```markdown # Monitoring Guide Complete guide for Cidadão.AI monitoring infrastructure. ## Architecture## Dashboards ### Federal APIs Monitoring - **Location**: Grafana > Dashboards > Federal APIs Monitoring - **UID**: `federal-apis` - **Panels**: 17 visualization panels #### Key Metrics 1. **Request Rate**: Requests/sec per API 2. **Error Rate**: Percentage of failed requests 3. **Latency**: P50, P95, P99 response times 4. **Cache Performance**: Hit rate, operations 5. **Retry Rate**: Upstream API stability ## Alerts ### Severity Levels - **INFO**: Informational, no action required - **WARNING**: Requires attention, not urgent - **CRITICAL**: Immediate action required ### Alert Rules #### FederalAPIHighErrorRate (WARNING) - **Threshold**: >5% errors for 2 minutes - **Action**: Check logs, investigate errors #### FederalAPICriticalErrorRate (CRITICAL) - **Threshold**: >25% errors for 1 minute - **Action**: IMMEDIATE - Check backend health, upstream APIs #### FederalAPIHighLatency (WARNING) - **Threshold**: P95 >5s for 3 minutes - **Action**: Check backend performance, database queries ## Warm-up Job ### Purpose Maintains metrics by periodically calling Federal API endpoints. ### Usage ```bash # Run once python scripts/monitoring/warmup_federal_apis.py # Run continuously (5 min interval) python scripts/monitoring/warmup_federal_apis.py --daemon # Custom interval (10 min) python scripts/monitoring/warmup_federal_apis.py --daemon --interval 600
As Systemd Service¶
Troubleshooting¶
No data in Grafana¶
- Check backend is running:
curl http://localhost:8000/health - Check Prometheus scraping: http://localhost:9090/targets
- Generate metrics:
python scripts/monitoring/warmup_federal_apis.py - Wait 15-30s for Prometheus to collect
Alerts always firing¶
- Check thresholds are appropriate for your load
- Adjust
for:duration in alerts.yml - Verify upstream APIs are stable
High error rate¶
- Check backend logs
- Verify Federal API availability
- Check network connectivity
- Review MONITORING_TEST_RESULTS.md for known issues
Related Files¶
monitoring/prometheus/prometheus.yml- Scrape configmonitoring/prometheus/alerts.yml- Alert rulesmonitoring/grafana/dashboards/federal-apis-dashboard.json- Dashboardscripts/monitoring/warmup_federal_apis.py- Warm-up jobMONITORING_TEST_RESULTS.md- Test results documentation#### Commit Documentação ```bash git add README.md git add docs/monitoring/ git commit -m "docs: update monitoring documentation Add comprehensive monitoring documentation: README Updates: - Add Monitoring & Observability section - Document quick start commands - List all Federal API endpoints - Document configured alerts New Documentation: - docs/monitoring/README.md with complete guide - Architecture diagrams - Dashboard usage instructions - Alert severity levels and actions - Warm-up job documentation - Troubleshooting guide Covers: - Sprint 1-5 implementations - Federal APIs REST endpoints - Warm-up job setup and usage - Alert rules and responses - Common issues and solutions Related to: SPRINT_PLAN_2025-10-13.md completion"
🎯 Critérios de Sucesso¶
Deve Funcionar¶
- ✅ Workspace organizado (testes em tests/manual/, scripts organizados)
- ✅ Federal APIs sem bugs (IBGE, INEP, DataSUS funcionando)
- ✅ REST endpoints respondendo corretamente
- ✅ Métricas sendo geradas em /health/metrics
- ✅ Prometheus coletando métricas (targets UP)
- ✅ Grafana mostrando dados no dashboard
- ✅ Warm-up job executando sem erros
- ✅ Alertas carregados no Prometheus
- ✅ Testes passando (manual e unitários)
- ✅ Documentação atualizada
Métricas de Qualidade¶
- Cobertura de testes: mantida ou aumentada (target: >80%)
- Zero bugs críticos introduzidos
- Todas as Federal APIs operacionais
- Tempo de resposta P95 < 5s
- Taxa de erro < 5%
- Cache hit rate > 50%
Entregas Documentadas¶
- SPRINT_PLAN_2025-10-13.md (este arquivo)
- Commits descritivos em inglês
- README.md atualizado
- docs/monitoring/README.md criado
- MONITORING_TEST_RESULTS.md (já existe)
📊 Timeline Resumido¶
| Fase | Descrição | Duração | Status |
|---|---|---|---|
| FASE 0 | Organização Workspace | 30 min | ⏳ Pendente |
| SPRINT 1 | Correção de Bugs | 40 min | ⏳ Pendente |
| SPRINT 2 | REST Endpoints | 30 min | ⏳ Pendente |
| SPRINT 3 | Warm-up Job | 20 min | ⏳ Pendente |
| SPRINT 4 | Alertas Prometheus | 20 min | ⏳ Pendente |
| SPRINT 5 | Validação e Docs | 15 min | ⏳ Pendente |
| TOTAL | 2h 15min | 0% Completo |
🚀 Como Executar Este Plano¶
Pré-requisitos¶
# 1. Backend rodando
make run-dev
# 2. Monitoring stack rodando
sudo docker-compose -f config/docker/docker-compose.monitoring-minimal.yml up -d
# 3. Ambiente virtual ativo
source venv/bin/activate
Execução¶
# 1. Abrir este plano no editor
code SPRINT_PLAN_2025-10-13.md
# 2. Seguir cada seção sequencialmente
# 3. Marcar checkboxes conforme avança
# 4. Executar comandos listados
# 5. Validar cada etapa antes de próxima
# 6. Fazer commits ao final de cada sprint
Dicas¶
- ⏱️ Usar timer para manter foco e ritmo
- ✅ Marcar checkboxes conforme completa
- 🔍 Validar cada etapa antes de avançar
- 📝 Fazer commits descritivos (em inglês)
- 🐛 Se encontrar bugs, documentar e continuar
- ⚡ Pedir ajuda se travar >15min em algo
📊 APÊNDICE A: A/B Testing Framework¶
Descoberto durante Sprint Planning - 2025-10-13
✅ Framework Já Implementado!¶
O backend possui um framework A/B testing COMPLETO para testar variações de modelos ML em produção.
Localização: src/ml/ab_testing.py (514 linhas)
🎯 Funcionalidades Principais¶
1. Estratégias de Alocação (4 tipos)¶
# RANDOM - Alocação aleatória com consistência por usuário
# WEIGHTED - Distribuição ponderada de tráfego
# EPSILON_GREEDY - Explore vs Exploit (10% exploração)
# THOMPSON_SAMPLING - Abordagem Bayesiana com Beta distributions
2. Análise Estatística Automática¶
- Chi-square test para significância
- Intervalos de confiança (Wilson score method)
- Cálculo automático de lift
- P-value < 0.05 para determinar vencedor
- Auto-stop quando vencedor encontrado
3. Estados do Teste¶
4. Configuração Avançada¶
- Tamanho mínimo de amostra configurável
- Duração máxima do teste
- Tráfego split customizável (50/50, 70/30, etc.)
- Métricas de sucesso personalizadas
🔌 API Endpoints Disponíveis¶
# Gestão de Testes
POST /api/v1/ml/ab-test/create # Criar novo teste
POST /api/v1/ml/ab-test/{test_name}/start # Iniciar teste
POST /api/v1/ml/ab-test/{test_name}/stop # Parar teste
GET /api/v1/ml/ab-test/active # Listar testes ativos
# Alocação e Tracking
GET /api/v1/ml/ab-test/{test_name}/allocate # Alocar modelo para usuário
POST /api/v1/ml/ab-test/{test_name}/record # Registrar resultado de predição
# Análise e Resultados
GET /api/v1/ml/ab-test/{test_name}/status # Status do teste
GET /api/v1/ml/ab-test/{test_name}/analyze # Análise estatística
POST /api/v1/ml/ab-test/{test_name}/promote # Promover vencedor para produção
📝 Exemplo de Uso¶
Criar um A/B Test¶
POST /api/v1/ml/ab-test/create
{
"test_name": "corruption_detector_v2_test",
"model_a": {"model_id": "corruption_detector", "version": 1},
"model_b": {"model_id": "corruption_detector", "version": 2},
"allocation_strategy": "thompson_sampling",
"traffic_split": [0.5, 0.5],
"success_metric": "f1_score",
"minimum_sample_size": 1000,
"significance_level": 0.05,
"auto_stop": true,
"duration_hours": 48
}
Iniciar Teste¶
Alocar Modelo para Usuário¶
Retorna:
Registrar Resultado¶
POST /api/v1/ml/ab-test/corruption_detector_v2_test/record
{
"model_selection": "model_b",
"success": true,
"prediction_metadata": {
"accuracy": 0.95,
"confidence": 0.87
}
}
Verificar Status e Análise¶
Retorna:
{
"test_id": "ab_test_corruption_detector_v2_test_20251013_071500",
"status": "running",
"results": {
"model_a": {"predictions": 500, "successes": 425},
"model_b": {"predictions": 500, "successes": 465}
},
"latest_analysis": {
"model_a": {"conversion_rate": 0.85, "sample_size": 500},
"model_b": {"conversion_rate": 0.93, "sample_size": 500},
"p_value": 0.001,
"significant": true,
"winner": "model_b",
"lift": 9.4
}
}
Promover Vencedor para Produção¶
🔧 Implementação Técnica¶
Persistência¶
- Redis: Armazena configuração do teste (90 dias de retenção)
- Key Pattern:
ab_test:{test_name}
Algoritmos Implementados¶
1. Thompson Sampling (Bayesian)
# Atualiza distribuições Beta conforme resultados
# Alpha++ para sucesso, Beta++ para falha
# Amostra de ambas e escolhe maior
sample_a = beta(alpha_a, beta_a)
sample_b = beta(alpha_b, beta_b)
winner = "model_a" if sample_a > sample_b else "model_b"
2. Epsilon-Greedy
# 10% exploration, 90% exploitation
if random() < 0.1:
return random_choice(["model_a", "model_b"])
else:
return best_performing_model
3. Chi-Square Test
# Análise estatística com scipy
chi2, p_value = chi2_contingency(contingency_table)
significant = p_value < 0.05
📊 Integração com Monitoramento¶
O framework A/B testing pode ser integrado com o stack de monitoramento:
Métricas Custom para A/B Tests¶
# Adicionar ao monitoring/prometheus/alerts.yml
- alert: ABTestSampleSizeTooSmall
expr: |
ab_test_sample_size < 100
for: 1h
labels:
severity: warning
- alert: ABTestWinnerFound
expr: |
ab_test_winner_confidence > 0.95
labels:
severity: info
Dashboard Grafana para A/B Tests¶
- Taxa de conversão por variante
- P-value ao longo do tempo
- Distribuição de tráfego
- Tempo até significância estatística
- Lift comparativo
💡 Casos de Uso¶
1. Testar Novos Modelos ML¶
# Testar accuracy de novo modelo de detecção de corrupção
test = ab_testing.create_test(
test_name="zumbi_anomaly_detector_v3",
model_a=("zumbi_detector", 2),
model_b=("zumbi_detector", 3),
success_metric="precision"
)
2. Otimizar Agentes¶
# Comparar estratégias de reasoning de agentes
test = ab_testing.create_test(
test_name="anita_reasoning_strategy",
model_a=("anita_agent", "chain_of_thought"),
model_b=("anita_agent", "tree_of_thought"),
allocation_strategy="epsilon_greedy"
)
3. Validar Features de Federal APIs¶
# Testar nova estratégia de cache
test = ab_testing.create_test(
test_name="federal_api_cache_strategy",
model_a=("api_cache", "lru"),
model_b=("api_cache", "lfu"),
success_metric="cache_hit_rate"
)
🚀 Próximos Passos (Opcional)¶
Se quiser expandir o framework A/B testing nos próximos sprints:
Sprint Futuro: Dashboard A/B Testing¶
- Criar painel Grafana específico para A/B tests
- Adicionar métricas Prometheus personalizadas
- Integrar com alertas de significância estatística
Sprint Futuro: Multi-Armed Bandit¶
- Implementar algoritmo contextual bandits
- Adicionar suporte para >2 variantes
- Otimização online com reward feedback
Sprint Futuro: Feature Flags Integration¶
- Conectar A/B tests com feature flags
- Rollout gradual baseado em confiança
- Automatic rollback em caso de degradação
📚 Referências¶
Arquivos Principais:
- src/ml/ab_testing.py - Framework core (514 linhas)
- src/api/routes/ml_pipeline.py - REST endpoints (linhas 244-370)
- src/ml/training_pipeline.py - Integração com modelos
Documentação Técnica: - Wilson Score Interval: https://en.wikipedia.org/wiki/Binomial_proportion_confidence_interval - Thompson Sampling: https://en.wikipedia.org/wiki/Thompson_sampling - Chi-Square Test: https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.chi2_contingency.html
Conclusão: O framework A/B testing está production-ready e completamente funcional. Não requer implementação adicional, apenas documentação e potencialmente integração com dashboards de monitoramento.
FIM DO PLANO DE SPRINT
Pronto para começar? 🚀