Pular para conteúdo

📊 Agent Metrics Dashboard

Created: 2025-10-31 Status: ✅ Implemented API Version: v1

Overview

The Agent Metrics Dashboard provides real-time performance monitoring and analytics for all 16 agents in the Cidadão.AI system. It tracks response times, success rates, error patterns, and usage statistics.

Features

✅ Implemented

  1. Real-time Metrics Collection
  2. Request counts per agent
  3. Response time tracking (avg, p95, p99)
  4. Success/failure rates
  5. Error tracking with timestamps
  6. Quality score monitoring

  7. Prometheus Integration

  8. Native Prometheus metrics export
  9. Pre-configured Grafana dashboards
  10. Time-series data collection
  11. Histogram and gauge metrics

  12. API Endpoints

  13. /api/v1/metrics/health - Health check
  14. /api/v1/metrics/agents/summary - All agents summary
  15. /api/v1/metrics/agents/{name}/stats - Individual agent stats
  16. /api/v1/metrics/prometheus - Prometheus format export
  17. /api/v1/metrics/reset - Reset metrics

Metrics Collected

Per-Agent Metrics

Metric Type Description
agent_requests_total Counter Total requests by agent, action, and status
agent_request_duration_seconds Histogram Request duration distribution
agent_active_requests Gauge Currently active requests
agent_error_rate Gauge Error rate (last 5 minutes)
agent_memory_usage_bytes Gauge Memory usage per agent
agent_reflection_iterations Histogram Reflection iterations distribution
agent_quality_score Histogram Response quality scores

System-Wide Metrics

Metric Description
Total Requests Sum of all agent requests
Overall Success Rate System-wide success percentage
Average Response Time Mean response time across all agents
Active Agents Number of agents currently processing
Error Trends Error rate changes over time

API Usage

Get All Agents Summary

GET /api/v1/metrics/agents/summary
Authorization: Bearer <token>

Response:
{
  "status": "success",
  "data": {
    "total_agents": 16,
    "total_requests": 10000,
    "total_successes": 9500,
    "total_failures": 500,
    "overall_success_rate": 0.95,
    "agents": {
      "zumbi": {
        "requests": 1500,
        "successes": 1450,
        "failures": 50,
        "avg_response_time": 1.2,
        "error_rate": 0.033
      },
      ...
    }
  }
}

Get Individual Agent Stats

GET /api/v1/metrics/agents/zumbi/stats
Authorization: Bearer <token>

Response:
{
  "status": "success",
  "data": {
    "agent_name": "zumbi",
    "total_requests": 1500,
    "successful_requests": 1450,
    "failed_requests": 50,
    "average_response_time": 1.2,
    "p95_response_time": 2.5,
    "p99_response_time": 3.8,
    "error_rate": 0.033,
    "quality_score": 0.85,
    "last_used": "2025-10-31T19:00:00Z",
    "actions_breakdown": {
      "anomaly_detection": 800,
      "pattern_analysis": 500,
      "fraud_detection": 200
    }
  }
}

Prometheus Metrics Export

GET /api/v1/metrics/prometheus

Response (text/plain):
# HELP agent_requests_total Total number of agent requests
# TYPE agent_requests_total counter
agent_requests_total{agent_name="zumbi",action="analyze",status="success"} 1450.0
agent_requests_total{agent_name="zumbi",action="analyze",status="failure"} 50.0
...

# HELP agent_request_duration_seconds Agent request duration in seconds
# TYPE agent_request_duration_seconds histogram
agent_request_duration_seconds_bucket{agent_name="zumbi",action="analyze",le="0.1"} 100.0
agent_request_duration_seconds_bucket{agent_name="zumbi",action="analyze",le="0.25"} 300.0
...

Grafana Dashboard

Pre-configured Panels

  1. Agent Overview
  2. Total requests per agent (bar chart)
  3. Success rates comparison (gauge)
  4. Average response times (line chart)

  5. Performance Metrics

  6. Response time percentiles (heatmap)
  7. Request rate over time (area chart)
  8. Error rate trends (line chart)

  9. Quality Metrics

  10. Quality score distribution (histogram)
  11. Reflection iterations (bar chart)
  12. Memory usage (line chart)

  13. Alerts & Anomalies

  14. Error spike detection
  15. Performance degradation alerts
  16. Memory leak warnings

Dashboard Configuration

# grafana/dashboards/agent-metrics.json
{
  "title": "Cidadão.AI Agent Metrics",
  "panels": [
    {
      "title": "Agent Request Rate",
      "targets": [
        {
          "expr": "rate(agent_requests_total[5m])",
          "legendFormat": "{{agent_name}}"
        }
      ]
    },
    {
      "title": "Response Time (p95)",
      "targets": [
        {
          "expr": "histogram_quantile(0.95, agent_request_duration_seconds)",
          "legendFormat": "{{agent_name}}"
        }
      ]
    }
  ]
}

Monitoring Best Practices

Key Performance Indicators (KPIs)

  1. Availability: > 99.9% uptime per agent
  2. Response Time: p95 < 2 seconds
  3. Success Rate: > 95% for all agents
  4. Error Rate: < 1% in 5-minute windows
  5. Quality Score: > 0.8 average

Alert Thresholds

Alert Condition Severity
High Error Rate > 5% errors in 5 min Critical
Slow Response p95 > 5 seconds Warning
Agent Down No requests in 10 min Critical
Memory Leak Memory growth > 100MB/hour Warning
Quality Drop Score < 0.6 Warning

Performance Optimization

  1. Cache Metrics: Metrics are cached for 5 seconds to reduce overhead
  2. Batch Updates: Metrics are batched before writing
  3. Async Processing: All metrics collection is non-blocking
  4. Memory Management: Circular buffers for time-series data

Integration with Agent Pool

The metrics service automatically integrates with the agent pool to collect:

# Automatic instrumentation in agent_pool.py
async def execute_agent(agent_name: str, message: AgentMessage):
    start_time = time.time()

    # Track active requests
    agent_metrics_service.increment_active_requests(agent_name)

    try:
        result = await agent.process(message)

        # Record success
        agent_metrics_service.record_request(
            agent_name=agent_name,
            action=message.action,
            status="success",
            duration=time.time() - start_time,
            quality_score=result.quality_score
        )

    except Exception as e:
        # Record failure
        agent_metrics_service.record_request(
            agent_name=agent_name,
            action=message.action,
            status="failure",
            duration=time.time() - start_time,
            error=str(e)
        )

    finally:
        agent_metrics_service.decrement_active_requests(agent_name)

Testing

Unit Tests

# Run metrics tests
pytest tests/unit/api/test_agent_metrics.py -v

# Test coverage
pytest tests/unit/api/test_agent_metrics.py --cov=src.services.agent_metrics

Load Testing

# Simulate high load
locust -f tests/load/metrics_load_test.py --host=http://localhost:8000

Manual Testing

# Check metrics health
curl http://localhost:8000/api/v1/metrics/health

# Get Prometheus metrics
curl http://localhost:8000/api/v1/metrics/prometheus

# Get agent summary (requires auth)
curl -H "Authorization: Bearer $TOKEN" \
  http://localhost:8000/api/v1/metrics/agents/summary

Future Enhancements

  1. Advanced Analytics
  2. Trend prediction using ML
  3. Anomaly detection algorithms
  4. Correlation analysis between agents

  5. Custom Dashboards

  6. User-configurable dashboards
  7. Export to PDF reports
  8. Email alerts

  9. Historical Data

  10. Long-term storage in TimescaleDB
  11. Data retention policies
  12. Historical comparisons

  13. Integration

  14. Datadog export
  15. New Relic integration
  16. Custom webhook alerts