Monitoring¶
All ai.doo services expose health check and Prometheus metrics endpoints for observability.
Health Check Endpoints¶
| Service | Endpoint | Port | Healthy Response |
|---|---|---|---|
| Ollama | GET /api/tags |
11434 | 200 with model list |
| Hub | GET /health |
2000 | 200 {"status": "ok"} |
| PIKA | GET /health |
8000 | 200 {"status": "ok"} |
| VERA API | GET /health |
4000 | 200 {"status": "ok"} |
Quick check from the host:
curl -sf http://localhost:2000/health && echo "Hub OK"
curl -sf http://localhost:8000/health && echo "PIKA OK"
curl -sf http://localhost:4000/health && echo "VERA OK"
Docker Healthchecks¶
Each service's docker-compose.yml should include a healthcheck so Docker can detect failures and restart containers automatically.
# Example for Hub
services:
hub:
image: ghcr.io/aidoo-biz/hub:latest
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:2000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 15s
# Example for PIKA
services:
pika:
image: ghcr.io/aidoo-biz/pika:latest
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:8000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
# Example for VERA
services:
backend:
image: ghcr.io/aidoo-biz/vera-backend:latest
healthcheck:
test: ["CMD", "curl", "-sf", "http://localhost:4000/health"]
interval: 30s
timeout: 10s
retries: 3
start_period: 30s
Tip
Use docker ps to see health status. Containers show (healthy), (unhealthy), or (health: starting) next to their status.
Prometheus Metrics¶
Each service exposes metrics in Prometheus exposition format at /metrics.
| Service | Endpoint |
|---|---|
| Hub | http://hub:2000/metrics |
| PIKA | http://pika:8000/metrics |
| VERA API | http://vera-backend:4000/metrics |
Key Metrics¶
| Metric | Type | Description |
|---|---|---|
http_requests_total |
counter | Total HTTP requests by method, path, and status code |
http_request_duration_seconds |
histogram | Request latency distribution |
auth_attempts_total |
counter | Login attempts by result (success, failure, locked) |
auth_lockouts_total |
counter | Accounts locked due to failed attempts |
model_pull_total |
counter | Model pull operations by status |
model_pull_duration_seconds |
histogram | Time to pull a model |
active_users |
gauge | Currently active user sessions |
license_seats_used |
gauge | Number of seats consumed |
license_days_remaining |
gauge | Days until license expiry (-1 if unlicensed) |
Prometheus Configuration¶
Add the ai.doo targets to your prometheus.yml:
scrape_configs:
- job_name: aidoo-hub
static_configs:
- targets: ["hub:2000"]
- job_name: aidoo-pika
static_configs:
- targets: ["pika:8000"]
- job_name: aidoo-vera
static_configs:
- targets: ["vera-backend:4000"]
Note
If Prometheus runs outside Docker, use the host-mapped ports (e.g. localhost:2000). If it runs on the same ollama_network, use the service names as shown above.
Grafana Dashboard¶
Setup¶
- Add Prometheus as a data source in Grafana (
http://prometheus:9090). - Import or create a dashboard with the panels below.
Recommended Panels¶
| Panel | Query | Visualisation |
|---|---|---|
| Request rate | rate(http_requests_total[5m]) |
Time series, grouped by service |
| Error rate | rate(http_requests_total{status=~"5.."}[5m]) |
Time series |
| P95 latency | histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) |
Time series, by service |
| Auth failures | rate(auth_attempts_total{result="failure"}[5m]) |
Time series |
| Active users | active_users |
Stat |
| License seats | license_seats_used |
Gauge (max = license seat limit) |
| License expiry | license_days_remaining |
Stat with thresholds (red < 30) |
| Model pulls | increase(model_pull_total[24h]) |
Stat |
Alerts¶
Consider setting up Grafana alerts for:
- Service down — health check returning non-200 for > 2 minutes.
- High error rate — 5xx rate exceeds 5% of total requests over 5 minutes.
- Auth brute force —
auth_lockouts_totalincreases by more than 3 in 10 minutes. - License expiring —
license_days_remainingfalls below 30. - Disk space — Ollama models volume exceeding 80% capacity.