Skip to main content

Metrics, dashboards & alerts

The FERAL Brain exposes a Prometheus-compatible /metrics endpoint backed by a named registry in feral-core/observability/metrics.py. A default Grafana dashboard and Prometheus alert rules ship in-tree under ops/.

Endpoint

GET /metrics returns text in the standard Prometheus 0.0.4 exposition format. Two environment switches gate it:
VariableDefaultEffect
FERAL_METRICS_ENDPOINT1 (on)Kill switch. Set to 0/false/off to silence both the endpoint and every emit() call.
FERAL_METRICS_PUBLIC0 (off)Off-loopback callers receive 404 unless this is set to 1. Loopback (127.0.0.1, ::1, localhost) is always allowed when the kill switch is on.
The off-loopback default returns 404 (not 401/403) so the response is indistinguishable from “endpoint not mounted” — preserving the pre-W13 public-internet behaviour for unconfigured installs.

Registered metrics

Every metric below is defined in feral-core/observability/metrics.py and referenced by either the bundled dashboard, the alert rules, or both. The tests/test_metrics_registry.py suite enforces that contract.
MetricTypeLabelsEmitter (owning workstream)
feral_http_requests_totalcountermethod, route, statusapi/server.py HTTP middleware (W13)
feral_http_request_duration_secondshistogrammethod, routeapi/server.py HTTP middleware (W13)
feral_llm_429_totalcounterprovideragents/llm_provider.py (W19, deferred)
feral_llm_failover_chain_exhausted_totalcounteragents/llm_provider.py (W19, deferred)
feral_sync_active_peersgaugememory/sync.py (W11, deferred)
feral_sync_failures_totalcounterreasonmemory/sync.py (W11, deferred)
feral_sync_was_active_recentgaugememory/sync.py (W11, deferred)
feral_supervisor_approval_queuegaugeagents/supervisor.py (W17, deferred)
feral_tool_denials_totalcountertoolsecurity/sandbox_policy.py (W4, deferred)
feral_sandbox_kills_totalcounterreasonsandbox runner (W4, deferred)
feral_vault_decrypt_errors_totalcountersecurity/vault.py (W9, deferred)
feral_ws_active_sessionsgaugeapi/server.py WS endpoints (W13.1, deferred)
W13 ships the registry surface and the proof-of-concept emit() call inside the existing HTTP middleware. The remaining call sites are tracked as W13.1 follow-ups so each owning workstream lands its own changes inside its own owned-paths set.

Retention

/metrics exposes the current in-process state — the Brain holds no historical samples. Persistence is the scrape pipeline’s job:
  • Recommended retention: 30 days at 15-second scrape interval (the default Prometheus retention is 15 days; double it for the canary-grade signal the alert rules assume).
  • Cardinality budget: every feral_http_requests_total series is bounded by method × route_template × status_class. Routes are recorded using FastAPI’s path template (e.g. /api/jobs/{id}), not the raw URL, so cardinality stays in the low hundreds even under heavy traffic.

Importing the dashboard

The bundled dashboard lives at ops/grafana/feral-overview.json (Grafana schema v39, validated against Grafana 11+).
# 1. From the Grafana UI: Dashboards → New → Import → Upload JSON file
#    Pick ops/grafana/feral-overview.json from your checkout.

# 2. Or via the HTTP API (requires an admin API key):
curl -X POST "$GRAFANA_URL/api/dashboards/db" \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -H "Content-Type: application/json" \
  -d "$(jq '{dashboard: ., overwrite: true, folderId: 0}' \
        ops/grafana/feral-overview.json)"
The dashboard expects a Prometheus datasource — pick yours from the Datasource template variable at the top of the dashboard once it loads.

Alert rules

ops/prometheus/alerts.yml is loaded directly by Prometheus via the rule_files directive. Drop it next to your existing rule files and reload Prometheus (SIGHUP or POST /-/reload).
AlertSeverityTriggers when
HighErrorRatecritical5xx ratio > 5% for 10 minutes
LLMAllProvidersDowncriticalThe failover chain ran out of providers in the last 5 minutes
SyncPeerDownwarningZero active sync peers for 15 minutes despite recent activity
SupervisorBacklogwarningSupervisor approval queue stuck above 50 for 30 minutes
VaultDecryptFailedcriticalAny vault decrypt failure in the last 5 minutes

Runbooks

HighErrorRate

  1. Check journalctl -u feral (or your equivalent log aggregator) for tracebacks.
  2. curl -fsSL http://localhost:9090/health to confirm the Brain is up.
  3. If the spike correlates with a deploy, roll back via feral upgrade --to <previous-version>.

LLMAllProvidersDown

  1. feral providers status to see which providers errored.
  2. Validate API keys via feral providers test <id>.
  3. If a provider is genuinely down, the chain will recover automatically once it comes back; the alert auto-resolves.

SyncPeerDown

  1. feral sync peers to list discovered peers.
  2. Check mDNS / passphrase / firewall on both ends.
  3. If the deployment is intentionally single-node, set FERAL_SYNC_DISABLED=1 and remove the feral_sync_was_active_recent gauge from being set to 1 — the alert will then stay quiet.

SupervisorBacklog

  1. Drain the queue from the Supervisor UI or feral supervisor approve --all-stale.
  2. If approvals are blocked because the deciding human is offline, increase the alert threshold or page the on-call.

VaultDecryptFailed

  1. Treat as a security-relevant signal: AEAD failures should never happen in steady state.
  2. Confirm the OS keychain still holds the master key (feral vault status).
  3. Inspect feral-core/security/vault.py logs for the failing entry.

Adding a new metric

  1. Define the metric in feral-core/observability/metrics.py against REGISTRY, register it in the _METRICS map, and write a one-line docstring naming the eventual emitter and owning workstream.
  2. Add at least one panel in ops/grafana/feral-overview.json or an alert rule in ops/prometheus/alerts.yml that consumes it. The test_metrics_registry.py::test_no_orphan_metrics test fails the build otherwise.
  3. Add an emit() call site inside your owned module. Keep it cheap — the helper no-ops when the kill switch is off, but the label dict still allocates.

Mintlify nav

This page lives at docs/mintlify/operations/metrics.mdx. As of the W13 PR, the operations/ sub-tree has no entry in docs/mintlify/docs.json; the nav owner adds metrics + soak in a single sweep once enough operations pages exist. Tracked in docs/AGENT_PROMPTS_FOLLOWUPS.md.