Metrics, dashboards & alerts
The FERAL Brain exposes a Prometheus-compatible/metrics endpoint backed by a
named registry in feral-core/observability/metrics.py. A default Grafana
dashboard and Prometheus alert rules ship in-tree under ops/.
Endpoint
GET /metrics returns text in the standard Prometheus 0.0.4 exposition
format. Two environment switches gate it:
| Variable | Default | Effect |
|---|---|---|
FERAL_METRICS_ENDPOINT | 1 (on) | Kill switch. Set to 0/false/off to silence both the endpoint and every emit() call. |
FERAL_METRICS_PUBLIC | 0 (off) | Off-loopback callers receive 404 unless this is set to 1. Loopback (127.0.0.1, ::1, localhost) is always allowed when the kill switch is on. |
404 (not 401/403) so the response is
indistinguishable from “endpoint not mounted” — preserving the pre-W13
public-internet behaviour for unconfigured installs.
Registered metrics
Every metric below is defined inferal-core/observability/metrics.py and
referenced by either the bundled dashboard, the alert rules, or both. The
tests/test_metrics_registry.py suite enforces that contract.
| Metric | Type | Labels | Emitter (owning workstream) |
|---|---|---|---|
feral_http_requests_total | counter | method, route, status | api/server.py HTTP middleware (W13) |
feral_http_request_duration_seconds | histogram | method, route | api/server.py HTTP middleware (W13) |
feral_llm_429_total | counter | provider | agents/llm_provider.py (W19, deferred) |
feral_llm_failover_chain_exhausted_total | counter | — | agents/llm_provider.py (W19, deferred) |
feral_sync_active_peers | gauge | — | memory/sync.py (W11, deferred) |
feral_sync_failures_total | counter | reason | memory/sync.py (W11, deferred) |
feral_sync_was_active_recent | gauge | — | memory/sync.py (W11, deferred) |
feral_supervisor_approval_queue | gauge | — | agents/supervisor.py (W17, deferred) |
feral_tool_denials_total | counter | tool | security/sandbox_policy.py (W4, deferred) |
feral_sandbox_kills_total | counter | reason | sandbox runner (W4, deferred) |
feral_vault_decrypt_errors_total | counter | — | security/vault.py (W9, deferred) |
feral_ws_active_sessions | gauge | — | api/server.py WS endpoints (W13.1, deferred) |
emit() call inside
the existing HTTP middleware. The remaining call sites are tracked as
W13.1 follow-ups so each owning workstream lands its own changes inside
its own owned-paths set.
Retention
/metrics exposes the current in-process state — the Brain holds no
historical samples. Persistence is the scrape pipeline’s job:
- Recommended retention: 30 days at 15-second scrape interval (the default Prometheus retention is 15 days; double it for the canary-grade signal the alert rules assume).
- Cardinality budget: every
feral_http_requests_totalseries is bounded bymethod × route_template × status_class. Routes are recorded using FastAPI’s path template (e.g./api/jobs/{id}), not the raw URL, so cardinality stays in the low hundreds even under heavy traffic.
Importing the dashboard
The bundled dashboard lives atops/grafana/feral-overview.json (Grafana
schema v39, validated against Grafana 11+).
Alert rules
ops/prometheus/alerts.yml is loaded directly by Prometheus via the
rule_files directive. Drop it next to your existing rule files and reload
Prometheus (SIGHUP or POST /-/reload).
| Alert | Severity | Triggers when |
|---|---|---|
HighErrorRate | critical | 5xx ratio > 5% for 10 minutes |
LLMAllProvidersDown | critical | The failover chain ran out of providers in the last 5 minutes |
SyncPeerDown | warning | Zero active sync peers for 15 minutes despite recent activity |
SupervisorBacklog | warning | Supervisor approval queue stuck above 50 for 30 minutes |
VaultDecryptFailed | critical | Any vault decrypt failure in the last 5 minutes |
Runbooks
HighErrorRate
- Check
journalctl -u feral(or your equivalent log aggregator) for tracebacks. curl -fsSL http://localhost:9090/healthto confirm the Brain is up.- If the spike correlates with a deploy, roll back via
feral upgrade --to <previous-version>.
LLMAllProvidersDown
feral providers statusto see which providers errored.- Validate API keys via
feral providers test <id>. - If a provider is genuinely down, the chain will recover automatically once it comes back; the alert auto-resolves.
SyncPeerDown
feral sync peersto list discovered peers.- Check mDNS / passphrase / firewall on both ends.
- If the deployment is intentionally single-node, set
FERAL_SYNC_DISABLED=1and remove theferal_sync_was_active_recentgauge from being set to 1 — the alert will then stay quiet.
SupervisorBacklog
- Drain the queue from the Supervisor UI or
feral supervisor approve --all-stale. - If approvals are blocked because the deciding human is offline, increase the alert threshold or page the on-call.
VaultDecryptFailed
- Treat as a security-relevant signal: AEAD failures should never happen in steady state.
- Confirm the OS keychain still holds the master key (
feral vault status). - Inspect
feral-core/security/vault.pylogs for the failing entry.
Adding a new metric
- Define the metric in
feral-core/observability/metrics.pyagainstREGISTRY, register it in the_METRICSmap, and write a one-line docstring naming the eventual emitter and owning workstream. - Add at least one panel in
ops/grafana/feral-overview.jsonor an alert rule inops/prometheus/alerts.ymlthat consumes it. Thetest_metrics_registry.py::test_no_orphan_metricstest fails the build otherwise. - Add an
emit()call site inside your owned module. Keep it cheap — the helper no-ops when the kill switch is off, but the label dict still allocates.
Mintlify nav
This page lives atdocs/mintlify/operations/metrics.mdx. As of the W13 PR,
the operations/ sub-tree has no entry in docs/mintlify/docs.json; the
nav owner adds metrics + soak in a single sweep once enough operations
pages exist. Tracked in docs/AGENT_PROMPTS_FOLLOWUPS.md.