Metrics, dashboards & alerts

The FERAL Brain exposes a Prometheus-compatible /metrics endpoint backed by a named registry in feral-core/observability/metrics.py. A default Grafana dashboard and Prometheus alert rules ship in-tree under ops/.

Endpoint

GET /metrics returns text in the standard Prometheus 0.0.4 exposition format. Two environment switches gate it:

Variable	Default	Effect
`FERAL_METRICS_ENDPOINT`	`1` (on)	Kill switch. Set to `0`/`false`/`off` to silence both the endpoint and every `emit()` call.
`FERAL_METRICS_PUBLIC`	`0` (off)	Off-loopback callers receive `404` unless this is set to `1`. Loopback (`127.0.0.1`, `::1`, `localhost`) is always allowed when the kill switch is on.

The off-loopback default returns 404 (not 401/403) so the response is indistinguishable from “endpoint not mounted” — preserving the pre-W13 public-internet behaviour for unconfigured installs.

Registered metrics

Every metric below is defined in feral-core/observability/metrics.py and referenced by either the bundled dashboard, the alert rules, or both. The tests/test_metrics_registry.py suite enforces that contract.

Metric	Type	Labels	Emitter (owning workstream)
`feral_http_requests_total`	counter	`method`, `route`, `status`	`api/server.py` HTTP middleware (W13)
`feral_http_request_duration_seconds`	histogram	`method`, `route`	`api/server.py` HTTP middleware (W13)
`feral_llm_429_total`	counter	`provider`	`agents/llm_provider.py` (W19, deferred)
`feral_llm_failover_chain_exhausted_total`	counter	—	`agents/llm_provider.py` (W19, deferred)
`feral_sync_active_peers`	gauge	—	`memory/sync.py` (W11, deferred)
`feral_sync_failures_total`	counter	`reason`	`memory/sync.py` (W11, deferred)
`feral_sync_was_active_recent`	gauge	—	`memory/sync.py` (W11, deferred)
`feral_supervisor_approval_queue`	gauge	—	`agents/supervisor.py` (W17, deferred)
`feral_tool_denials_total`	counter	`tool`	`security/sandbox_policy.py` (W4, deferred)
`feral_sandbox_kills_total`	counter	`reason`	sandbox runner (W4, deferred)
`feral_vault_decrypt_errors_total`	counter	—	`security/vault.py` (W9, deferred)
`feral_ws_active_sessions`	gauge	—	`api/server.py` WS endpoints (W13.1, deferred)

W13 ships the registry surface and the proof-of-concept emit() call inside the existing HTTP middleware. The remaining call sites are tracked as W13.1 follow-ups so each owning workstream lands its own changes inside its own owned-paths set.

Retention

/metrics exposes the current in-process state — the Brain holds no historical samples. Persistence is the scrape pipeline’s job:

Recommended retention: 30 days at 15-second scrape interval (the default Prometheus retention is 15 days; double it for the canary-grade signal the alert rules assume).
Cardinality budget: every feral_http_requests_total series is bounded by method × route_template × status_class. Routes are recorded using FastAPI’s path template (e.g. /api/jobs/{id}), not the raw URL, so cardinality stays in the low hundreds even under heavy traffic.

Importing the dashboard

The bundled dashboard lives at ops/grafana/feral-overview.json (Grafana schema v39, validated against Grafana 11+).

# 1. From the Grafana UI: Dashboards → New → Import → Upload JSON file
#    Pick ops/grafana/feral-overview.json from your checkout.

# 2. Or via the HTTP API (requires an admin API key):
curl -X POST "$GRAFANA_URL/api/dashboards/db" \
  -H "Authorization: Bearer $GRAFANA_API_KEY" \
  -H "Content-Type: application/json" \
  -d "$(jq '{dashboard: ., overwrite: true, folderId: 0}' \
        ops/grafana/feral-overview.json)"

The dashboard expects a Prometheus datasource — pick yours from the Datasource template variable at the top of the dashboard once it loads.

Alert rules

ops/prometheus/alerts.yml is loaded directly by Prometheus via the rule_files directive. Drop it next to your existing rule files and reload Prometheus (SIGHUP or POST /-/reload).

Alert	Severity	Triggers when
`HighErrorRate`	critical	`5xx` ratio > 5% for 10 minutes
`LLMAllProvidersDown`	critical	The failover chain ran out of providers in the last 5 minutes
`SyncPeerDown`	warning	Zero active sync peers for 15 minutes despite recent activity
`SupervisorBacklog`	warning	Supervisor approval queue stuck above 50 for 30 minutes
`VaultDecryptFailed`	critical	Any vault decrypt failure in the last 5 minutes

Runbooks

`HighErrorRate`

Check journalctl -u feral (or your equivalent log aggregator) for tracebacks.
curl -fsSL http://localhost:9090/health to confirm the Brain is up.
If the spike correlates with a deploy, roll back via feral upgrade --to <previous-version>.

`LLMAllProvidersDown`

feral providers status to see which providers errored.
Validate API keys via feral providers test <id>.
If a provider is genuinely down, the chain will recover automatically once it comes back; the alert auto-resolves.

`SyncPeerDown`

feral sync peers to list discovered peers.
Check mDNS / passphrase / firewall on both ends.
If the deployment is intentionally single-node, set FERAL_SYNC_DISABLED=1 and remove the feral_sync_was_active_recent gauge from being set to 1 — the alert will then stay quiet.

`SupervisorBacklog`

Drain the queue from the Supervisor UI or feral supervisor approve --all-stale.
If approvals are blocked because the deciding human is offline, increase the alert threshold or page the on-call.

`VaultDecryptFailed`

Treat as a security-relevant signal: AEAD failures should never happen in steady state.
Confirm the OS keychain still holds the master key (feral vault status).
Inspect feral-core/security/vault.py logs for the failing entry.

Adding a new metric

Define the metric in feral-core/observability/metrics.py against REGISTRY, register it in the _METRICS map, and write a one-line docstring naming the eventual emitter and owning workstream.
Add at least one panel in ops/grafana/feral-overview.json or an alert rule in ops/prometheus/alerts.yml that consumes it. The test_metrics_registry.py::test_no_orphan_metrics test fails the build otherwise.
Add an emit() call site inside your owned module. Keep it cheap — the helper no-ops when the kill switch is off, but the label dict still allocates.

Mintlify nav

This page lives at docs/mintlify/operations/metrics.mdx. As of the W13 PR, the operations/ sub-tree has no entry in docs/mintlify/docs.json; the nav owner adds metrics + soak in a single sweep once enough operations pages exist. Tracked in docs/AGENT_PROMPTS_FOLLOWUPS.md.

Overview

Getting Started

Guides

Marketplace

Memory

Hardware

Channels

Connectivity

Security

Native Apps

Operations

SDKs

Reference

Help

Community

Metrics, dashboards & alerts

Metrics, dashboards & alerts

Endpoint

Registered metrics

Retention

Importing the dashboard

Alert rules

Runbooks

`HighErrorRate`

`LLMAllProvidersDown`

`SyncPeerDown`

`SupervisorBacklog`

`VaultDecryptFailed`

Adding a new metric

Mintlify nav

Overview

Getting Started

Guides

Marketplace

Memory

Hardware

Channels

Connectivity

Security

Native Apps

Operations

SDKs

Reference

Help

Community

​Metrics, dashboards & alerts

​Endpoint

​Registered metrics

​Retention

​Importing the dashboard

​Alert rules

​Runbooks

​HighErrorRate

​LLMAllProvidersDown

​SyncPeerDown

​SupervisorBacklog

​VaultDecryptFailed

​Adding a new metric

​Mintlify nav

Metrics, dashboards & alerts

Endpoint

Registered metrics

Retention

Importing the dashboard

Alert rules

Runbooks

`HighErrorRate`

`LLMAllProvidersDown`

`SyncPeerDown`

`SupervisorBacklog`

`VaultDecryptFailed`

Adding a new metric

Mintlify nav