Agent Reliability KPI Dictionary + Dashboard JSON Schema (2026): Production Template for AI Ops Teams
If your team argues every week about how to calculate the same metric, your dashboard is not mature yet.
You don’t need more charts first.
You need one shared KPI dictionary and one schema contract.
This guide gives you both in a practical format you can implement quickly.
It is the next article in our runtime series after:
- Agent Runtime Checklist
- Agent Runtime Audit Template
- Agent Reliability Dashboard Setup Guide
Now we’re building the layer that keeps all those systems consistent: metric definitions and JSON schema standards.
Why This Matters in Real Teams
Without shared KPI definitions, different teams report different “truths.”
Product may report success rate one way.
Engineering may report another.
Leadership then gets conflicting data and slower decisions.
A KPI dictionary fixes that by locking definitions, formulas, and thresholds.
A dashboard JSON schema fixes this further by standardizing how metrics move through your pipeline.
What You’ll Get
- A practical KPI dictionary for agent reliability
- Field-level JSON schema for dashboard ingestion
- Severity and threshold model for alerts
- Weekly summary payload schema for leadership
- Implementation blueprint and validation checklist
KPI Dictionary: Core Metric Set
| KPI Name | Definition | Formula | Target |
|---|---|---|---|
| task_success_quality_pct | Accepted outputs as a share of total outputs | (accepted / total) x 100 | >= 92% |
| incident_rate_per_1000 | Incidents per 1,000 runs | (incidents / total_runs) x 1000 | <= 3 |
| recovery_success_pct | Recovered runs among recoverable failures | (recovered / recoverable_failures) x 100 | >= 95% |
| policy_escape_rate_pct | Policy-violating outputs not blocked | (escapes / violations) x 100 | 0% |
| p95_latency_ms | 95th percentile total workflow latency | percentile(latency_ms, 95) | <= 8000 |
| cost_per_completed_task_usd | Average cost per completed workflow | total_cost / completed_tasks | Within budget band |
| manual_rework_pct | Outputs requiring human correction | (rework / total_outputs) x 100 | <= 12% |
| approval_bypass_count | High-risk actions executed without approval | count(events) | 0 |
| unauthorized_action_count | Actions outside permission scope | count(events) | 0 |
| alert_to_ack_median_min | Median minutes from alert to acknowledgment | median(ack_time – alert_time) | <= 10 |
In my experience, ten well-defined KPIs outperform fifty loosely defined ones.
Metadata Rules for Every KPI
Each KPI should store these metadata fields:
- owner_team (product/engineering/security/ops)
- owner_person (single accountable owner)
- calc_window (hourly/daily/weekly)
- source_tables (data lineage)
- quality_checks (null threshold, outlier handling)
- status_thresholds (green/yellow/red values)
This is where many teams improve fastest, because ownership and lineage become explicit.
Dashboard JSON Schema (Core Payload)
{
"$schema": "https://json-schema.org/draft/2020-12/schema",
"title": "AgentReliabilitySnapshot",
"type": "object",
"required": [
"snapshot_id",
"snapshot_time",
"workflow_id",
"workflow_name",
"environment",
"status",
"score_overall",
"kpis",
"incidents",
"cost"
],
"properties": {
"snapshot_id": { "type": "string" },
"snapshot_time": { "type": "string", "format": "date-time" },
"workflow_id": { "type": "string" },
"workflow_name": { "type": "string" },
"environment": { "type": "string", "enum": ["prod", "staging", "dev"] },
"status": { "type": "string", "enum": ["green", "yellow", "red"] },
"score_overall": { "type": "number", "minimum": 0, "maximum": 100 },
"kpis": {
"type": "object",
"required": [
"task_success_quality_pct",
"incident_rate_per_1000",
"recovery_success_pct",
"policy_escape_rate_pct",
"p95_latency_ms",
"cost_per_completed_task_usd"
],
"properties": {
"task_success_quality_pct": { "type": "number", "minimum": 0, "maximum": 100 },
"incident_rate_per_1000": { "type": "number", "minimum": 0 },
"recovery_success_pct": { "type": "number", "minimum": 0, "maximum": 100 },
"policy_escape_rate_pct": { "type": "number", "minimum": 0, "maximum": 100 },
"p95_latency_ms": { "type": "integer", "minimum": 0 },
"cost_per_completed_task_usd": { "type": "number", "minimum": 0 },
"manual_rework_pct": { "type": "number", "minimum": 0, "maximum": 100 },
"approval_bypass_count": { "type": "integer", "minimum": 0 },
"unauthorized_action_count": { "type": "integer", "minimum": 0 },
"alert_to_ack_median_min": { "type": "number", "minimum": 0 }
},
"additionalProperties": false
},
"incidents": {
"type": "object",
"required": ["count", "sev1", "sev2", "sev3", "mttr_min"],
"properties": {
"count": { "type": "integer", "minimum": 0 },
"sev1": { "type": "integer", "minimum": 0 },
"sev2": { "type": "integer", "minimum": 0 },
"sev3": { "type": "integer", "minimum": 0 },
"mttr_min": { "type": "number", "minimum": 0 }
}
},
"cost": {
"type": "object",
"required": ["total_usd", "budget_variance_pct"],
"properties": {
"total_usd": { "type": "number", "minimum": 0 },
"budget_variance_pct": { "type": "number" }
}
},
"top_risks": {
"type": "array",
"items": { "type": "string" },
"maxItems": 5
},
"actions": {
"type": "array",
"items": {
"type": "object",
"required": ["owner", "task", "eta"],
"properties": {
"owner": { "type": "string" },
"task": { "type": "string" },
"eta": { "type": "string", "format": "date" }
}
}
}
},
"additionalProperties": false
}
Why This Schema Works
- Strict required fields reduce missing-data surprises.
- Status and score enable fast top-level decisions.
- KPI object keeps operational metrics grouped and extensible.
- Incidents and costs are explicit, not buried in notes.
- Action items make reliability review execution-oriented.
Sample Weekly Snapshot JSON
{
"snapshot_id": "snap_2026_05_25_sales_agent_prod",
"snapshot_time": "2026-05-25T09:30:00Z",
"workflow_id": "wf_sales_outreach_v3",
"workflow_name": "Sales Outreach Agent",
"environment": "prod",
"status": "yellow",
"score_overall": 78.4,
"kpis": {
"task_success_quality_pct": 90.8,
"incident_rate_per_1000": 4.2,
"recovery_success_pct": 92.1,
"policy_escape_rate_pct": 0.2,
"p95_latency_ms": 9100,
"cost_per_completed_task_usd": 0.47,
"manual_rework_pct": 13.5,
"approval_bypass_count": 0,
"unauthorized_action_count": 0,
"alert_to_ack_median_min": 8.0
},
"incidents": {
"count": 6,
"sev1": 0,
"sev2": 2,
"sev3": 4,
"mttr_min": 34
},
"cost": {
"total_usd": 1530,
"budget_variance_pct": 11.4
},
"top_risks": [
"Latency spikes during campaign bursts",
"Manual rework above target"
],
"actions": [
{"owner": "Platform Eng", "task": "Enable fallback route for burst traffic", "eta": "2026-05-29"},
{"owner": "Product Ops", "task": "Reduce regenerate loops in review flow", "eta": "2026-05-30"}
]
}
Alert Severity Schema
| Severity | Trigger | Response SLA | Default Action |
|---|---|---|---|
| Sev-1 | Unauthorized action or policy escape > 0.5% | Immediate | Freeze high-risk workflow + incident bridge |
| Sev-2 | Incident rate > 2x baseline or p95 latency > threshold for 2h | 15 min | Rollback or fallback route, start root-cause analysis |
| Sev-3 | Cost drift >20% or quality drop below yellow threshold | 60 min | Create action ticket + weekly tracking |
Implementation Blueprint (30-60-90)
Days 1-30
- Publish KPI dictionary v1 with owner sign-off.
- Add JSON schema validation in ingestion pipeline.
- Launch dashboard for top 2 production workflows.
Days 31-60
- Integrate incident and cost objects into leadership summary view.
- Add trend delta logic (week-over-week change fields).
- Audit data quality drift and null-rate exceptions.
Days 61-90
- Scale schema to all high-impact workflows.
- Add automated anomaly detection for KPI deviations.
- Version the schema with compatibility policy.
Versioning Strategy for Schema Stability
Use semantic versioning:
- Major: breaking field changes
- Minor: backward-compatible new fields
- Patch: documentation or validation corrections
Every snapshot should include:
schema_versionproducer_servicevalidation_status
This prevents painful migration surprises as your agent stack evolves.
Common Mistakes to Avoid
- Changing KPI formulas without changelog notice
- Allowing free-form status labels across teams
- Skipping schema validation in ingestion jobs
- Mixing workflow-level and org-level metrics in the same object without namespace
- No owner for data quality and metric lineage
Most reliability confusion comes from loose contracts, not from missing tools.
Pros and Cons of Standardized KPI + Schema Layer
| Pros | Cons |
|---|---|
| One source of truth across teams | Requires early coordination effort |
| Faster incident and review decisions | Needs ongoing schema governance |
| Cleaner automation of weekly reporting | Initial pipeline validation work |
| Easier scaling across many workflows | Versioning discipline required |
FAQ: KPI Dictionary + JSON Schema
1) What is a KPI dictionary for AI reliability?
It is a controlled catalog of metric names, formulas, thresholds, and owners used to avoid inconsistent reporting.
2) Why use JSON schema for dashboards?
Schema validation ensures payloads are consistent, complete, and machine-actionable across systems.
3) How many KPIs should we standardize first?
Start with 8-12 high-impact KPIs. Expand only after stable adoption.
4) Should schema include action items?
Yes. Reliability workflows improve when metrics and actions are linked in one payload.
5) How often should dictionary thresholds be reviewed?
Quarterly, or immediately after major architecture/model changes.
6) Can startups use this too?
Absolutely. A lean version helps startups avoid chaos as traffic scales.
7) What if one team wants custom KPIs?
Allow extension fields in a namespace, but keep core KPIs mandatory.
8) How do we validate payloads?
Run schema validation in CI and ingestion jobs, and fail fast on required-field errors.
9) Should dashboard status be computed or manual?
Primary status should be computed from thresholds; manual override can be allowed with justification notes.
10) What is the next maturity step after this?
Automated anomaly detection and predictive reliability forecasting per workflow.
Final Thoughts
If dashboards are your eyes, KPI dictionaries and schemas are your nervous system.
Without them, teams react slower and reliability drifts quietly.
With them, you get faster decisions, cleaner automation, and safer scale.
Want the next build in this series?
We can publish an Agent Reliability Incident Runbook Library with ready-to-use Sev-1/Sev-2 playbooks, response checklists, and postmortem templates.
Extended KPI Dictionary Fields (Template Format)
For each KPI entry, store this complete record so audits are reproducible.
| Field | Description | Example |
|---|---|---|
| kpi_key | Canonical machine key | task_success_quality_pct |
| display_name | Human readable name | Task Success Quality % |
| business_goal | Why KPI exists | Maintain output trust and user satisfaction |
| formula_text | Readable formula | (accepted / total_outputs) x 100 |
| query_ref | SQL/model reference | metrics.sql#task_success_quality |
| calc_frequency | Update cadence | Hourly + daily aggregate |
| threshold_green | Healthy range | >= 92 |
| threshold_yellow | Watch range | 85-91.99 |
| threshold_red | Intervention range | < 85 |
| alert_severity_map | Alert mapping by range | yellow=sev3, red=sev2 |
| owner | Accountable person | Platform Lead |
| runbook_link | Resolution guide | /runbooks/reliability/task-quality |
Schema Extensions for Multi-Workflow Organizations
As you scale, one snapshot per workflow may not be enough.
Add organization-level and portfolio-level rollups.
{
"portfolio_summary": {
"workflow_count": 17,
"green_count": 10,
"yellow_count": 5,
"red_count": 2,
"weighted_portfolio_score": 81.7
},
"workflow_snapshots": [
{ "workflow_id": "wf_1", "score_overall": 88.2, "status": "green" },
{ "workflow_id": "wf_2", "score_overall": 72.4, "status": "yellow" }
]
}
This helps leadership decide where to scale and where to stabilize.
Data Quality Guardrails for KPI Integrity
- Reject payload if required KPI fields are null.
- Reject payload if snapshot_time is older than max staleness window.
- Reject payload if score_overall is outside 0-100.
- Warn if week-over-week change exceeds sanity band (e.g., >30 points) without change notes.
- Track validation pass rate as its own KPI.
After testing this pattern, teams usually find broken ETL mappings much earlier.
Change Management Policy for KPI Formula Updates
Formula changes are dangerous if done silently.
Use this policy:
- Propose change with business rationale.
- Run historical backfill comparison on last 8 weeks.
- Document expected score shifts before rollout.
- Announce version change and effective date.
- Keep old formula output for two weeks in parallel view.
This avoids misleading trend breaks and stakeholder confusion.
Reference Mapping: KPI to Runbook
| KPI | If Red Then | Runbook Action |
|---|---|---|
| task_success_quality_pct | Below 85% | Enable human-review mode and sample 50 failed outputs |
| incident_rate_per_1000 | Above 6 | Freeze releases and run incident cluster analysis |
| policy_escape_rate_pct | Above 0.5% | Block risky endpoints and apply stricter content filters |
| p95_latency_ms | Above 14000 | Switch to low-latency route and inspect queue backlog |
| cost_per_completed_task_usd | Above +20% budget | Activate cost guardrails and route low-risk steps to cheaper model |
Team Adoption Playbook
A dictionary and schema are only useful if teams actually use them.
Use this rollout sequence:
- Week 1: align on top 10 KPI keys and thresholds.
- Week 2: validate payloads in staging and fix mapping errors.
- Week 3: launch dashboard with computed status labels.
- Week 4: start weekly review where every action links to KPI movement.
Most people miss this: adoption is a process problem, not a tooling problem.
Case Study: Multi-Agent Team Standardization
A SaaS company had six active agent workflows and conflicting reporting.
Marketing said reliability was improving; engineering said it was not.
They implemented:
- One KPI dictionary with owner and formula per metric
- One JSON schema validated at ingestion
- One weekly dashboard snapshot artifact sent to all teams
Results after one quarter:
- Incident triage time reduced by 31%
- Metric disputes in review meetings dropped significantly
- Score-driven release decisions became consistent
- Leadership confidence in AI ops reporting increased
What changed wasn’t just data quality. Team alignment improved too.
Security and Compliance Add-On Fields
If you operate in regulated environments, include these optional fields:
{
"compliance": {
"policy_version": "v3.2",
"data_residency": "IN",
"retention_days": 30,
"contains_pii": true,
"approval_required": true,
"audit_hash": "sha256:..."
}
}
This helps audit and legal teams review runtime posture without separate spreadsheets.
Future-Proofing the Schema
- Keep core contract strict and small.
- Add extension namespaces for team-specific fields.
- Deprecate fields with sunset dates, not immediate removal.
- Publish a migration guide for every major schema version.
The more workflows you add, the more this discipline pays off.
Additional FAQs
11) Should KPI dictionary live in docs or code?
Both. Keep readable documentation and source-of-truth machine config versioned in code.
12) Can one schema support multiple environments?
Yes, include environment enums and enforce environment-specific thresholds in evaluation layer.
13) How do we keep schema from becoming bloated?
Review quarterly and remove unused optional fields with deprecation policy.
14) Should action items be mandatory in red status?
Yes. Require at least one owner and ETA whenever status is red.
15) What metric indicates dashboard maturity?
Track “percentage of workflows with valid weekly snapshots” and “validation pass rate.”






