Performance Calibration — plain-language explainer

Performance Calibration measures what the calibration meeting actually changed — how many ratings moved up, how many moved down, how many held — and attributes that movement to the leaders who own it.

A People Analytics Toolbox component. Built to the portfolio Explainer Standard v1.0. Every claim below is grounded in the spoke's own code and contracts (src/spokes/performance-calibration/, contract 1.0.0); anything not yet built is marked (TBD).

1. What is it?

Performance Calibration is an impact analytics service for the rating-calibration process: feed it the pre-calibration and post-calibration rating for each employee in a cycle, and it tells you exactly how much the calibration room moved those ratings — per cycle, and per leader in the management chain.

It keeps a persisted snapshot of every { preCalibrated → postCalibrated } pair, keyed by tenant × employee × cycle, so the question "what did calibration do this cycle, and whose ratings moved?" has a durable, queryable answer rather than a one-off spreadsheet that gets lost after the meeting.

Visual — Tier A (live API capture). A real read against the spoke, grounded in the in-repo fixture (integration/performix/fixtures/sample-response.json):

GET /api/spokes/performance-calibration/summary?tenantId=tenant-cal-fixture&cycleIds=FY25-H1
→ {
    "cycles": [
      {
        "cycleId": "FY25-H1", "cycleLabel": "FY25 H1",
        "totalRated": 48,
        "bumpedUp": 6, "bumpedDown": 9, "unchanged": 33,
        "pctBumpedUp": 0.125, "pctBumpedDown": 0.1875, "pctUnchanged": 0.6875
      }
    ],
    "leaderRollups": [
      {
        "layer": "l3", "leaderName": "Alex Ng", "cycleId": "FY25-H1",
        "totalRated": 12,
        "pctBumpedUp": 0.08, "pctBumpedDown": 0.25, "pctUnchanged": 0.67, "pctChanged": 0.33
      }
    ]
  }

(Shape matches CycleQueryResponseSchema in contracts/types.ts; values are the spoke's own committed fixture.)

2. What problem does it solve — and why is it different?

The pain it removes: calibration meetings move ratings, but the before picture usually evaporates the moment the meeting ends, so no one can say with numbers whether the room did real work or rubber-stamped the draft — and certainly not which leader's submissions changed the most.

The difference, stated as a shift:

FROM a final rating distribution with no memory of what it looked like before the room, and "calibration happened" as the only evidence.
TO a persisted pre/post pair per employee, rolled up to how many moved up, how many down, how many held — and the same breakdown attributed down the leader hierarchy.

How it differs from the obvious substitutes:

vs. doing it by hand in a spreadsheet — the pre/post columns are exactly the kind of thing that lives in one analyst's workbook and is gone next cycle. This spoke persists each pair under a stable { tenant × employee × cycle } key, so trajectories across cycles and year-over-year comparisons are queries, not archaeology.
vs. generic BI on the final ratings — a dashboard over post-calibration ratings can show the distribution but not the movement; it has no pre-state to diff against. The unit here is the delta, classified up / down / unchanged / incomplete per person.

Visual — Tier B (FROM→TO typographic block). final-distribution-only (movement invisible) → persisted {pre → post} pair per person → up/down/unchanged counts → same counts attributed per leader. A rendered comparison block is a follow-up (FU-A).

3. How does it work?

Inputs → method → outputs, concretely:

Input: calibration pairs — { employeeId, tenantId, cycleId, cycleLabel?, preCalibrated, postCalibrated, leaderChain?, capturedAt? }. Ratings arrive as number | string | null (raw, un-normalized); the optional leaderChain carries the ELT / L3 / L4 / L5 leader names for attribution. Ingested via POST /pairs (service-key gated), which UPSERTs idempotently — duplicate keys overwrite, so ETL retries are safe.
Method — normalize, then classify, then roll up:
1. Normalize each rating through parseRating (src/lib/rating-codec/parseRating.ts): the toolbox five-point ladder plus spreadsheet-decimal fallback. A rating it can't parse becomes null.
2. Classify the delta per pair (classifyDeltaDirection in core/rollup.ts): post − pre > 0 → up, < 0 → down, = 0 → unchanged. If either side fails to parse, the pair is incomplete and is excluded from the denominators — counts only ever reflect pairs with both parses present.
3. Roll up (buildCalibrationRollups) in one sweep: per-cycle bumped-up / -down / -unchanged counts and percentages, plus per-leader aggregates at each of the four hierarchy layers.
Output: a CycleQueryResponse (cycle summaries + leader rollups) from GET /summary, and an EmployeeTrajectoryResponse (per-cycle pre/post/delta/direction, deduped to the newest row per cycle then truncated to lastNCycles, default 6) from GET /trajectory.

The science backing is methodological rather than psychometric: the honest treatment of incomplete parses (never silently counted as unchanged) is the load-bearing discipline — pctUnchanged is defined as calibration-room accuracy among complete parses only, so a cycle with sparse data can't masquerade as a stable one.

Differentiation beat: the practitioner's real question isn't "what's the new distribution" — it's "did calibration do anything, and is one leader's draft systematically generous or harsh?" The per-layer leader rollup answers the second half directly: a leader whose pctBumpedDown is far above the cycle average had ratings the room pulled down, every cycle, on the record.

Visual — Tier B (step flow). raw {pre, post} (string|number|null) → parseRating → classifyDeltaDirection (up/down/unchanged/incomplete) → buildCalibrationRollups → { cycle summaries · ELT/L3/L4/L5 leader rollups }.

4. What does it enable?

Concrete uses a practitioner would recognize:

Quantify what calibration changed — report pctUnchanged as the share of ratings the room left alone, and pctBumpedUp / pctBumpedDown as the share it moved, for one cycle or several at once.
Spot leniency or harshness by leader — the ELT/L3/L4/L5 rollups surface which leaders' draft ratings the room systematically adjusted, and in which direction (pctChanged per leader).
Track an employee's calibration history — the trajectory read returns the trailing N cycles of pre/post/delta for one person, with the ELT + L3 leader attributed at each cycle.
Audit a cycle after the fact — because every {pre → post} pair is persisted, a cycle can be re-examined cycles later without the original workbook.
Feed downstream validity / fairness analysis — the cycle and trajectory reads are the inputs a validity spoke uses to ask whether bumps persist or wash out next cycle.
Detect drift over time — comparing pctUnchanged across cycles shows whether calibration is converging (less movement needed) or the drafts are drifting.

Visual — (TBD — a rendered per-leader up/down/unchanged stacked-bar across one cycle's leaders).

5. How it fits in the toolbox

Data flow:

Consumes — uploaded HRIS/calibration data: pre/post rating pairs and the leader chain, ingested per tenant via POST /pairs. Ratings normalize through the shared src/lib/rating-codec/parseRating codec (cross-cutting, not a spoke). No external data sources (no BLS/O*NET/Principia) — the inputs are tenant calibration submissions.
Emits — two contracts consumers vendor from src/spokes/performance-calibration/contracts/types.ts: CycleQueryResponse (cycle summaries + leader rollups) and EmployeeTrajectoryResponse (per-employee time series).
Feeds — the registered consumers are performix and vela, both planned (per registry.ts). The README names a future performance-validity consumer that reads cycle + trajectory to quantify downstream effects of calibration.
Adjacency — leader attribution today reads flattened ELT/L3/L4/L5 columns off each pair; a noted TODO is to compose with org-graph ancestor collapse so leader coverage derives from persisted graph edges rather than denormalized chain columns. That cross-spoke link is (TBD — org-graph rollup integration).

Visual — Tier B (typographic data-flow). tenant calibration pairs + leaderChain → POST /pairs (parseRating) → persisted {tenant × employee × cycle} → { GET /summary · GET /trajectory } → performix · vela · (future) performance-validity.

6. Commercialization / packaging

Performance Calibration is a service component, not a standalone product — it is the calibration-impact analytics leg behind a performance-management or talent-review surface, consumed over HTTP + MCP by buyer-facing tools rather than sold on its own.

Data-license posture: the inputs are the customer's own calibration ratings and leader chains; there is no vendor-survey or licensed third-party data in the pipeline, so no external data-license constraint attaches to the analytics. Tenant data stays scoped by tenantId; team-level rollups are subject to the toolbox's standard min-N gate handled elsewhere in the stack.
Anything about pricing tiers or packaged offerings is (TBD) — not earned yet, so not stated.

Visual — (TBD — product-tier placement diagram).

7. The vision

A durable, auditable record of what every calibration room actually changed — movement quantified per cycle and attributed cleanly down the real org hierarchy, feeding the validity and fairness questions that come after.

The near-term direction visible in the code: replace the flattened leader-chain columns with org-graph-derived leader coverage (the integrate-org-graph-rollup TODO in core/rollup.ts), and harden rating normalization as the shared parseRating codec graduates from its current stub toward the full PAT-152 rating ladder. The downstream arc is wiring the planned performance-validity consumer so calibration movement can be tested for persistence — do this cycle's bumps survive to next cycle, or wash out.

Visual — (TBD — the calibration → validity loop as a multi-cycle persistence diagram).

8. Current status

Grounded in the real code state (contract 1.0.0, src/spokes/performance-calibration/, registry status: "live", PAT-159, 2026-05-24):

Shipped: Postgres performance_calibration schema with heartbeat + calibration_pairs tables; the delta-classification + rollup engine (core/rollup.ts); idempotent ingest (POST /pairs, service-key gated); cycle summary read (GET /summary) returning cycle stats + ELT/L3/L4/L5 leader rollups; employee trajectory read (GET /trajectory); GET /health. MCP module registered under performance-calibration.* (.health, .pairs.ingest, .cycles.summary, .employees.trajectory).
In flight / planned: parseRating is a stub today (full rating ladder lands with PAT-152); org-graph-derived leader coverage is a noted TODO, not yet wired; the performix / vela / performance-validity consumers are all planned, not connected.

Visual — Tier A (live capture). GET /api/spokes/performance-calibration/health reports the schema reachability + contract version at request time.

Worked example used above is the spoke's own committed fixture (integration/performix/fixtures/sample-response.json) for cycle FY25-H1: of 48 employees with both ratings parsed, 33 held (pctUnchanged 0.6875), 6 moved up (0.125), 9 moved down (0.1875) — and the L3 leader Alex Ng's 12 rated reports moved twice as often downward as the cycle average (pctBumpedDown 0.25 vs. cycle 0.1875), the exact harsh-draft signal the leader rollup exists to surface. No figure here is invented.