Performance Calibration — plain-language explainer
Performance Calibration measures what the calibration meeting actually changed — how many ratings moved up, how many moved down, how many held — and attributes that movement to the leaders who own it.
A People Analytics Toolbox component. Built to the portfolio Explainer Standard v1.0. Every claim below is grounded in the spoke's own code and contracts (src/spokes/performance-calibration/, contract 1.0.0); anything not yet built is marked (TBD).
1. What is it?
Performance Calibration is an impact analytics service for the rating-calibration process: feed it the pre-calibration and post-calibration rating for each employee in a cycle, and it tells you exactly how much the calibration room moved those ratings — per cycle, and per leader in the management chain.
It keeps a persisted snapshot of every { preCalibrated → postCalibrated } pair, keyed by tenant × employee × cycle, so the question "what did calibration do this cycle, and whose ratings moved?" has a durable, queryable answer rather than a one-off spreadsheet that gets lost after the meeting.
Visual — Tier A (live API capture). A real read against the spoke, grounded in the in-repo fixture (integration/performix/fixtures/sample-response.json):
GET /api/spokes/performance-calibration/summary?tenantId=tenant-cal-fixture&cycleIds=FY25-H1
→ {
"cycles": [
{
"cycleId": "FY25-H1", "cycleLabel": "FY25 H1",
"totalRated": 48,
"bumpedUp": 6, "bumpedDown": 9, "unchanged": 33,
"pctBumpedUp": 0.125, "pctBumpedDown": 0.1875, "pctUnchanged": 0.6875
}
],
"leaderRollups": [
{
"layer": "l3", "leaderName": "Alex Ng", "cycleId": "FY25-H1",
"totalRated": 12,
"pctBumpedUp": 0.08, "pctBumpedDown": 0.25, "pctUnchanged": 0.67, "pctChanged": 0.33
}
]
}
(Shape matches CycleQueryResponseSchema in contracts/types.ts; values are the spoke's own committed fixture.)
2. What problem does it solve — and why is it different?
The pain it removes: calibration meetings move ratings, but the before picture usually evaporates the moment the meeting ends, so no one can say with numbers whether the room did real work or rubber-stamped the draft — and certainly not which leader's submissions changed the most.
The difference, stated as a shift:
- FROM a final rating distribution with no memory of what it looked like before the room, and "calibration happened" as the only evidence.
- TO a persisted pre/post pair per employee, rolled up to how many moved up, how many down, how many held — and the same breakdown attributed down the leader hierarchy.
How it differs from the obvious substitutes:
- vs. doing it by hand in a spreadsheet — the pre/post columns are exactly the kind of thing that lives in one analyst's workbook and is gone next cycle. This spoke persists each pair under a stable
{ tenant × employee × cycle }key, so trajectories across cycles and year-over-year comparisons are queries, not archaeology. - vs. generic BI on the final ratings — a dashboard over post-calibration ratings can show the distribution but not the movement; it has no pre-state to diff against. The unit here is the delta, classified
up/down/unchanged/incompleteper person.
Visual — Tier B (FROM→TO typographic block). final-distribution-only (movement invisible) → persisted {pre → post} pair per person → up/down/unchanged counts → same counts attributed per leader. A rendered comparison block is a follow-up (FU-A).
3. How does it work?
Inputs → method → outputs, concretely:
- Input: calibration pairs —
{ employeeId, tenantId, cycleId, cycleLabel?, preCalibrated, postCalibrated, leaderChain?, capturedAt? }. Ratings arrive asnumber | string | null(raw, un-normalized); the optionalleaderChaincarries the ELT / L3 / L4 / L5 leader names for attribution. Ingested viaPOST /pairs(service-key gated), which UPSERTs idempotently — duplicate keys overwrite, so ETL retries are safe. - Method — normalize, then classify, then roll up:
- Normalize each rating through
parseRating(src/lib/rating-codec/parseRating.ts): the toolbox five-point ladder plus spreadsheet-decimal fallback. A rating it can't parse becomesnull. - Classify the delta per pair (
classifyDeltaDirectionincore/rollup.ts):post − pre > 0→up,< 0→down,= 0→unchanged. If either side fails to parse, the pair isincompleteand is excluded from the denominators — counts only ever reflect pairs with both parses present. - Roll up (
buildCalibrationRollups) in one sweep: per-cycle bumped-up / -down / -unchanged counts and percentages, plus per-leader aggregates at each of the four hierarchy layers.
- Normalize each rating through
- Output: a
CycleQueryResponse(cycle summaries + leader rollups) fromGET /summary, and anEmployeeTrajectoryResponse(per-cycle pre/post/delta/direction, deduped to the newest row per cycle then truncated tolastNCycles, default 6) fromGET /trajectory.
The science backing is methodological rather than psychometric: the honest treatment of incomplete parses (never silently counted as unchanged) is the load-bearing discipline — pctUnchanged is defined as calibration-room accuracy among complete parses only, so a cycle with sparse data can't masquerade as a stable one.
Differentiation beat: the practitioner's real question isn't "what's the new distribution" — it's "did calibration do anything, and is one leader's draft systematically generous or harsh?" The per-layer leader rollup answers the second half directly: a leader whose pctBumpedDown is far above the cycle average had ratings the room pulled down, every cycle, on the record.
Visual — Tier B (step flow). raw {pre, post} (string|number|null) → parseRating → classifyDeltaDirection (up/down/unchanged/incomplete) → buildCalibrationRollups → { cycle summaries · ELT/L3/L4/L5 leader rollups }.
4. What does it enable?
Concrete uses a practitioner would recognize:
- Quantify what calibration changed — report
pctUnchangedas the share of ratings the room left alone, andpctBumpedUp/pctBumpedDownas the share it moved, for one cycle or several at once. - Spot leniency or harshness by leader — the ELT/L3/L4/L5 rollups surface which leaders' draft ratings the room systematically adjusted, and in which direction (
pctChangedper leader). - Track an employee's calibration history — the trajectory read returns the trailing N cycles of pre/post/delta for one person, with the ELT + L3 leader attributed at each cycle.
- Audit a cycle after the fact — because every
{pre → post}pair is persisted, a cycle can be re-examined cycles later without the original workbook. - Feed downstream validity / fairness analysis — the cycle and trajectory reads are the inputs a validity spoke uses to ask whether bumps persist or wash out next cycle.
- Detect drift over time — comparing
pctUnchangedacross cycles shows whether calibration is converging (less movement needed) or the drafts are drifting.
Visual — (TBD — a rendered per-leader up/down/unchanged stacked-bar across one cycle's leaders).
5. How it fits in the toolbox
Data flow:
- Consumes — uploaded HRIS/calibration data: pre/post rating pairs and the leader chain, ingested per tenant via
POST /pairs. Ratings normalize through the sharedsrc/lib/rating-codec/parseRatingcodec (cross-cutting, not a spoke). No external data sources (no BLS/O*NET/Principia) — the inputs are tenant calibration submissions. - Emits — two contracts consumers vendor from
src/spokes/performance-calibration/contracts/types.ts:CycleQueryResponse(cycle summaries + leader rollups) andEmployeeTrajectoryResponse(per-employee time series). - Feeds — the registered consumers are performix and vela, both
planned(perregistry.ts). The README names a future performance-validity consumer that reads cycle + trajectory to quantify downstream effects of calibration. - Adjacency — leader attribution today reads flattened ELT/L3/L4/L5 columns off each pair; a noted TODO is to compose with org-graph ancestor collapse so leader coverage derives from persisted graph edges rather than denormalized chain columns. That cross-spoke link is (TBD — org-graph rollup integration).
Visual — Tier B (typographic data-flow). tenant calibration pairs + leaderChain → POST /pairs (parseRating) → persisted {tenant × employee × cycle} → { GET /summary · GET /trajectory } → performix · vela · (future) performance-validity.
6. Commercialization / packaging
Performance Calibration is a service component, not a standalone product — it is the calibration-impact analytics leg behind a performance-management or talent-review surface, consumed over HTTP + MCP by buyer-facing tools rather than sold on its own.
- Data-license posture: the inputs are the customer's own calibration ratings and leader chains; there is no vendor-survey or licensed third-party data in the pipeline, so no external data-license constraint attaches to the analytics. Tenant data stays scoped by
tenantId; team-level rollups are subject to the toolbox's standard min-N gate handled elsewhere in the stack. - Anything about pricing tiers or packaged offerings is (TBD) — not earned yet, so not stated.
Visual — (TBD — product-tier placement diagram).
7. The vision
A durable, auditable record of what every calibration room actually changed — movement quantified per cycle and attributed cleanly down the real org hierarchy, feeding the validity and fairness questions that come after.
The near-term direction visible in the code: replace the flattened leader-chain columns with org-graph-derived leader coverage (the integrate-org-graph-rollup TODO in core/rollup.ts), and harden rating normalization as the shared parseRating codec graduates from its current stub toward the full PAT-152 rating ladder. The downstream arc is wiring the planned performance-validity consumer so calibration movement can be tested for persistence — do this cycle's bumps survive to next cycle, or wash out.
Visual — (TBD — the calibration → validity loop as a multi-cycle persistence diagram).
8. Current status
Grounded in the real code state (contract 1.0.0, src/spokes/performance-calibration/, registry status: "live", PAT-159, 2026-05-24):
- Shipped: Postgres
performance_calibrationschema withheartbeat+calibration_pairstables; the delta-classification + rollup engine (core/rollup.ts); idempotent ingest (POST /pairs, service-key gated); cycle summary read (GET /summary) returning cycle stats + ELT/L3/L4/L5 leader rollups; employee trajectory read (GET /trajectory);GET /health. MCP module registered underperformance-calibration.*(.health,.pairs.ingest,.cycles.summary,.employees.trajectory). - In flight / planned:
parseRatingis a stub today (full rating ladder lands with PAT-152); org-graph-derived leader coverage is a noted TODO, not yet wired; the performix / vela / performance-validity consumers are allplanned, not connected.
Visual — Tier A (live capture). GET /api/spokes/performance-calibration/health reports the schema reachability + contract version at request time.
Worked example used above is the spoke's own committed fixture (integration/performix/fixtures/sample-response.json) for cycle FY25-H1: of 48 employees with both ratings parsed, 33 held (pctUnchanged 0.6875), 6 moved up (0.125), 9 moved down (0.1875) — and the L3 leader Alex Ng's 12 rated reports moved twice as often downward as the cycle average (pctBumpedDown 0.25 vs. cycle 0.1875), the exact harsh-draft signal the leader rollup exists to surface. No figure here is invented.