Performance Validity — plain-language explainer

Performance Validity treats a performance rating as a measurement instrument and asks the question every rating cycle dodges: is this number reliable, does it predict anything, and does anyone agree on it?

A People Analytics Toolbox component. Built to the portfolio Explainer Standard v1.0. Every claim below is grounded in the spoke's own code and contracts (src/spokes/performance-validity/, contract 1.2.0); anything not yet built is marked (TBD).

1. What is it?

Performance Validity is a panel of psychometric diagnostics for rating instruments — it takes performance ratings (and their cousins: calibration scores, multi-rater allocations, longitudinal cycle histories) and runs the same checks a test publisher runs on a published exam.

The framing is the whole point: a performance rating is a one-item measurement, so it can be interrogated like one. Performance Validity asks, of any rating instrument: does it agree with itself over time (reliability), does it agree with operational reality and with other raters (convergence), does it predict the outcomes it claims to (predictive validity), and how much of the variance in ratings is actually explained by anything measurable at all.

It is mostly stateless — analyzers take data in the request body and return a scored, narrated result; nothing is stored except an MEI evidence sink (below).

Visual — Tier A (live API capture). The real fixture request → response for the validity scorecard, captured from the spoke's own integration fixtures (integration/performix/fixtures/):

POST /api/spokes/performance-validity/validity-scorecard
{
  "tenantId": "fixture-tenant",
  "testRetestObservations": [
    { "subjectId": "e1", "score": 3.1, "occasion": "t1" },
    { "subjectId": "e1", "score": 3.3, "occasion": "t2" },
    { "subjectId": "e2", "score": 2.9, "occasion": "t1" },
    { "subjectId": "e2", "score": 3.0, "occasion": "t2" },
    { "subjectId": "e3", "score": 4.0, "occasion": "t1" },
    { "subjectId": "e3", "score": 3.85, "occasion": "t2" }
  ],
  "convergentFocal":     [{ "subjectId": "c1", "score": 0.82 }, ...],
  "convergentComparator":[{ "subjectId": "c1", "score": 0.79 }, ...]
}
→ {
    "reliability": {
      "testRetest": { "coefficient": 0.983, "interpretation": "excellent", "n": 3 },
      "crossCycleStability": { "coefficient": 0, "interpretation": "No cross-cycle panel supplied.", "n": 0 }
    },
    "validity": {
      "convergent":  { "coefficient": 0, "pValue": 1, "n": 0, "interpretation": "convergent: insufficient paired cohort." },
      "predictive":  { "coefficient": 0, "pValue": 1, "n": 0, "interpretation": "Skipped — focal or comparator payload empty." },
      "discriminant":{ "coefficient": 0, "pValue": 1, "n": 0, "interpretation": "Skipped — focal or comparator payload empty." }
    },
    "overallScorecard": "questionable",
    "narration": "Reliability facets: test–retest (3 subjects) · cross-cycle (0). Validity joins: convergent n=0, predictive n=0, discriminant n=0. Composite heuristic: questionable."
  }

(Real fixture pair shipped in the spoke. Note the honesty: a strong test–retest of 0.983 does not earn a passing scorecard when the validity joins have n=0 — the composite stays "questionable" and the narration says why.)

2. What problem does it solve — and why is it different?

The pain it removes: performance ratings are trusted as data without ever being tested as instruments. They drive pay, promotion, and exits, but almost no organization checks whether the rating from manager A means the same thing as the rating from manager B, whether last cycle's rating predicts this cycle's outcome, or whether the calibration meeting actually improved anything.

The difference, stated as a shift:

FROM a rating treated as ground truth — a number you act on without ever asking what it measures.
TO a rating treated as a measurement with known reliability, convergence, predictive power, and a defensible scorecard — the same discipline a published assessment is held to.

How it differs from the obvious substitutes:

vs. doing it by hand — the statistics here (test–retest reliability, inter-rater agreement, convergent/predictive/discriminant correlations with p-values, stacked OLS variance decomposition, regression-to-mean detection) are textbook psychometrics, but assembling them per cycle in a spreadsheet is error-prone and rarely reproducible. Performance Validity is a callable service with shared inference primitives.
vs. generic BI — a dashboard shows you the distribution of ratings; it does not tell you whether the instrument that produced them is valid. Performance Validity refuses to launder a high reliability coefficient into a passing scorecard when the validity evidence is thin (the worked example above proves it).

Visual — Tier B (FROM→TO typographic block). The shift above is the visual; a rendered comparison block is a follow-up (FU-A).

3. How does it work?

Inputs → method → outputs, organized around the six diagnostic questions the spoke is built to answer (Q1–Q6 in the contract), plus two composable alignment primitives.

Q1 — Rating-ops convergence (POST /rating-ops-convergence): given employee ratings + operational metrics, returns the Pearson correlation, a p-value, a strength band (negligible | weak | moderate | strong), and a ranked diagnostic chain of failure modes — why ratings and ops might diverge.
Q2 — Validity scorecard (POST /validity-scorecard): the panel in the worked example. Reliability facets (test–retest, optional inter-rater ICC, cross-cycle stability) plus convergent / predictive / discriminant validity, rolled to an overall verdict (poor → excellent).
Q3 — Predictability decomposition (POST /predictability-decomposition): stacked additive OLS — feed ordered blocks of covariates and get each block's incremental R² (ΔR²), the total explained, and the unexplained residual; optionally bounded by a reliability ceiling (ρ²) so you never claim to explain more variance than the instrument can reliably carry.
Q4 — Trajectory analysis (POST /trajectory-analysis): per-employee longitudinal slope, intercept, variance around the line, slope p-value, and a regression-to-mean signal against the population (or pooled) mean.
Q5 — Calibration value-add (POST /calibration-value-add): a five-test panel (validity-add, reliability-add, convergence-add, direction analysis, behavioral impact) that asks whether a calibration cycle improved the ratings or just reshuffled noise — verdict calibration-adds-value | calibration-neutral | calibration-noise.
Q6 — Change attribution (POST /change-attribution): ranked explanatory factors for an individual's rating delta between two cycles, with an unexplained share.

Data sources / science backing. Everything runs on toolbox-internal inference libraries — @/lib/measurement-theory, @/lib/inference, @/lib/diagnostic-chain, @/lib/rating-codec — not on external feeds. The inputs are the caller's own HRIS / ratings / ops data supplied in the request body. Q5/Q6 additionally pull calibration rollups from the sibling performance-calibration spoke over published HTTP GET routes (never a cross-spoke core/ import). The science is classical test theory: reliability as agreement-with-self, validity as the convergent/predictive/discriminant triad, variance decomposition via OLS.

Differentiation beat: the practitioner's real question is not "what's the correlation" — it's "can I trust this rating enough to make a pay or promotion decision on it?" The scorecard answers exactly that, and refuses to pass an instrument on reliability alone.

Visual — Tier B (step flow). ratings + ops/cycle data → [Q1 convergence · Q2 scorecard · Q3 decomposition · Q4 trajectory · Q5 calibration ROI · Q6 attribution] → scored verdict + narration.

4. What does it enable?

Concrete uses a practitioner would recognize:

Audit a rating instrument before you trust it — run the Q2 scorecard on a cycle's ratings and get a defensible reliability/validity verdict, not a vibe.
Prove (or disprove) that calibration earns its meeting time — Q5 returns calibration-adds-value vs. calibration-noise from the actual pre/post snapshots.
Show how little of a rating is explained — Q3's stacked ΔR² makes visible how much rating variance is tenure, manager, function… and how much is unexplained, bounded by what the instrument could ever reliably carry.
Catch regression-to-mean before it's mistaken for a story — Q4 flags employees whose "improvement" or "decline" is just statistical reversion.
Measure whether ratings track operational reality — Q1's rating-ops convergence with a ranked failure-mode chain.
Score multi-rater agreement and three-way (up/down/lateral) alignment — the two PA Instruments below, feeding the Leadership Index.

Visual — Tier A (live capture of the alignment primitive). See the worked example in §9.

5. How it fits in the toolbox

Data flow:

Consumes — the caller's ratings / ops / cycle data (request-body, stateless) and, for Q5/Q6, calibration rollups fetched over HTTP from performance-calibration. Depends on toolbox libraries only (measurement-theory, inference, diagnostic-chain, rating-codec).
Emits — six analyzer contracts plus the alignment results. Consumers vendor src/spokes/performance-validity/contracts/types.ts.
Feeds — manager-effectiveness: the directional-alignment PA Instrument is consumed by the Leadership Index Product over HTTP, and the mei-predictive-evidence UPSERT route persists per-domain predictive r² rows (the one stateful surface — table mei_predictive_evidence) that recalibrate MEI's empirical weights.
PA Instruments — POST /alignment and POST /directional-alignment are tagged PA Instrument — in the registry (composable measurement primitives; catalog docs/primitives/00-CATALOG.md). They are org-graph-agnostic: the caller supplies cohort membership.

Visual — Tier B (typographic data-flow). ratings + ops/cycle data → Performance Validity → { scorecards/verdicts · directional-alignment → Leadership Index · predictive-evidence → MEI weights }, with performance-calibration HTTP rollups as the upstream source for Q5/Q6.

6. Commercialization / packaging

Performance Validity is a service component, not a standalone product — it is the measurement-quality layer that sits behind compensation, calibration, and leadership surfaces rather than being sold on its own.

Data-license posture: the spoke ships no external licensed data; it operates entirely on the caller's own ratings/ops/cycle inputs and toolbox-internal statistics, so there are no third-party data-license constraints to inherit.
Access: POST analyzers require the standard toolbox service-key header; GET /health is public. MCP tools are registered and discoverable under slug performance-validity.
Anything about pricing tiers or packaged offerings is (TBD) — not earned yet, so not stated.

Visual — (TBD — product-tier placement diagram).

7. The vision

Every performance rating, calibration score, and multi-rater allocation in the toolbox carries a known measurement quality — reliable or not, predictive or not, agreed-upon or not — so no high-stakes decision rests on an untested number.

The near-term direction is grounded in the spoke's own follow-on notes: tightening the Q6 attribution heuristics once worker-resolution + org-graph exports land to feed role/ops milestones, and optional audited-run persistence (analysis_runs) if a consumer asks for replayable diagnostics independent of caller logs.

Visual — (TBD — coverage map of which toolbox instruments have a quality verdict).

8. Current status

Grounded in the real code state (contract 1.2.0, status: "live" in src/lib/contracts/registry.ts):

Shipped: all six Q-analyzers (rating-ops-convergence, validity-scorecard, predictability-decomposition, trajectory-analysis, calibration-value-add, change-attribution), the two PA Instruments (alignment, directional-alignment), the mei-predictive-evidence UPSERT bridge, GET /health, and MCP registration. Schema performance_validity holds the heartbeat + mei_predictive_evidence table; all analytics are otherwise stateless.
In flight / planned: tighter Q6 attribution heuristics pending worker-resolution/org-graph feeds; optional analysis_runs audited-replay persistence; the consumer status for Performix is planned in the registry.

Visual — Tier A (live capture). GET /api/spokes/performance-validity/health reports the live performance_validity heartbeat and contract version at request time.

9. Worked example (load-bearing) — multi-rater alignment

This is real output of the spoke's computeAlignment core (core/alignment.ts), the inverted-coefficient-of-variation primitive. Scenario (clearly labeled illustrative inputs, real computation): three executives each allocate 100 points across three comp levers for one role, and we ask how much they agree, lever by lever.

Input — one value per rater per item:

retention-bonus: 40, 35, 45
base-merit: 30, 10, 50
equity-refresh: 30, 55, 5

Computed result (verified against the core logic: population sd, cv = sd/mean, alignmentScore = round((1 − min(cv/1.5, 1)) × 100)):

retention-bonus — mean 40, sd 4.082, cv 0.102, alignmentScore 93 (near-consensus)
base-merit — mean 30, sd 16.33, cv 0.544, alignmentScore 64 (real disagreement)
equity-refresh — mean 30, sd 20.412, cv 0.68, alignmentScore 55 (the most contested lever)
overall — alignmentScore 70.667, meanCv 0.442, raterCount 3

What a practitioner does with it: the executives broadly agree on retention bonuses (93) but are split on how to weight equity refreshes (55). That divergence number is the actionable output — it points the calibration conversation at the lever where agreement is weakest, instead of relitigating the one they already share. Fed the right way, the same per-item alignment scores roll up through directional-alignment into the Leadership Index.

Every number in §1 is real captured fixture output; every number in §9 is the verified result of the spoke's own computeAlignment core on the stated inputs. No figure here is invented.