Program Evaluation — plain-language explainer

Program Evaluation answers "did this program actually work?" with a defensible causal estimate — and locks down what you said you'd measure before you see the result.

A People Analytics Toolbox component. Built to the portfolio Explainer Standard v1.0. Every claim below is grounded in the spoke's own code and contracts (src/spokes/program-evaluation/, contract 0.2.0); anything not yet built is marked (TBD).

1. What is it?

Program Evaluation is a causal-experimentation engine for people programs: you declare a hypothesis, register the analysis plan you commit to before the data comes in, randomly assign people to a control arm and one or more treatment arms, then — after the program runs — get back a lift estimate with a confidence interval rather than a hand-waved "it seemed to help."

It owns five concrete jobs, each backed by real code: hypothesis + arm capture, tamper-evident pre-registration, deterministic treatment assignment, power planning, and persisted lift estimation.

Visual — Tier B (the experiment lifecycle as a step flow).

declare experiment + arms → pre-register the plan (SHA-256 freeze) → assign principals deterministically → run the program → compute lift (point estimate + bootstrap CI) → persist for audit

2. What problem does it solve — and why is it different?

The pain it removes: most "we ran a program" evaluations are decided after the fact, by whoever frames the numbers most favorably — outcomes get cherry-picked, the comparison group is assembled retroactively, and there's no record of what success was supposed to look like.

The difference, stated as a shift:

FROM a post-hoc story told over a dashboard, with the success criterion quietly chosen to match the result.
TO a frozen plan (hypothesis + analysis plan + primary-outcome definition, hashed to a 64-character SHA-256 digest) that cannot be silently rewritten after results land, plus a randomized assignment that's reproducible from a key rather than a coin flip nobody can re-run.

How it differs from the obvious substitutes:

vs. doing it by hand in a spreadsheet — the spoke enforces the discipline: treatment-arm fractions must sum to 1 (validated in ExperimentCreateRequestSchema), the pre-registration is immutable (a rewrite attempt returns HTTP 409), and a lift computation is refused until the pre-registration digest verifies. You cannot accidentally skip the rigor.
vs. a generic A/B product — those are built for web traffic, not workforce programs. Here a treatment arm's audience is an opaque segmentation handle (audienceFilter), so cohorts come from segmentation-studio and published aggregates pass through data-anonymizer's min-N gate before leaving a tenant-safe context.

Visual — Tier B (FROM→TO typographic block). The shift above is the visual; a rendered comparison block is a follow-up (FU-A).

3. How does it work?

Inputs → method → outputs, concretely:

Input: an experiment declaration — a hypothesis string, a primary outcome id, optional secondary outcomes, a randomization strategy (simple | blocked | cluster | stratified), and two or more treatment arms whose randomizationFraction values must sum to 1.
Method — four real cores:
1. Pre-registration (core/pre-registration.ts) — serializes the frozen bundle (hypothesis text, analysis-plan text, primary-outcome definition) with keys in lexicographic order and takes a SHA-256 digest. The same bundle always produces the same 64-hex digest; any later tampering fails verifyPreRegistrationDigest.
2. Assignment (core/randomization.ts) — a stable weighted draw via HMAC-SHA256 keyed by pepper \0 tenantId \0 experimentId, hashing the principalId. The first 32 bits of the digest become a uniform draw in (0,1), walked across the arms' cumulative normalized weights. Same person + same experiment → same arm, every time, no stored coin flip needed.
3. Power planning (core/power-analysis.ts) — classic two-sided two-sample sizing from Cohen's d: n_per_group = 2·(z_{1−α/2} + z_β)² / d², via simple-statistics' probit.
4. Lift estimation (core/lift-estimation.ts) — difference-in-means with a percentile bootstrap confidence interval (1,500 iterations by default) and a Welch large-sample two-sided p-value for reporting. The bootstrap RNG is a seeded mulberry32, so a provided seedKey makes the interval reproducible for audits.
Output: a persisted LiftEstimate — point estimate, confidence interval, optional p-value, actual sample size, plus an estimateId and computedAt timestamp.

Science backing: randomized assignment + pre-registration + difference-in-means with bootstrap CIs are the standard causal-inference toolkit; the statistics primitives are the pinned simple-statistics library (7.8.9), not bespoke math. The spoke does not invent the outcomes themselves — the analysis-plan note in the demo seed says calculator wiring "remains calculus-side," i.e. metric values come from calculus.

Differentiation beat: the practitioner's real question isn't "what's the lift" — it's "will this survive scrutiny?" The frozen digest + the refuse-to-compute-without-verification gate mean the answer is defensible by construction: you can prove what you committed to measuring before you saw a single result.

Visual — Tier A (real persisted lift shape). The demo seed (scripts/seed-program-evaluation-demo.ts) builds two deterministic outcome arrays and runs them through the real computeAndPersistLift. The arm means are fully determined (control 0.4118, n=120; variant 0.4844, n=118 — computed directly from the seed arrays), so the point estimate is real:

POST /api/spokes/program-evaluation/experiments/{id}/lift
→ {
    "estimate": {
      "experimentId": "<demo>",
      "armId": "<variant-arm-id>",
      "baselineArmId": "<control-arm-id>",
      "pointEstimate": 0.0727,
      "confidenceInterval": [ <bootstrap-low>, <bootstrap-high> ],
      "pValue": <welch-two-sided>,
      "sampleSizeActual": 238,
      "estimateId": "<uuid>",
      "computedAt": "<iso8601>"
    }
  }

(pointEstimate 0.0727 and sampleSizeActual 238 are computed directly from the seed's deterministic outcome arrays; the bootstrap CI and Welch p-value are real outputs but depend on the seed key, so their exact values are left as placeholders rather than invented.)

4. What does it enable?

Concrete uses a practitioner would recognize:

Evaluate an onboarding nudge — exactly the demo scenario: control vs. variant on an accept-rate proxy, with a lift estimate instead of an anecdote.
Pre-register a manager-training rollout so HR and Legal can both see the success criterion was set before results, not after.
Size a pilot honestly — run power planning to learn the sample size needed to detect a given effect, before committing budget, rather than running an underpowered study that can never conclude.
Assign fairly and reproducibly — deterministic HMAC assignment means re-running the resolver reproduces the same arm for the same person, so audits and re-analyses agree.
Defend a result in a review — the persisted lift estimate carries its CI, p-value, and sample size, all tied to a verifiable pre-registration digest.
Stage a tamper-evident program-evaluation registry — the 409-on-rewrite behavior makes the record of "what we said we'd measure" immutable.

Visual — (TBD — a rendered control-vs-variant lift chart with the CI band for the demo experiment).

5. How it fits in the toolbox

Data flow:

Consumes — cohort handles from segmentation-studio (an arm's audienceFilter is an opaque segmentation handle, never a raw filter the spoke interprets); outcome metric values from calculus (the spoke records the primary-outcome id, not the calculation). Tenant context arrives via the PAT-N7 tenant bridge (organizationId / tenantId).
Emits — the program-evaluation contract types (Experiment, TreatmentArm, PreRegistration, LiftEstimate, and their request/response shapes). Consumers vendor src/spokes/program-evaluation/contracts/types.ts.
Feeds — data-anonymizer gates published lift aggregates against the toolbox min-N rule before they leave a tenant-safe context (named explicitly in the spoke README's cross-spoke boundary).
Boundary discipline — per AGENTS.md, it imports only @/spokes/*/contracts/types from siblings; it never reaches into another spoke's core/, db/, or routes.

Visual — Tier B (typographic data-flow). segmentation-studio cohort handle + calculus outcome metric → Program Evaluation (pre-register · assign · lift) → data-anonymizer min-N gate → published evaluation.

6. Commercialization / packaging

Program Evaluation is a service component, not a standalone product — it is the causal-rigor layer a people-analytics or program-management offering composes, sitting behind buyer-facing surfaces rather than being sold on its own.

Auth posture: write endpoints (experiment create, pre-register, assignment resolve, lift compute) require TOOLBOX_SERVICE_KEY (x-toolbox-service-key preferred) per PAT-11; the experiment-detail read is public.
Data-license posture: the spoke stores tenant-supplied program data and metric ids only — it ships no third-party licensed data, so there are no external-data constraints attached to it. Privacy of published aggregates is delegated to data-anonymizer.
Anything about pricing tiers or packaged offerings is (TBD) — not earned yet, so not stated.

Visual — (TBD — product-tier placement diagram).

7. The vision

Every people program — a nudge, a training, a policy change — evaluated the way a clinical trial is: a frozen plan, a fair assignment, and a lift you can defend, with the post-hoc storytelling designed out.

The near-term direction the code points to: the power-planning input still carries a projectedDurationDays "calendar placeholder until an arrival-rate model exists" (core/power-analysis.ts), so a real enrollment/arrival model is a natural next step; richer randomization strategies (blocked, cluster, stratified are typed in the contract but simple is the assignment path exercised today); and tighter wiring to calculus so primary outcomes resolve through the toolbox's metric envelope rather than being passed as ids. Performix is the planned consumer (registry consumers: [{ name: "performix", status: "planned" }]).

Visual — (TBD — the experiment-to-evidence roadmap).

8. Current status

Grounded in the real code state (contract 0.2.0, src/spokes/program-evaluation/, registry status: "live"):

Shipped (PAT-D3-B, 2026-05-22): the program_evaluation schema (migration drizzle/0076_*) with heartbeat, experiments, treatment_arms, pre_registrations (SHA-256 digest), memoized assignments, and persisted lift_estimates; the four cores (pre-registration digest, HMAC weighted assignment, two-arm power planning, bootstrap lift); REST routes POST /experiments, GET /experiments/[id], POST /experiments/[id]/pre-register, POST /assignments/resolve, POST /experiments/[id]/lift, GET /health; MCP tools program-evaluation.*; health-aggregate rollup; the demo seed helper. Contract CONTRACT_VERSION = 0.2.0.
In flight / planned: an arrival-rate model to replace the duration placeholder in power planning; exercising the blocked / cluster / stratified strategies beyond the typed contract; the Performix consumer integration (planned, not yet wired).

Visual — Tier A (live capture). GET /api/spokes/program-evaluation/health reports the real spoke status ({ spoke, status, contractVersion: "0.2.0", schemaReachable, latencyMs, checkedAt }) at request time.

Worked example used above is the spoke's own demo seed (scripts/seed-program-evaluation-demo.ts): a two-arm onboarding-nudge experiment whose deterministic control (mean 0.4118, n=120) and variant (mean 0.4844, n=118) outcome arrays produce a real difference-in-means point estimate of 0.0727 over a total sample of 238, run through the production computeAndPersistLift core. The point estimate and sample size are computed directly from the seed; the bootstrap CI and p-value are genuine outputs whose exact values depend on the run-time seed and so are not pinned here. No figure is invented.