Program Evaluation — plain-language explainer
Program Evaluation answers "did this program actually work?" with a defensible causal estimate — and locks down what you said you'd measure before you see the result.
A People Analytics Toolbox component. Built to the portfolio Explainer Standard v1.0. Every claim below is grounded in the spoke's own code and contracts (src/spokes/program-evaluation/, contract 0.2.0); anything not yet built is marked (TBD).
1. What is it?
Program Evaluation is a causal-experimentation engine for people programs: you declare a hypothesis, register the analysis plan you commit to before the data comes in, randomly assign people to a control arm and one or more treatment arms, then — after the program runs — get back a lift estimate with a confidence interval rather than a hand-waved "it seemed to help."
It owns five concrete jobs, each backed by real code: hypothesis + arm capture, tamper-evident pre-registration, deterministic treatment assignment, power planning, and persisted lift estimation.
Visual — Tier B (the experiment lifecycle as a step flow).
declare experiment + arms → pre-register the plan (SHA-256 freeze) → assign principals deterministically → run the program → compute lift (point estimate + bootstrap CI) → persist for audit
2. What problem does it solve — and why is it different?
The pain it removes: most "we ran a program" evaluations are decided after the fact, by whoever frames the numbers most favorably — outcomes get cherry-picked, the comparison group is assembled retroactively, and there's no record of what success was supposed to look like.
The difference, stated as a shift:
- FROM a post-hoc story told over a dashboard, with the success criterion quietly chosen to match the result.
- TO a frozen plan (hypothesis + analysis plan + primary-outcome definition, hashed to a 64-character SHA-256 digest) that cannot be silently rewritten after results land, plus a randomized assignment that's reproducible from a key rather than a coin flip nobody can re-run.
How it differs from the obvious substitutes:
- vs. doing it by hand in a spreadsheet — the spoke enforces the discipline: treatment-arm fractions must sum to 1 (validated in
ExperimentCreateRequestSchema), the pre-registration is immutable (a rewrite attempt returns HTTP 409), and a lift computation is refused until the pre-registration digest verifies. You cannot accidentally skip the rigor. - vs. a generic A/B product — those are built for web traffic, not workforce programs. Here a treatment arm's audience is an opaque segmentation handle (
audienceFilter), so cohorts come fromsegmentation-studioand published aggregates pass throughdata-anonymizer's min-N gate before leaving a tenant-safe context.
Visual — Tier B (FROM→TO typographic block). The shift above is the visual; a rendered comparison block is a follow-up (FU-A).
3. How does it work?
Inputs → method → outputs, concretely:
- Input: an experiment declaration — a hypothesis string, a primary outcome id, optional secondary outcomes, a randomization strategy (
simple|blocked|cluster|stratified), and two or more treatment arms whoserandomizationFractionvalues must sum to 1. - Method — four real cores:
- Pre-registration (
core/pre-registration.ts) — serializes the frozen bundle (hypothesis text, analysis-plan text, primary-outcome definition) with keys in lexicographic order and takes a SHA-256 digest. The same bundle always produces the same 64-hex digest; any later tampering failsverifyPreRegistrationDigest. - Assignment (
core/randomization.ts) — a stable weighted draw viaHMAC-SHA256keyed bypepper \0 tenantId \0 experimentId, hashing theprincipalId. The first 32 bits of the digest become a uniform draw in (0,1), walked across the arms' cumulative normalized weights. Same person + same experiment → same arm, every time, no stored coin flip needed. - Power planning (
core/power-analysis.ts) — classic two-sided two-sample sizing from Cohen's d:n_per_group = 2·(z_{1−α/2} + z_β)² / d², viasimple-statistics'probit. - Lift estimation (
core/lift-estimation.ts) — difference-in-means with a percentile bootstrap confidence interval (1,500 iterations by default) and a Welch large-sample two-sided p-value for reporting. The bootstrap RNG is a seededmulberry32, so a providedseedKeymakes the interval reproducible for audits.
- Pre-registration (
- Output: a persisted
LiftEstimate— point estimate, confidence interval, optional p-value, actual sample size, plus anestimateIdandcomputedAttimestamp.
Science backing: randomized assignment + pre-registration + difference-in-means with bootstrap CIs are the standard causal-inference toolkit; the statistics primitives are the pinned simple-statistics library (7.8.9), not bespoke math. The spoke does not invent the outcomes themselves — the analysis-plan note in the demo seed says calculator wiring "remains calculus-side," i.e. metric values come from calculus.
Differentiation beat: the practitioner's real question isn't "what's the lift" — it's "will this survive scrutiny?" The frozen digest + the refuse-to-compute-without-verification gate mean the answer is defensible by construction: you can prove what you committed to measuring before you saw a single result.
Visual — Tier A (real persisted lift shape). The demo seed (scripts/seed-program-evaluation-demo.ts) builds two deterministic outcome arrays and runs them through the real computeAndPersistLift. The arm means are fully determined (control 0.4118, n=120; variant 0.4844, n=118 — computed directly from the seed arrays), so the point estimate is real:
POST /api/spokes/program-evaluation/experiments/{id}/lift
→ {
"estimate": {
"experimentId": "<demo>",
"armId": "<variant-arm-id>",
"baselineArmId": "<control-arm-id>",
"pointEstimate": 0.0727,
"confidenceInterval": [ <bootstrap-low>, <bootstrap-high> ],
"pValue": <welch-two-sided>,
"sampleSizeActual": 238,
"estimateId": "<uuid>",
"computedAt": "<iso8601>"
}
}
(pointEstimate 0.0727 and sampleSizeActual 238 are computed directly from the seed's deterministic outcome arrays; the bootstrap CI and Welch p-value are real outputs but depend on the seed key, so their exact values are left as placeholders rather than invented.)
4. What does it enable?
Concrete uses a practitioner would recognize:
- Evaluate an onboarding nudge — exactly the demo scenario: control vs. variant on an accept-rate proxy, with a lift estimate instead of an anecdote.
- Pre-register a manager-training rollout so HR and Legal can both see the success criterion was set before results, not after.
- Size a pilot honestly — run power planning to learn the sample size needed to detect a given effect, before committing budget, rather than running an underpowered study that can never conclude.
- Assign fairly and reproducibly — deterministic HMAC assignment means re-running the resolver reproduces the same arm for the same person, so audits and re-analyses agree.
- Defend a result in a review — the persisted lift estimate carries its CI, p-value, and sample size, all tied to a verifiable pre-registration digest.
- Stage a tamper-evident program-evaluation registry — the 409-on-rewrite behavior makes the record of "what we said we'd measure" immutable.
Visual — (TBD — a rendered control-vs-variant lift chart with the CI band for the demo experiment).
5. How it fits in the toolbox
Data flow:
- Consumes — cohort handles from
segmentation-studio(an arm'saudienceFilteris an opaque segmentation handle, never a raw filter the spoke interprets); outcome metric values fromcalculus(the spoke records the primary-outcome id, not the calculation). Tenant context arrives via the PAT-N7 tenant bridge (organizationId/tenantId). - Emits — the
program-evaluationcontract types (Experiment,TreatmentArm,PreRegistration,LiftEstimate, and their request/response shapes). Consumers vendorsrc/spokes/program-evaluation/contracts/types.ts. - Feeds —
data-anonymizergates published lift aggregates against the toolbox min-N rule before they leave a tenant-safe context (named explicitly in the spoke README's cross-spoke boundary). - Boundary discipline — per
AGENTS.md, it imports only@/spokes/*/contracts/typesfrom siblings; it never reaches into another spoke'score/,db/, or routes.
Visual — Tier B (typographic data-flow). segmentation-studio cohort handle + calculus outcome metric → Program Evaluation (pre-register · assign · lift) → data-anonymizer min-N gate → published evaluation.
6. Commercialization / packaging
Program Evaluation is a service component, not a standalone product — it is the causal-rigor layer a people-analytics or program-management offering composes, sitting behind buyer-facing surfaces rather than being sold on its own.
- Auth posture: write endpoints (experiment create, pre-register, assignment resolve, lift compute) require
TOOLBOX_SERVICE_KEY(x-toolbox-service-keypreferred) per PAT-11; the experiment-detail read is public. - Data-license posture: the spoke stores tenant-supplied program data and metric ids only — it ships no third-party licensed data, so there are no external-data constraints attached to it. Privacy of published aggregates is delegated to
data-anonymizer. - Anything about pricing tiers or packaged offerings is (TBD) — not earned yet, so not stated.
Visual — (TBD — product-tier placement diagram).
7. The vision
Every people program — a nudge, a training, a policy change — evaluated the way a clinical trial is: a frozen plan, a fair assignment, and a lift you can defend, with the post-hoc storytelling designed out.
The near-term direction the code points to: the power-planning input still carries a projectedDurationDays "calendar placeholder until an arrival-rate model exists" (core/power-analysis.ts), so a real enrollment/arrival model is a natural next step; richer randomization strategies (blocked, cluster, stratified are typed in the contract but simple is the assignment path exercised today); and tighter wiring to calculus so primary outcomes resolve through the toolbox's metric envelope rather than being passed as ids. Performix is the planned consumer (registry consumers: [{ name: "performix", status: "planned" }]).
Visual — (TBD — the experiment-to-evidence roadmap).
8. Current status
Grounded in the real code state (contract 0.2.0, src/spokes/program-evaluation/, registry status: "live"):
- Shipped (PAT-D3-B, 2026-05-22): the
program_evaluationschema (migrationdrizzle/0076_*) withheartbeat,experiments,treatment_arms,pre_registrations(SHA-256 digest), memoizedassignments, and persistedlift_estimates; the four cores (pre-registration digest, HMAC weighted assignment, two-arm power planning, bootstrap lift); REST routesPOST /experiments,GET /experiments/[id],POST /experiments/[id]/pre-register,POST /assignments/resolve,POST /experiments/[id]/lift,GET /health; MCP toolsprogram-evaluation.*; health-aggregate rollup; the demo seed helper. ContractCONTRACT_VERSION = 0.2.0. - In flight / planned: an arrival-rate model to replace the duration placeholder in power planning; exercising the
blocked/cluster/stratifiedstrategies beyond the typed contract; the Performix consumer integration (planned, not yet wired).
Visual — Tier A (live capture). GET /api/spokes/program-evaluation/health reports the real spoke status ({ spoke, status, contractVersion: "0.2.0", schemaReachable, latencyMs, checkedAt }) at request time.
Worked example used above is the spoke's own demo seed (scripts/seed-program-evaluation-demo.ts): a two-arm onboarding-nudge experiment whose deterministic control (mean 0.4118, n=120) and variant (mean 0.4844, n=118) outcome arrays produce a real difference-in-means point estimate of 0.0727 over a total sample of 238, run through the production computeAndPersistLift core. The point estimate and sample size are computed directly from the seed; the bootstrap CI and p-value are genuine outputs whose exact values depend on the run-time seed and so are not pinned here. No figure is invented.