Segmentation Studio — plain-language explainer

Segmentation Studio turns messy HRIS columns into governed, named groups of people — and issues a stable ID for every group so two tools can mean the same thing by "Engineering in the West."

A People Analytics Toolbox component. Built to the portfolio Explainer Standard v1.0. Every claim below is grounded in the spoke's own code and contracts (src/spokes/segmentation-studio/, exported CONTRACT_VERSION = 2.11.0; the CHANGELOG records additive releases through 2.12.0); anything not yet built is marked (TBD).

1. What is it?

Segmentation Studio is the toolbox's cohort-resolution engine: it does two jobs in sequence. First, it normalizes raw HRIS columns — whatever a system happens to call them — into a canonical field library (a column named dept, team, or org_unit all resolve to the canonical department). Second, it resolves operator-defined cuts ({ department in [engineering], region in [west] }) into a concrete cohort: the list of employee IDs in the group and the stable segment-node IDs they matched.

What makes it more than a filter is that the cohort comes back as typed, governed references rather than string matches. "Engineering in the West" stops being a brittle text query inside one app and becomes a dimension + node pair every other spoke can join against and agree on.

Visual — Tier A (real computed output). A cohort resolved against the spoke's own demo seed (drizzle/0007_pat5_seed.sql):

resolveCohort({
  include: [
    { dimensionKey: "department", operator: "in", values: ["engineering"] },
    { dimensionKey: "region",     operator: "in", values: ["west"] }
  ]
})
→ {
    "memberIds": ["EMP-001"],
    "segmentNodeIds": ["pat5-node-engineering", "pat5-node-west"],
    "totalMembers": 1,
    "totalNodes": 2
  }

(Real output of the pure resolveCohort core in core/cohort-resolution.ts against the 6-employee seed. EMP-002 is engineering but East; EMP-005 is West but Product — both correctly excluded by the AND across dimensions.)

2. What problem does it solve — and why is it different?

The pain it removes: every analytics tool re-invents segmentation badly. One app calls the group "Engineering," another "Eng Org," a third matches on a department string that breaks the next reorg — and none of them can be sure they're counting the same people.

The difference, stated as a shift:

FROM a department string matched ad hoc inside each consuming app, re-derived (and re-broken) per tool.
TO a typed reference to a stable, governed dimension + node pair that every spoke resolves identically.

How it differs from the obvious substitutes:

vs. doing it by hand / generic BI filters — a BI filter lives inside one report. Segmentation Studio exposes resolution as a service API any consumer (including an AI agent authoring a query from a natural-language prompt) can call, returning the same member set every time.
vs. enterprise HRIS analytics (Visier, Workday Adaptive) — those bind segment definitions to their own reporting layer at build time. This spoke issues segment IDs as a portable contract, decoupled from any one report.
vs. BI semantic layers (LookML, dbt, Cube) — those are SQL-translation layers needing warehouse access and a config DSL. The spoke resolves against an in-Postgres segment_memberships table with no warehouse and no query language to learn.
vs. statistical clustering (sklearn, scipy) — that discovers groups statistically; this resolves operator-defined groups consistently. The operator owns the boundaries; the spoke just answers "who's in this one" the same way every time.

Visual — Tier B (FROM→TO typographic block). dept / team / org_unit (string match, per-app) → canonical "department" dimension → governed node "engineering" (stable ID, every consumer).

3. How does it work?

Inputs → method → outputs, framed as the questions a practitioner actually asks.

"What is this column?" — The field-normalization layer matches each raw HRIS column against a canonical field library. Match precedence runs priority-catalog name → name-variant array → unmapped, emitting a per-column confidence and the variant it matched (FieldMappingSuggestion). Duplicate-target detection drops the confidence of a second column that suggests an already-claimed canonical key (core/field-mapping.ts). Optionally it also runs union-find identity resolution on employee/email/name columns to dedupe people across files before normalizing.

"Who is in this group?" — The cohort resolver (core/cohort-resolution.ts) is a pure function over the materialized segment_memberships table. Semantics: AND across dimensions, OR within a dimension's values, exclude rules subtract. The contract is deliberately a small filter-config ({ include: SegmentRule[], exclude?: SegmentRule[] }) — no expression DSL, no SQL surface — which keeps it LLM-friendly: an agent can author { dimensionKey, operator, values } without knowing SQL.

Inputs — raw HRIS rows (HTTP, or pulled live via the Workday SOAP/OAuth and BambooHR REST connectors); operator-authored SegmentDefinition criteria. Outputs — FieldMappingSuggestion[] + normalized rows on ingest; { memberIds, segmentNodeIds, totalMembers, totalNodes } on cohort resolve; a versioned SegmentationPack snapshot for downstream consumers.

Data sources and the science backing. The canonical catalog carries per-source attribution (PAT-51): authoritative taxonomies — O*NET, BLS, NAICS, GICS, U.S. Census, Radford / Mercer / CompAnalyst — can each contribute their own definition prose and name variants to a field or segment. The Tier-2 canonical segments are cited to their sources: generational cohorts to Pew Research, socio-economic-status bands to peer-reviewed scales (Hauser-Warren SEI, Nakao-Treas SEI, Ganzeboom ISEI, UK NS-SEC), per-jurisdiction diversity taxonomies to their statistical agencies (US EEO-1, UK ONS, Canada StatCan, EU Eurostat, Australia ABS, Singapore SingStat) — which deliberately do not crosswalk, since EEO-1 "Asian" is not ONS "Asian / Asian British." Where a construct maps to a validated I/O-psych measure (engagement, psychological safety, burnout), the link to a Principia construct is recorded.

Differentiation beat. The practitioner's real question isn't "filter this report" — it's "will the next tool count the same people I just counted?" The stable dimension + node ID is the answer: it travels across spokes and survives a reorg, so a cohort defined once is the same cohort everywhere.

Visual — Tier B (step flow). raw columns → field-mapping (priority → variant → unmapped, per-column confidence) → [optional identity-resolve] → segment_memberships → resolveCohort(criteria) → { memberIds, segmentNodeIds }.

4. What does it enable?

Concrete uses a practitioner would recognize:

Onboard a new HRIS export without hand-mapping every column — ingest suggests canonical keys with confidence and flags what it couldn't place.
Define a cohort once, reuse it everywhere — "Engineering, West, excluding contractors" becomes a stable ID that calculus, preference-modeler, and consumer apps all resolve identically.
Support matrixed orgs — multi-membership dimensions (isMultiMembership = true) let one person belong to "Engineering" + "Platform Squad" + "EMEA" simultaneously, instead of losing information to a single-membership model.
Segment on governed, cited taxonomies — generational cohort (Pew), SES scales, per-jurisdiction diversity, compensation-position bands — with the source definition attached, not invented.
Hot-swap calculated segments — attrition-risk and Nine-Box bands keep the bucket shape + ID stable while the algorithm underneath is upgraded as more signals come online, so consumers never break.
Merge multiple HRIS files deterministically — identity-resolve answers "who is who" across files, then data-join merges their rows under per-column OVERWRITE / IGNORE / FILL_HOLES policy with a column-lineage and conflict report.
Walk the manager graph — resolve everyone downstream of a focal person (reporting-chain), distinct from finance-org or operational-org groupings.

Visual — Tier B (typographic list). The seven verbs above map 1:1 to live routes: hris/ingest · cohorts/resolve · segments/define · identity/resolve · data-join/run · canonical-segments · org/reporting-chain/resolve.

5. How it fits in the toolbox

Segmentation Studio is a clean leaf in the dependency DAG: it issues segment IDs and consumes from no other toolbox spoke.

Consumes — raw HRIS data (uploaded, or pulled via Workday / BambooHR connectors) and curated public taxonomies (O*NET, BLS, NAICS, GICS, Census, Pew, the SES scales). It holds canonical bucket definitions for compensation-position bands but not the formula math — anycomp owns the compa-ratio formula, which consumers vendor directly (a deliberate cross-spoke split, PAT-57).
Emits — resolved cohorts (ResolveCohortResponse), canonical field/segment catalogs, and versioned SegmentationPack snapshots. Consumers vendor src/spokes/segmentation-studio/contracts/types.ts.
Feeds — calculus's metric × segment grouping key; preference-modeler's by-segment cross-tab; data-anonymizer's min-N gate, which calls in for cohort sizes before any team-level rollup is released; and Performix's segmentation adapter. Without this spoke, "engagement by department" is a string match in each consumer; with it, it's a typed reference.

Visual — Tier B (typographic data-flow). HRIS + public taxonomies → Segmentation Studio (normalize + resolve) → { calculus metric×segment · preference-modeler cross-tab · data-anonymizer min-N · Performix adapter }.

6. Commercialization / packaging

Segmentation Studio is a service component, not a standalone product — it is the shared cohort-ID issuer that the toolbox's analytics surfaces compose, so it sits behind buyer-facing tooling rather than being sold on its own.

Data-license posture: the structure and definitions the spoke ships (canonical field library, segment predicates, the toolbox's own taxonomies) are first-party. Vendor-derived content — Radford / Mercer / CompAnalyst definitions and their name variants — is held separately under the per-source attribution layer (PAT-51) precisely so first-party structure can be distributed openly while licensed source content stays governed by its own terms.
Anything about pricing tiers or packaged offerings is (TBD) — not earned yet, so not stated.

Visual — (TBD — product-tier placement diagram showing the spoke behind the analytics surfaces that compose it).

7. The vision

One governed definition of "who is in this group," issued once and resolved identically by every tool in the portfolio — including AI agents authoring cohorts from plain language — and upgradeable underneath without ever breaking a consumer.

The direction is two-fold: deepen the canonical catalog (more cited, governed taxonomies — the Tier-2 batches added generational, SES, per-jurisdiction diversity, comp-position, company-size/stage, education, and Mintzberg org dimensions) and widen the calculated-segment family (attrition-risk and Nine-Box ship today on the hot-swap pattern; PAT-63's performance / competency / behavioral / psychological meta-segmentation dimensions are seeded but their resolver wiring is (TBD — PAT-63-FU-A)). Per-tenant canonical-field overrides and a custom-segment authoring UI are filed follow-ups, not yet built.

Visual — (TBD — a catalog-growth + calculated-segment hot-swap map).

8. Current status

Grounded in the real code state (exported CONTRACT_VERSION = 2.11.0; CHANGELOG additive through 2.12.0; src/spokes/segmentation-studio/; registry status: "live"):

Shipped: field-normalization ingest with per-column confidence; the pure multi-membership cohort resolver (AND-across / OR-within / exclude); SegmentDefinition registration; batch identity resolution (union-find on email + exact + fuzzy name); anchor-supplemental data-join with lineage + conflict report; OneModel semantic-layer import; reporting-chain walk; versioned segmentation packs; declarative Config_Segmentation rules engine; live HRIS connectors (Workday SOAP + OAuth/RaaS, BambooHR REST) persisting tenant-scoped datasets + an ingestion_jobs audit; the Tier-1/Tier-2 canonical-segment batches with per-source attribution; calculated segments (attrition-risk, Nine-Box) on the hot-swap pattern. Roughly 25+ HTTP routes and a matching MCP tool surface registered.
Seeded but not yet wired: PAT-63 meta-segmentation dimensions (performance / competency / behavioral / psychological) — rows exist with placeholder predicates; resolver wiring is (TBD — PAT-63-FU-A).
Filed follow-ups (not built): recipe-execution for derived dimensions; per-tenant canonical-field overrides; a custom-segment authoring UI.

Visual — Tier A (live capture). GET /api/spokes/segmentation-studio/health and GET /api/spokes/segmentation-studio/canonical-fields report the real shipped catalog and reachability at request time.

Worked example used above is the real output of the spoke's pure resolveCohort core (core/cohort-resolution.ts) against its own demo seed (drizzle/0007_pat5_seed.sql): the cohort "Engineering ∧ West" over six seeded employees resolves to exactly one member (EMP-001) and two matched node IDs — EMP-002 (Engineering/East) and EMP-005 (Product/West) correctly fall out of the AND. No figure here is invented.