Data Anonymizer — plain-language explainer
Data Anonymizer is the privacy floor every other toolbox component stands on: it finds personal information, masks or tokenizes it, and refuses to show a team-level number when the team is too small to be anonymous.
A People Analytics Toolbox component. Built to the portfolio Explainer Standard v1.0. Every claim below is grounded in the spoke's own code and contracts (src/spokes/data-anonymizer/, contract 1.2.0); anything not yet built is marked (TBD).
1. What is it?
Data Anonymizer is a stateless privacy primitive — four operations a service calls at the moment data is about to leave a trusted boundary: detect which fields and text contain personal information, redact that information out of free text, tokenize an identifier into a stable but non-reversible token, and gate a cohort number against a minimum-headcount threshold so small groups never get reported.
It is deliberately not a job-running platform. There is no async pipeline, no policy engine, no stored copy of your HRIS. Each operation is a pure call: text or a value goes in, a labeled result comes back. The complexity lives in the rule catalog and the risk-prioritized overlap resolution, not in orchestration.
Visual — Tier B (the four primitives). What the spoke exposes, as a step list:
- detect —
GET /pii-ruleslists the active detection rules; field-name and free-text matching run on top of them. - redact —
POST /redactturns text into redacted text plus flagged spans. - tokenize —
POST /tokenizeturns a value into a deterministic, non-reversible token. - transform —
POST /transformapplies a named anonymization strategy (fake-but-realistic substitution). - gate —
POST /min-n-checksays whether a cohort is large enough to report.
2. What problem does it solve — and why is it different?
The pain it removes: every people-analytics surface eventually wants to show team-level or respondent-level data, and every one of them faces the same three failure modes — leaking PII into a UI a third party might see, joining datasets on a raw identifier that travels with the person, and reporting a "segment of three" that anyone can de-anonymize. Solving this by hand means hand-rolled regexes, ad-hoc hashing, and a min-headcount rule that lives in one analyst's spreadsheet.
The difference, stated as a shift:
- FROM privacy logic scattered across every spoke, each reinventing detection, hashing, and the small-group rule.
- TO one component every consumer calls inline at the data egress point, with a shared rule catalog, deterministic tokens that stay consistent across services, and a single min-N gate the whole toolbox checks against.
How it differs from the obvious substitutes (per the spoke's own positioning):
- vs. enterprise privacy platforms (BigID, OneTrust, Privitar) — those are governance + discovery stacks priced per workflow; this is a free-standing service primitive, four endpoints, callable from any HTTP client in a few lines.
- vs. cloud DLP services (AWS Macie, Google Cloud DLP, Azure Purview) — those classify at scale on stored objects; this is sub-second per call, in-process for any dependent service, and its rule set is operator-editable in plain SQL.
- vs. open-source PII libraries (presidio, scrubadub) — those detect well but have no min-N gate, no persisted deterministic-tokenization, and no risk-prioritized overlap resolution. This combines all four.
3. How does it work?
Inputs → method → outputs, concretely, per primitive:
- detect / redact. Input is text (optionally a
fieldNameand a category filter). The detector runs the active rule catalog — both column-header patterns and free-text content patterns — over the text. Matches are sorted highest-risk-first (the category→risk map incontracts/types.ts:identifier,email,financial,name,trade_secretarehigh;phone,address,company,surveyaremedium), then overlapping matches are dropped in favor of the higher-risk, longer one. Output is{ redactedText, spans, fieldNameMatched }, where each span carries{ start, end, category, risk, ruleName, matched, redactedAs }— an audit trail of exactly what was masked. - tokenize. Input is
{ tenantId, field, value }. The value is run through HMAC-SHA256 under a per-tenant derived key (the master secret is the env varDATA_ANONYMIZER_TOKEN_SECRET). The same input always yields the same token within a tenant; the token is not reversible without the master secret. Tokens are cached in thetokenization_mapstable so a consumer can re-tokenize deterministically without recomputation. - transform. Input is
{ tenantId, field, value, strategy }. Applies one of the named anonymization strategies (fake_name,safe_email,salary_band,shift_date,deterministic_id,redact, … 19 in the enum) to produce a fake-but-realistic substitute rather than a mask. - gate. Input is
{ segmentId, respondentCount, threshold }; output is{ ok, reason? }. A pure comparison:ok = respondentCount >= threshold, with a human-readable reason when it fails.
Data sources and the science behind it. The detection rules are a regex catalog (lifted from the donor data-anonymizer service and tuned), seeded as 24 default global rules covering names, identifiers, email/phone, addresses, company/department, financial, and survey fields. The risk model is the category→risk table in the contract. There are no external priors here — this is rule-driven privacy logic, not a statistical model. (No Principia, BLS, O*NET, or NAICS dependency.)
Differentiation beat. The practitioner's real question is not "did it find some PII" — it is "can I trust which classification it chose when two rules collide?" The risk-prioritized resolution answers that directly and is tuned for HRIS / survey workflows: an SSN-formatted string (123-45-6789) is matched by both the broad phone-content rule (medium) and the SSN-content rule (high); the spoke classifies it as identifier (high), because misclassifying an SSN as a phone number is materially worse than the inverse. Generic libraries treat all matches as equal-priority.
4. What does it enable?
Concrete uses a practitioner would recognize:
- Redact free text before it hits a UI — run a survey verbatim or a manager comment through
/redactbefore rendering it on an insight card a third party might see; the returned spans are the audit of what was masked. - Join datasets without carrying the person — tokenize an employee ID or email per tenant, then join on the token across spokes; the same person resolves to the same token, but the token reveals nothing.
- Enforce the small-group rule everywhere — call
/min-n-checkbefore showing any segment-level rollup, so a "segment of three" is suppressed by the same gate across the whole toolbox. - Generate fake-but-realistic demo or export data — apply
transformstrategies (fake_name,safe_email,salary_band) to produce a usable non-sensitive copy of an HRIS extract. - Layer tenant-specific rules — a tenant can add its own
trade_secretrules on top of the global catalog;GET /pii-rules?tenantId=merges them (tenant row wins on name collision). - Audit a privacy review — the
redaction_audit_logtable records input hash + categories + span count per call, so a reviewer can verify what was masked without retaining the raw text.
Visual — Tier B (consumer call sites). Real in-toolbox dependents: preference-modeler's by-segment cross-tab and segmentation-studio's cohort resolution call min-n-check before display; cross-spoke joins call tokenize; insight-card surfaces call redact before rendering free text.
5. How it fits in the toolbox
Data flow:
- Consumes — nothing from other spokes. Its only inputs are the caller's text/values and its own rule catalog (the
pii_rulestable, seeded global rules plus optional per-tenant overrides). The min-N gate is a pure evaluator: the caller suppliesrespondentCount(typically fetched fromsegmentation-studio) andthreshold(the calling surface's policy); the toolbox does not own segment metadata. - Emits — no insights and no metric × segment lists of its own. It returns labeled privacy results:
RedactionResponse,TokenizationResponse,TransformResponse,MinNCheckResponse,PiiRulesResponse. Consumers vendorsrc/spokes/data-anonymizer/contracts/types.ts. - Feeds — every spoke that surfaces team-level or respondent-level data. It is the privacy floor the others stand on: they call it before displaying or transmitting respondent-level data.
- Boundary note — the spoke labels and redacts on request; it does not refuse to return PII. Trust-boundary enforcement is a consumer concern by design.
Visual — Tier B (typographic data-flow). caller text / value + rule catalog → Data Anonymizer { detect · redact · tokenize · transform · gate } → labeled privacy result the consumer trusts before display.
6. Commercialization / packaging
Data Anonymizer is a service primitive, not a standalone product — it is the privacy layer the consuming surfaces and the broader compensation/analytics offerings stand on, rather than something sold on its own.
- Data-license posture: the rule catalog is regex authored in-repo; there is no third-party licensed data inside this spoke, which is part of why it can be a free-standing primitive. The master tokenization secret (
DATA_ANONYMIZER_TOKEN_SECRET) is a per-deploy env var, not shipped data. - Auth posture: the read endpoints (
/pii-rules,/strategies,/health) are public; the write endpoints (/redact,/tokenize,/transform,/min-n-check) require the toolbox service key (POST gate). - Anything about pricing tiers or packaged offerings is (TBD) — not earned yet, so not stated.
Visual — (TBD — product-tier placement diagram showing the privacy primitive under the consuming surfaces).
7. The vision
A privacy floor every people-analytics surface calls by reflex — detection that classifies correctly even when rules collide, tokens that stay consistent across services, and a small-group gate the whole portfolio shares — so that getting privacy right is the default, not an each-team responsibility.
The near-term direction lives in the spoke's own follow-up cards: tuning the content regexes at the root (the donor's phone regex over-grabs trailing punctuation; the SSN regex doesn't gracefully handle a trailing period — today the risk-priority resolution masks the symptom) and broadening the transform strategy set for synthetic-realistic export pipelines. Both are filed as PAT-4-FU follow-ups, not yet shipped.
Visual — (TBD — a rule-catalog coverage map: categories covered vs. follow-up-tuning queue).
8. Current status
Grounded in the real code state (contract 1.2.0, src/spokes/data-anonymizer/, registry status: "live"):
- Shipped: the four-plus-one primitives, all live. Routes:
GET /pii-rules(layered global + tenant rules),POST /redact(redacted text + audited spans),POST /tokenize(HMAC, per-tenant key, cached intokenization_maps),POST /transform(19 named strategies),GET /strategies(discovery),POST /min-n-check,GET /health. Risk-prioritized overlap resolution. Seeded catalog: 24 global rules (drizzle/0005_pat4_seed.sql) plus a demo tenant override + twotrade_secretrules (drizzle/0011_pat4_fu_a_tenant_seed.sql). MCP tools registered (data-anonymizer.pii-rules,.redact,.tokenize,.transform,.strategies,.min-n-check,.emit-insight-cards,.health). Audit tableredaction_audit_log. - In flight / planned (PAT-4-FU): root-cause regex tuning for the phone/SSN content patterns; a broader
transformstrategy set for HRIS-export / demo-data pipelines.
Note: the spoke README still cites CONTRACT_VERSION = "1.1.0" and lists only four endpoints; the live contract is 1.2.0 and the transform + strategies endpoints are wired (registry + routes confirm). The README lags the code.
Visual — Tier B (in-repo reference). Live status is verifiable at GET /api/spokes/data-anonymizer/health and the registry entry in src/lib/contracts/registry.ts (status: "live", contract imported from the spoke).
Load-bearing worked example (real, traced through the spoke's own code)
A real POST /redact call, traced through detectSpans + apply (src/lib/pii-rules/detect.ts, apply.ts) against the seeded content rules pat4-rule-email-content, pat4-rule-ssn-content, and pat4-rule-phone-content (drizzle/0005_pat4_seed.sql).
Input text:
Contact Jane at jane.doe@acme.com or SSN 123-45-6789
What happens, step by step:
- The email-content rule (
[a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}, categoryemail, riskhigh) matchesjane.doe@acme.com. - The SSN-content rule (
\d{3}-?\d{2}-?\d{4}, categoryidentifier, riskhigh) matches123-45-6789. The broad phone-content rule (categoryphone, riskmedium) also spans that same substring. - Detection sorts highest-risk-first;
dropOverlapskeeps theidentifier(high) match and discards the overlappingphone(medium) match. This is the HRIS-tuned resolution: the SSN is never mislabeled as a phone number. - Each kept span is replaced with
[REDACTED:<CATEGORY>].
Result (shape per the RedactionResponse contract; values traced from the code, not invented):
redactedText:Contact Jane at [REDACTED:EMAIL] or SSN [REDACTED:IDENTIFIER]spans:{ category: "email", risk: "high", ruleName: "Email Content Pattern", matched: "jane.doe@acme.com", redactedAs: "[REDACTED:EMAIL]" }{ category: "identifier", risk: "high", ruleName: "SSN Content Pattern", matched: "123-45-6789", redactedAs: "[REDACTED:IDENTIFIER]" }
What a practitioner does with it: render redactedText into the UI; keep spans in the privacy review so they can show what was masked and why (category + risk + the rule that fired) without ever retaining the raw values. The free-text token "Jane" is left intact — it is a bare first name with no header context, so no content rule fires on it; that boundary (name-in-prose) is exactly where a consumer would add a tenant rule or fall back to transform.
Every value above is traced from the spoke's real detection/redaction code and seeded rule catalog. No metric or figure here is invented.