Data Anonymizer — plain-language explainer

Data Anonymizer is the privacy floor every other toolbox component stands on: it finds personal information, masks or tokenizes it, and refuses to show a team-level number when the team is too small to be anonymous.

A People Analytics Toolbox component. Built to the portfolio Explainer Standard v1.0. Every claim below is grounded in the spoke's own code and contracts (src/spokes/data-anonymizer/, contract 1.2.0); anything not yet built is marked (TBD).


1. What is it?

Data Anonymizer is a stateless privacy primitive — four operations a service calls at the moment data is about to leave a trusted boundary: detect which fields and text contain personal information, redact that information out of free text, tokenize an identifier into a stable but non-reversible token, and gate a cohort number against a minimum-headcount threshold so small groups never get reported.

It is deliberately not a job-running platform. There is no async pipeline, no policy engine, no stored copy of your HRIS. Each operation is a pure call: text or a value goes in, a labeled result comes back. The complexity lives in the rule catalog and the risk-prioritized overlap resolution, not in orchestration.

Visual — Tier B (the four primitives). What the spoke exposes, as a step list:

  • detectGET /pii-rules lists the active detection rules; field-name and free-text matching run on top of them.
  • redactPOST /redact turns text into redacted text plus flagged spans.
  • tokenizePOST /tokenize turns a value into a deterministic, non-reversible token.
  • transformPOST /transform applies a named anonymization strategy (fake-but-realistic substitution).
  • gatePOST /min-n-check says whether a cohort is large enough to report.

2. What problem does it solve — and why is it different?

The pain it removes: every people-analytics surface eventually wants to show team-level or respondent-level data, and every one of them faces the same three failure modes — leaking PII into a UI a third party might see, joining datasets on a raw identifier that travels with the person, and reporting a "segment of three" that anyone can de-anonymize. Solving this by hand means hand-rolled regexes, ad-hoc hashing, and a min-headcount rule that lives in one analyst's spreadsheet.

The difference, stated as a shift:

  • FROM privacy logic scattered across every spoke, each reinventing detection, hashing, and the small-group rule.
  • TO one component every consumer calls inline at the data egress point, with a shared rule catalog, deterministic tokens that stay consistent across services, and a single min-N gate the whole toolbox checks against.

How it differs from the obvious substitutes (per the spoke's own positioning):

  • vs. enterprise privacy platforms (BigID, OneTrust, Privitar) — those are governance + discovery stacks priced per workflow; this is a free-standing service primitive, four endpoints, callable from any HTTP client in a few lines.
  • vs. cloud DLP services (AWS Macie, Google Cloud DLP, Azure Purview) — those classify at scale on stored objects; this is sub-second per call, in-process for any dependent service, and its rule set is operator-editable in plain SQL.
  • vs. open-source PII libraries (presidio, scrubadub) — those detect well but have no min-N gate, no persisted deterministic-tokenization, and no risk-prioritized overlap resolution. This combines all four.

3. How does it work?

Inputs → method → outputs, concretely, per primitive:

  • detect / redact. Input is text (optionally a fieldName and a category filter). The detector runs the active rule catalog — both column-header patterns and free-text content patterns — over the text. Matches are sorted highest-risk-first (the category→risk map in contracts/types.ts: identifier, email, financial, name, trade_secret are high; phone, address, company, survey are medium), then overlapping matches are dropped in favor of the higher-risk, longer one. Output is { redactedText, spans, fieldNameMatched }, where each span carries { start, end, category, risk, ruleName, matched, redactedAs } — an audit trail of exactly what was masked.
  • tokenize. Input is { tenantId, field, value }. The value is run through HMAC-SHA256 under a per-tenant derived key (the master secret is the env var DATA_ANONYMIZER_TOKEN_SECRET). The same input always yields the same token within a tenant; the token is not reversible without the master secret. Tokens are cached in the tokenization_maps table so a consumer can re-tokenize deterministically without recomputation.
  • transform. Input is { tenantId, field, value, strategy }. Applies one of the named anonymization strategies (fake_name, safe_email, salary_band, shift_date, deterministic_id, redact, … 19 in the enum) to produce a fake-but-realistic substitute rather than a mask.
  • gate. Input is { segmentId, respondentCount, threshold }; output is { ok, reason? }. A pure comparison: ok = respondentCount >= threshold, with a human-readable reason when it fails.

Data sources and the science behind it. The detection rules are a regex catalog (lifted from the donor data-anonymizer service and tuned), seeded as 24 default global rules covering names, identifiers, email/phone, addresses, company/department, financial, and survey fields. The risk model is the category→risk table in the contract. There are no external priors here — this is rule-driven privacy logic, not a statistical model. (No Principia, BLS, O*NET, or NAICS dependency.)

Differentiation beat. The practitioner's real question is not "did it find some PII" — it is "can I trust which classification it chose when two rules collide?" The risk-prioritized resolution answers that directly and is tuned for HRIS / survey workflows: an SSN-formatted string (123-45-6789) is matched by both the broad phone-content rule (medium) and the SSN-content rule (high); the spoke classifies it as identifier (high), because misclassifying an SSN as a phone number is materially worse than the inverse. Generic libraries treat all matches as equal-priority.

4. What does it enable?

Concrete uses a practitioner would recognize:

  • Redact free text before it hits a UI — run a survey verbatim or a manager comment through /redact before rendering it on an insight card a third party might see; the returned spans are the audit of what was masked.
  • Join datasets without carrying the person — tokenize an employee ID or email per tenant, then join on the token across spokes; the same person resolves to the same token, but the token reveals nothing.
  • Enforce the small-group rule everywhere — call /min-n-check before showing any segment-level rollup, so a "segment of three" is suppressed by the same gate across the whole toolbox.
  • Generate fake-but-realistic demo or export data — apply transform strategies (fake_name, safe_email, salary_band) to produce a usable non-sensitive copy of an HRIS extract.
  • Layer tenant-specific rules — a tenant can add its own trade_secret rules on top of the global catalog; GET /pii-rules?tenantId= merges them (tenant row wins on name collision).
  • Audit a privacy review — the redaction_audit_log table records input hash + categories + span count per call, so a reviewer can verify what was masked without retaining the raw text.

Visual — Tier B (consumer call sites). Real in-toolbox dependents: preference-modeler's by-segment cross-tab and segmentation-studio's cohort resolution call min-n-check before display; cross-spoke joins call tokenize; insight-card surfaces call redact before rendering free text.

5. How it fits in the toolbox

Data flow:

  • Consumes — nothing from other spokes. Its only inputs are the caller's text/values and its own rule catalog (the pii_rules table, seeded global rules plus optional per-tenant overrides). The min-N gate is a pure evaluator: the caller supplies respondentCount (typically fetched from segmentation-studio) and threshold (the calling surface's policy); the toolbox does not own segment metadata.
  • Emits — no insights and no metric × segment lists of its own. It returns labeled privacy results: RedactionResponse, TokenizationResponse, TransformResponse, MinNCheckResponse, PiiRulesResponse. Consumers vendor src/spokes/data-anonymizer/contracts/types.ts.
  • Feeds — every spoke that surfaces team-level or respondent-level data. It is the privacy floor the others stand on: they call it before displaying or transmitting respondent-level data.
  • Boundary note — the spoke labels and redacts on request; it does not refuse to return PII. Trust-boundary enforcement is a consumer concern by design.

Visual — Tier B (typographic data-flow). caller text / value + rule catalog → Data Anonymizer { detect · redact · tokenize · transform · gate } → labeled privacy result the consumer trusts before display.

6. Commercialization / packaging

Data Anonymizer is a service primitive, not a standalone product — it is the privacy layer the consuming surfaces and the broader compensation/analytics offerings stand on, rather than something sold on its own.

  • Data-license posture: the rule catalog is regex authored in-repo; there is no third-party licensed data inside this spoke, which is part of why it can be a free-standing primitive. The master tokenization secret (DATA_ANONYMIZER_TOKEN_SECRET) is a per-deploy env var, not shipped data.
  • Auth posture: the read endpoints (/pii-rules, /strategies, /health) are public; the write endpoints (/redact, /tokenize, /transform, /min-n-check) require the toolbox service key (POST gate).
  • Anything about pricing tiers or packaged offerings is (TBD) — not earned yet, so not stated.

Visual — (TBD — product-tier placement diagram showing the privacy primitive under the consuming surfaces).

7. The vision

A privacy floor every people-analytics surface calls by reflex — detection that classifies correctly even when rules collide, tokens that stay consistent across services, and a small-group gate the whole portfolio shares — so that getting privacy right is the default, not an each-team responsibility.

The near-term direction lives in the spoke's own follow-up cards: tuning the content regexes at the root (the donor's phone regex over-grabs trailing punctuation; the SSN regex doesn't gracefully handle a trailing period — today the risk-priority resolution masks the symptom) and broadening the transform strategy set for synthetic-realistic export pipelines. Both are filed as PAT-4-FU follow-ups, not yet shipped.

Visual — (TBD — a rule-catalog coverage map: categories covered vs. follow-up-tuning queue).

8. Current status

Grounded in the real code state (contract 1.2.0, src/spokes/data-anonymizer/, registry status: "live"):

  • Shipped: the four-plus-one primitives, all live. Routes: GET /pii-rules (layered global + tenant rules), POST /redact (redacted text + audited spans), POST /tokenize (HMAC, per-tenant key, cached in tokenization_maps), POST /transform (19 named strategies), GET /strategies (discovery), POST /min-n-check, GET /health. Risk-prioritized overlap resolution. Seeded catalog: 24 global rules (drizzle/0005_pat4_seed.sql) plus a demo tenant override + two trade_secret rules (drizzle/0011_pat4_fu_a_tenant_seed.sql). MCP tools registered (data-anonymizer.pii-rules, .redact, .tokenize, .transform, .strategies, .min-n-check, .emit-insight-cards, .health). Audit table redaction_audit_log.
  • In flight / planned (PAT-4-FU): root-cause regex tuning for the phone/SSN content patterns; a broader transform strategy set for HRIS-export / demo-data pipelines.

Note: the spoke README still cites CONTRACT_VERSION = "1.1.0" and lists only four endpoints; the live contract is 1.2.0 and the transform + strategies endpoints are wired (registry + routes confirm). The README lags the code.

Visual — Tier B (in-repo reference). Live status is verifiable at GET /api/spokes/data-anonymizer/health and the registry entry in src/lib/contracts/registry.ts (status: "live", contract imported from the spoke).


Load-bearing worked example (real, traced through the spoke's own code)

A real POST /redact call, traced through detectSpans + apply (src/lib/pii-rules/detect.ts, apply.ts) against the seeded content rules pat4-rule-email-content, pat4-rule-ssn-content, and pat4-rule-phone-content (drizzle/0005_pat4_seed.sql).

Input text:

Contact Jane at jane.doe@acme.com or SSN 123-45-6789

What happens, step by step:

  1. The email-content rule ([a-zA-Z0-9._%+-]+@[a-zA-Z0-9.-]+\.[a-zA-Z]{2,}, category email, risk high) matches jane.doe@acme.com.
  2. The SSN-content rule (\d{3}-?\d{2}-?\d{4}, category identifier, risk high) matches 123-45-6789. The broad phone-content rule (category phone, risk medium) also spans that same substring.
  3. Detection sorts highest-risk-first; dropOverlaps keeps the identifier (high) match and discards the overlapping phone (medium) match. This is the HRIS-tuned resolution: the SSN is never mislabeled as a phone number.
  4. Each kept span is replaced with [REDACTED:<CATEGORY>].

Result (shape per the RedactionResponse contract; values traced from the code, not invented):

  • redactedText: Contact Jane at [REDACTED:EMAIL] or SSN [REDACTED:IDENTIFIER]
  • spans:
    • { category: "email", risk: "high", ruleName: "Email Content Pattern", matched: "jane.doe@acme.com", redactedAs: "[REDACTED:EMAIL]" }
    • { category: "identifier", risk: "high", ruleName: "SSN Content Pattern", matched: "123-45-6789", redactedAs: "[REDACTED:IDENTIFIER]" }

What a practitioner does with it: render redactedText into the UI; keep spans in the privacy review so they can show what was masked and why (category + risk + the rule that fired) without ever retaining the raw values. The free-text token "Jane" is left intact — it is a bare first name with no header context, so no content rule fires on it; that boundary (name-in-prose) is exactly where a consumer would add a tenant rule or fall back to transform.

Every value above is traced from the spoke's real detection/redaction code and seeded rule catalog. No metric or figure here is invented.