Identity Resolution

This document specifies the engineering architecture for identity resolution in HAAK. The ontological grounding is in [[11-identity]]; this document is about how identifiers flow through the system,…

#Hub-and-Spoke Model

The local entity_identifiers table is the authoritative hub. External registries are read-only spokes. HAAK never writes upstream.

                    ┌──────────────┐
                    │  entity_     │
                    │  identifiers │
                    │  (local hub) │
                    └──────┬───────┘
           ┌───────┬───────┼───────┬──────────┐
           ▼       ▼       ▼       ▼          ▼
        contacts  github  orcid  semantic   spotify
        .db       API     API    scholar    API
                                  API

Each spoke contributes identifiers in a single namespace. The hub holds the canonical mapping. Resolution runs locally — no external service has authority over canonical identity. External services provide evidence; the hub renders judgment.

Data flow is always inward. An enrichment script queries an external API, receives candidate identifiers, evaluates confidence, and writes to entityidentifiers. The hub never pushes identifiers outward. The external registry never learns what canonicalid its identifier maps to locally.

#Current State (v1)

The resolution script (scripts/resolve_identities.py) runs five passes over local data:

Pass	Source	Namespace(s)	Method	Confidence
1	contacts.db	contact, email, phone	Union-find dedup: email merge unrestricted, phone merge within name clusters, name merge excluding noise	1.0 (email), 0.8 (phone), 0.85 (name)
2	personas/	persona	Normalized name match against contact canonical groups	0.95
3	papers.db	paper-author	Family name + first initial match	0.9
4	repos.db	github	GitHub username match against persona frontmatter or contact	0.9
5	Hard-coded	model, agent-session	Claude model registration, session registration from persona frontmatter	1.0

Output: 12,081 identifiers across 9 namespaces, 4,968 canonical groups from 6,888 contacts (28% dedup).

Known limitations:

Email alias resolution is string-exact (zmainen@neuro ≠ zachary.mainen@neuro)
ORCID populated for 1 person only
No external registry enrichment yet
Paper-author matching uses family name + initial only (ambiguous for common names)
Multi-value email fields partially handled (semicolons split, but some leak)

#Enrichment Pipeline Pattern

Adding a new external registry follows five steps. This is the reusable pattern for all future enrichment.

#Step 1: Register the registry

Add the registry as an entity. In practice, this means choosing a namespace slug and documenting what identifiers it issues.

REGISTRY = {
    "namespace": "semantic-scholar",
    "name": "Semantic Scholar",
    "id_format": "AUTH-XXXXXXXX (author), PAPER-XXXXXXXX (paper)",
    "api_base": "https://api.semanticscholar.org/graph/v1",
    "rate_limit": "100 req/5 min (unauthenticated)",
}

#Step 2: Define the namespace

Add the namespace to the resolution script's known namespaces. The namespace determines how identifiers are stored and queried.

#Step 3: Write the adapter

An adapter script queries the external API and returns candidate identifier pairs:

def enrich_from_semantic_scholar(canonical_id: str, name: str, known_papers: list[str]) -> list[tuple]:
    """Returns [(namespace, external_id, confidence, source), ...]"""
    # Query API by name
    # Cross-check against known paper DOIs for disambiguation
    # Return high-confidence matches only

Adapters are stateless functions. They take a canonical_id and whatever local evidence is available (name, email, known DOIs), query the external API, and return candidate identifiers with confidence scores.

#Step 4: Wire into the pipeline

Add the adapter as a pass in resolve_identities.py. Enrichment passes run after dedup passes — they operate on canonical groups, not raw contacts.

#Step 5: Record provenance

Every enrichment-sourced identifier gets:

source: adapter name (e.g., "semantic-scholar-enrichment")
confidence: set by the adapter based on match quality
Timestamp via created_at

The enrichment act itself is a situation in the ontological sense — it has actors (the script, the API), methods (name matching, DOI cross-check), and outputs (new identifier-belongings).

#Prioritized Roadmap

Registries ordered by value density — how many local entities they can enrich per unit of implementation effort.

#1. Semantic Scholar (academic author disambiguation)

Value: ~300 personas and paper-author strings can be matched. Resolves the paper-author ambiguity problem (family name + initial is insufficient for "Wang Y" or "Li J"). Author IDs are stable. Paper IDs cross-link to DOIs already in papers.db.

Method: For each persona with known papers, query S2 author search by name, disambiguate by matching DOIs from papers.db against the candidate's paper list. High-overlap = high confidence.

Namespaces added: semantic-scholar-author, semantic-scholar-paper

#2. Sender resolution (email ↔ contact linking)

Value: Email messages in gmail.db have sender/recipient addresses. Mapping these to canonical entities makes every email a situation with identified participants. Huge volume — ~500K messages.

Method: Extract unique sender addresses from gmail.db, match against entity_identifiers namespace="email". Most will match directly (same email strings). Remaining unmatched addresses go through domain-aware alias resolution (strip dots, check plus-addressing, match local-part patterns within same domain).

Namespaces added: None new — uses existing email namespace but increases coverage.

#3. ORCID (systematic enrichment)

Value: Authoritative, persistent, researcher-controlled identifiers. Currently only 1 person has ORCID in the system. All PIs and postdocs at Champalimaud have ORCIDs.

Method: For personas with known publications, query ORCID public API by name + affiliation. Cross-check returned publication list against papers.db DOIs. Store as orcid namespace.

Namespaces added: Enriches existing orcid namespace.

#4. Spotify (artist identity)

Value: music.db has track metadata with artist names. Spotify API returns artist IDs. Linking artists to canonical entities enables: "what papers has this person published?" → "what music do they make?" (for the rare person who does both).

Method: For tracks in music.db, query Spotify API for artist IDs. Create entities for artists not already in the system. For known persons who are also musicians, merge.

Namespaces added: spotify-artist

#5. Wikidata crosswalks

Value: Wikidata is a hub of hubs — it maps between ORCID, VIAF, ISNI, IMDb, MusicBrainz, and hundreds of other identifiers. One Wikidata lookup can yield 5+ external IDs.

Method: For high-confidence entities (PIs, public figures), query Wikidata by ORCID or name + affiliation. Harvest all P-number identifier properties. Each becomes a new namespace entry.

Namespaces added: wikidata, plus whatever identifiers Wikidata yields (viaf, isni, musicbrainz-artist, imdb-person, etc.)

#6. Public archives (entity linking)

Value: public-archives.db has 1.6M entity mentions extracted from DOJ documents, WikiLeaks cables, ICIJ leaks. Linking these to canonical entities enables cross-source triangulation (the core capability described in [[09-situation-graph]]).

Method: Named entity matching with disambiguation. High-profile entities (public figures, institutions) first. Conservative confidence thresholds — false positives in this domain have reputational risk.

Namespaces added: epstein-doj, wikileaks, icij

#Email Alias Resolution (v2)

The highest-impact near-term fix. Current string-exact email matching fragments entities that use multiple addresses at the same domain.

Rules (domain-aware):

Gmail dot-insensitivity: z.mainen@gmail.com = zmainen@gmail.com
Plus-addressing: zmainen+list@gmail.com = zmainen@gmail.com
Institutional aliases: within the same domain, zmainen@neuro.fc.org and zachary.mainen@neuro.fc.org are candidates for merge (require name confirmation)
Domain equivalence: @neuro.fchampalimaud.org and @fundacaochampalimaud.pt are the same institution (maintain a small domain-equivalence table)

These rules apply only within the email namespace and only increase confidence when the name cluster already overlaps.

#Confidence Calibration

Confidence is ordinal, not probabilistic. The scale:

Range	Meaning	Example
1.0	Definitional	Hard-coded agent registration, manual verification
0.95	Near-certain	Shared unique email address
0.9	High	Persona-to-contact name match, DOI cross-validated S2 author
0.8	Moderate	Phone within name cluster, name-only match with common name
0.7	Tentative	Single-paper author match, domain-inferred email alias
< 0.7	Review required	Flagged for human confirmation before use

Merges below 0.7 are never automatic. They are stored as candidates and surfaced for human review.

#Invariants

Local authority. entity_identifiers is the single source of truth for canonical identity. External registries provide evidence, not verdicts.
Read-only spokes. No enrichment script writes to an external service. Data flows inward only.
Idempotent rebuild. resolveidentities.py clears and rebuilds entityidentifiers from source data. Any manual correction must be encoded as a rule in the script, not as a row edit.
Provenance on every identifier. The source field records how the identifier entered the system. No anonymous insertions.
Confidence ordering. Higher-confidence identifiers take precedence in canonical_id assignment. When merging groups, the group with more high-confidence identifiers becomes the canonical target.
No false merges. Phone numbers only merge within name clusters. Common names require secondary evidence (shared affiliation, shared paper, shared email domain). Conservative merging is preferable to aggressive dedup.

architecture · 25 · identity-resolution · 2026-03-14 · zach + claude

Architecture 25 — Identity Resolution — 2026 — Zachary F. Mainen / HAAK