This document specifies the engineering architecture for identity resolution in HAAK. The ontological grounding is in [[11-identity]]; this document is about how identifiers flow through the system, how external registries are queried, and how the local identity hub grows over time.
#Hub-and-Spoke Model
The local entity_identifiers table is the authoritative hub. External registries are read-only spokes. HAAK never writes upstream.
┌──────────────┐
│ entity_ │
│ identifiers │
│ (local hub) │
└──────┬───────┘
┌───────┬───────┼───────┬──────────┐
▼ ▼ ▼ ▼ ▼
contacts github orcid semantic spotify
.db API API scholar API
API
Each spoke contributes identifiers in a single namespace. The hub holds the canonical mapping. Resolution runs locally — no external service has authority over canonical identity. External services provide evidence; the hub renders judgment.
Data flow is always inward. An enrichment script queries an external API, receives candidate identifiers, evaluates confidence, and writes to entityidentifiers. The hub never pushes identifiers outward. The external registry never learns what canonicalid its identifier maps to locally.
#Current State (v1)
The resolution script (scripts/resolve_identities.py) runs five passes over local data:
| Pass | Source | Namespace(s) | Method | Confidence |
|---|---|---|---|---|
| 1 | contacts.db | contact, email, phone | Union-find dedup: email merge unrestricted, phone merge within name clusters, name merge excluding noise | 1.0 (email), 0.8 (phone), 0.85 (name) |
| 2 | personas/ | persona | Normalized name match against contact canonical groups | 0.95 |
| 3 | papers.db | paper-author | Family name + first initial match | 0.9 |
| 4 | repos.db | github | GitHub username match against persona frontmatter or contact | 0.9 |
| 5 | Hard-coded | model, agent-session | Claude model registration, session registration from persona frontmatter | 1.0 |
Output: 12,081 identifiers across 9 namespaces, 4,968 canonical groups from 6,888 contacts (28% dedup).
Known limitations:
- Email alias resolution is string-exact (
zmainen@neuro≠zachary.mainen@neuro) - ORCID populated for 1 person only
- No external registry enrichment yet
- Paper-author matching uses family name + initial only (ambiguous for common names)
- Multi-value email fields partially handled (semicolons split, but some leak)
#Enrichment Pipeline Pattern
Adding a new external registry follows five steps. This is the reusable pattern for all future enrichment.
#Step 1: Register the registry
Add the registry as an entity. In practice, this means choosing a namespace slug and documenting what identifiers it issues.
REGISTRY = {
"namespace": "semantic-scholar",
"name": "Semantic Scholar",
"id_format": "AUTH-XXXXXXXX (author), PAPER-XXXXXXXX (paper)",
"api_base": "https://api.semanticscholar.org/graph/v1",
"rate_limit": "100 req/5 min (unauthenticated)",
}
#Step 2: Define the namespace
Add the namespace to the resolution script's known namespaces. The namespace determines how identifiers are stored and queried.
#Step 3: Write the adapter
An adapter script queries the external API and returns candidate identifier pairs:
def enrich_from_semantic_scholar(canonical_id: str, name: str, known_papers: list[str]) -> list[tuple]:
"""Returns [(namespace, external_id, confidence, source), ...]"""
# Query API by name
# Cross-check against known paper DOIs for disambiguation
# Return high-confidence matches only
Adapters are stateless functions. They take a canonical_id and whatever local evidence is available (name, email, known DOIs), query the external API, and return candidate identifiers with confidence scores.
#Step 4: Wire into the pipeline
Add the adapter as a pass in resolve_identities.py. Enrichment passes run after dedup passes — they operate on canonical groups, not raw contacts.
#Step 5: Record provenance
Every enrichment-sourced identifier gets:
source: adapter name (e.g., "semantic-scholar-enrichment")confidence: set by the adapter based on match quality- Timestamp via
created_at
The enrichment act itself is a situation in the ontological sense — it has actors (the script, the API), methods (name matching, DOI cross-check), and outputs (new identifier-belongings).
#Prioritized Roadmap
Registries ordered by value density — how many local entities they can enrich per unit of implementation effort.
#1. Semantic Scholar (academic author disambiguation)
Value: ~300 personas and paper-author strings can be matched. Resolves the paper-author ambiguity problem (family name + initial is insufficient for "Wang Y" or "Li J"). Author IDs are stable. Paper IDs cross-link to DOIs already in papers.db.
Method: For each persona with known papers, query S2 author search by name, disambiguate by matching DOIs from papers.db against the candidate's paper list. High-overlap = high confidence.
Namespaces added: semantic-scholar-author, semantic-scholar-paper
#2. Sender resolution (email ↔ contact linking)
Value: Email messages in gmail.db have sender/recipient addresses. Mapping these to canonical entities makes every email a situation with identified participants. Huge volume — ~500K messages.
Method: Extract unique sender addresses from gmail.db, match against entity_identifiers namespace="email". Most will match directly (same email strings). Remaining unmatched addresses go through domain-aware alias resolution (strip dots, check plus-addressing, match local-part patterns within same domain).
Namespaces added: None new — uses existing email namespace but increases coverage.
#3. ORCID (systematic enrichment)
Value: Authoritative, persistent, researcher-controlled identifiers. Currently only 1 person has ORCID in the system. All PIs and postdocs at Champalimaud have ORCIDs.
Method: For personas with known publications, query ORCID public API by name + affiliation. Cross-check returned publication list against papers.db DOIs. Store as orcid namespace.
Namespaces added: Enriches existing orcid namespace.
#4. Spotify (artist identity)
Value: music.db has track metadata with artist names. Spotify API returns artist IDs. Linking artists to canonical entities enables: "what papers has this person published?" → "what music do they make?" (for the rare person who does both).
Method: For tracks in music.db, query Spotify API for artist IDs. Create entities for artists not already in the system. For known persons who are also musicians, merge.
Namespaces added: spotify-artist
#5. Wikidata crosswalks
Value: Wikidata is a hub of hubs — it maps between ORCID, VIAF, ISNI, IMDb, MusicBrainz, and hundreds of other identifiers. One Wikidata lookup can yield 5+ external IDs.
Method: For high-confidence entities (PIs, public figures), query Wikidata by ORCID or name + affiliation. Harvest all P-number identifier properties. Each becomes a new namespace entry.
Namespaces added: wikidata, plus whatever identifiers Wikidata yields (viaf, isni, musicbrainz-artist, imdb-person, etc.)
#6. Public archives (entity linking)
Value: public-archives.db has 1.6M entity mentions extracted from DOJ documents, WikiLeaks cables, ICIJ leaks. Linking these to canonical entities enables cross-source triangulation (the core capability described in [[09-situation-graph]]).
Method: Named entity matching with disambiguation. High-profile entities (public figures, institutions) first. Conservative confidence thresholds — false positives in this domain have reputational risk.
Namespaces added: epstein-doj, wikileaks, icij
#Email Alias Resolution (v2)
The highest-impact near-term fix. Current string-exact email matching fragments entities that use multiple addresses at the same domain.
Rules (domain-aware):
- Gmail dot-insensitivity:
z.mainen@gmail.com=zmainen@gmail.com - Plus-addressing:
zmainen+list@gmail.com=zmainen@gmail.com - Institutional aliases: within the same domain,
zmainen@neuro.fc.organdzachary.mainen@neuro.fc.orgare candidates for merge (require name confirmation) - Domain equivalence:
@neuro.fchampalimaud.organd@fundacaochampalimaud.ptare the same institution (maintain a small domain-equivalence table)
These rules apply only within the email namespace and only increase confidence when the name cluster already overlaps.
#Confidence Calibration
Confidence is ordinal, not probabilistic. The scale:
| Range | Meaning | Example |
|---|---|---|
| 1.0 | Definitional | Hard-coded agent registration, manual verification |
| 0.95 | Near-certain | Shared unique email address |
| 0.9 | High | Persona-to-contact name match, DOI cross-validated S2 author |
| 0.8 | Moderate | Phone within name cluster, name-only match with common name |
| 0.7 | Tentative | Single-paper author match, domain-inferred email alias |
| < 0.7 | Review required | Flagged for human confirmation before use |
Merges below 0.7 are never automatic. They are stored as candidates and surfaced for human review.
#Invariants
- Local authority. entity_identifiers is the single source of truth for canonical identity. External registries provide evidence, not verdicts.
- Read-only spokes. No enrichment script writes to an external service. Data flows inward only.
- Idempotent rebuild.
resolveidentities.pyclears and rebuilds entityidentifiers from source data. Any manual correction must be encoded as a rule in the script, not as a row edit. - Provenance on every identifier. The
sourcefield records how the identifier entered the system. No anonymous insertions. - Confidence ordering. Higher-confidence identifiers take precedence in canonical_id assignment. When merging groups, the group with more high-confidence identifiers becomes the canonical target.
- No false merges. Phone numbers only merge within name clusters. Common names require secondary evidence (shared affiliation, shared paper, shared email domain). Conservative merging is preferable to aggressive dedup.
architecture · 25 · identity-resolution · 2026-03-14 · zach + claude
Architecture 25 — Identity Resolution — 2026 — Zachary F. Mainen / HAAK