Identity as Belonging

This document establishes the ontological treatment of identity — how the system determines that two references pick out the same entity. Identity is not a primitive. It is a consequence of the…

This document establishes the ontological treatment of identity — how the system determines that two references pick out the same entity. Identity is not a primitive. It is a consequence of the belonging structure: identifiers are belongings with specific qualities, and identification is the judgment that two identifier-belongings share a canonical target.


#The Problem

The same person appears fragmented across data sources. "Zach Mainen" in contacts, "Mainen ZF" in paper author lists, "zmainen" on GitHub, ORCID 0000-0003-2446-9869 in journal metadata, a phone number in WhatsApp, an email address in Gmail. Each source knows one facet. None knows the whole.

This is not a data-cleaning problem. It is an ontological problem. The person is an entity. Each identifier is a belonging — the entity belongs to the namespace (email system, GitHub, ORCID registry) with a quality that specifies the kind of identification. The question "are these the same person?" is a question about whether two belongings share a canonical target.


#Identifiers as Belongings

Definition I1 (Identifier). An identifier is a belonging where an entity belongs to a namespace with quality "identified-by." The namespace is itself an entity (an institution or system that issues identifiers). The external_id is the specific handle within that namespace.

entity      belongs-to  namespace     quality: "identified-by"
                                       external_id: "zmainen"

This follows directly from the quality mechanism (R2, R3). "Identified-by" is a quality in the provenance family — it records how an entity is referenced within a system. Like all qualities, it is an entity, related to other qualities via meta-qualities:

  • "identified-by" is an instance of "reference" (classification)
  • "identified-by" implies "registered-in" (the entity is known to the namespace)
  • "identified-by" applies-to entities whose target is a namespace (domain constraint)

Definition I2 (Namespace). A namespace is an entity that issues identifiers. GitHub, ORCID, a phone system, an email domain, a contacts database — each is a namespace. Namespaces are entities with quality "registry" in their belonging to the broader information ecosystem.

Within a namespace, identifiers are unique by definition (that is what namespaces do). Across namespaces, the same entity may have different identifiers, and different entities may have colliding surface forms ("Lima" is both a person and a city). Cross-namespace identification is where the ontological work happens.


#Canonical Identity

Definition I3 (Canonical identity). A canonical identity is the local designation of an entity as a single node in the belonging graph. When the system determines that two identifier-belongings refer to the same entity, it assigns them a shared canonicalid. The canonicalid is itself an identifier — but in the local namespace, with quality "canonical."

The canonical identity is authoritative locally. It does not claim global truth. It claims: within this system, these identifiers are treated as co-referential. The confidence field on each identifier-belonging records how certain the co-reference judgment is.

Definition I4 (Co-reference). Two identifier-belongings are co-referential when they share the same canonicalid. Co-reference is transitive: if identifier A and identifier B share a canonicalid, and identifier B and identifier C share a canonical_id, then A and C are co-referential. This transitivity is implemented through union-find during resolution (see [[25-identity-resolution]]).


#Registries as Entities

External registries — Semantic Scholar, ORCID, Wikidata, MusicBrainz, IMDb — are entities in the ontology. They are not authoritative hubs. They are namespaces that the local system reads from, never writes to.

Each registry belongs to the information ecosystem with quality "registry" and has its own belongings:

semantic-scholar  belongs-to  academic-infrastructure  quality: "registry"
orcid             belongs-to  academic-infrastructure  quality: "registry"
github            belongs-to  software-infrastructure  quality: "registry"
spotify           belongs-to  music-infrastructure     quality: "registry"

When the system enriches a local entity with an external identifier, it creates a new belonging:

entity    belongs-to  semantic-scholar  quality: "identified-by"
                                         external_id: "AUTH-12345"
                                         confidence: 0.95
                                         source: "enrichment-script"

The enrichment is a situation — an act of extraction with actors (the script, the API), methods (name matching, DOI lookup), and materials (the API response, the local entity record). This follows the principle from [[10-relations-applied]]: every act of extraction is itself a situation in the graph.


#The Denormalized Index

The entity_identifiers table in data/entities.db is not the ontological structure — it is a denormalized index over it. The ontological structure is the quality graph of identifier-belongings. The table flattens this for efficient lookup:

entity_identifiers (
    canonical_id   TEXT     -- local canonical designation
    namespace      TEXT     -- the registry/system (email, github, orcid, ...)
    external_id    TEXT     -- the handle within that namespace
    confidence     REAL     -- certainty of co-reference judgment
    source         TEXT     -- provenance of the identification
)

This is the same pattern as the situation graph in [[09-situation-graph]]: relationships are derived from belongings, stored in a denormalized index for query performance, but grounded in the ontological primitives. The index is rebuilt from source data (contacts, personas, papers, repos) by the resolution script — it is a materialization, not a primary record.


#Identity and Withdrawal

The identity problem is a direct manifestation of withdrawal (R7, [[02-relations]]). Each namespace reveals a partial view of the entity — the email system sees the email address, GitHub sees the username, the paper system sees the author string. No single namespace sees the whole entity. The canonical identity is the system's attempt to reconstruct the entity from its partial projections.

This reconstruction is always incomplete. New namespaces may reveal new facets. Two entities believed to be distinct may turn out to be the same person (merge). One entity may turn out to be two people sharing a phone number (split). The identity graph is not a static truth — it is a working hypothesis, refined by each new belonging that enters the system.

The confidence field encodes this epistemic humility. A shared email address gives high confidence (0.95). A matching family name + first initial gives moderate confidence (0.9). A shared phone number within the same name cluster gives lower confidence (0.8). The numbers are not probabilities in a formal sense — they are ordinal rankings of evidential strength, used to prioritize review and flag uncertain merges.


#Scope and Continuation

This document establishes identity as belonging. The engineering implementation — hub-and-spoke architecture, enrichment pipeline, resolution algorithm, prioritized roadmap — is in [[25-identity-resolution]].


ontology · 11 · identity · 2026-03-14 · zach + claude

Ontology 11 — Identity as Belonging — 2026 — Zachary F. Mainen / HAAK