Domain Branches

Architecture 02 (forkability) distinguishes two layers: the system layer (portable) and the data layer (instance-specific). This document subdivides the data layer into **domain branches** —…

Architecture 02 (forkability) distinguishes two layers: the system layer (portable) and the data layer (instance-specific). This document subdivides the data layer into domain branches — independently selectable bundles of databases, projects, and configuration that different installations opt into. A branch is not a git branch. It is a domain in the ontological sense: a context that determines which materials exist and which services run.


#The Problem

HAAK's data layer has grown from a handful of project directories to 25+ databases totaling 10+ GB, soon 500+ GB with reference corpora. These databases serve radically different purposes: some are private communications, some are institutional work products, some are public reference data. They belong to different contexts, different access policies, different infrastructure, and different people.

A fork of HAAK — someone cloning the repo to run their own instance — should not need to install Semantic Scholar's 200 GB academic graph if they're a musician. A scientist should not need Discogs. Everyone needs identity resolution. No one outside the original author needs their WhatsApp history.

The data layer is not one thing. It is a collection of domains, each independently present or absent.


#Branch Definitions

#Core

Present in every HAAK install. The coordination infrastructure — without these, the system cannot maintain identity, track state, or discover its own contents.

DatabaseWhatSize
entities.dbIdentity graph, belongings, entity_identifiers283 MB
vault.dbCredentials, services, database registry128 KB
todos.dbAction items704 KB
files.dbFilesystem index63 MB
threads.dbConceptual thread graph2 MB
storage.dbCloud storage index20 KB

Plus: patterns/, .claude/, CLAUDE.md, ontology/, foundations/ — the system layer from architecture 02.

#Personal

Private life. Belongs to the individual, never shared, never on institutional infrastructure. Deploys to: personal server (for always-on access) and personal machine.

DatabaseWhatSize
contacts.dbDeduplicated contacts3 MB
gateway.dbUnified messaging (all channels)100 MB
whatsapp/messages.dbWhatsApp history93 MB
signal/messages.dbSignal historyvaries
matrix/messages.dbMatrix historyvaries
google_personal.dbPersonal email + Drive604 MB
notes.dbApple Notes56 MB
arc.dbBrowser history63 MB
music.dbMusic library + DJ sets4 MB
spotify.dbListening history256 KB
events.dbCalendar events2 MB
conversations.dbChatGPT archive45 MB

Plus: personal projects, personal personas, any strategy/ docs that are life-planning rather than work.

#Work

Professional/institutional context. Bound to a specific employer or collaboration. Deploys to: work infrastructure (institutional cloud, lab server) and work machine. Credentials come from the institution.

DatabaseWhatSize
google_work.dbWork email + Drive3.7 GB
papers.dbCurated paper collection (eLife corpus)481 MB
repos.dbGitHub repo mirror192 KB
transcripts.dbMeeting transcripts10 MB

Plus: work projects (manuscripts, grants, reviews), work personas (colleagues, reviewers), lab-specific configuration.

The work branch is institution-specific. A Champalimaud researcher and a CSHL researcher would have different work branches with different Google accounts, different repos, different paper collections. The system layer is the same; the work data is local to each institution.

#Reference

Large public corpora. Opt-in by topic. Each sub-branch is independently installable. Deploys to: wherever has the disk and compute (typically a cloud instance, since these are too large for most laptops to carry permanently).

Sub-branchDatabasesCombined sizeSource
reference:academics2ag.db, dblp.db~205 GBSemantic Scholar, DBLP
reference:musicdiscogs.db, musicbrainz.db~45 GBDiscogs, MusicBrainz
reference:culturalimdb.db~10 GBIMDb
reference:crosswalkwikidata.db~70 GBWikidata (filtered)
reference:investigativepublic-archives.db, reference.db~17 GBDOJ, WikiLeaks, ICIJ

Reference branches are read-mostly. They get bulk-ingested on a schedule (weekly for S2AG, monthly for Discogs, etc.) and queried by enrichment scripts and agents. They are the library — public knowledge loaded locally for fast access.


#Branch Manifest

Each HAAK installation declares its active branches in a manifest. The manifest lives in data/manifest.toml:

[install]
name = "haak-zach"           # unique install name
host = "mac"                 # mac, exoscale, gcloud, ...

[branches]
core = true                  # always true
personal = true
work = true
reference = ["academic", "music", "crosswalk", "investigative"]

Scripts, hooks, and skills read the manifest to determine what's available:

  • check-services.sh only monitors databases in active branches
  • sync-dbs.sh only syncs databases in active branches
  • /search only queries databases in active branches
  • resolve_identities.py only links sources in active branches
  • Ingest scripts skip branches not in the manifest

#Deployment Topology

For the current HAAK instance (Zach's):

Mac (haak-zach-mac)                 Exoscale (haak-zach-exo)           GCloud (haak-zach-gcloud)
branches:                           branches:                          branches:
  core ✓                              core ✓                             core ✓
  personal ✓                           personal ✓                         work ✓
  work ✓                              (gateway, viewer,                   reference:academic ✓
  reference: on-demand                  bridges — always on)              reference:music ✓
                                                                          reference:crosswalk ✓
                                                                          reference:investigative ✓

Sync tool: scripts/sync-dbs.sh — branch-aware rsync. Reads manifest.toml on both source and target, computes branch intersection, syncs only matching databases. Runs as cron every 5 minutes (sync-dbs.sh --all). Status view: sync-dbs.sh --status.

Sync flow:

  • entities.db (core) syncs across all three — it's the identity hub
  • personal databases live on Mac + Exoscale, never on GCloud
  • work databases live on Mac + GCloud, not on Exoscale
  • reference databases live on GCloud, pulled to Mac on demand

Enrichment flow:

  • Enrichment scripts run on GCloud (where reference DBs live)
  • They read s2ag.db + wikidata.db, write to entities.db
  • entities.db syncs back to Exoscale and Mac

#Vault Integration

The databases table in vault.db gains a branch column:

ALTER TABLE databases ADD COLUMN branch TEXT;

Values: core, personal, work, reference:academic, reference:music, reference:cultural, reference:crosswalk, reference:investigative.

The vault refresh scan (already built) uses the branch field to:

  • Report health only for active branches
  • Flag databases present on disk but not in the manifest (orphaned)
  • Flag databases in the manifest but missing from disk (needed)

#Forkability Revisited

Architecture 02's two-layer model becomes three:

LayerWhatPortable?Example
SystemSkills, patterns, agents, ontologyYes — clone and use.claude/, patterns/, ontology/
Data: CoreIdentity graph, vault, coordinationPartially — schema portable, data instance-specificentities.db, vault.db
Data: BranchesDomain-specific databases + projectsNo — selected per install, populated per users2ag.db, google_work.db, music.db

A fresh fork starts with: system layer + empty core databases + a manifest with only core = true. The user adds branches as they need them. Each branch brings its own ingest scripts, sync configuration, and skill extensions.


#Branch Dependencies

Some branches depend on others:

BranchRequires
core
personalcore
workcore
reference:academiccore (for entity linking)
reference:musiccore
reference:culturalcore
reference:crosswalkcore, plus at least one other reference branch to be useful
reference:investigativecore

The enrichment pipeline (architecture 25) operates across branches: it reads from reference branches and writes to core (entities.db). This is the only cross-branch write path.


architecture · 26 · domain-branches · 2026-03-15 · zach + claude

Architecture 26 — Domain Branches — 2026 — Zachary F. Mainen / HAAK