Domain Branches

Architecture 02 (forkability) distinguishes two layers: the system layer (portable) and the data layer (instance-specific). This document subdivides the data layer into **domain branches** —…

Architecture 02 (forkability) distinguishes two layers: the system layer (portable) and the data layer (instance-specific). This document subdivides the data layer into domain branches — independently selectable bundles of databases, projects, and configuration that different installations opt into. A branch is not a git branch. It is a domain in the ontological sense: a context that determines which materials exist and which services run.

#The Problem

HAAK's data layer has grown from a handful of project directories to 25+ databases totaling 10+ GB, soon 500+ GB with reference corpora. These databases serve radically different purposes: some are private communications, some are institutional work products, some are public reference data. They belong to different contexts, different access policies, different infrastructure, and different people.

A fork of HAAK — someone cloning the repo to run their own instance — should not need to install Semantic Scholar's 200 GB academic graph if they're a musician. A scientist should not need Discogs. Everyone needs identity resolution. No one outside the original author needs their WhatsApp history.

The data layer is not one thing. It is a collection of domains, each independently present or absent.

#Branch Definitions

#Core

Present in every HAAK install. The coordination infrastructure — without these, the system cannot maintain identity, track state, or discover its own contents.

Database	What	Size
entities.db	Identity graph, belongings, entity_identifiers	283 MB
vault.db	Credentials, services, database registry	128 KB
todos.db	Action items	704 KB
files.db	Filesystem index	63 MB
threads.db	Conceptual thread graph	2 MB
storage.db	Cloud storage index	20 KB

Plus: patterns/, .claude/, CLAUDE.md, ontology/, foundations/ — the system layer from architecture 02.

#Personal

Private life. Belongs to the individual, never shared, never on institutional infrastructure. Deploys to: personal server (for always-on access) and personal machine.

Database	What	Size
contacts.db	Deduplicated contacts	3 MB
gateway.db	Unified messaging (all channels)	100 MB
whatsapp/messages.db	WhatsApp history	93 MB
signal/messages.db	Signal history	varies
matrix/messages.db	Matrix history	varies
google_personal.db	Personal email + Drive	604 MB
notes.db	Apple Notes	56 MB
arc.db	Browser history	63 MB
music.db	Music library + DJ sets	4 MB
spotify.db	Listening history	256 KB
events.db	Calendar events	2 MB
conversations.db	ChatGPT archive	45 MB

Plus: personal projects, personal personas, any strategy/ docs that are life-planning rather than work.

#Work

Professional/institutional context. Bound to a specific employer or collaboration. Deploys to: work infrastructure (institutional cloud, lab server) and work machine. Credentials come from the institution.

Database	What	Size
google_work.db	Work email + Drive	3.7 GB
papers.db	Curated paper collection (eLife corpus)	481 MB
repos.db	GitHub repo mirror	192 KB
transcripts.db	Meeting transcripts	10 MB

Plus: work projects (manuscripts, grants, reviews), work personas (colleagues, reviewers), lab-specific configuration.

The work branch is institution-specific. A Champalimaud researcher and a CSHL researcher would have different work branches with different Google accounts, different repos, different paper collections. The system layer is the same; the work data is local to each institution.

#Reference

Large public corpora. Opt-in by topic. Each sub-branch is independently installable. Deploys to: wherever has the disk and compute (typically a cloud instance, since these are too large for most laptops to carry permanently).

Sub-branch	Databases	Combined size	Source
reference:academic	s2ag.db, dblp.db	~205 GB	Semantic Scholar, DBLP
reference:music	discogs.db, musicbrainz.db	~45 GB	Discogs, MusicBrainz
reference:cultural	imdb.db	~10 GB	IMDb
reference:crosswalk	wikidata.db	~70 GB	Wikidata (filtered)
reference:investigative	public-archives.db, reference.db	~17 GB	DOJ, WikiLeaks, ICIJ

Reference branches are read-mostly. They get bulk-ingested on a schedule (weekly for S2AG, monthly for Discogs, etc.) and queried by enrichment scripts and agents. They are the library — public knowledge loaded locally for fast access.

#Branch Manifest

Each HAAK installation declares its active branches in a manifest. The manifest lives in data/manifest.toml:

[install]
name = "haak-zach"           # unique install name
host = "mac"                 # mac, exoscale, gcloud, ...

[branches]
core = true                  # always true
personal = true
work = true
reference = ["academic", "music", "crosswalk", "investigative"]

Scripts, hooks, and skills read the manifest to determine what's available:

check-services.sh only monitors databases in active branches
sync-dbs.sh only syncs databases in active branches
/search only queries databases in active branches
resolve_identities.py only links sources in active branches
Ingest scripts skip branches not in the manifest

#Deployment Topology

For the current HAAK instance (Zach's):

Mac (haak-zach-mac)                 Exoscale (haak-zach-exo)           GCloud (haak-zach-gcloud)
branches:                           branches:                          branches:
  core ✓                              core ✓                             core ✓
  personal ✓                           personal ✓                         work ✓
  work ✓                              (gateway, viewer,                   reference:academic ✓
  reference: on-demand                  bridges — always on)              reference:music ✓
                                                                          reference:crosswalk ✓
                                                                          reference:investigative ✓

Sync tool: scripts/sync-dbs.sh — branch-aware rsync. Reads manifest.toml on both source and target, computes branch intersection, syncs only matching databases. Runs as cron every 5 minutes (sync-dbs.sh --all). Status view: sync-dbs.sh --status.

Sync flow:

entities.db (core) syncs across all three — it's the identity hub
personal databases live on Mac + Exoscale, never on GCloud
work databases live on Mac + GCloud, not on Exoscale
reference databases live on GCloud, pulled to Mac on demand

Enrichment flow:

Enrichment scripts run on GCloud (where reference DBs live)
They read s2ag.db + wikidata.db, write to entities.db
entities.db syncs back to Exoscale and Mac

#Vault Integration

The databases table in vault.db gains a branch column:

ALTER TABLE databases ADD COLUMN branch TEXT;

Values: core, personal, work, reference:academic, reference:music, reference:cultural, reference:crosswalk, reference:investigative.

The vault refresh scan (already built) uses the branch field to:

Report health only for active branches
Flag databases present on disk but not in the manifest (orphaned)
Flag databases in the manifest but missing from disk (needed)

#Forkability Revisited

Architecture 02's two-layer model becomes three:

Layer	What	Portable?	Example
System	Skills, patterns, agents, ontology	Yes — clone and use	`.claude/`, `patterns/`, `ontology/`
Data: Core	Identity graph, vault, coordination	Partially — schema portable, data instance-specific	`entities.db`, `vault.db`
Data: Branches	Domain-specific databases + projects	No — selected per install, populated per user	`s2ag.db`, `google_work.db`, `music.db`

A fresh fork starts with: system layer + empty core databases + a manifest with only core = true. The user adds branches as they need them. Each branch brings its own ingest scripts, sync configuration, and skill extensions.

#Branch Dependencies

Some branches depend on others:

Branch	Requires
core	—
personal	core
work	core
reference:academic	core (for entity linking)
reference:music	core
reference:cultural	core
reference:crosswalk	core, plus at least one other reference branch to be useful
reference:investigative	core

The enrichment pipeline (architecture 25) operates across branches: it reads from reference branches and writes to core (entities.db). This is the only cross-branch write path.

architecture · 26 · domain-branches · 2026-03-15 · zach + claude

Architecture 26 — Domain Branches — 2026 — Zachary F. Mainen / HAAK