Architecture 02 (forkability) distinguishes two layers: the system layer (portable) and the data layer (instance-specific). This document subdivides the data layer into domain branches — independently selectable bundles of databases, projects, and configuration that different installations opt into. A branch is not a git branch. It is a domain in the ontological sense: a context that determines which materials exist and which services run.
#The Problem
HAAK's data layer has grown from a handful of project directories to 25+ databases totaling 10+ GB, soon 500+ GB with reference corpora. These databases serve radically different purposes: some are private communications, some are institutional work products, some are public reference data. They belong to different contexts, different access policies, different infrastructure, and different people.
A fork of HAAK — someone cloning the repo to run their own instance — should not need to install Semantic Scholar's 200 GB academic graph if they're a musician. A scientist should not need Discogs. Everyone needs identity resolution. No one outside the original author needs their WhatsApp history.
The data layer is not one thing. It is a collection of domains, each independently present or absent.
#Branch Definitions
#Core
Present in every HAAK install. The coordination infrastructure — without these, the system cannot maintain identity, track state, or discover its own contents.
| Database | What | Size |
|---|---|---|
| entities.db | Identity graph, belongings, entity_identifiers | 283 MB |
| vault.db | Credentials, services, database registry | 128 KB |
| todos.db | Action items | 704 KB |
| files.db | Filesystem index | 63 MB |
| threads.db | Conceptual thread graph | 2 MB |
| storage.db | Cloud storage index | 20 KB |
Plus: patterns/, .claude/, CLAUDE.md, ontology/, foundations/ — the system layer from architecture 02.
#Personal
Private life. Belongs to the individual, never shared, never on institutional infrastructure. Deploys to: personal server (for always-on access) and personal machine.
| Database | What | Size |
|---|---|---|
| contacts.db | Deduplicated contacts | 3 MB |
| gateway.db | Unified messaging (all channels) | 100 MB |
| whatsapp/messages.db | WhatsApp history | 93 MB |
| signal/messages.db | Signal history | varies |
| matrix/messages.db | Matrix history | varies |
| google_personal.db | Personal email + Drive | 604 MB |
| notes.db | Apple Notes | 56 MB |
| arc.db | Browser history | 63 MB |
| music.db | Music library + DJ sets | 4 MB |
| spotify.db | Listening history | 256 KB |
| events.db | Calendar events | 2 MB |
| conversations.db | ChatGPT archive | 45 MB |
Plus: personal projects, personal personas, any strategy/ docs that are life-planning rather than work.
#Work
Professional/institutional context. Bound to a specific employer or collaboration. Deploys to: work infrastructure (institutional cloud, lab server) and work machine. Credentials come from the institution.
| Database | What | Size |
|---|---|---|
| google_work.db | Work email + Drive | 3.7 GB |
| papers.db | Curated paper collection (eLife corpus) | 481 MB |
| repos.db | GitHub repo mirror | 192 KB |
| transcripts.db | Meeting transcripts | 10 MB |
Plus: work projects (manuscripts, grants, reviews), work personas (colleagues, reviewers), lab-specific configuration.
The work branch is institution-specific. A Champalimaud researcher and a CSHL researcher would have different work branches with different Google accounts, different repos, different paper collections. The system layer is the same; the work data is local to each institution.
#Reference
Large public corpora. Opt-in by topic. Each sub-branch is independently installable. Deploys to: wherever has the disk and compute (typically a cloud instance, since these are too large for most laptops to carry permanently).
| Sub-branch | Databases | Combined size | Source |
|---|---|---|---|
| reference:academic | s2ag.db, dblp.db | ~205 GB | Semantic Scholar, DBLP |
| reference:music | discogs.db, musicbrainz.db | ~45 GB | Discogs, MusicBrainz |
| reference:cultural | imdb.db | ~10 GB | IMDb |
| reference:crosswalk | wikidata.db | ~70 GB | Wikidata (filtered) |
| reference:investigative | public-archives.db, reference.db | ~17 GB | DOJ, WikiLeaks, ICIJ |
Reference branches are read-mostly. They get bulk-ingested on a schedule (weekly for S2AG, monthly for Discogs, etc.) and queried by enrichment scripts and agents. They are the library — public knowledge loaded locally for fast access.
#Branch Manifest
Each HAAK installation declares its active branches in a manifest. The manifest lives in data/manifest.toml:
[install]
name = "haak-zach" # unique install name
host = "mac" # mac, exoscale, gcloud, ...
[branches]
core = true # always true
personal = true
work = true
reference = ["academic", "music", "crosswalk", "investigative"]
Scripts, hooks, and skills read the manifest to determine what's available:
check-services.shonly monitors databases in active branchessync-dbs.shonly syncs databases in active branches/searchonly queries databases in active branchesresolve_identities.pyonly links sources in active branches- Ingest scripts skip branches not in the manifest
#Deployment Topology
For the current HAAK instance (Zach's):
Mac (haak-zach-mac) Exoscale (haak-zach-exo) GCloud (haak-zach-gcloud)
branches: branches: branches:
core ✓ core ✓ core ✓
personal ✓ personal ✓ work ✓
work ✓ (gateway, viewer, reference:academic ✓
reference: on-demand bridges — always on) reference:music ✓
reference:crosswalk ✓
reference:investigative ✓
Sync tool: scripts/sync-dbs.sh — branch-aware rsync. Reads manifest.toml on both source and target, computes branch intersection, syncs only matching databases. Runs as cron every 5 minutes (sync-dbs.sh --all). Status view: sync-dbs.sh --status.
Sync flow:
- entities.db (core) syncs across all three — it's the identity hub
- personal databases live on Mac + Exoscale, never on GCloud
- work databases live on Mac + GCloud, not on Exoscale
- reference databases live on GCloud, pulled to Mac on demand
Enrichment flow:
- Enrichment scripts run on GCloud (where reference DBs live)
- They read s2ag.db + wikidata.db, write to entities.db
- entities.db syncs back to Exoscale and Mac
#Vault Integration
The databases table in vault.db gains a branch column:
ALTER TABLE databases ADD COLUMN branch TEXT;
Values: core, personal, work, reference:academic, reference:music, reference:cultural, reference:crosswalk, reference:investigative.
The vault refresh scan (already built) uses the branch field to:
- Report health only for active branches
- Flag databases present on disk but not in the manifest (orphaned)
- Flag databases in the manifest but missing from disk (needed)
#Forkability Revisited
Architecture 02's two-layer model becomes three:
| Layer | What | Portable? | Example |
|---|---|---|---|
| System | Skills, patterns, agents, ontology | Yes — clone and use | .claude/, patterns/, ontology/ |
| Data: Core | Identity graph, vault, coordination | Partially — schema portable, data instance-specific | entities.db, vault.db |
| Data: Branches | Domain-specific databases + projects | No — selected per install, populated per user | s2ag.db, google_work.db, music.db |
A fresh fork starts with: system layer + empty core databases + a manifest with only core = true. The user adds branches as they need them. Each branch brings its own ingest scripts, sync configuration, and skill extensions.
#Branch Dependencies
Some branches depend on others:
| Branch | Requires |
|---|---|
| core | — |
| personal | core |
| work | core |
| reference:academic | core (for entity linking) |
| reference:music | core |
| reference:cultural | core |
| reference:crosswalk | core, plus at least one other reference branch to be useful |
| reference:investigative | core |
The enrichment pipeline (architecture 25) operates across branches: it reads from reference branches and writes to core (entities.db). This is the only cross-branch write path.
architecture · 26 · domain-branches · 2026-03-15 · zach + claude
Architecture 26 — Domain Branches — 2026 — Zachary F. Mainen / HAAK