HAAK is moving from ad-hoc mixed filesystem/database storage to a principled architecture in which the filesystem is authoritative for all HAAK-originated data and databases are derived build artifacts. This document argues for that architecture from ontological first principles, specifies its taxonomy, describes the build system, and lays out the migration plan.
#1. The Principle
The ontology establishes that the filesystem is the materialization of the situation-nesting structure (Definition S4 of ontology/12). A project directory materializes a project-situation. A manuscript file materializes the intellectual act of writing. A review round directory materializes the review episode. These are provenance-quality belongings: the file belongs to the situation it records, with quality "materialization."
The Library Theorem (Foundation 02) grounds the complementary claim: hierarchical indexed retrieval over this materialized structure costs O(log N) per access; scanning a flat store costs Omega(N). The index.md hierarchy is the mechanism that makes this bound achievable. Databases do not participate in this hierarchy. They are opaque blobs to the index system, invisible to navigating agents, and legible only through queries that presuppose knowledge of their schema. A database is a compiled artifact — useful for cross-cutting access, but not a place where knowledge lives.
The principle follows: for all data that HAAK originates, the filesystem is the authority and the database is a derived artifact. Files are the data. The directory hierarchy is the index. SQLite databases are compiled views built by a deterministic build step, analogous to how a compiler produces object code from source. The source is versioned, human-readable, agent-navigable, and durable. The compiled artifact is .gitignored, rebuildable, and disposable.
This is not a novel architecture. It is the "baked data" pattern described by Simon Willison: source files are authority, a build step produces SQLite as a read-optimized projection. What this document adds is the ontological grounding — why the filesystem is the right authority — and the taxonomy that determines which data falls under this principle and which does not.
#2. The Taxonomy
Not all databases in HAAK should become filesystem-authoritative. The right distinction tracks where the data originates and who holds the ground truth.
#Category 1: HAAK-Originated
Data that HAAK itself creates, curates, and owns. The filesystem representation — markdown, YAML, JSONL — is the source of truth. A build step compiles it into SQLite for queryable cross-cutting access. If the database is deleted, make db rebuilds it from the files. If a file is deleted, that data is gone.
Current databases in this category: books.db (4 books, 64K), lab.db (123 people, 77 projects, 874 documents; 524K), letters.db (874 letters, 2M), music.db (6,786 tracks, 8M), personal.db (4 people, 27 documents, 13 insurance policies; 60K), conversations.db (1,205 conversations, 15,734 messages; 45M), sessions.db (40 sessions, 128K), vault.db (19 services, 22 credentials; 128K), threads.db (286 threads, 2.1M), transcripts.db (196 transcripts, 27,347 segments; 10M), todos.db (467 items, 680K), events.db (1,904 events, 1.1M), photos.db (10 photos, 44K), web_clips.db (0 clips, 12K).
Also: per-session .db files under data/sessions/ (hundreds of Claude Code session databases). And agent-runner.db (192K), which tracks agent sessions and ledger entries.
#Category 2: External Mirrors
Data that originates in an external service. HAAK maintains a local cache for fast querying, but the external service is authoritative. Sync scripts pull fresh data; the local database is the cache, not the source.
Current databases: googlework.db (3.6G — Gmail messages, Drive files, Calendar), googlepersonal.db (604M), contacts.db (8.4M — extracted from Gmail headers + Google Contacts), repos.db (136K — GitHub repository metadata), health.db (2.3G — Apple Health export), arc.db (64M — browser history), notes.db (80M — Apple Notes mirror), mirror.db (7.1M — Apple Reminders/Calendar state).
These stay as databases. The filesystem-authority principle does not apply because the files would not be the source of truth — the external service is. A markdown file claiming to represent a Gmail message would be a declaration about external mutable state, not the data itself. This is the Terraform lesson: filesystem authority works when files ARE the data. It breaks when files are declarations about state that lives elsewhere. External mirrors need databases precisely because they store the binding between HAAK's view and the external identity.
#Category 3: Research Corpora
Large ingested datasets from external sources, used for research queries but not originated by HAAK. The data is too voluminous for per-record filesystem representation and too static to need file-level versioning.
Current databases: public-archives.db (5.2G — 2.4M documents, 1.6M entities from DOJ, WikiLeaks, ICIJ, DDoSecrets), elife.db (482M — eLife journal corpus), papers.db (449M — academic paper metadata), files.db (81M — file metadata index).
These stay as databases. They are ingested, not originated. Their build step is an ingest script, not a filesystem scan. The ontological status is clear: they are materials that belong to the HAAK domain with quality "reference" — library holdings, not authored works.
#Category 4: Agent Infrastructure
Data generated by the agent runtime itself — session transcripts, dispatch logs, coordination state. Some of this is already filesystem-native (board.md, index.md files, agent mandates). The rest should become filesystem-native, with databases baked from the files.
Current databases: console.db (24K), gateway.db (112M — messaging gateway state), entities.db (307M — the entity graph itself).
The entity graph (entities.db) deserves special attention. Under the new architecture it becomes a baked artifact — built from filesystem belongings, not maintained directly. The index.md files, YAML frontmatter, and explicit belonging declarations in the filesystem are the source; entities.db is compiled from them. This is architecturally coherent: the entity graph is a projection of the filesystem state, exactly as architecture 37 already describes.
The gateway database is an operational store for a running daemon. It mirrors external messaging state (WhatsApp, Signal, email) and therefore falls under Category 2, not Category 4.
#Summary Table
| Category | Authority | DB role | Rebuild command | Examples |
|---|---|---|---|---|
| HAAK-originated | Filesystem (md/yaml/jsonl) | Baked view | make db | books, lab, letters, music, personal, conversations, threads, transcripts, todos, events, vault, sessions, photos, web_clips, agent-runner |
| External mirrors | External service | Local cache | sync scripts | googlework, googlepersonal, contacts, repos, health, arc, notes, mirror, gateway |
| Research corpora | External datasets | Ingested data | ingest scripts | public-archives, elife, papers, files |
| Agent infrastructure | Filesystem (boards, indices) | Baked view | indexer/scribe | entities.db, console.db |
#Empty Databases to Delete
Six databases at 0 bytes serve no purpose: chatgpt.db, gmail.db, dispatch.db, mailbox.db, agent_mailbox.db, agents.db. These should be deleted outright. Where functionality was planned (mailbox, dispatch), the implementation has moved elsewhere (the mailbox skill, the gateway).
#3. The Build System
The baked data pattern requires a build system: a deterministic process that reads filesystem source and produces SQLite output. The design follows three constraints from the HAAK principles: no hidden state (principle 1 — the build is reproducible from files alone), documents not code (principle 7 — the source is markdown/YAML, not programming constructs), and single source of truth (principle 8 — each datum lives in exactly one file).
#Architecture
A Makefile at the project root declares each baked database as a target. Each target has a Python builder script that reads filesystem source and produces SQLite. The builder is idempotent: running it twice on the same source produces byte-identical output (modulo SQLite internal ordering, which is normalized by sorting before insert).
data/books.db: data/books/*.yaml
python infra/scripts/bake_books.py
data/lab.db: personas/*.md projects/lab/**/*.md
python infra/scripts/bake_lab.py
data/entities.db: **/index.md **/*.md
python infra/scripts/bake_entities.py
The make db target rebuilds all baked databases. Individual targets rebuild specific databases. The build step is also validation — malformed files break the build, providing immediate feedback.
#Invariants
The build system maintains several invariants that together guarantee the architecture's coherence:
No direct DB edits. All mutations flow through files. An agent that wants to add a book writes a YAML file; the build step produces the database row. An agent that wants to update a person's role edits the persona markdown; the build step reflects the change. The database is never opened for INSERT, UPDATE, or DELETE by anything other than the builder.
Build is validation. Each builder validates source files against their schema (see section 4) before inserting. A YAML file missing a required field breaks the build. A markdown file with malformed frontmatter breaks the build. The build step surfaces errors that would otherwise accumulate silently in direct-to-database writes.
DB is .gitignored. Baked databases are not committed to version control. They are ephemeral artifacts, rebuilt on any machine from the committed source files. This eliminates merge conflicts on binary files, reduces repository size, and makes the source/artifact distinction physically enforced.
Agents query the DB, read the files. The baked database serves cross-cutting queries: "which people are on more than one project," "which threads reference this paper," "what events happened in March." Individual content retrieval — reading a letter, reviewing a transcript, checking a todo — goes through the filesystem and the index hierarchy. The database is a lens; the filesystem is the library.
#4. Schema Enforcement
Without schema enforcement, filesystem authority degrades into filesystem chaos. Three layers prevent drift, each catching errors at a different stage.
#Schema Files
A .schema.yaml per collection declares required fields, optional fields, field types, and controlled vocabularies. The schema file lives alongside its collection — data/books/.schema.yaml governs all files in data/books/. Schema files are human-readable, filesystem-native, versioned in git, and serve as documentation of the collection's structure.
# data/books/.schema.yaml
required:
title: string
author: string
status: enum [reading, read, want-to-read, abandoned]
optional:
isbn: string
rating: integer [1-5]
date_read: date
tags: list[string]
notes: string
The controlled vocabularies from architecture 37 — domain, concern, status — become the seed vocabulary for index-level schemas. Collection-specific vocabularies extend them for domain-specific fields.
#Build-Time Validation
Each builder script validates every source file against its schema before inserting into SQLite. Validation checks: required fields present, field types correct, enum values within vocabulary, date formats parseable. A single validation failure stops the build and reports the offending file, field, and violation. This is the safety net — no malformed data enters the baked database.
#Write-Time Validation
The /write skill validates frontmatter against the schema file before writing. This catches errors at authoring time, before they reach the build. Agents that bypass /write — writing directly via the Write tool — bypass this check, but they are caught at build time. The two layers are complementary: write-time is fast feedback for well-behaved agents; build-time is the backstop for everything else.
#5. Index Enforcement
The index hierarchy is the mechanism that makes the Library Theorem's O(log N) bound achievable. If indices drift from the filesystem — files exist without index entries, index entries point to missing files — retrieval degrades toward O(N). Three mechanisms keep indices honest.
#Post-Write Hook
A Claude Code hook fires after file creation. It checks whether the parent directory's index.md contains an entry for the new file. If not, it warns the agent immediately. The hook does not auto-repair — inserting a meaningful description requires judgment — but it prevents the common failure mode of creating a file and forgetting to index it. The hook is deterministic (a file check, not a judgment call), making it appropriate for a hook rather than an agent.
#Build-Time Audit
The make db step walks every directory, compares the filesystem against index.md entries, and reports two kinds of inconsistency:
Orphans — files on disk without a corresponding index entry. These are invisible to navigating agents and violate the Library Theorem's premise (all items indexed). The build reports them; a human or agent adds the missing entry.
Ghosts — index entries without a corresponding file. These send navigating agents to dead ends. The build reports them; the stale entry is removed.
The audit is advisory, not blocking. A few orphans should not prevent the database from building. But the report is prominent — appended to the build output and optionally posted to the board — so drift is visible.
#Index as Manifest
A conceptual inversion that strengthens the architecture: treat the index not just as a catalog of what exists, but as a manifest of what should exist. Under this interpretation, an orphan file (present on disk, absent from the index) is not merely undiscoverable — it is unauthorized. The index declares the collection's membership; the filesystem instantiates it. Files that appear without index entries are anomalous and require review.
This inversion is not enforced mechanically — HAAK does not delete unindexed files. But it changes the default expectation. The question shifts from "did someone forget to index this file?" to "should this file exist, and if so, what is it?" The manifest interpretation aligns with Definition S4: the directory materializes a situation, and the index declares the situation's configuration. A file outside the index is a materialization outside the declared configuration.
#6. Migration Plan
Each HAAK-originated database must be migrated from database-authoritative to filesystem-authoritative. The migration follows a standard sequence: export current data to filesystem format, validate that the export is complete, build a new database from the export, diff against the original, and switch over. Priority is determined by a combination of simplicity (smaller databases first to build confidence) and impact (databases that agents query frequently).
#Phase 1: Trivial Migrations (week 1)
books.db (64K, 4 records) — Export each book to data/books/<slug>.yaml. Four files. Trivial builder. Immediate switchover.
personal.db (60K, 4 people + 27 documents + 13 insurance policies) — Export to data/personal/people/<name>.yaml, data/personal/documents/<slug>.yaml, data/personal/insurance/<slug>.yaml. Builder reads the tree, produces SQLite. The data is sensitive; files should be in .gitignore (they already are under data/).
webclips.db (12K, 0 records) — Empty. Delete the database. Create data/webclips/ with a .schema.yaml for future clips. Builder produces a database from whatever files exist.
photos.db (44K, 10 records) — Export to data/photos/<slug>.yaml or integrate photo metadata into the media directory structure. Trivial.
vault.db (128K, 19 services + 22 credentials) — Export to data/vault/services/<slug>.yaml, data/vault/credentials/<slug>.yaml. Sensitive data requires encryption at rest; the files must be GPG-encrypted or the vault directory must be excluded from sync. The current vault.db is unencrypted SQLite, which is already a security concern — filesystem migration is an opportunity to add encryption.
#Phase 2: Structured Migrations (weeks 2-3)
todos.db (680K, 467 items) — Export to data/todos/<id>.yaml or a single data/todos.jsonl file. JSONL is appropriate here: todos are uniform, numerous, and often queried in bulk. The builder reads the JSONL and produces SQLite. Dependencies (tododeps) and tags (todotags) embed in the YAML/JSONL record.
events.db (1.1M, 1,904 events) — Export to data/events/<year>/<slug>.yaml or a data/events.jsonl. Events are calendar-like: date, title, participants, location. JSONL suits the volume and uniformity.
sessions.db (128K, 40 sessions) — Export session metadata to data/sessions/index.jsonl. Per-session .db files under data/sessions/claude/ should be migrated to per-session directories with session.yaml (metadata) and transcript.md (content). This aligns with the session-as-situation model from ontology/12.
lab.db (524K, 123 people + 77 projects + 874 documents) — The richest migration in this phase. People become persona files (many already exist in personas/); projects become project directories (many already exist in projects/lab/); documents become markdown files within their project directories. The builder walks personas/, projects/lab/, and data/lab/ to produce the compiled database. Cross-references (project-person links, project-document links) are expressed as YAML frontmatter fields, not as separate join tables.
#Phase 3: Large Migrations (weeks 4-6)
threads.db (2.1M, 286 threads) — Export each thread to data/threads/<slug>.md with YAML frontmatter carrying thread metadata (canonical terms, waypoints, intersections). Thread bodies are markdown. References and citations embed in frontmatter. The builder produces SQLite with FTS for cross-thread search.
letters.db (2M, 874 letters) — Export each letter to data/letters/<year>/<slug>.md. Letter metadata (recipients, date, type) in YAML frontmatter; letter text in markdown body. The builder produces SQLite. The existing data/letters/ directory already exists; the database moves from single-file to directory-of-files.
transcripts.db (10M, 196 transcripts + 27,347 segments) — Export each transcript to data/transcripts/<slug>/transcript.md with segments as markdown sections. Speaker metadata and tags in frontmatter. The builder produces SQLite with FTS for segment search. At 10M this is the first database where build time matters — expect 10-30 seconds.
music.db (8M, 6,786 tracks) — Export to data/music/tracks.jsonl (one line per track), data/music/playlists/<slug>.yaml, data/music/sets/<slug>.yaml. Track-level JSONL is appropriate given the volume and uniformity. The builder reads JSONL and YAML, produces SQLite with cross-references.
conversations.db (45M, 1,205 conversations + 15,734 messages) — The largest originated database. Export each conversation to data/conversations/<id>/conversation.yaml (metadata + summary) with messages.jsonl (message stream). At 1,205 conversations this produces ~1,200 directories. The builder reads the tree, produces SQLite with FTS. Build time will be 30-60 seconds. This migration should be validated carefully — conversations carry historical context that is difficult to reconstruct if lost.
agent-runner.db (192K) — Export session metadata to data/agent-runner/sessions/<id>.yaml, ledger entries to data/agent-runner/ledger.jsonl, turn history to per-session directories. The builder compiles these into SQLite. This aligns with the agent-runner's existing content-addressed ledger design.
#Duplicate Databases to Consolidate
Three pairs of databases serve overlapping functions:
sessions.db+ per-sessionsession.dbfiles — consolidate into filesystem-native session directoriesconsole.db(24K) +conversations.db(45M) — console conversations should merge into the conversations collectionmailbox.db(0B) +agent_mailbox.db(0B) +dispatch.db(0B) — all empty, all superseded by the mailbox skill. Delete all three.
#8. Precedents
The filesystem-authoritative architecture has precedents across several domains, each illustrating a different aspect of the pattern.
Beancount and hledger implement plain-text accounting: financial transactions are authored as text files, and reporting tools compile views (balance sheets, income statements) from those files. The text file is authoritative; the reports are derived. The accounting community has maintained this pattern for over a decade, demonstrating that even data with strong consistency requirements (double-entry bookkeeping) can be filesystem-authoritative.
Static site generators (Hugo, Jekyll, Eleventy) implement the baked data pattern for web publishing. Markdown files with YAML frontmatter are the source; the build step produces HTML with taxonomy pages, tag indices, and search indexes. The generated site is disposable — make clean && make rebuilds it. HAAK's baked databases are the same architecture applied to structured data rather than web content.
Obsidian operates as a markdown vault with derived indices. All knowledge lives in .md files; the application builds a graph view, backlink index, and search index at runtime. Vault portability — the ability to move the folder to another tool — is a stated design goal, enabled by filesystem authority.
Git itself demonstrates the pattern at the infrastructure level. Git's object store uses loose files (one file per object) that are periodically compiled into packfiles with derived .idx index files. The loose objects are authoritative; packfiles are a read-optimized projection. The .idx file is a baked artifact — delete it, and git index-pack rebuilds it from the packfile.
Terraform provides the cautionary counterexample. Terraform uses filesystem-native configuration files (.tf) to declare infrastructure, but the declared infrastructure lives in external services (AWS, GCP). The configuration file is not the data — it is a declaration about external mutable state. Terraform must maintain a separate state file (.tfstate) to track the binding between declaration and reality. This is precisely the situation that HAAK's external mirrors face: the data lives elsewhere, and the local representation is a cache, not an authority. The lesson is clear — filesystem authority works when files ARE the data; it breaks when files are declarations about data that lives somewhere else.
Derek Sivers' "tech independence" philosophy articulates the principle at a personal level: keep all important data in plain text files under your control. No vendor lock-in, no proprietary formats, no dependency on a running service to access your own records. HAAK's filesystem authority extends this from personal philosophy to system architecture.
#9. Belongings: Situations as the Locus
The filesystem-authority architecture established in sections 1–8 determines where data lives and how databases derive from it. This section addresses the complementary question: how do the relationships between entities — who participates in what, which domains apply, what plays which role — get represented in the filesystem? The answer follows directly from the ontology's treatment of belongings and axis-roles.
#The Principle
The ontology's central relational claim is R4: "Nothing IS an actor in absolute terms. Things PARTICIPATE AS actors in specific situations." Axis-roles — actor, method, material, domain — are qualities of belongings, not intrinsic properties of entities (ontology/02-relations, Definition R4). A person is not intrinsically a principal investigator; they participate as principal investigator in specific situations. The belonging carries the role; the entity carries only intrinsic properties (name, ORCID, institutional affiliation as a persistent fact).
This distinction maps cleanly onto the filesystem. Entity files — persona descriptions in personas/, method documents in patterns/ — carry intrinsic properties: who this person is, what this method does, independent of any particular engagement. Situation files — the index.md of a project, program, or episode directory — carry participatory belongings: who plays what role in this situation. The entity file says what something is. The situation file says what it does here.
#Three Loci
Belongings in the filesystem architecture distribute across three loci, each handling a distinct ontological mode.
Situation frontmatter carries participatory belongings — the declarations of who and what plays which role in a given situation. These are the belongings that vary across situations: Cazettes is a postdoc in the Mainen lab situation but an alumna in a future situation. The frontmatter format encodes them directly:
type: situation
scale: program # or project, episode
status: active
belongings:
- entity: mainen-z
quality: principal-investigator
since: 2007
- entity: cazettes-f
quality: postdoc
since: 2019
until: 2025-06
domains:
- champalimaud
- neuroscience
Departed participants retain their entry with an until: date rather than being deleted. The situation's history is part of its configuration — the record of who participated and when is as much a part of the materialization as the record of what was produced.
The quality registry (ontology/quality-registry.yaml) carries the meta-quality relationships between qualities themselves — the quality graph described in ontology/02-relations. This is where principal-investigator is declared as an instance of participant, where sender is declared as the inverse of receiver, where member is declared as implied by principal-investigator. The registry is a small, stable file (hundreds of entries, rarely changing) that enables the inference machinery: implication chains, inverse traversal, transitivity. It is not a belonging store — it is the grammar that makes belonging stores interpretable.
Compositional belongings — part-of/contains relationships between entities — are structural rather than situational. A chapter belongs to a book. A track belongs to an album. A file belongs to a directory. These do not vary across situations; they are facts about the entity's composition. The bake step extracts them from the filesystem hierarchy itself (directories contain files) and from explicit frontmatter fields where non-hierarchical composition exists. They populate entities.db as belongings with composition-family qualities.
#Design Decisions
Several design choices follow from the ontological commitments and the filesystem-authority principle.
Use the specific quality directly. A belonging declares quality: principal-investigator, not quality: actor with a sub-field role: PI. The quality graph resolves the hierarchy — principal-investigator is an instance of participant, which is an instance of actor. Queries at any level of generality traverse the instance chain. Storing the most specific quality preserves information; the bake step and quality registry handle generalization.
Temporal bounds, never deletion. The since: and until: fields record when a belonging was active. A postdoc who departs gets until: 2025-06, not removal from the list. This follows from the materialization principle (S4 of ontology/12-situation-nesting): the situation file materializes the situation's full history, not just its current state. Deletion destroys provenance.
Filesystem hierarchy as sole inheritance mechanism. A project nested within a program inherits participants from the program through the directory hierarchy — the same mechanism that S3 (situation subsumption) describes. There is no parent: field in the frontmatter. Nesting is structural, not declared. Cross-domain affiliation — a project that participates in both champalimaud and neuroscience — uses the domains: list, which records domain-quality belongings without implying hierarchical containment.
Selective inheritance. Not all qualities propagate downward through situation nesting. Only qualities whose entry in the quality registry declares implies: participant inherit — a principal-investigator on the program is a participant in the program's projects, but a visiting-speaker at the program level does not automatically become a participant in every sub-project. The bake step resolves inheritance by walking the nesting hierarchy and the quality graph together, and flags inferred belongings separately from declared ones in the baked database. Agents can distinguish "declared here" from "inherited from parent" in query results.
type: situation distinguishes situation directories from structural directories. The frontmatter field type: situation marks a directory's index.md as carrying situation semantics. Directories without this marker — data/, patterns/, infra/ — are structural scaffolding, not situations. The bake step skips unmarked directories when extracting participatory belongings.
#Domains and Situations
Long-lived situations that play the domain role (S2 of ontology/12-situation-nesting) remain situations in their frontmatter typing. Champalimaud is a domain — it persists independently of any particular action within it. The Mainen lab is a situation with domain-like properties: it provides context for sub-projects, carries resources, and attracts participants, but it exists because of the ongoing research activity and would terminate if that activity ceased. The frontmatter records this: type: situation, scale: program, with domains: [champalimaud, neuroscience] declaring the persistent domains it draws context from. The distinction matters for the bake step: domains are always available as context; situations are available only within their temporal bounds.
#Wikilinks as Unqualified Belongings
Inline [[entity-id]] references in document bodies are extracted by the bake step as unqualified belongings with quality reference. These are navigational — they record that a document mentions an entity — but they are not the primary belonging mechanism. The structured frontmatter carries the ontological work: who participates as what, with temporal bounds and specific qualities. Wikilinks supplement this with a lightweight discovery layer, useful for "show me everything that mentions Cazettes" queries without requiring every mention to be a formal belonging declaration.
#Migration
The migration is incremental. The bake step tolerates index.md files without type: situation frontmatter — it warns and skips them. No existing file breaks. Migration proceeds one project at a time, starting with a dense subgraph: projects/lab/mainen-lab/ and its sub-projects, which have enough participants, temporal depth, and nesting to exercise the full representation model.
The first migration pass populates only participants and domains. The frontmatter schema accommodates all four axis-roles — actor, method, material, domain — but methods and materials are added to specific situations only when concrete queries demand them. Start with the belongings that agents actually need to resolve: who works on this project, and in what domain does it sit. The rest follows from use.
#10. What This Does Not Change
Several aspects of the current architecture are already correct and should not be modified.
External mirrors stay as databases. Google Workspace data, Apple Health exports, GitHub repository metadata, browser history, Apple Notes — all originate outside HAAK. The external service is authoritative. Local databases are caches refreshed by sync scripts. Applying filesystem authority here would create Terraform's problem: files that declare what external state should look like, requiring a state-binding mechanism to reconcile declaration with reality.
Research corpora stay as databases. Public archives (5.2G, 2.4M documents), eLife (482M), papers (449M) — these are ingested external datasets with millions of rows. Per-record filesystem representation is neither practical nor useful. The ingest script is their build step; the external dataset is their authority.
The entity graph becomes a baked artifact. entities.db (307M) currently occupies an ambiguous position — partly maintained by sync daemons, partly by direct writes. Under filesystem authority, it becomes fully baked. The index.md files, YAML frontmatter, and belonging declarations in the filesystem are the source. The build step (bake_entities.py) walks the filesystem, extracts entities and belongings, and produces entities.db. This is the same role architecture 37 already assigns to Veda's sync daemon, but made deterministic and reproducible.
The ontology, foundations, and constitution are unaffected. They are already filesystem-native. They are the reason filesystem authority is correct — the theoretical ground that this document applies to the data layer.
The operational gateway stays as a database. gateway.db is a running daemon's state store for message routing. It mirrors external messaging services and maintains delivery state. This is operational infrastructure, not authored content.
#11. The Ontological Argument
The taxonomy in section 2 is not arbitrary. It follows from the ontology's treatment of materialization and authority.
A HAAK-originated datum — a book record, a letter, a todo item — is a material entity that belongs to a situation with quality "materialization." The letter materializes the act of writing it. The book record materializes the act of reading and annotating. The todo materializes the intention to act. In each case, the file IS the materialization. The materialization relation is direct: the file is the material evidence of the situation it records.
When we store this datum in a database instead, we introduce a layer of indirection. The database row is a representation of a representation — a compiled encoding of the file that materializes the situation. The row belongs to the database with quality "encoding"; the database belongs to the build process with quality "output." The provenance chain lengthens: situation → file (materialization) → database row (encoding). But only the first link carries semantic authority. The file is evidence of what happened. The database row is a derived index of that evidence.
For external mirrors, the ontological structure is different. A Gmail message does not materialize a HAAK situation — it materializes a communication situation that occurred in Google's domain. HAAK's copy of it belongs to the original with quality "cache" or "replica." The local database is the most natural representation of this cached state because the cache must track sync cursors, external identifiers, and reconciliation state that have no natural filesystem representation.
The taxonomy thus reduces to a single ontological question: does the file directly materialize a HAAK situation, or does it cache state from an external domain? If the former, the file is authoritative. If the latter, the database is the appropriate local representation.
strategy · 32 · filesystem authority · 2026-03-21 · zach + claude
Strategy 32 — Filesystem Authority — 2026 — Zachary F. Mainen / HAAK