Lab Database Architecture

The lab is a graph with two primary entities — people and projects — and everything else connects through one or both. This document describes the data model, its ontological grounding, and the…

#The ontological claim

The relational situational ontology (01-objects, 02-relations) provides the architecture. The lab database is an applied instance of the belongs-to model: entities participate in situations through belongings, each qualified by a quality that carries the semantics. No special-purpose relation types. No flat tables of attributes. The quality on the belonging does the work.

Entities are the first-class objects: people and projects. Neither contains the other. A person does not belong to a project the way a chapter belongs to a book. A person participates in a project — that participation is a situation with its own qualities (role, dates, status). The person and the project are peers.

Situations are the rich junctions. projectpeople is not a bare foreign-key pair — it is a materialized situation with independent qualities. Guido left the lab (person status: alumni) but remains active on the IBL project (projectpeople status: active, role: collaborator). These are two independent facts about different belongings. The person's relationship to the lab is one belonging; the person's relationship to the project is another. Their qualities vary independently.

Materials are the things that belong to entities or situations: documents, papers, datasets, repos, bibliography entries. A document belongs to a person (as author) and to a funding source (as deliverable). A paper belongs to a project (as output) and to authors (as creation). The quality on each belonging distinguishes the mode of dependence.

Qualities are the properties that specify the nature of each belonging: status, role, dates, amounts. They are not columns on the entity table — they are properties of the belonging itself. A person's status in the lab (active/alumni/visiting/candidate) is a quality of the person-to-lab belonging. A person's role on a project (PI/postdoc/student/collaborator) is a quality of the person-to-project belonging.

#Schema

#Entity tables

people — lab members past, present, and candidate. The status field records the person's relationship to the lab as an organizational unit: active, alumni, visiting, candidate. This is distinct from any project-level status.

Current schema carries: id, name, role, entityid, orcid, email, phone, nationality, startdate, enddate, terminationdate, status, institution, currentposition, source, s2id, notes. 104 rows populated from three sources: people/*.md profiles, entities.db, and the master roster. ORCID filled for 55 via public API lookup. Phone enriched for 73 from WhatsApp and contacts.db.

projects — research projects. Each project is a node in the graph, linked to people, funding, papers, repos, and other projects. Fields: id, name, slug, status (active/completed/paused/proposed), path (filesystem location in projects/lab/), description, gdrivefolderid, startdate, enddate, piid, parentid (for sub-projects), repourl, datapath. 32 rows, sparsely populated — most discovered from GDrive folders linked to grants.

identities — organizational identities the lab operates under (Mainen Lab, Latent States, HAAK, TTT). 5 rows. Projects belong to identities via project_identities (46 links).

#Situation tables

These are the core of the model. Each is a materialized situation with its own qualities.

projectpeople — who works on what project. Has its own status (active/completed/contributor), independent of people.status. Carries role (PI/postdoc/student/technician/collaborator), startdate, end_date. Currently 0 rows — the key table to populate in Phase 2.

project_funding — which funding sources support which projects. 6 rows. Bridges the funding table (61 grants and fellowships) to projects.

funding — unified grants and fellowships. The type field (grant/fellowship/contract/scholarship) and scope field (lab/personal) distinguish scope. personid links personal fellowships to their holder. parentid enables grant hierarchy (sub-grants under umbrella awards). 61 rows covering ERC, FCT, Simons, La Caixa, and others. Already carries gdrivefolderid for document discovery.

supervision — person-to-person with role (primary/co-supervisor/thesis_committee). 0 rows — to be populated from milestone and thesis records.

milestones — career events: thesiscommittee, qualifyingexam, defense, contract_renewal. 0 rows — to be populated from HR records and GDrive documents.

documentlinks — polymorphic junction linking documents to any entity type. entitytype (person/funding/project) + entity_id + role (subject/author/recipient/applicant/source). 432 rows across 874 documents.

#Tables to add

projectpapers — publications as outputs of projects. Link to Semantic Scholar IDs (don't duplicate S2 metadata — fetch on demand). Fields: projectid, s2paperid, role (primary/secondary/review), year.

projectrelations — project-to-project links. Relation types: parent/child (already via parentid), spin-off, depends-on, successor. Enables the project graph: face-decoding spawned face-flipping-decoding; IBL depends on infrastructure work at CCU.

projectrepos — links to repos.db entries (197 repos already indexed across 5 GitHub orgs). Fields: projectid, repo_id (FK to repos.db), role (primary/data/analysis/infrastructure).

projectdata — datasets and storage locations. Fields: projectid, name, storagetype (local/exoscale/gdrive), pathorurl, sizebytes, description.

projectreferences — key input papers for each project (bibliography, not outputs). Fields: projectid, s2paperid, role (foundational/methodological/comparison), notes.

#Discovery pipeline

Projects should not be manually enumerated. They are discovered from existing data sources and merged:

GDrive scan. data/gdrivework.db contains the lab's shared drive structure. The funding table already carries gdrivefolder_id for 61 grants. Project-shaped folders (those containing subfolders for data, manuscripts, figures) can be identified by structure.

Repos scan. data/repos.db holds 197 repositories across 5 GitHub organizations (zmainen, mainenlab, latentstates, davcrom, int-brain-lab). Each repo maps to zero or one projects. Repo names, descriptions, and activity dates provide project metadata.

Filesystem match. projects/lab/mainen-lab/ contains the active project hierarchy. Directory names and index.md files carry descriptions, status, and participant lists.

Merge and deduplicate. A project may appear in all three sources under different names. The merge step creates canonical project entries and records the source of each field. Conflicts are flagged for human resolution, not auto-resolved.

Enrich from index.md. Each project directory's index.md carries metadata (participants, status, description) that can be parsed and loaded into the projects table.

#Consumers

The same entity graph serves multiple renderers. This is the payoff of the data model: build once, render many ways.

Table viewer (infra/daemons/table-viewer/, port 18796). Already built. Serves any SQLite table with filtering, sorting, search. The lab management interface for Zach and lab managers. Shows people by status, projects by activity, funding by deadline.

Public website (web/zmainen/). Team page generated from people (status=active). Project pages generated from projects. Publication lists generated from projectpapers + S2 API. Alumni page from people (status=alumni) with currentposition and institution.

Grant reports. The Simons progress report personnel table — currently hand-maintained in markdown — should be generatable from lab.db by querying: all people where projectpeople.projectid IN (projects funded by Simons) AND project_people.status = active, with roles and dates. Same for FCT, ERC, La Caixa reports.

Internal lab portal. Potential future: richer view for lab members showing project membership, milestones, shared resources, upcoming deadlines.

HAAK engagements. A project in lab.db IS the same entity as a project directory in the filesystem. The path field on the projects table points to the filesystem location. Engagements are project-scoped work sessions — the bridge between the entity graph and the operational system.

#Current state

Built in the March 2026 session:

Table	Rows	Source
people	104	profiles, entities.db, master roster
funding	61	GDrive grants folders, manual
documents	874	GDrive scan
document_links	432	automated linking
projects	32	GDrive folders
project_identities	46	manual assignment
identities	5	manual
project_funding	6	manual
project_people	0	—
supervision	0	—
milestones	0	—

Infrastructure: table viewer at infra/daemons/table-viewer/ (server.py + index.html), port 18796. ORCID bulk lookup (55 filled from public API). Phone enrichment from WhatsApp + contacts (73 filled).

#Build plan

#Phase 1 — People + basic viewer (done)

People table populated. Table viewer operational. ORCID and phone enrichment complete. Document linking automated. This phase established the infrastructure.

#Phase 2 — Projects + project_people (next)

Flesh out the projects table via the discovery pipeline. Populate project_people with independent status — this is the most important junction table because it captures the many-to-many relationship between people and projects with temporal and role qualities. Source: index.md files in project directories, GDrive folder membership, git commit history.

Deliverable: every active project has a row with PI, participants, status, and path. Every person-project link has role and dates.

#Phase 3 — Papers + project graph

Integrate Semantic Scholar API for publication data. Create projectpapers linking publications to projects. Build projectrelations for the project dependency graph (spin-offs, dependencies, successors). Create project_references for key input papers.

Deliverable: for any project, query its outputs (papers), its inputs (references), and its relationships to other projects.

#Phase 4 — Web publishing pipeline

Generate website pages from lab.db. Team page from people (active). Project pages from projects + projectpeople + projectpapers. Publication list from project_papers + S2 metadata. Alumni page with current positions.

Deliverable: web/zmainen/ generated from lab.db, not hand-maintained.

#Phase 5 — Grant report generation

Template-driven report generation from lab.db. Personnel tables, publication lists, project summaries — all queryable and formatted per funder requirements (Simons, FCT, ERC, La Caixa each have different formats).

Deliverable: python generate_report.py --funder simons --period 2025-2026 produces a personnel table and publication list.

#Phase 6 — Internal lab portal

Richer web interface for lab members. Project dashboards, milestone tracking, shared resource inventory, upcoming deadlines. Builds on the table viewer infrastructure.

Deliverable: authenticated web app at a lab-internal URL.

#Design decisions

People and projects are peers. The temptation is to make projects contain people, or people contain projects. Neither is correct. A person outlives any project; a project outlives any person's tenure on it. The junction table (project_people) is the situation where they meet, and it carries its own temporal bounds.

Independent status fields. A person's lab status and their project status are independent qualities of different belongings. This is not redundancy — it is the ontology working correctly. Guido is alumni (left the lab) but active on IBL (still collaborating). These are two facts about two belongings.

Don't duplicate Semantic Scholar. Papers are referenced by S2 ID. Metadata (title, authors, year, venue, citations) is fetched from S2 on demand, not stored locally. Local storage: only the project-to-paper link and any local annotations (role, notes).

Discovery over enumeration. Projects are discovered from existing data sources (GDrive, repos, filesystem), not manually listed. This ensures completeness and surfaces projects that might otherwise be invisible. Manual curation happens after discovery, not instead of it.

SQLite, not the entity system. The entity system (entities.db, strategy 20) is the universal graph. lab.db is a domain-specific materialized view — optimized for lab management queries, not for ontological generality. The two are bridged by entity_id on the people table: every person in lab.db maps to a canonical entity. If the entity system matures to the point where lab.db queries can be expressed as entity/belonging traversals, lab.db becomes a view. Until then, it's a pragmatic store.

haak · strategy 29 · lab database architecture · 2026-03-20 · zach + claude

Strategy 29 — Lab Database Architecture — 2026 — Zachary F. Mainen / HAAK