decision
ADR-0026: Global-brain inbound ingest — a shared staging + review spine
ADR-0026 (Accepted (2026-06-23) — `ai-engineer` APPROVE-WITH-CHANGES + `ceo-blair` APPROVE-WITH-CONDITIONS; condition (a) (in-row provenance for the direct-write tier) verified in merged PR #174. Conditions (b)/(c)/(d) gate the future staged adapters (research, cross-pollination), not Phase A. See the Review log., 2026-06-22): Global-brain inbound ingest — a shared staging + review spine.
Status: Accepted (2026-06-23) — ai-engineer APPROVE-WITH-CHANGES + ceo-blair APPROVE-WITH-CONDITIONS; condition (a) (in-row provenance for the direct-write tier) verified in merged PR #174. Conditions (b)/(c)/(d) gate the future staged adapters (research, cross-pollination), not Phase A. See the Review log.
Date: 2026-06-22
Deciders: Seth Shoultes (Lead Architect), Blair Williams (CEO)
Author: Seth Shoultes
Context
The global (“collective”) brain is real and shipping: brain_entries rows with
tenant_id="global", seeded from Hive Mind by a pull job
(ADR-0006, scripts/seed_brain.py).
As of 2026-06-22, staging holds 488 hive_docs rows (MemberPress/WordPress KB
substrate) and zero curated content — the substrate exists, the moat does not.
Three independent efforts now need to write curated content into that global
brain, and each has been designing its own path:
- Authored playbooks (Omar / Brain Content Lead) — net-new operator
playbooks. Phase A is in review (PR #174, branch
feat/global-brain-playbook-ingest; not yet onmain): a markdown inlet
(scripts/ingest_playbooks.py) writingcollection="playbook",
source_type="playbook", PR-as-gate. - Primary research (the Interviewer service) — de-identified interview
findings, pushed service-to-service. Pitched in
interviewer@pitch/memberintel-global-brain-feedas aresearch_candidates
staging table + content-lead review + OIDC ingest endpoint. - Cross-pollination (V2+) — candidate entries abstracted from per-customer
brains, gated by k-anonymity and the three-role boundary
(ADR-0014).
The seed is a pull; nothing today accepts an inbound write to the global
brain. If each effort builds its own staging table, review surface, and embed
path, we get three divergent pipelines, three audit stories, and three places
to get provenance or the sensitive-vertical exclusion wrong. The v1-spec (§8.1)
already says the global brain includes “content authored by Caseproof staff” but
leaves the mechanics unspecified — this ADR specifies them, once.
Decision
Adopt one inbound ingest spine for the global brain. Sources are adapters;
the staging table, review/approval step, embed-on-approve, and provenance are
shared.
1. Provenance is first-class. Stamp the exact JSONB key
metadata.source_type on every global brain_entries row. Canonical
values: hive_docs | playbook | primary_research | cross_pollinated.
Retrieval and audit branch on this key, never on free-text guessing.
This is a new, dedicated key — distinct from the existing free-form
metadata.source that seeded rows already carry (the Hive Mind seed passes
through Hive Mind’s own metadata, where source holds values like "hive" /
"wordpress_kb"; see scripts/seed_brain.py and tests/unit/test_seed_brain.py).
We do not overload source. Backfill: a one-time UPDATE stamps
metadata.source_type = 'hive_docs' on existing tenant_id='global' rows so the
field is total, not sparse, before retrieval starts reading it. New rows set it
at write time. Keeping it in JSONB (vs. a first-class column) is the V1 choice;
see Open Question 2 for the promote-to-column trigger.
2. One staging table for non-self-authored content:
global_brain_candidates (supersedes the interview pitch’s interview-only
research_candidates). Columns at minimum: id, source_type,
payload/content, provenance (JSONB — study/run ids, source linkage),
anonymization_status, review_status, review_reason, reviewed_by,
created_at. Columns are nullable-by-source with a required set per
source_type — e.g. anonymization_status is required for
primary_research and cross_pollinated, N/A for playbook (no
anonymization step). This keeps one table from silently re-growing three
divergent shapes. Cross-pollination’s source linkage lives in a separate
audit table the global brain never queries, per ADR-0014 — the candidate table
references it, it does not inline it.
3. One review → embed-on-approve path. A content-lead approval promotes a
candidate: chunk → Voyage voyage-3-lite 512d → write brain_entries
(tenant_id="global", source_type=…). Embed strictly on approve, never on
stage — rejected candidates must cost zero embed dollars and zero
re-identification surface. On reject, hard-delete the candidate payload and
source linkage (ADR-0014’s re-ID-risk requirement) but retain a
content-free rejection stub (id, source_type, review_status='rejected',
review_reason, reviewed_by, timestamps) so “was this ever considered and
rejected?” stays answerable in a post-mortem without retaining the sensitive
text. The review surface is the embryonic Brain Console; it starts as a
minimal approve/reject list and grows authoring, retrieval-preview, and the
cross-poll queue over time. SLO target: 95% within 7 days (consistent with
arch-cross-pollination).
4. One write-path guard. The sensitive-vertical exclusion runs at the
promote-to-brain_entries step, regardless of source — so no adapter can
bypass it.
5. Two authorization tiers for “who may stage.”
- Staff-authored content (playbooks) is trusted at the source. Phase A
keeps PR-as-gate:status: approvedmarkdown writes directly to
brain_entriesviaingest_playbooks.py. It may migrate behind the
staging table later for one uniform audit trail, but is not required to.
Non-negotiable for the direct-write tier: every row must carry its own
in-row provenance so the global brain alone answers “what entered and who
approved it” —source_type,content_hash,author, and the
ingest_commit(the merged commit that includes the approved file, which
resolves to the PR and its reviewer). Without this, the one source that
bypasses staging is also the one with no DB-queryable approver trail. - Machine-/customer-derived content (research, cross-pollination)
must go throughglobal_brain_candidates+ human review — it is never
written directly.
6. The customer brain editor is out of scope. The per-customer four-document
editor (ADR-0015, Brain
Vault #124) stays as-is; its guardrail is “refuse tenant_id='global'”. Global
ingest gets its own surface. We do not overload the customer editor.
Consequences
Positive:
- The Interviewer pitch and the playbook inlet build toward the same spine;
the pitch’s endpoint + staging becomes the shared infrastructure, not a
one-off. Cross-pollination (V2) plugs in rather than re-inventing. - Provenance + a single write-path guard make “what’s in the global brain and
where did it come from” answerable for audit and counsel. - Retrieval can weight or filter by
source_type(e.g. preferplaybookover
hive_docssubstrate) from one consistent field.
Negative / costs:
- The interview pitch must rename
research_candidates→global_brain_candidates
and treat its table as shared. Minor, but it changes the pitch’s schema. - Phase A playbooks (direct write) and staged sources (research/cross-poll) have
two write paths until/unless playbooks migrate behind staging — a deliberate,
documented asymmetry, not drift. source_typein JSONB (not a column) means retrieval filters use a JSONB
predicate; acceptable at V1 volumes, revisit if it becomes a hot path.
Neutral / follow-ups:
- Consent + counsel gate the research source only (interview-participant
consent must cover “anonymized input may inform a customer-facing KB”). The
spine is buildable; the research adapter is the part that waits on counsel.
The spine must not let the research adapter write a candidate before
consent_versionis stamped on the row. Playbooks have no such gate
(staff-authored, no customer PII). - The Hive Mind seed remaining a local cron (not a managed job) is a separate
operational gap, tracked independently of this ADR.
Material choice for Blair:
- The two-tier authorization asymmetry (staff-authored writes directly;
machine-/customer-derived must stage) is a security-posture call about who may
write to the shared corpus every customer reads — the same class of decision
ADR-0014 elevated to Blair + counsel. Approve it with the in-row provenance
requirement (Decision 5) attached as a condition. Trigger to revisit: if
playbooks ever accept non-staff contributors (community/partner content), this
tier must collapse into staging.
Alternatives considered
- Three independent pipelines. Rejected — divergent audit/provenance and
three chances to mishandle anonymization or the sensitive-vertical exclusion. - Extend the customer brain editor for global. Rejected — wrong data model
(per-customer 4-doc vs. flat shared corpus) and it fights its own
tenant_id='global'refusal guard. - Everything through staging, including playbooks. Viable and cleaner for a
single audit trail, but adds a review hop to staff-authored content whose
author is the reviewer; deferred as an option, not required for Phase A.
Open questions
- Sensitive-vertical exclusion definition. Decision 4 enforces it at the
single write-path chokepoint, but the list/predicate of “sensitive
vertical” doesn’t exist in the repo yet. It’s only as good as that
definition. Owner: Product Lead + counsel (ADR-0014 ties medical/advocacy
exclusion to counsel). Blocking for the research/cross-poll adapters, not for
Phase A playbooks. source_typeJSONB → column trigger. JSONB is correct for V1 (matches
what Phase A ships; retrieval cost unchanged). When V1.5 dual-search unions
tenant + global, asource_typefilter becomes a JSONB predicate inside the
global leg. Revisit promoting it to a first-class column at ~50k global
rows (today: 488), not on vibes.- Embed-on-approve discipline. Confirm no adapter embeds at stage time —
only on promote. (Stated in Decision 3; called out here as the thing to
verify in each adapter’s tests.)
Review log
-
2026-06-22 —
ai-engineerreview: APPROVE-WITH-CHANGES. Required changes
folded in: corrected the Phase A “shipped” claim (it’s PR #174, unmerged);
in-row provenance mandated for the direct-write tier (Decision 5);
reject-handling reconciled with ADR-0014 (hard-delete payload + linkage,
retain content-free rejection stub); per-source required-column set
(Decision 2); sensitive-vertical owner, JSONB→column trigger, and
embed-on-approve raised as open questions; asymmetry flagged for Blair.
Cost back-of-envelope confirmed negligible (~$0.002 one-time to embed the
launch playbooks; retrieval unchanged). Pending:ceo-blairapproval. -
2026-06-22 — Copilot review (PR #175). Addressed: Status line now
consistent with this log; provenance key pinned tometadata.source_type
(distinct from the seed’s free-formmetadata.source) with a one-time
backfill for existing rows; Consequences headings aligned to the ADR
template (**Positive:**/**Negative / costs:**). -
2026-06-22 —
ceo-blairapproval gate: APPROVE-WITH-CONDITIONS. The
two-tier authorization asymmetry is approved as the security posture,
conditioned on:- (a) In-row provenance for the direct-write tier is non-negotiable and
ships with the playbook inlet (Decision 5). Status: satisfied —
implemented in PR #174 (metadata.source_type/content_hash/author
/ingest_commit); to be verified at merge. - (b) The sensitive-vertical exclusion list is a launch gate for the
staged adapters, owned by Product Lead + counsel (Allen). Phase A
playbooks proceed without it; research/cross-poll may not promote a row
until the predicate exists and counsel signs it. - (c) Consent gates the research adapter at the row level — the spine
refuses to stage a research candidate withoutconsent_versionstamped. - (d) The asymmetry-collapse trigger is binding: if playbooks ever accept
non-staff contributors, the direct-write tier collapses into staging.
Cost confirmed negligible; SPEC §4 honesty clean (staff-authored /
consented-de-identified content only). Status → Accepted when (a) is
verified in the merged PR #174. - (a) In-row provenance for the direct-write tier is non-negotiable and