decision

ADR-0029: Citation content access and entitlement

ADR-0029 (**Accepted — Closed** (2026-07-01). `ai-engineer` APPROVE-WITH-CHANGES, 2026-07-01): Citation content access and entitlement.

Status: Accepted — Closed (2026-07-01). ai-engineer APPROVE-WITH-CHANGES
(folded in) + ceo-blair APPROVE-WITH-CONDITIONS. The deployed decision (#245/#252) stands,
and the closing condition — the code-layer metadata allowlist + de-identification regression
test (#256) — merged in #261 (a follow-up type-guard hardening landed in #262). The
format_citations payload now emits only the allowlist by construction.
Date: 2026-07-01
Deciders: Seth Shoultes (Lead Architect), Blair Williams (CEO)
Author: Seth Shoultes

Context

The advisor cites its sources. In the chat payload,
src/memberintel/api/retrieval/citations.py::format_citations forwards each cited source’s
metadata but never the body/content, plus the brain_entries row id. That is
deliberate: chat responses stay small and the LLM prompt context already carries the
retrieved content it needs.

Important nuance (surfaced in review): format_citations currently forwards the
entire stored metadata dict verbatim ("metadata": result.metadata, where
search.py sets metadata=entry.metadata_ in full) — not a curated field set. So the chat
payload ships every key written at ingest. That is the exact path by which the author
field leaked an internal analysis name (PR #246, scrubbed at the data layer). De-identifying
by scrubbing each ingest path is fragile; the code layer should enforce it. See the Decision
fast-follow.

The member UI now needs the full text of a cited source on demand — the citation
full-text modal (#245 / #252) opens a playbook’s body when a member clicks the ↗ pop-out.
The body is not in the chat payload, so it must be fetched separately.

Two forces are in tension: (1) keep chat payloads lean — don’t bundle every cited body into
every response when most are never expanded; (2) the global brain is a shared, all-tenant
corpus (tenant_id='global') while per-customer brains are private (tenant_id=<user>), so
reading content by id must not become a cross-tenant read primitive (the ADR-0014 boundary).

Decision

Citations carry metadata only in the chat payload; the full body is fetched lazily by
id through a dedicated read endpoint.

GET /api/v1/brain/sources/{source_id} (src/memberintel/api/brain/router.py):

Deps: get_current_user + get_session. No enforce_not_suspended — it is a
read, matching the other brain GETs (soul / bible / memories); the suspended-gate is for
mutations only.
Entitlement guard: load the BrainEntry by id; return 404 unless
row.tenant_id == "global" OR row.tenant_id == str(user.id). 404 (not 403) for
everything else — including another tenant’s real row — so the endpoint never confirms the
existence of a row the caller cannot read (no id-enumeration oracle across tenants).
Response SourceContentResponse {id, collection, title, category, last_reviewed, body}.
body is the stored content with the leading title line split off. Every metadata read
uses an isinstance(val, str) guard, so a null / non-string value degrades to "" / null —
never the literal "None" and never a Pydantic 500.
Read-only: no write path, no mutation, no tool-call surface.

source_id is the brain_entries row UUID already emitted by format_citations
("id": str(result.id)) — an opaque key, so no internal file path is exposed to the
client.

Fast-follow (required, tracked separately): format_citations must emit an explicit
metadata allowlist (title, url, summary, category, last_reviewed) rather than the raw
entry.metadata_ dict — mirroring the field whitelist the /sources response already uses.
This moves de-identification from ingest-time hygiene (remember to scrub every key) to a
code-enforced boundary (unknown keys like author can never reach the client), and makes
this ADR’s “metadata only” claim true by construction. One-line change + a test asserting an
unexpected key (e.g. author) is dropped. Tracked as #256; its merge is the trigger to
move this ADR to Closed.

Consequences

Positive:

Chat payloads stay metadata-only regardless of how many sources are cited — no
per-response body bloat. The body round-trip happens only on an explicit expand and is
cached client-side per session.
Reuses the existing tenant model as the security boundary; the endpoint can only ever
return content the advisor could already retrieve and cite (the global corpus, or the
caller’s own brain).
Additive and read-only — no schema change, no migration, no new write surface, no new
auth surface (reuses the standard member-auth deps).

Negative / costs:

A new authenticated data-egress path on the global brain: any authenticated member can
read any global body by id. Accepted — the global brain is the shared cited corpus by
design (ADR-0026); there is nothing private about a playbook a member was just shown.
One extra authenticated request per citation-expand (bounded, session-cached).

Mitigations:

404-not-403 on absent / non-entitled ids (no existence oracle over private tenant content).
isinstance(val, str) metadata guards (fail-safe rendering; a documented recurring coercion
bug class).
Provenance: every metadata field format_citations forwards must be de-identified —
the author field leaked an internal analysis name via this exact path and was scrubbed
(PR #246). The body is already the de-identified “MemberIntel-connected sites” corpus.

Alternatives considered

Bundle full bodies into the chat payload. Rejected — heavier payloads on every
response, and most cited sources are never expanded; wasted bandwidth + serialization for
a rare interaction.
Serve by file path (playbook metadata already carries path). Rejected — leaks an
internal repo path to the client and is not stable across collections; the row UUID is
opaque and collection-agnostic.
403 for non-entitled ids. Rejected — a 403 confirms the row exists, giving an
id-enumeration oracle over private tenant content. 404 is uniform for absent + non-entitled.
Widen search_brain / reuse a query endpoint. Rejected — retrieval is query-driven;
fetching one known citation by id is a distinct, simpler read with its own entitlement
shape.

Open questions

If non-playbook global collections (hive_docs, benchmarks) later get the ↗ modal,
confirm their bodies are equally safe to surface verbatim (hive_docs already carry a
public url; benchmarks are aggregate / de-identified). No change needed for V1 —
playbooks only.
Rate-limiting: the endpoint is authenticated and cheap, but if abused as a bulk
global-corpus scraper, add a per-user rate limit. Not needed at V1 scale.
summary is listed as a forwarded citation field but the /sources response omits it
(the modal shows the full body, so summary is redundant there). Intentional — recorded so
it isn’t “restored” later.

Review log

2026-07-01 — ai-engineer review: APPROVE-WITH-CHANGES. Entitlement boundary is
correct (opaque-UUID lookup, tenant-as-boundary, uniform 404, no id-enumeration oracle;
test suite proves each arm). Read-only, zero token cost, no cost gate, no material choice
for Blair. Required change (fast-follow): the ADR’s “metadata only [curated set]”
claim didn’t match the code — format_citations forwards the full entry.metadata_ dict
verbatim (the author leak path). Folded in above as a required allowlist fast-follow +
corrected the Context and the file path. Also noted the summary-in-modal SPEC
clarification. This is a recurring de-id bug class → the fix belongs at the code layer.
2026-07-01 — ceo-blair: APPROVE-WITH-CONDITIONS. Per-tenant isolation holds
(global-or-own, 404-not-403); zero cost/vendor movement — nothing needing an architecture
veto beyond ratification. The one non-negotiable: provenance is a hard line (never imply
non-customer data trains the AI). Scrubbing the author instance (#246) fixed the symptom,
not the class — as long as the payload forwards a stored metadata blob, we’re one bad
ingest from re-leaking. Condition (blocks “closed,” not deployment): the metadata
allowlist fast-follow must land WITH a regression test asserting a planted non-allowlisted
field (e.g. author/internal name) never reaches the payload. Trigger to close: the
allowlist PR number. Field set / handler shape / caching / test placement deferred to Seth.

Relates to ADR-0014 (cross-tenant boundary),
ADR-0026 (global-brain ingest),
ADR-0028 (profile-aware retrieval).

For: S Seth Shoultes A AI Engineer B Blair Williams S Santiago Perez Asis P Product Lead