decision
ADR-0029: Citation content access and entitlement
ADR-0029 (**Accepted — Closed** (2026-07-01). `ai-engineer` APPROVE-WITH-CHANGES, 2026-07-01): Citation content access and entitlement.
Status: Accepted — Closed (2026-07-01). ai-engineer APPROVE-WITH-CHANGES
(folded in) + ceo-blair APPROVE-WITH-CONDITIONS. The deployed decision (#245/#252) stands,
and the closing condition — the code-layer metadata allowlist + de-identification regression
test (#256) — merged in #261 (a follow-up type-guard hardening landed in #262). The
format_citations payload now emits only the allowlist by construction.
Date: 2026-07-01
Deciders: Seth Shoultes (Lead Architect), Blair Williams (CEO)
Author: Seth Shoultes
Context
The advisor cites its sources. In the chat payload,
src/memberintel/api/retrieval/citations.py::format_citations forwards each cited source’s
metadata but never the body/content, plus the brain_entries row id. That is
deliberate: chat responses stay small and the LLM prompt context already carries the
retrieved content it needs.
Important nuance (surfaced in review): format_citations currently forwards the
entire stored metadata dict verbatim ("metadata": result.metadata, where
search.py sets metadata=entry.metadata_ in full) — not a curated field set. So the chat
payload ships every key written at ingest. That is the exact path by which the author
field leaked an internal analysis name (PR #246, scrubbed at the data layer). De-identifying
by scrubbing each ingest path is fragile; the code layer should enforce it. See the Decision
fast-follow.
The member UI now needs the full text of a cited source on demand — the citation
full-text modal (#245 / #252) opens a playbook’s body when a member clicks the ↗ pop-out.
The body is not in the chat payload, so it must be fetched separately.
Two forces are in tension: (1) keep chat payloads lean — don’t bundle every cited body into
every response when most are never expanded; (2) the global brain is a shared, all-tenant
corpus (tenant_id='global') while per-customer brains are private (tenant_id=<user>), so
reading content by id must not become a cross-tenant read primitive (the ADR-0014 boundary).
Decision
Citations carry metadata only in the chat payload; the full body is fetched lazily by
id through a dedicated read endpoint.
GET /api/v1/brain/sources/{source_id} (src/memberintel/api/brain/router.py):
- Deps:
get_current_user+get_session. Noenforce_not_suspended— it is a
read, matching the other brain GETs (soul / bible / memories); the suspended-gate is for
mutations only. - Entitlement guard: load the
BrainEntryby id; return 404 unless
row.tenant_id == "global"ORrow.tenant_id == str(user.id). 404 (not 403) for
everything else — including another tenant’s real row — so the endpoint never confirms the
existence of a row the caller cannot read (no id-enumeration oracle across tenants). - Response
SourceContentResponse {id, collection, title, category, last_reviewed, body}.
bodyis the storedcontentwith the leading title line split off. Every metadata read
uses anisinstance(val, str)guard, so a null / non-string value degrades to""/null—
never the literal"None"and never a Pydantic 500. - Read-only: no write path, no mutation, no tool-call surface.
source_id is the brain_entries row UUID already emitted by format_citations
("id": str(result.id)) — an opaque key, so no internal file path is exposed to the
client.
Fast-follow (required, tracked separately): format_citations must emit an explicit
metadata allowlist (title, url, summary, category, last_reviewed) rather than the raw
entry.metadata_ dict — mirroring the field whitelist the /sources response already uses.
This moves de-identification from ingest-time hygiene (remember to scrub every key) to a
code-enforced boundary (unknown keys like author can never reach the client), and makes
this ADR’s “metadata only” claim true by construction. One-line change + a test asserting an
unexpected key (e.g. author) is dropped. Tracked as #256; its merge is the trigger to
move this ADR to Closed.
Consequences
Positive:
- Chat payloads stay metadata-only regardless of how many sources are cited — no
per-response body bloat. The body round-trip happens only on an explicit expand and is
cached client-side per session. - Reuses the existing tenant model as the security boundary; the endpoint can only ever
return content the advisor could already retrieve and cite (the global corpus, or the
caller’s own brain). - Additive and read-only — no schema change, no migration, no new write surface, no new
auth surface (reuses the standard member-auth deps).
Negative / costs:
- A new authenticated data-egress path on the global brain: any authenticated member can
read any global body by id. Accepted — the global brain is the shared cited corpus by
design (ADR-0026); there is nothing private about a playbook a member was just shown. - One extra authenticated request per citation-expand (bounded, session-cached).
Mitigations:
- 404-not-403 on absent / non-entitled ids (no existence oracle over private tenant content).
isinstance(val, str)metadata guards (fail-safe rendering; a documented recurring coercion
bug class).- Provenance: every metadata field
format_citationsforwards must be de-identified —
theauthorfield leaked an internal analysis name via this exact path and was scrubbed
(PR #246). The body is already the de-identified “MemberIntel-connected sites” corpus.
Alternatives considered
- Bundle full bodies into the chat payload. Rejected — heavier payloads on every
response, and most cited sources are never expanded; wasted bandwidth + serialization for
a rare interaction. - Serve by file
path(playbook metadata already carriespath). Rejected — leaks an
internal repo path to the client and is not stable across collections; the row UUID is
opaque and collection-agnostic. - 403 for non-entitled ids. Rejected — a 403 confirms the row exists, giving an
id-enumeration oracle over private tenant content. 404 is uniform for absent + non-entitled. - Widen
search_brain/ reuse a query endpoint. Rejected — retrieval is query-driven;
fetching one known citation by id is a distinct, simpler read with its own entitlement
shape.
Open questions
- If non-playbook global collections (
hive_docs,benchmarks) later get the ↗ modal,
confirm their bodies are equally safe to surface verbatim (hive_docsalready carry a
public url; benchmarks are aggregate / de-identified). No change needed for V1 —
playbooks only. - Rate-limiting: the endpoint is authenticated and cheap, but if abused as a bulk
global-corpus scraper, add a per-user rate limit. Not needed at V1 scale. summaryis listed as a forwarded citation field but the/sourcesresponse omits it
(the modal shows the full body, so summary is redundant there). Intentional — recorded so
it isn’t “restored” later.
Review log
- 2026-07-01 —
ai-engineerreview: APPROVE-WITH-CHANGES. Entitlement boundary is
correct (opaque-UUID lookup, tenant-as-boundary, uniform 404, no id-enumeration oracle;
test suite proves each arm). Read-only, zero token cost, no cost gate, no material choice
for Blair. Required change (fast-follow): the ADR’s “metadata only [curated set]”
claim didn’t match the code —format_citationsforwards the fullentry.metadata_dict
verbatim (theauthorleak path). Folded in above as a required allowlist fast-follow +
corrected the Context and the file path. Also noted thesummary-in-modal SPEC
clarification. This is a recurring de-id bug class → the fix belongs at the code layer. - 2026-07-01 —
ceo-blair: APPROVE-WITH-CONDITIONS. Per-tenant isolation holds
(global-or-own, 404-not-403); zero cost/vendor movement — nothing needing an architecture
veto beyond ratification. The one non-negotiable: provenance is a hard line (never imply
non-customer data trains the AI). Scrubbing theauthorinstance (#246) fixed the symptom,
not the class — as long as the payload forwards a stored metadata blob, we’re one bad
ingest from re-leaking. Condition (blocks “closed,” not deployment): the metadata
allowlist fast-follow must land WITH a regression test asserting a planted non-allowlisted
field (e.g.author/internal name) never reaches the payload. Trigger to close: the
allowlist PR number. Field set / handler shape / caching / test placement deferred to Seth.
Relates to ADR-0014 (cross-tenant boundary),
ADR-0026 (global-brain ingest),
ADR-0028 (profile-aware retrieval).