M MemberIntel KB
Activity Decisions

decision

ADR-0029: Citation content access and entitlement

ADR-0029 (**Accepted — Closed** (2026-07-01). `ai-engineer` APPROVE-WITH-CHANGES, 2026-07-01): Citation content access and entitlement.

Status: Accepted — Closed (2026-07-01). ai-engineer APPROVE-WITH-CHANGES
(folded in) + ceo-blair APPROVE-WITH-CONDITIONS. The deployed decision (#245/#252) stands,
and the closing condition — the code-layer metadata allowlist + de-identification regression
test (#256) — merged in #261 (a follow-up type-guard hardening landed in #262). The
format_citations payload now emits only the allowlist by construction.
Date: 2026-07-01
Deciders: Seth Shoultes (Lead Architect), Blair Williams (CEO)
Author: Seth Shoultes

Context

The advisor cites its sources. In the chat payload,
src/memberintel/api/retrieval/citations.py::format_citations forwards each cited source’s
metadata but never the body/content, plus the brain_entries row id. That is
deliberate: chat responses stay small and the LLM prompt context already carries the
retrieved content it needs.

Important nuance (surfaced in review): format_citations currently forwards the
entire stored metadata dict verbatim ("metadata": result.metadata, where
search.py sets metadata=entry.metadata_ in full) — not a curated field set. So the chat
payload ships every key written at ingest. That is the exact path by which the author
field leaked an internal analysis name (PR #246, scrubbed at the data layer). De-identifying
by scrubbing each ingest path is fragile; the code layer should enforce it. See the Decision
fast-follow.

The member UI now needs the full text of a cited source on demand — the citation
full-text modal (#245 / #252) opens a playbook’s body when a member clicks the ↗ pop-out.
The body is not in the chat payload, so it must be fetched separately.

Two forces are in tension: (1) keep chat payloads lean — don’t bundle every cited body into
every response when most are never expanded; (2) the global brain is a shared, all-tenant
corpus (tenant_id='global') while per-customer brains are private (tenant_id=<user>), so
reading content by id must not become a cross-tenant read primitive (the ADR-0014 boundary).

Decision

Citations carry metadata only in the chat payload; the full body is fetched lazily by
id
through a dedicated read endpoint.

GET /api/v1/brain/sources/{source_id} (src/memberintel/api/brain/router.py):

  • Deps: get_current_user + get_session. No enforce_not_suspended — it is a
    read, matching the other brain GETs (soul / bible / memories); the suspended-gate is for
    mutations only.
  • Entitlement guard: load the BrainEntry by id; return 404 unless
    row.tenant_id == "global" OR row.tenant_id == str(user.id). 404 (not 403) for
    everything else — including another tenant’s real row — so the endpoint never confirms the
    existence of a row the caller cannot read (no id-enumeration oracle across tenants).
  • Response SourceContentResponse {id, collection, title, category, last_reviewed, body}.
    body is the stored content with the leading title line split off. Every metadata read
    uses an isinstance(val, str) guard, so a null / non-string value degrades to "" / null
    never the literal "None" and never a Pydantic 500.
  • Read-only: no write path, no mutation, no tool-call surface.

source_id is the brain_entries row UUID already emitted by format_citations
("id": str(result.id)) — an opaque key, so no internal file path is exposed to the
client.

Fast-follow (required, tracked separately): format_citations must emit an explicit
metadata allowlist (title, url, summary, category, last_reviewed) rather than the raw
entry.metadata_ dict — mirroring the field whitelist the /sources response already uses.
This moves de-identification from ingest-time hygiene (remember to scrub every key) to a
code-enforced boundary (unknown keys like author can never reach the client), and makes
this ADR’s “metadata only” claim true by construction. One-line change + a test asserting an
unexpected key (e.g. author) is dropped. Tracked as #256; its merge is the trigger to
move this ADR to Closed.

Consequences

Positive:

  • Chat payloads stay metadata-only regardless of how many sources are cited — no
    per-response body bloat. The body round-trip happens only on an explicit expand and is
    cached client-side per session.
  • Reuses the existing tenant model as the security boundary; the endpoint can only ever
    return content the advisor could already retrieve and cite (the global corpus, or the
    caller’s own brain).
  • Additive and read-only — no schema change, no migration, no new write surface, no new
    auth surface (reuses the standard member-auth deps).

Negative / costs:

  • A new authenticated data-egress path on the global brain: any authenticated member can
    read any global body by id. Accepted — the global brain is the shared cited corpus by
    design (ADR-0026); there is nothing private about a playbook a member was just shown.
  • One extra authenticated request per citation-expand (bounded, session-cached).

Mitigations:

  • 404-not-403 on absent / non-entitled ids (no existence oracle over private tenant content).
  • isinstance(val, str) metadata guards (fail-safe rendering; a documented recurring coercion
    bug class).
  • Provenance: every metadata field format_citations forwards must be de-identified —
    the author field leaked an internal analysis name via this exact path and was scrubbed
    (PR #246). The body is already the de-identified “MemberIntel-connected sites” corpus.

Alternatives considered

  • Bundle full bodies into the chat payload. Rejected — heavier payloads on every
    response, and most cited sources are never expanded; wasted bandwidth + serialization for
    a rare interaction.
  • Serve by file path (playbook metadata already carries path). Rejected — leaks an
    internal repo path to the client and is not stable across collections; the row UUID is
    opaque and collection-agnostic.
  • 403 for non-entitled ids. Rejected — a 403 confirms the row exists, giving an
    id-enumeration oracle over private tenant content. 404 is uniform for absent + non-entitled.
  • Widen search_brain / reuse a query endpoint. Rejected — retrieval is query-driven;
    fetching one known citation by id is a distinct, simpler read with its own entitlement
    shape.

Open questions

  • If non-playbook global collections (hive_docs, benchmarks) later get the ↗ modal,
    confirm their bodies are equally safe to surface verbatim (hive_docs already carry a
    public url; benchmarks are aggregate / de-identified). No change needed for V1 —
    playbooks only.
  • Rate-limiting: the endpoint is authenticated and cheap, but if abused as a bulk
    global-corpus scraper, add a per-user rate limit. Not needed at V1 scale.
  • summary is listed as a forwarded citation field but the /sources response omits it
    (the modal shows the full body, so summary is redundant there). Intentional — recorded so
    it isn’t “restored” later.

Review log

  • 2026-07-01 — ai-engineer review: APPROVE-WITH-CHANGES. Entitlement boundary is
    correct (opaque-UUID lookup, tenant-as-boundary, uniform 404, no id-enumeration oracle;
    test suite proves each arm). Read-only, zero token cost, no cost gate, no material choice
    for Blair. Required change (fast-follow): the ADR’s “metadata only [curated set]”
    claim didn’t match the code — format_citations forwards the full entry.metadata_ dict
    verbatim (the author leak path). Folded in above as a required allowlist fast-follow +
    corrected the Context and the file path. Also noted the summary-in-modal SPEC
    clarification. This is a recurring de-id bug class → the fix belongs at the code layer.
  • 2026-07-01 — ceo-blair: APPROVE-WITH-CONDITIONS. Per-tenant isolation holds
    (global-or-own, 404-not-403); zero cost/vendor movement — nothing needing an architecture
    veto beyond ratification. The one non-negotiable: provenance is a hard line (never imply
    non-customer data trains the AI). Scrubbing the author instance (#246) fixed the symptom,
    not the class — as long as the payload forwards a stored metadata blob, we’re one bad
    ingest from re-leaking. Condition (blocks “closed,” not deployment): the metadata
    allowlist fast-follow must land WITH a regression test asserting a planted non-allowlisted
    field (e.g. author/internal name) never reaches the payload. Trigger to close: the
    allowlist PR number. Field set / handler shape / caching / test placement deferred to Seth.

Relates to ADR-0014 (cross-tenant boundary),
ADR-0026 (global-brain ingest),
ADR-0028 (profile-aware retrieval).

For: S Seth Shoultes A AI Engineer B Blair Williams S Santiago Perez Asis P Product Lead