M MemberIntel KB
Activity Decisions

decision

ADR-0030: Entry-level performance signal for the global brain (v1a)

ADR-0030 (**Accepted — conditions open** (2026-07-01). `ai-engineer` APPROVE-WITH-CHANGES, 2026-06-30): Entry-level performance signal for the global brain (v1a).

Status: Accepted — conditions open (2026-07-01). ai-engineer APPROVE-WITH-CHANGES
(a real helpful_rate denominator bug found + fixed in-branch with a regression test) +
ceo-blair APPROVE-WITH-CONDITIONS (rock shift approved directional; conditions below).
Renumbered from ADR-0029 → 0030 (ADR-0029 is citation content access, merged in #257).
Date: 2026-06-30 (accepted 2026-07-01)
Deciders: Omar (content), Seth (lead architect / ai-engineer review), Blair (ceo-blair approval)

Review log

  • 2026-07-01 — ai-engineer APPROVE-WITH-CHANGES. Shape sound (read-side rollup over
    data we already store; v1a/v1b split correct; per-tenant isolation is defense-in-depth —
    the refresh SQL and the admin endpoint both re-scope to tenant_id='global'). Zero token
    cost. Real bug caught + fixed in-branch: helpful_rate divided total up-votes by
    feedback-bearing message count, so a multi-voter message could print a rate > 1.0 (and
    the panel sorts on helpful_rate_30d). Fixed to a true share-of-votes up/(up+down) with
    a regression test pinning [0,1] on a multi-voter message. Eval-coverage gate satisfied.
  • 2026-07-01 — ceo-blair APPROVE-WITH-CONDITIONS. The volume→quality rock shift is the
    right strategic call; cost is a non-issue; privacy/honesty line intact (citation/feedback
    counts on our entries, no customer-content signal). Conditions (block “closed,” not
    ship):
    (1) the deferral of the numeric helpful-rate target is time-boxed to end of
    Q3 2026
    — a separate lightweight sign-off once v1a has ~a quarter of baseline, not a
    re-litigation of the shift; (2) the DoD’s pruning half must be a real weekly ritual with
    a named owner and a place to log prunes
    (document the content-review ritual, not just the
    ops runbook). Deferred implementation choices (HTTP-endpoint vs Cloud Run Job, table vs
    matview, cadence, panel UX) explicitly left to Seth. Tracked under issue #11.

Context

The “50 entries” rock for the global brain measures volume, not value. Today
we cannot tell which entries solve operator problems in deployed brains —
which are retrieved, which are cited, and which lead to answers users find
helpful. A polished but ineffective entry counts the same as one that
quietly carries the product.

The message-level loop is already in place: message_feedback (👍/👎 with
failure category, one vote per user/message), Message.sources (every
assistant turn stores its cited brain-entry ids), and feedback/eval_capture.py
(thumbs-down triage). Nothing rolls those signals up to the entry level,
and the full ranked retrieval set is not yet logged.

Spec: docs/superpowers/specs/2026-06-30-entry-performance-v1a-design.md
(committed). Related: issue #11 (content rock signal) under epic #8;
ADR-0010 (observability); ADR-0014 (cross-pollination boundary);
ADR-0028 (profile-aware retrieval).

Decision

Ship a two-step instrumentation, decided as separate ADRs because they
have different owners and ship on different timelines.

v1a (this ADR): read-side rollup, no new event stream.

A new entry_performance Postgres table is materialized nightly by a
single CTE-based SQL statement in src/memberintel/feedback/entry_performance.py.
The aggregation joins Message.sources to message_feedback to produce
per-entry citation count, last-cited timestamp, feedback-bearing citation
count, thumbs-up/down counts, and helpful rate — each as both all-time and
30d-window columns. Scope is strictly brain_entries.tenant_id = 'global' AND collection <> 'memory'. Every eligible entry gets a row (zero-citation
entries write rows of zeros), so the panel surfaces dead-weight candidates
even before v1b’s retrieval log lands.

Trigger uses the existing codebase IaC convention: a
google_cloud_scheduler_job fires daily at 03

UTC and HTTP-POSTs to a
new internal endpoint POST /internal/entry-performance/refresh on the
existing memberintel_api Cloud Run service. The endpoint is OIDC-token
authenticated, matching wrap_up_sweep_backfill verbatim. The CLI at
scripts/refresh_entry_performance.py is kept for manual/ops invocation
but is not the production trigger. We chose this over introducing a
google_cloud_run_v2_job resource because that pattern does not yet exist
in this codebase and pulling it in for one task would add IAM scaffolding
for no benefit — the refresh is a fast single-statement aggregation that
fits an HTTP endpoint cleanly.

The admin panel at admin/src/routes/brain/performance/+page.svelte is a
sortable table served by GET /api/v1/admin/brain/global/performance (auth:
require_cf_access_admin). No filter chips in v1a; sort columns only.
Thresholds for underperformer / dead weight defaults are deferred until
2–3 weekly review passes reveal the natural cliffs in the data — picking
a threshold now would either hide entries worth reviewing or flood the
panel with false positives.

v1b (future ADR, owned by AI Engineer): retrieval-event log.

Log the full ranked set returned by search_brain() per turn (entry id,
score, rank, cited y/n) → BigQuery sink per ADR-0010 / ADR-0021. This
closes the retrieved-but-not-cited gap and enables true “never retrieved”
detection. Out of scope for this ADR; will be ADR-0031.

Definition-of-done for issue #11. The raw “50 entries” bar is replaced
by two operational signals: (a) helpful-rate of cited answers trends up
quarter-over-quarter on the entries we own, and (b) dead entries get
pruned via the weekly review ritual. Volume stops being the goal;
performance does.

Privacy (ADR-0014): all aggregates are over tenant_id='global'
entries (Tier 1: system-level data — MemberPress KB / curated playbooks).
Feedback votes contribute to entry-level counts; no per-user or
per-customer content is exposed. No cross-tenant data flow. v1a clears
the boundary on inspection; this ADR is the formal record.

Consequences

Positive:

  • Closes the feedback loop the message-level signal opened: per-answer
    feedback now feeds back into which entries we keep, rewrite, or retire.
  • Replaces a volume metric with a quality metric for the content rock
    without breaking any existing surface.
  • Cheap to ship: read-side rollup over data we already store, single SQL
    statement, no new event stream, no schema change to high-traffic tables.
  • Idempotent: re-running the refresh is safe, and recovery from a failed
    nightly run is trivial.
  • Makes content evals (the production-signal half) explicit and ownable —
    distinct from the AI Engineer’s technical/retrieval evals.

Negative / costs:

  • v1a only sees cited entries. “Retrieved but not cited” and “never
    retrieved” cannot be distinguished until v1b ships.
  • The nightly cadence means the panel can be up to 24h stale. Acceptable
    for a weekly review ritual; not acceptable if we ever want sub-day
    feedback loops on rewrites (would push us to a smaller cadence or an
    on-demand refresh button).
  • One more endpoint + scheduler job to keep observable. Cloud Logging
    filters and the runbook at docs/runbooks/11-entry-performance-refresh.md
    carry the operational load.
  • refreshed_at returned by the admin endpoint is computed as the newest
    updated_at across all eligible rows (not page-scoped). This adds one
    small extra query per request; acceptable for v1a and keeps staleness
    reporting accurate even with pagination.

Mitigations:

  • v1b retrieval log is scoped and owned (AI Engineer) — the gap is named,
    not silent.
  • The internal refresh endpoint can be force-fired via gcloud scheduler jobs run for emergency rebuilds (documented in the runbook).
  • The CLI script is preserved so a developer with DATABASE_URL access
    can refresh from a local shell during incident response.

Alternatives considered

  • Build v1a + v1b under one ADR. Rejected: v1b’s instrumentation
    touches retrieval — code owned by the AI Engineer, with timing tied to
    their roadmap. Coupling them blocks v1a behind v1b’s schedule for no
    product benefit. Two ADRs lets each ship on its own merit.

  • Cloud Run Job instead of HTTP endpoint. Rejected: the existing
    scheduled-work pattern in this codebase is cloud_scheduler → cloud_run_v2_service HTTP endpoint (wrap_up_sweep_tick,
    wrap_up_sweep_backfill, sites_sweep_tick). Introducing a
    cloud_run_v2_job resource for one task means new IAM, new SA, new
    IaC pattern — without a payoff. The refresh is one fast SQL statement;
    an HTTP endpoint fits.

  • SQL view (or materialized view) instead of a refreshed table.
    Rejected: the unnest-and-join is non-trivial; doing it on every admin
    panel load is wasteful, and materialized views in Postgres still need
    scheduled refresh. A plain table with a nightly UPSERT job is the
    simplest shape that gives us indexed reads and a clear place to add
    derived columns later (windowed metrics, trend deltas).

  • Fixed threshold filters in v1a (e.g., underperformer: ≥10
    feedback-bearing citations in 30d AND helpful_rate_30d < 0.6).
    Rejected: at current corpus scale (~21 live playbooks against a 50
    target, modest feedback volume) almost no entry would clear the
    denominator. The filter would be empty and quietly misleading. Sortable
    table only in v1a; promote to defaults after 2–3 weekly passes.

  • Replace “50 entries” rock today with a pure helpful-rate target.
    Rejected (not by this ADR): until v1a has run for a quarter, we don’t
    have the baseline to set a target. This ADR proposes the rock shift,
    but committing to a numeric helpful-rate goal needs data first.

For: S Seth Shoultes A AI Engineer B Blair Williams S Santiago Perez Asis P Product Lead