decision
ADR-0030: Entry-level performance signal for the global brain (v1a)
ADR-0030 (**Accepted — conditions open** (2026-07-01). `ai-engineer` APPROVE-WITH-CHANGES, 2026-06-30): Entry-level performance signal for the global brain (v1a).
Status: Accepted — conditions open (2026-07-01). ai-engineer APPROVE-WITH-CHANGES
(a real helpful_rate denominator bug found + fixed in-branch with a regression test) +
ceo-blair APPROVE-WITH-CONDITIONS (rock shift approved directional; conditions below).
Renumbered from ADR-0029 → 0030 (ADR-0029 is citation content access, merged in #257).
Date: 2026-06-30 (accepted 2026-07-01)
Deciders: Omar (content), Seth (lead architect / ai-engineer review), Blair (ceo-blair approval)
Review log
- 2026-07-01 —
ai-engineerAPPROVE-WITH-CHANGES. Shape sound (read-side rollup over
data we already store; v1a/v1b split correct; per-tenant isolation is defense-in-depth —
the refresh SQL and the admin endpoint both re-scope totenant_id='global'). Zero token
cost. Real bug caught + fixed in-branch:helpful_ratedivided total up-votes by
feedback-bearing message count, so a multi-voter message could print a rate > 1.0 (and
the panel sorts onhelpful_rate_30d). Fixed to a true share-of-votesup/(up+down)with
a regression test pinning[0,1]on a multi-voter message. Eval-coverage gate satisfied. - 2026-07-01 —
ceo-blairAPPROVE-WITH-CONDITIONS. The volume→quality rock shift is the
right strategic call; cost is a non-issue; privacy/honesty line intact (citation/feedback
counts on our entries, no customer-content signal). Conditions (block “closed,” not
ship): (1) the deferral of the numeric helpful-rate target is time-boxed to end of
Q3 2026 — a separate lightweight sign-off once v1a has ~a quarter of baseline, not a
re-litigation of the shift; (2) the DoD’s pruning half must be a real weekly ritual with
a named owner and a place to log prunes (document the content-review ritual, not just the
ops runbook). Deferred implementation choices (HTTP-endpoint vs Cloud Run Job, table vs
matview, cadence, panel UX) explicitly left to Seth. Tracked under issue #11.
Context
The “50 entries” rock for the global brain measures volume, not value. Today
we cannot tell which entries solve operator problems in deployed brains —
which are retrieved, which are cited, and which lead to answers users find
helpful. A polished but ineffective entry counts the same as one that
quietly carries the product.
The message-level loop is already in place: message_feedback (👍/👎 with
failure category, one vote per user/message), Message.sources (every
assistant turn stores its cited brain-entry ids), and feedback/eval_capture.py
(thumbs-down triage). Nothing rolls those signals up to the entry level,
and the full ranked retrieval set is not yet logged.
Spec: docs/superpowers/specs/2026-06-30-entry-performance-v1a-design.md
(committed). Related: issue #11 (content rock signal) under epic #8;
ADR-0010 (observability); ADR-0014 (cross-pollination boundary);
ADR-0028 (profile-aware retrieval).
Decision
Ship a two-step instrumentation, decided as separate ADRs because they
have different owners and ship on different timelines.
v1a (this ADR): read-side rollup, no new event stream.
A new entry_performance Postgres table is materialized nightly by a
single CTE-based SQL statement in src/memberintel/feedback/entry_performance.py.
The aggregation joins Message.sources to message_feedback to produce
per-entry citation count, last-cited timestamp, feedback-bearing citation
count, thumbs-up/down counts, and helpful rate — each as both all-time and
30d-window columns. Scope is strictly brain_entries.tenant_id = 'global' AND collection <> 'memory'. Every eligible entry gets a row (zero-citation
entries write rows of zeros), so the panel surfaces dead-weight candidates
even before v1b’s retrieval log lands.
Trigger uses the existing codebase IaC convention: a
google_cloud_scheduler_job fires daily at 03
new internal endpoint
POST /internal/entry-performance/refresh on theexisting
memberintel_api Cloud Run service. The endpoint is OIDC-tokenauthenticated, matching
wrap_up_sweep_backfill verbatim. The CLI atscripts/refresh_entry_performance.py is kept for manual/ops invocationbut is not the production trigger. We chose this over introducing a
google_cloud_run_v2_job resource because that pattern does not yet existin this codebase and pulling it in for one task would add IAM scaffolding
for no benefit — the refresh is a fast single-statement aggregation that
fits an HTTP endpoint cleanly.
The admin panel at admin/src/routes/brain/performance/+page.svelte is a
sortable table served by GET /api/v1/admin/brain/global/performance (auth:
require_cf_access_admin). No filter chips in v1a; sort columns only.
Thresholds for underperformer / dead weight defaults are deferred until
2–3 weekly review passes reveal the natural cliffs in the data — picking
a threshold now would either hide entries worth reviewing or flood the
panel with false positives.
v1b (future ADR, owned by AI Engineer): retrieval-event log.
Log the full ranked set returned by search_brain() per turn (entry id,
score, rank, cited y/n) → BigQuery sink per ADR-0010 / ADR-0021. This
closes the retrieved-but-not-cited gap and enables true “never retrieved”
detection. Out of scope for this ADR; will be ADR-0031.
Definition-of-done for issue #11. The raw “50 entries” bar is replaced
by two operational signals: (a) helpful-rate of cited answers trends up
quarter-over-quarter on the entries we own, and (b) dead entries get
pruned via the weekly review ritual. Volume stops being the goal;
performance does.
Privacy (ADR-0014): all aggregates are over tenant_id='global'
entries (Tier 1: system-level data — MemberPress KB / curated playbooks).
Feedback votes contribute to entry-level counts; no per-user or
per-customer content is exposed. No cross-tenant data flow. v1a clears
the boundary on inspection; this ADR is the formal record.
Consequences
Positive:
- Closes the feedback loop the message-level signal opened: per-answer
feedback now feeds back into which entries we keep, rewrite, or retire. - Replaces a volume metric with a quality metric for the content rock
without breaking any existing surface. - Cheap to ship: read-side rollup over data we already store, single SQL
statement, no new event stream, no schema change to high-traffic tables. - Idempotent: re-running the refresh is safe, and recovery from a failed
nightly run is trivial. - Makes content evals (the production-signal half) explicit and ownable —
distinct from the AI Engineer’s technical/retrieval evals.
Negative / costs:
- v1a only sees cited entries. “Retrieved but not cited” and “never
retrieved” cannot be distinguished until v1b ships. - The nightly cadence means the panel can be up to 24h stale. Acceptable
for a weekly review ritual; not acceptable if we ever want sub-day
feedback loops on rewrites (would push us to a smaller cadence or an
on-demand refresh button). - One more endpoint + scheduler job to keep observable. Cloud Logging
filters and the runbook atdocs/runbooks/11-entry-performance-refresh.md
carry the operational load. refreshed_atreturned by the admin endpoint is computed as the newest
updated_atacross all eligible rows (not page-scoped). This adds one
small extra query per request; acceptable for v1a and keeps staleness
reporting accurate even with pagination.
Mitigations:
- v1b retrieval log is scoped and owned (AI Engineer) — the gap is named,
not silent. - The internal refresh endpoint can be force-fired via
gcloud scheduler jobs runfor emergency rebuilds (documented in the runbook). - The CLI script is preserved so a developer with
DATABASE_URLaccess
can refresh from a local shell during incident response.
Alternatives considered
-
Build v1a + v1b under one ADR. Rejected: v1b’s instrumentation
touches retrieval — code owned by the AI Engineer, with timing tied to
their roadmap. Coupling them blocks v1a behind v1b’s schedule for no
product benefit. Two ADRs lets each ship on its own merit. -
Cloud Run Job instead of HTTP endpoint. Rejected: the existing
scheduled-work pattern in this codebase iscloud_scheduler → cloud_run_v2_serviceHTTP endpoint (wrap_up_sweep_tick,
wrap_up_sweep_backfill,sites_sweep_tick). Introducing a
cloud_run_v2_jobresource for one task means new IAM, new SA, new
IaC pattern — without a payoff. The refresh is one fast SQL statement;
an HTTP endpoint fits. -
SQL view (or materialized view) instead of a refreshed table.
Rejected: the unnest-and-join is non-trivial; doing it on every admin
panel load is wasteful, and materialized views in Postgres still need
scheduled refresh. A plain table with a nightly UPSERT job is the
simplest shape that gives us indexed reads and a clear place to add
derived columns later (windowed metrics, trend deltas). -
Fixed threshold filters in v1a (e.g., underperformer: ≥10
feedback-bearing citations in 30d AND helpful_rate_30d < 0.6).
Rejected: at current corpus scale (~21 live playbooks against a 50
target, modest feedback volume) almost no entry would clear the
denominator. The filter would be empty and quietly misleading. Sortable
table only in v1a; promote to defaults after 2–3 weekly passes. -
Replace “50 entries” rock today with a pure helpful-rate target.
Rejected (not by this ADR): until v1a has run for a quarter, we don’t
have the baseline to set a target. This ADR proposes the rock shift,
but committing to a numeric helpful-rate goal needs data first.