M MemberIntel KB

spec

Observability & Incident Response

Describes the three-destination telemetry model — Cloud Logging for debugging, BigQuery for business analytics, locked-down BigQuery for audit — plus domain-specific metrics, on-call structure, and pre-written runbooks for the five highest-stakes failure modes.

Observability and incident response

The framing: three audiences, three different needs.

Most teams lump observability into one bucket and end up with a system that serves nobody well. For MemberIntel there are three distinct audiences for telemetry, and they want different things:

Engineers debugging production. They need traces, logs, and metrics that tell them why a specific request failed for a specific user. High cardinality, short retention, fast queries.

Cindy and Blair watching the business. They need cost-per-Free-user dashboards, conversion funnels, AI quality drift alerts, and the success metrics from the SPEC (signup rate, conversion rate, WAU). Lower cardinality, longer retention, periodic queries.

Privacy counsel and the future auditor. They need an immutable record of who accessed what data, when, and why. Append-only, very long retention, queryable on demand but not frequently.

These three needs map to three different storage destinations and three different operational disciplines. Trying to serve all three from one logging stream is how you end up either over-paying for retention on debugging logs or under-investing in audit on the things that actually matter.

The stack, concretely.

For the engineering audience: Cloud Logging plus Cloud Trace plus Cloud Monitoring, all built into GCP. The default. Don’t fight this. Cloud Logging is fine for application logs at your scale, Cloud Trace gives you distributed tracing across Cloud Run services, Cloud Monitoring gives you metrics and alerts. The integrations with Cloud Run, Cloud SQL, and Cloud Tasks are first-class, which means you spend zero time on instrumentation plumbing for the standard stuff.

The wrinkle is that you’ll want a little more than the defaults. Specifically, OpenTelemetry instrumentation in the application code. The reason is that Cloud Trace’s auto-instrumentation captures HTTP requests and database queries, but it misses the things that matter most for an LLM-shaped product: which tool calls fired, how long the LLM call took, how many tokens went in and out, which model was used, which retrieval was performed. You want every request to produce a trace that shows the chat message → tool calls → LLM call → response, with token counts and model attached as span attributes. This is a couple of days of OTel setup and it pays back forever, because every “why was this response slow” or “why did this customer’s chat cost so much” question becomes a trace query rather than a forensic dig.

For the business audience: BigQuery, fed by Cloud Logging’s BigQuery sink and by application-level event emission. The pattern is that operational logs go to Cloud Logging (short retention, expensive per-GB) and important events — chat messages, upgrades, downgrades, agent actions, brain updates, cross-pollination runs — get emitted as structured events to BigQuery (longer retention, cheap per-GB, queryable with SQL). Cindy’s cost-per-Free-user dashboard is a BigQuery query, not a Cloud Logging filter. The conversion funnel is a BigQuery query. The cohort analysis the SPEC requires is a BigQuery query.

This separation matters for cost. Cloud Logging at scale gets expensive fast — somewhere around the 10K-customer mark you’ll feel it. BigQuery at the same data volume is roughly an order of magnitude cheaper for the patterns you’ll actually run. The right mental model is “Cloud Logging is for debugging the last 7 days; BigQuery is for analyzing the last year.”

For the audit audience: a separate, locked-down BigQuery dataset with restricted IAM, organization-policy-enforced retention, and write-only access from the application. This is where audit events live: tier changes, brain edits, cross-pollination promotions, agent actions, data exports, account deletions, consent flow events, IAM grants and revocations on the GCP side. The dataset is read-only to almost everyone (Cindy for compliance dashboards, privacy counsel on request) and write-only to the application service accounts. Nobody can delete from it — that’s an organization policy, not just an IAM convention.

The audit dataset is what you hand a regulator if they ask “show us every time customer X’s data was accessed.” It’s also what protects you internally — a future incident where someone says “did anyone access this data inappropriately” is answerable in five minutes if the audit log is right.

The metrics that actually matter for MemberIntel.

Beyond the standard SLI/SLO stuff (latency, error rate, availability), there are domain-specific signals that this product needs to watch from day one because they’re either privacy-critical or economically critical:

Token spend per customer per day. Both as a sum and as a moving average. The threshold for “this customer is spending more than expected” should fire an alert that goes to Seth, not to PagerDuty — it’s an investigate-this-tomorrow signal, not a wake-someone-up signal. The SPEC’s hard token caps are the hard guard; this metric is the soft guard that catches users approaching the cap, abuse patterns, or prompt regressions that suddenly cause much longer outputs.

Free-tier cost-per-cohort. The signup-month cohort and their cumulative LLM spend. The SPEC explicitly calls for this dashboard at launch. It needs to exist on day one and Cindy needs access to it without asking Seth.

RLS violation attempts. This one is subtle. You should have an integration test that’s permanently running in production-like staging that confirms a tenant cannot access another tenant’s data. If that test ever fails, every alert goes off. In production itself, you can also instrument Postgres to log query patterns that look suspicious — queries on tenant tables without the expected app.current_tenant_id set. The expected count is zero. Anything above zero is a P1.

Cross-pollination pipeline health. Every run of the cross-pollination job emits structured events: tenants eligible, candidates drafted, candidates approved, candidates rejected, k-anonymity floor violations, anonymization-step failures. These go to the audit dataset and to a Cindy-facing dashboard. The signal you watch most closely is rejection rate — both directions, per the earlier conversation. Below 10%, the upstream filtering is too loose; above 50%, the LLM drafting is failing.

Eval suite drift. The nightly eval run produces a pass-rate per scenario category. If the rate drops, an alert fires. This catches Anthropic model updates that subtly change behavior, retrieval changes that affect grounding, and prompt edits that looked fine in code review but degrade on certain inputs. Worth wiring up in V1 even if it’s just “post to a Slack channel” — by V1.5 it becomes a release-blocking gate.

Hallucination rate proxy. The SPEC has a target of <1% on financial-data answers. The way you measure this in production is sampled human review of chat outputs, plus an automated check that every chat response with a numeric claim has a citation attached (per SPEC §8.4). The automated check is cheap — flag responses that mention numbers but don’t have a data_id reference in the trace — and gets you most of the way. The sampled review is a content-lead task once a week.

Sync pipeline reliability. The MP and Stripe sync pipelines fail in interesting ways: rate limits, schema changes, plugin updates breaking the customer’s MP install. Every sync attempt emits success/failure with reason. The failure rate per customer is what matters — a single customer with 100% sync failure for a week is a customer who’s about to churn because their dashboard is wrong, and proactive support outreach on this signal is one of the highest-leverage things Cindy can do.

The on-call structure for a 4-engineer team.

The realistic picture for V1: Seth and the Senior AI Engineer share primary on-call, Ronald is secondary, Cindy is the customer-facing escalation point but doesn’t carry a pager. That’s a rotation of two people, which is the minimum that doesn’t burn someone out. Once the team grows past four engineers, you can move to a proper rotation.

What you want to avoid in V1 is a culture where every alert pages someone. The faster path to burnout is over-paging. The discipline is to be ruthless about which alerts are pageable.

Pageable alerts (wake someone at 3am): customer-facing API down, Cloud SQL down, RLS violation detected, cross-pollination job leaked across tenants (which should be impossible by construction but the alert exists as a backstop), sustained error rate above 1%, payment-processor webhook failures (because customers can’t upgrade, which directly costs revenue).

Tomorrow alerts (slack channel, Seth reviews in the morning): single-customer sync failures, cost-per-customer outliers, eval suite drift below threshold but above floor, cross-pollination rejection rate out of band.

Weekly review (no alert, dashboard only): free-tier cost-per-cohort, conversion funnel health, brain growth metrics, customer engagement patterns.

The ratio you want is roughly 1 pageable alert per week or less, ideally per month. If you’re getting paged more than that, your alerts are tuned wrong, not your reliability.

Runbooks: the document that turns alerts into action.

For every pageable alert, there’s a runbook. The runbook lives in the same repo as the application code, in a docs/runbooks/ directory, and is linked directly from the alert. The format is severe and short: what does this alert mean, what’s the immediate action, what’s the rollback, who do you escalate to if it gets worse.

The runbooks for V1 that you can write now without waiting for incidents to teach you:

The “API down” runbook: check Cloud Run revision status, check whether the latest deploy looks suspicious, rollback to previous revision, escalate to Seth if rollback doesn’t help in 5 minutes.

The “Cloud SQL down” runbook: check Cloud SQL status page, check whether failover happened automatically, check connection pool status from the app, escalate to Seth immediately because Cloud SQL incidents almost always need vendor support.

The “RLS violation detected” runbook: this is the brand-defining one. The action is to take the affected service offline immediately (not gracefully — kill the traffic), notify Seth and Cindy, and not bring it back up until the source is identified. Practice this one in staging. The instinct in a real incident is to “investigate while it’s running” — that instinct is wrong here. Customer data is leaking; stop the bleeding.

The “cross-pollination leak suspected” runbook: pause the cross-pollination scheduled job, halt content-lead review of pending candidates, page Seth and Cindy, audit the most recent runs. This is an “act first, investigate second” alert.

The “payment failure spike” runbook: check Stripe status, check whether webhook delivery is succeeding, check whether the dunning sequence is firing correctly, page Ally Roger if the issue is on the billing-integration side.

Write these before you launch, not after the first incident. They take half a day each and they’re the difference between a calm response and panic.

Status page and customer comms.

A status page is non-optional for a paid SaaS. Use Atlassian Statuspage, Better Stack Status, or a similar managed product — don’t build your own. The status page subscribes to the same alerts your team sees, with a manual gate (“this is customer-affecting, post to status page”). The discipline is to over-communicate during incidents, not under. A 10-minute outage that you posted about is less brand-damaging than a 10-minute outage you stayed silent about.

Customer-comms during incidents is Cindy’s responsibility per the JD (“Customer comms in event of incident”). Pre-write three templates: degraded service, partial outage, full outage. Have Cindy post them with one click. Don’t compose during the incident.

Privacy and observability collide here.

One important rule that’s easy to get wrong: logs and traces should never contain customer-data values. Tenant IDs, request IDs, user IDs are fine. Member emails, transaction amounts, chat message contents are not. The temptation when debugging is to dump everything; resist it. Set up log scrubbing at the application layer: a structured logger that takes a context object and only emits the safe fields, with a developer-only verbose mode that’s never enabled in production.

The audit dataset is the place where the fact of an event is recorded, with enough context to investigate (“user X updated brain entry Y at time Z”), but not the content of the data (“here’s the entry text”). If counsel needs the content, the application has a separate, gated query path that requires elevated permissions and gets logged when used. Logging-the-logging-access is what privacy professionals call meta-audit, and it’s the right pattern for this product.

This rule has implications for AI traces specifically. The OTel spans for chat calls should record token counts, model used, latency, tool calls — but not the prompt text or the response text. When debugging an actual prompt issue, engineers reproduce in staging with synthetic data; they don’t read prod prompts. This is harder than it sounds and worth being firm about.

The cost of observability.

Worth flagging: observability isn’t free. At your scale, Cloud Logging plus Cloud Monitoring plus BigQuery plus Cloud Trace will run somewhere in the $500-2000/month range during V1 and grow with traffic. Budget for it explicitly rather than discovering it. The cost-per-Free-user model in the SPEC doesn’t currently include observability; it should.

A small cost-control move that pays back: most logs and traces don’t need to be retained beyond 30 days. Default Cloud Logging retention is 30 days for the _Default bucket and that’s correct — don’t extend it. The BigQuery audit dataset is the long-retention store, and it’s much cheaper per GB than Cloud Logging.

The decisions to make to move forward.

  1. Confirm the three-destination model: Cloud Logging for short-term debug, BigQuery for business analytics, locked-down BigQuery dataset for audit. Recommendation: yes.

  2. OTel instrumentation in V1 from day one, not added later. Recommendation: yes — adding it later means rewriting every code path that calls the LLM, which is the core of your product.

  3. Status page selected and configured before GA, not after the first incident. Recommendation: Atlassian Statuspage or Better Stack Status, integrated with Cloud Monitoring.

  4. Pageable alert discipline: target 1 page per month or less. Recommendation: aggressively un-page alerts during the first month post-launch as you learn what’s noise.

  5. Runbooks written for the five highest-stakes failure modes before GA: API down, DB down, RLS violation, cross-pollination leak, payment failure spike. Recommendation: half a day each, allocated explicitly in Phase 5.

  6. The customer-data-never-in-logs rule, enforced at the logger level. Recommendation: yes, and include it in the Phase 1 architecture review with privacy counsel so it’s documented.

The natural next threads from here are: secrets management at depth (Stripe tokens, OAuth refresh tokens, key rotation strategy — this gets dense but it’s the part counsel will scrutinize most), the LLM cost-control architecture (entitlement layer’s role in routing, per-customer budget enforcement, abuse prevention — economically critical for the Free tier), or the auth and identity layer (MP-license OAuth flow, standalone email/password fallback, MFA, session management — touches both security and product UX).

For: S Seth Shoultes S Santiago Perez Asis