M MemberIntel KB
Activity Decisions

decision

ADR-0010: Observability — Cloud Logging + OpenTelemetry + BigQuery

ADR-0010 (Accepted, 2026-05-14): Observability — Cloud Logging + OpenTelemetry + BigQuery.

Status: Accepted
Date: 2026-05-14
Deciders: Seth (Lead Architect)

Context

The MemberIntel V1 API runs on Cloud Run (memberintel-api-staging) in GCP project memberintel-v1. It is a Python FastAPI application using SQLAlchemy (Cloud SQL Postgres), the Anthropic SDK for LLM calls, the Voyage AI SDK for embeddings, and httpx for outbound HTTP. The cost review and SPEC Open Q11 require per-request cost tracking and the ability to produce a weekly cost-per-cohort report.

Current state:

  1. Structured logging is in place for LLM costs. _record_cost() in src/memberintel/llm/call.py logs llm.cost as a structured line with tier, operation, input_tokens, and output_tokens. The application uses structlog with JSONRenderer (configured in src/memberintel/api/app.py), so these logs arrive in Cloud Logging as structured JSON payloads.
  2. No request tracing exists. There is no correlation ID linking a chat request through retrieval, LLM call, and response. When a user reports a slow or failed interaction, we cannot trace the request end-to-end.
  3. No metrics collection exists. Cloud Monitoring dashboards are scaffolded in Terraform but not populated. There are no application-level metrics for API latency, error rates, LLM token usage, embedding latency, or quota enforcement counts.
  4. No BigQuery sink exists. Cloud Logging is the only destination. The cost review requires an audit dataset (locked down) and an analytics dataset (normal access) in BigQuery for cost-per-cohort reporting.
  5. No alerting policies exist. 5xx rate spikes, quota exhaustion, and embedding latency outliers are invisible until a user complains.

The forces at play:

  • GCP-native first. We already run on Cloud Run with Cloud Logging. Adding a Cloud Logging sink to BigQuery is zero-code infra. OpenTelemetry instrumentation is the standard way to add request tracing and metrics to a FastAPI app without coupling to GCP SDKs.
  • Cost discipline requires structured data. The _record_cost() line already produces structured JSON. Sinking it to BigQuery turns log lines into queryable rows for the weekly cost-per-cohort report.
  • Single-vendor observability is simpler than multi-vendor. Cloud Logging + Cloud Monitoring + BigQuery keeps the stack inside GCP. OpenTelemetry is vendor-neutral at the instrumentation layer, so we can redirect traces/metrics elsewhere later without rewriting app code.

Decision

1. Structured JSON logging to Cloud Logging (already in place, extended)

Continue using structlog with JSONRenderer as the application logger. Extend the structured payload beyond _record_cost():

  • Add a request_id (UUID) to every log line emitted during a request, injected via structlog.contextvars in a FastAPI middleware. This provides the correlation ID that links all log lines for a single request.
  • Add user_id, tier, and operation to the structlog context at the point where they become known (after auth and entitlement check).
  • Emit structured log lines for: request start, retrieval latency, embedding latency, LLM call start/end (with token counts), quota enforcement decisions (check_and_consume accept/reject), and request end with status code.

The existing _record_cost() function stays as-is, but gains the request_id from contextvars automatically.

2. OpenTelemetry for request tracing and metrics

Add the opentelemetry-* Python packages and instrument three telemetry signals:

  • Traces: opentelemetry-instrumentation-fastapi auto-instruments request handling. Custom spans are added for search_brain() (retrieval), call() / call_stream() (LLM), embed_texts() / embed_query() (embeddings), and check_and_consume() (quota). Export to Cloud Trace via opentelemetry-exporter-gcp-trace.
  • Metrics: opentelemetry-instrumentation-fastapi provides HTTP server duration/count metrics. Custom counters and histograms for: LLM tokens consumed (llm.tokens{direction=input|output}), embedding latency (embedding.latency_ms), quota rejections (quota.rejected{operation}), and active requests gauge. Export to Cloud Monitoring via opentelemetry-exporter-prometheus scraped by the Cloud Run sidecar, or via opentelemetry-exporter-gcp-monitoring (preferred, fewer moving parts).
  • Logs: OpenTelemetry log bridge is deferred to post-V1. Structured structlog to Cloud Logging is sufficient for now.

Instrumentation lives in src/memberintel/api/telemetry.py. The FastAPI app initializes telemetry at startup in app.py, reading OTEL_EXPORTER from the environment (gcp or none, defaulting to none locally).

3. Cloud Logging sink to BigQuery

Create two BigQuery datasets via Terraform:

  • memberintel_audit — locked-down dataset. IAM restricted to memberintel-audit-viewers group. Receives all Cloud Logging entries via a log sink with a filter matching logName:"memberintel-api" AND (severity>=ERROR OR jsonPayload.message=~"llm.cost"). Retention: 365 days.
  • memberintel_analytics — normal-access dataset. IAM allows memberintel-analytics-viewers group. Receives all Cloud Logging entries from the service via a second sink (no severity filter). Retention: 90 days, then auto-expire.

Both sinks use the bigquery destination type with use_partitioned_tables enabled (partitioned by timestamp, clustered by jsonPayload.tier and jsonPayload.operation) for efficient cost-per-cohort queries.

The weekly cost-per-cohort report queries memberintel_analytics:

SELECT
  jsonPayload.tier AS tier,
  jsonPayload.operation AS operation,
  DATE(timestamp) AS date,
  SUM(CAST(jsonPayload.input_tokens AS INT64)) AS total_input_tokens,
  SUM(CAST(jsonPayload.output_tokens AS INT64)) AS total_output_tokens,
  COUNT(*) AS call_count
FROM `memberintel-v1.memberintel_analytics.cloud_logging_sink`
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
  AND jsonPayload.message = "llm.cost"
GROUP BY tier, operation, date
ORDER BY date, tier, operation

4. Cloud Monitoring dashboards

Define four dashboards in Terraform (google_monitoring_dashboard resources):

  • API Latency: p50/p95/p99 of http.server.duration by route. Annotated with Cloud Run container restarts.
  • Error Rates: 4xx rate by route, 5xx rate by route. Rolling 5-minute window.
  • LLM Token Usage: llm.tokens counter by direction (input/output) and tier. Cumulative sum over 24 hours with per-cohort breakdown.
  • Quota Enforcement: quota.rejected counter by operation. Stacked bar chart by tier.

Dashboards are read-only for the memberintel-operators IAM group.

5. Alerting policies

Define three alert policies in Terraform (google_monitoring_alert_policy resources):

  • 5xx rate spike: Alert when 5xx error rate exceeds 5% over a 5-minute rolling window on the memberintel-api-staging Cloud Run service. Notify memberintel-oncall PagerDuty channel.
  • Quota exhaustion: Alert when quota.rejected count exceeds 10 in any 5-minute window for a single operation. Notify memberintel-oncall PagerDuty channel and Seth via email.
  • Embedding latency: Alert when p95 of embedding.latency_ms exceeds 2000ms over a 10-minute window. Notify Seth via email (embedding latency is not page-worthy but requires investigation).

All alert policies route notifications through a single google_monitoring_notification_channel per destination (PagerDuty, email).

Consequences

Positive:

  • End-to-end request tracing from auth through retrieval, LLM call, and response via Cloud Trace + request_id in structlog context.
  • LLM cost data is queryable in BigQuery within minutes of a request, enabling the weekly cost-per-cohort report without custom ETL.
  • OpenTelemetry instrumentation is vendor-neutral; switching trace/metrics backends requires changing one config env var, not rewriting application code.
  • Audit dataset is IAM-locked for compliance; analytics dataset is openly queryable for product decisions.
  • Alerting on 5xx rates, quota rejections, and embedding latency catches regressions before users report them.
  • No external SaaS observability vendor (Datadog, Honeycomb) — GCP-native keeps costs predictable and avoids another vendor dependency.

Negative / costs:

  • OpenTelemetry Python packages add ~15 dependencies and ~2ms per-request overhead for trace/metrics export. This is acceptable for V1 traffic levels but must be monitored.
  • BigQuery sink incurs storage costs (audit dataset at 365-day retention) and query costs. The analytics dataset’s 90-day auto-expire mitigates this.
  • Cloud Trace has a 10k spans/second free tier; above that, per-span charges apply. V1 traffic is expected to stay well below this.
  • structlog contextvars require careful lifecycle management in async FastAPI — the context must be cleared at request end to avoid leaking data between requests.
  • Two BigQuery datasets mean two IAM groups to maintain. This is intentional (separation of audit and analytics access) but adds administrative overhead.

Mitigations:

  • The telemetry module (src/memberintel/api/telemetry.py) is the single seam for all instrumentation. Disabling telemetry locally is one env var (OTEL_EXPORTER=none).
  • FastAPI middleware that sets structlog.contextvars uses try/finally to guarantee context cleanup.
  • BigQuery sink filters are narrow (only the Cloud Run service’s logName), avoiding unnecessary data ingestion from other GCP resources.
  • The audit dataset is small (only errors + cost lines) — estimated < 1 GB/month even at peak V1 traffic.
  • If per-request overhead becomes a concern, traces can be sampled (1-in-10) while metrics remain 100%.

Alternatives considered

  • Datadog / Honeycomb / Grafana Cloud — rejected for V1: adds a third-party SaaS vendor on the critical path, increases monthly cost ($23-31/host for Datadog APM alone), and duplicates data that Cloud Logging already collects. Revisit if multi-cloud or multi-cluster observability becomes a requirement.
  • Self-hosted Prometheus + Jaeger on GKE — rejected: Cloud Run does not run on GKE; running a separate metrics/traces infra cluster contradicts the “single vendor” principle and adds operational burden disproportionate to V1 traffic.
  • Cloud Logging only (no OpenTelemetry) — rejected: Cloud Logging alone cannot produce request traces (spans linked by trace ID) or application-level metrics (histograms, counters). Without OTel, diagnosing a slow chat request requires manual log grep across retrieval, LLM, and embedding lines — a 10-minute task per incident instead of a single trace view.
  • Direct BigQuery inserts from the app (no Cloud Logging sink) — rejected: would require a BigQuery client in the app process, coupling the FastAPI request path to BigQuery availability and adding ~50ms per insert. The sink pattern decouples app code from BigQuery and provides at-least-once delivery with no app-level latency impact.
For: S Seth Shoultes A AI Engineer B Blair Williams S Santiago Perez Asis P Product Lead