decision
ADR-0010: Observability — Cloud Logging + OpenTelemetry + BigQuery
ADR-0010 (Accepted, 2026-05-14): Observability — Cloud Logging + OpenTelemetry + BigQuery.
Status: Accepted
Date: 2026-05-14
Deciders: Seth (Lead Architect)
Context
The MemberIntel V1 API runs on Cloud Run (memberintel-api-staging) in GCP project memberintel-v1. It is a Python FastAPI application using SQLAlchemy (Cloud SQL Postgres), the Anthropic SDK for LLM calls, the Voyage AI SDK for embeddings, and httpx for outbound HTTP. The cost review and SPEC Open Q11 require per-request cost tracking and the ability to produce a weekly cost-per-cohort report.
Current state:
- Structured logging is in place for LLM costs.
_record_cost()insrc/memberintel/llm/call.pylogsllm.costas a structured line withtier,operation,input_tokens, andoutput_tokens. The application usesstructlogwithJSONRenderer(configured insrc/memberintel/api/app.py), so these logs arrive in Cloud Logging as structured JSON payloads. - No request tracing exists. There is no correlation ID linking a chat request through retrieval, LLM call, and response. When a user reports a slow or failed interaction, we cannot trace the request end-to-end.
- No metrics collection exists. Cloud Monitoring dashboards are scaffolded in Terraform but not populated. There are no application-level metrics for API latency, error rates, LLM token usage, embedding latency, or quota enforcement counts.
- No BigQuery sink exists. Cloud Logging is the only destination. The cost review requires an audit dataset (locked down) and an analytics dataset (normal access) in BigQuery for cost-per-cohort reporting.
- No alerting policies exist. 5xx rate spikes, quota exhaustion, and embedding latency outliers are invisible until a user complains.
The forces at play:
- GCP-native first. We already run on Cloud Run with Cloud Logging. Adding a Cloud Logging sink to BigQuery is zero-code infra. OpenTelemetry instrumentation is the standard way to add request tracing and metrics to a FastAPI app without coupling to GCP SDKs.
- Cost discipline requires structured data. The
_record_cost()line already produces structured JSON. Sinking it to BigQuery turns log lines into queryable rows for the weekly cost-per-cohort report. - Single-vendor observability is simpler than multi-vendor. Cloud Logging + Cloud Monitoring + BigQuery keeps the stack inside GCP. OpenTelemetry is vendor-neutral at the instrumentation layer, so we can redirect traces/metrics elsewhere later without rewriting app code.
Decision
1. Structured JSON logging to Cloud Logging (already in place, extended)
Continue using structlog with JSONRenderer as the application logger. Extend the structured payload beyond _record_cost():
- Add a
request_id(UUID) to every log line emitted during a request, injected viastructlog.contextvarsin a FastAPI middleware. This provides the correlation ID that links all log lines for a single request. - Add
user_id,tier, andoperationto the structlog context at the point where they become known (after auth and entitlement check). - Emit structured log lines for: request start, retrieval latency, embedding latency, LLM call start/end (with token counts), quota enforcement decisions (
check_and_consumeaccept/reject), and request end with status code.
The existing _record_cost() function stays as-is, but gains the request_id from contextvars automatically.
2. OpenTelemetry for request tracing and metrics
Add the opentelemetry-* Python packages and instrument three telemetry signals:
- Traces:
opentelemetry-instrumentation-fastapiauto-instruments request handling. Custom spans are added forsearch_brain()(retrieval),call()/call_stream()(LLM),embed_texts()/embed_query()(embeddings), andcheck_and_consume()(quota). Export to Cloud Trace viaopentelemetry-exporter-gcp-trace. - Metrics:
opentelemetry-instrumentation-fastapiprovides HTTP server duration/count metrics. Custom counters and histograms for: LLM tokens consumed (llm.tokens{direction=input|output}), embedding latency (embedding.latency_ms), quota rejections (quota.rejected{operation}), and active requests gauge. Export to Cloud Monitoring viaopentelemetry-exporter-prometheusscraped by the Cloud Run sidecar, or viaopentelemetry-exporter-gcp-monitoring(preferred, fewer moving parts). - Logs: OpenTelemetry log bridge is deferred to post-V1. Structured
structlogto Cloud Logging is sufficient for now.
Instrumentation lives in src/memberintel/api/telemetry.py. The FastAPI app initializes telemetry at startup in app.py, reading OTEL_EXPORTER from the environment (gcp or none, defaulting to none locally).
3. Cloud Logging sink to BigQuery
Create two BigQuery datasets via Terraform:
memberintel_audit— locked-down dataset. IAM restricted tomemberintel-audit-viewersgroup. Receives all Cloud Logging entries via a log sink with a filter matchinglogName:"memberintel-api" AND (severity>=ERROR OR jsonPayload.message=~"llm.cost"). Retention: 365 days.memberintel_analytics— normal-access dataset. IAM allowsmemberintel-analytics-viewersgroup. Receives all Cloud Logging entries from the service via a second sink (no severity filter). Retention: 90 days, then auto-expire.
Both sinks use the bigquery destination type with use_partitioned_tables enabled (partitioned by timestamp, clustered by jsonPayload.tier and jsonPayload.operation) for efficient cost-per-cohort queries.
The weekly cost-per-cohort report queries memberintel_analytics:
SELECT
jsonPayload.tier AS tier,
jsonPayload.operation AS operation,
DATE(timestamp) AS date,
SUM(CAST(jsonPayload.input_tokens AS INT64)) AS total_input_tokens,
SUM(CAST(jsonPayload.output_tokens AS INT64)) AS total_output_tokens,
COUNT(*) AS call_count
FROM `memberintel-v1.memberintel_analytics.cloud_logging_sink`
WHERE timestamp > TIMESTAMP_SUB(CURRENT_TIMESTAMP(), INTERVAL 7 DAY)
AND jsonPayload.message = "llm.cost"
GROUP BY tier, operation, date
ORDER BY date, tier, operation
4. Cloud Monitoring dashboards
Define four dashboards in Terraform (google_monitoring_dashboard resources):
- API Latency: p50/p95/p99 of
http.server.durationby route. Annotated with Cloud Run container restarts. - Error Rates: 4xx rate by route, 5xx rate by route. Rolling 5-minute window.
- LLM Token Usage:
llm.tokenscounter bydirection(input/output) andtier. Cumulative sum over 24 hours with per-cohort breakdown. - Quota Enforcement:
quota.rejectedcounter byoperation. Stacked bar chart by tier.
Dashboards are read-only for the memberintel-operators IAM group.
5. Alerting policies
Define three alert policies in Terraform (google_monitoring_alert_policy resources):
- 5xx rate spike: Alert when 5xx error rate exceeds 5% over a 5-minute rolling window on the
memberintel-api-stagingCloud Run service. Notifymemberintel-oncallPagerDuty channel. - Quota exhaustion: Alert when
quota.rejectedcount exceeds 10 in any 5-minute window for a single operation. Notifymemberintel-oncallPagerDuty channel and Seth via email. - Embedding latency: Alert when p95 of
embedding.latency_msexceeds 2000ms over a 10-minute window. Notify Seth via email (embedding latency is not page-worthy but requires investigation).
All alert policies route notifications through a single google_monitoring_notification_channel per destination (PagerDuty, email).
Consequences
Positive:
- End-to-end request tracing from auth through retrieval, LLM call, and response via Cloud Trace +
request_idin structlog context. - LLM cost data is queryable in BigQuery within minutes of a request, enabling the weekly cost-per-cohort report without custom ETL.
- OpenTelemetry instrumentation is vendor-neutral; switching trace/metrics backends requires changing one config env var, not rewriting application code.
- Audit dataset is IAM-locked for compliance; analytics dataset is openly queryable for product decisions.
- Alerting on 5xx rates, quota rejections, and embedding latency catches regressions before users report them.
- No external SaaS observability vendor (Datadog, Honeycomb) — GCP-native keeps costs predictable and avoids another vendor dependency.
Negative / costs:
- OpenTelemetry Python packages add ~15 dependencies and ~2ms per-request overhead for trace/metrics export. This is acceptable for V1 traffic levels but must be monitored.
- BigQuery sink incurs storage costs (audit dataset at 365-day retention) and query costs. The analytics dataset’s 90-day auto-expire mitigates this.
- Cloud Trace has a 10k spans/second free tier; above that, per-span charges apply. V1 traffic is expected to stay well below this.
structlogcontextvars require careful lifecycle management in async FastAPI — the context must be cleared at request end to avoid leaking data between requests.- Two BigQuery datasets mean two IAM groups to maintain. This is intentional (separation of audit and analytics access) but adds administrative overhead.
Mitigations:
- The telemetry module (
src/memberintel/api/telemetry.py) is the single seam for all instrumentation. Disabling telemetry locally is one env var (OTEL_EXPORTER=none). - FastAPI middleware that sets
structlog.contextvarsusestry/finallyto guarantee context cleanup. - BigQuery sink filters are narrow (only the Cloud Run service’s logName), avoiding unnecessary data ingestion from other GCP resources.
- The audit dataset is small (only errors + cost lines) — estimated < 1 GB/month even at peak V1 traffic.
- If per-request overhead becomes a concern, traces can be sampled (1-in-10) while metrics remain 100%.
Alternatives considered
- Datadog / Honeycomb / Grafana Cloud — rejected for V1: adds a third-party SaaS vendor on the critical path, increases monthly cost ($23-31/host for Datadog APM alone), and duplicates data that Cloud Logging already collects. Revisit if multi-cloud or multi-cluster observability becomes a requirement.
- Self-hosted Prometheus + Jaeger on GKE — rejected: Cloud Run does not run on GKE; running a separate metrics/traces infra cluster contradicts the “single vendor” principle and adds operational burden disproportionate to V1 traffic.
- Cloud Logging only (no OpenTelemetry) — rejected: Cloud Logging alone cannot produce request traces (spans linked by trace ID) or application-level metrics (histograms, counters). Without OTel, diagnosing a slow chat request requires manual log grep across retrieval, LLM, and embedding lines — a 10-minute task per incident instead of a single trace view.
- Direct BigQuery inserts from the app (no Cloud Logging sink) — rejected: would require a BigQuery client in the app process, coupling the FastAPI request path to BigQuery availability and adding ~50ms per insert. The sink pattern decouples app code from BigQuery and provides at-least-once delivery with no app-level latency impact.