M MemberIntel KB
Reading as

spec

LLM Cost-Control Architecture

Defines the four-layer cost-control stack — rate limiting, entitlement service, per-call token budgets, and continuous spend monitoring — with Redis-backed quota counters, server-side model routing enforcement, and a global daily circuit breaker.

LLM cost-control architecture

The framing: cost control is an architectural concern, not a feature.

Most teams treat LLM cost like they treat compute cost — “we’ll watch the dashboard, optimize when we have to.” That works for compute because compute scales linearly with users and slowly. LLM cost doesn’t behave that way. A single user with a runaway prompt loop can spend $50 in an hour. A single regression in retrieval logic that causes “include the full corpus in the prompt instead of top-K” can 10x your spend overnight. A bot abusing the Free tier signup can rack up thousands in cost before manual intervention catches it.

The architectural implication is that cost controls have to be enforced server-side at the request level, before the LLM is called, with hard limits that cannot be bypassed by prompts, by application bugs, or by user behavior. This is the entitlement layer’s most important job, and it’s why the SPEC calls out that it’s a single source of truth, not scattered checks.

The four layers of defense.

Cost control isn’t one mechanism — it’s a layered system where each layer catches what the previous one missed. The layers, from outermost to innermost:

Layer 1: rate limiting at the request boundary. Per-user, per-IP, per-endpoint. Blunt instrument that catches the obvious abuse: scripted attacks, bot floods, accidental client-side loops. Cloud Armor or a simple token-bucket in the application handles this.

Layer 2: entitlement check before the LLM call. This is the SPEC’s central control. Before any LLM call fires, the request passes through the entitlement service which answers: does this user have quota remaining for this operation? Is this model allowed for this user? Is this tier currently in good standing? If any answer is no, the request is rejected with a clear error before any tokens are spent.

Layer 3: per-call token budget enforcement. Even an authorized request shouldn’t be allowed to spend unboundedly. Every LLM call has a max_tokens cap on output. The cap differs by tier and by operation. This is the layer that catches “the LLM decided to produce a 50,000-token response for a chat message.” Set the cap, server-side, on every call.

Layer 4: continuous spend monitoring. The catch-all. Aggregate spend per user per day rolls up to a real-time view. Thresholds fire alerts. Outliers get auto-throttled. This is what catches the patterns the first three layers don’t anticipate — a user who’s within rate limits, within entitlement, within per-call caps, but somehow accumulating high spend through volume of small calls.

Each layer should fail closed. If the entitlement service is down, requests are rejected, not allowed-by-default. If spend monitoring can’t read the latest counters, the user is throttled, not unconstrained. The asymmetry of cost-control failures — wrongly allowing is expensive, wrongly denying is annoying — should drive every decision in this stack.

The entitlement service in detail.

This is the most important piece. The SPEC mentions it but doesn’t go deep on the shape, so let me sketch what it actually looks like in practice.

The entitlement service is a small, dedicated service (or a well-isolated module if you prefer not to split) that owns the answer to “what is this user allowed to do, right now.” It exposes a synchronous API that every LLM call path consults before executing.

The data model is simple but the schema matters. Every user has an entitlement record that captures:

  • Tier (free, pro_trial, pro, churned).
  • Trial state (start_date, end_date, card_on_file, charge_attempted, charge_succeeded). Becomes critical for V1.5.
  • Quota counters: chat_messages_this_month, sonnet_tokens_this_day, haiku_tokens_this_day, agent_actions_this_hour (V1.5+), site_analyses_this_week.
  • Quota reset timestamps (when does each counter reset).
  • Hard caps per quota.
  • Soft warning thresholds.
  • Account flags: is_throttled, is_suspended, abuse_flag_reason.

The quota counters live in a fast store — Redis or Memorystore — because every chat message reads and writes them and Postgres latency would dominate the request. Postgres is the source of truth for tier and state; Redis is the source of truth for “how many messages has this user sent this month.” Both are kept consistent through the standard pattern: Redis is updated atomically on the hot path, with periodic reconciliation against Postgres for long-term durability.

The entitlement check is a single function call from the chat handler:

entitlement.check_and_consume(user_id, operation="chat_message", model="haiku")

Returns either an Allowed(quota_remaining=4) or a Denied(reason="monthly_cap_reached", upgrade_path=...). The function increments the counter atomically as part of the check — no separate “check then consume” race condition. If the call to the LLM fails or the request is rolled back, the counter is decremented in a finally block, but only if the LLM call demonstrably didn’t happen. Better to slightly over-count than to under-count and bleed quota.

The model routing is enforced at this layer too. A free user calling the chat endpoint passes through entitlement, which returns the allowed model (Haiku). The chat handler passes that model through to the LLM call. There is no path in the codebase where a user can specify their own model. There is no user-facing setting for “preferred model.” The model is determined by tier, period. This is what the SPEC means by “server-side enforcement of model routing” and getting it wrong even once is a brand-defining bug — a free user accidentally getting Sonnet costs you 30x and the business model assumption collapses.

The token budget per call.

Every LLM call in your application has a max_tokens parameter. Set it explicitly, never use the model default. The defaults are designed for API safety, not for your unit economics.

A reasonable starting matrix:

  • Free chat (Haiku): max_tokens=1000 on output. Inputs are bounded by the chat context, which is itself trimmed before the call.
  • Pro chat (Sonnet): max_tokens=2000 on output. Higher because Pro users expect deeper answers.
  • Weekly digest (Haiku for Free, Sonnet for Pro): max_tokens=1500.
  • Insight cards (Sonnet): max_tokens=500 per card. Multiple cards per dashboard.
  • Site analysis (Sonnet, but cached weekly): max_tokens=4000. Worth more because it runs weekly per customer, not per request.
  • Cross-pollination drafting (Sonnet): max_tokens=1500 per candidate.

These are starting points; tune based on real data. The discipline is that every LLM call is wrapped in a function that requires max_tokens to be set explicitly — a code-review checklist item, ideally a lint rule. Don’t let “I forgot to set the cap” be a possible failure mode.

Input tokens are trickier because they accumulate from retrieval. The pattern that works: the retrieval layer returns a pre-budgeted set of context. If the global brain returns 5 candidates at 500 tokens each and the per-customer brain returns 5 at 500 tokens each, that’s 5000 tokens of context, plus the system prompt (1000), plus the chat history (variable, capped at 4000), plus the user’s new message. Total context budget: 12,000 input tokens, hard-capped. If the retrieval would return more, it’s truncated by relevance score before the call. Setting an input-token ceiling per call is what prevents the “we accidentally included the whole corpus” failure mode.

The per-customer spend dashboard.

The SPEC requires this and Cindy’s JD makes her responsible for it. The shape:

A single BigQuery view that aggregates LLM events from the structured event stream:

  • Per user, per day: tokens by model, USD spend, operation breakdown.
  • Per user, lifetime: cumulative spend.
  • Per cohort (signup month): average spend per user over time.
  • Per tier: total spend, average spend per user, P50/P90/P99 spender.

Cindy’s dashboard slices this several ways: cost-per-Free-user (rolling average), top spenders this week, spend variance from baseline. The dashboard refreshes hourly, not real-time — real-time isn’t needed for this audience, and hourly is cheap.

There’s a separate engineering view that’s closer to real-time: spend over the last hour, anomaly detection on individual users, spend velocity (USD per minute) for the system as a whole. This is what fires alerts when something’s going wrong economically.

The abuse prevention pattern.

The Free tier is the abuse target. Bots signing up, scraping responses, racking up cost. The mitigations layer:

Signup friction calibrated to risk. The SPEC says “no friction beyond consent” for Free signup. That’s right for the experience but wrong for abuse defense without other guards. The compromise: email verification required, basic captcha on signup, MP license verification preferred (which the SPEC’s MP-license-based auth already provides). If the user comes through MP, they’re a verified MP customer and abuse risk is low. If they come through standalone signup at memberintel.com, they get more friction.

Anomaly detection on first-week behavior. New accounts that immediately consume their entire monthly quota in the first 24 hours are suspicious. New accounts that hit the chat cap on day one and immediately try to upgrade and fail are suspicious. Pattern-match these on a sliding window and auto-throttle pending review. The throttle is reversible — if a real user gets caught, support unblocks them — but the default is conservative.

IP-based limits at the edge. Cloud Armor rules: more than N signups per IP per day blocks the IP. More than N requests per minute throttles. These catch the obvious automated abuse.

Hard global cost caps. The system as a whole has a daily spend ceiling. If LLM spend across all users exceeds the ceiling in a day, new requests are throttled or rejected. This is the circuit breaker. Set it well above expected normal usage but low enough that a runaway loop or attack triggers it before doing serious damage. The SPEC’s projection of $1.10/Free user × 50K users = $55K/month implies roughly $1800/day at full free-tier scale. A circuit breaker at $5000/day catches catastrophic anomalies without ever firing on legitimate traffic.

The economics check that should run weekly.

A scheduled report that Cindy and Seth see every Monday morning:

Free-tier average cost-per-user, this week vs last week vs the SPEC target ($1.10).

Pro-tier average cost-per-user, this week vs last week vs the SPEC target ($6-12).

Top 10 free spenders by week — anyone significantly above the average is either a real heavy user (worth understanding) or a misconfigured user (worth fixing) or an abuser (worth blocking).

Cost-per-cohort by signup month — does the curve trend up over time as the per-customer brain grows and more retrieval happens, or does it stabilize?

Conversion rate per cohort, as the second axis — the question Cindy actually cares about is “what is the unit economics of the Free tier” and that requires both cost and conversion side by side.

This report is the economic flywheel of the product. If cost-per-Free-user trends above $1.10 sustainably, either the conversion rate has to increase to compensate, or the Free tier features tighten, or the cost gets engineered down. Catching that drift early is what keeps the model viable.

The V1.5 wrinkle: agent actions are a different cost shape.

When V1.5 ships, agent actions add a new cost dimension. Each action involves an LLM call to propose, possibly several to refine, plus the actual MP MCP execution. The eval suite runs on every release and that’s a meaningful per-deploy cost. The trial period (14-day Pro trial with card upfront) means users on day 1 of trial behave like Pro users in cost terms but haven’t paid yet — the unit economics of the trial depend on conversion at day 14.

The entitlement service needs to be extended for V1.5 to track agent action quotas (the SPEC mentions 20 actions per hour as a starting cap), trial state machine progression, and to handle the trial-to-paid or trial-to-free transitions cleanly. Worth designing the entitlement schema in V1 with these extension points in mind so V1.5 doesn’t require a refactor — specifically, the trial state fields and the operation-specific quota counters should be present even if not yet used.

The thing most teams underbuild: the kill switch.

You should have a single feature flag that says “disable all LLM calls system-wide.” It exists for two scenarios: (1) a critical bug is causing runaway spend and you need to stop the bleeding before fixing the cause, or (2) Anthropic has an outage and you want to fail gracefully rather than retry-storm them.

The kill switch is a flag in your config layer that every LLM call checks before firing. When tripped, the application returns a degraded response: “The AI advisor is temporarily unavailable. Your dashboard and data are still working.” It’s tripped via a deploy-config change or a manual flag flip in your config service — not via code change, because in an incident you don’t want to be waiting for CI.

There are smaller versions of this: a kill switch for cross-pollination (already implied by the earlier conversation), a kill switch for the agent (V1.5+), a kill switch for the eval suite (so it doesn’t compete for tokens during an incident). Build the pattern once and apply it everywhere LLM calls happen.

The decisions to make to move forward.

  1. Entitlement service as its own module with strict boundaries — every LLM call path goes through it, no exceptions, enforced by code review and ideally by a linter rule. Recommendation: yes.

  2. Redis (or Memorystore) for hot quota counters, Postgres for durable state, periodic reconciliation. Recommendation: yes — the latency math doesn’t work otherwise.

  3. max_tokens set explicitly on every LLM call, with defaults derived from the operation type, never from the model. Recommendation: yes, enforced as a wrapper function that requires the parameter.

  4. Per-customer cost dashboard live by GA, not as a post-launch addition. Recommendation: yes — Cindy needs this from day one to do her job and the SPEC requires it.

  5. Daily global cost circuit breaker, set at roughly 3-5x projected normal usage. Recommendation: $3-5K/day for V1, scale with growth.

  6. System-wide LLM kill switch, tested in staging before GA. Recommendation: yes — and test it. The first time you trip it shouldn’t be during an incident.

  7. The entitlement schema designed with V1.5 trial state fields present in V1 so V1.5 is additive, not a refactor. Recommendation: yes.

The natural next threads from here: the auth and identity layer (MP license OAuth flow, standalone fallback, MFA, session lifecycle — touches the SPEC’s Open Q4 about Stripe Connect vs customer OAuth), secrets management at depth (the OAuth refresh token storage problem, customer Stripe credentials, key rotation), or the data sync pipeline architecture (the MP and Stripe pipelines have their own reliability and back-pressure problems that are worth working through).

For: S Seth Shoultes C Cindy Thoennessen B Blair Williams