M MemberIntel KB

reference

V1 Cost Discipline Review

AI Engineer persona's feasibility review of V1 unit economics: architecture as specified can hit $1.07 Free / $10.75 Pro targets if input-token ceilings are pinned, model routing is typed from the entitlement service, retrieval is pre-budgeted, and site analysis stays weekly-cached. The scariest finding is that Free breaks even AT the SPEC's floor conversion target — a Blair conversation about circuit-breaker authority, not an architecture fix.

Date: 2026-05-08
Persona: ai-engineer (Lead Architect / AI EngineerSeth Shoultes’ role)
Source persona file: .claude/agents/ai-engineer.md (KB repo)
Review focus: SPEC §5.5 (cost controls) + architecture deep-dive against V1 unit-economics targets ($1.10/Free, $6–12/Pro at $29 list)
Sources read: v1-spec, seth-jd, arch-overview, arch-llm-cost, arch-evals

This is the inaugural artifact from the persona-based subagent system. It was produced as a smoke test of the ai-engineer agent definition, but the substance is real and load-bearing for O2 (V1 product implementation kickoff).


1. What’s sound

The four-layer cost stack in arch-llm-cost.md (rate limit → entitlement check-and-consume → explicit max_tokens per call → continuous spend monitoring + global circuit breaker) is the right shape, and pinning current_tenant_id via RLS middleware in arch-overview.md keeps tenant isolation an invariant rather than a discipline. The SPEC §8.4 commitment to Anthropic-direct, no LangChain, no fine-tuning is correct and I want to keep it that way.

2. What I’d change before we build it

  • Tighten the Free chat cap from “5 messages/month” to “5 exchanges/month with a hard input-token ceiling per turn.” SPEC §5.1 specifies 5 messages, but arch-llm-cost.md already budgets 12K input tokens per call and 1K Haiku output. Five turns × ~13K combined tokens at Haiku 4.5 pricing is fine; five turns where retrieval bloats input to 30K+ is not. Constraint: cost-per-Free ≤ $1.10.
  • Move “Haiku-only for Free” from a tier-routing rule to a typed model handle that the entitlement service mints. The handler should never receive a model string; it receives a token issued by entitlement.check_and_consume(...) that names the resolved model. This closes the “admin tool, debug surface, internal job” hole the JD red-line names. Constraint: never accidentally route a Free user to Sonnet.
  • Make max_tokens enforcement a wrapper, not a convention. arch-llm-cost.md already says this; I want it as a single llm.call(operation, ...) function where operation is an enum and the per-op caps live in one table. No raw Anthropic SDK calls allowed except inside that wrapper — CI lint rule. Constraint: cost discipline.
  • Pre-budget retrieval, don’t post-truncate. search_global_brain and search_customer_brain should each return a token-budgeted page, not a top-K that the caller may forget to trim. Constraint: cost + citation discipline (truncating mid-chunk corrupts citations).
  • Cross-pollination job runs as the migration role and is the only code that touches multi-tenant data. Already implied in arch-overview.md; I want it written as an ADR with the anonymization step explicit, because Risk #10 (cross-pollination privacy slip) is the single bug that blows up the moat narrative.
  • Eval gate before any prompt or retrieval change ships. Per arch-evals.md the tier-routing-safety category is 100%-pass blocking — I want that wired into CI before Phase 1 ends, not Phase 2. JD Risk #1 mitigation is the eval suite; we don’t get to ship without it.

3. What’s a material choice (Blair sign-off + ADR)

Five ADRs to draft this sprint (all now written and committed in the V1 product repo):

4. Open questions for the SPEC

  • Q3 (token budgets) is still TBD in §5.5. The numbers in arch-llm-cost.md (Free chat 1K out, Pro chat 2K out, etc.) are architecture-deep-dive numbers, not SPEC numbers. That’s drift. Fold them back into §5.5 as the V1 starting matrix.
  • Q7 (MP API surface) blocks the sync pipeline cost model. Live Pro sync at “daily refresh” (§5.2) costs different things if MP exposes a delta endpoint vs. forces full re-pulls. Need Paul’s commitment before Phase 1 freeze.
  • Q8 (hosting) — recommend mark resolved as GCP per the brainstorm in arch-overview.md. Currently still listed as open in §14.
  • Q9 (vector store) — recommend mark resolved as pgvector with the namespace/Pinecone-fallback note.
  • Q11 (telemetry/observability) — open and load-bearing. The weekly cost-per-cohort report arch-llm-cost.md requires is the observability stack for the Product Lead. Need a pick (Cloud Logging + BigQuery view, or LangSmith-style LLM tracer, or both) before Phase 2.
  • New Q to add: what’s the Free-tier input-token cap per chat turn? SPEC §5.5 names output budgets; input is where the runaway lives.
  • New Q to add: does cross-pollination spend count against any user, or is it global P&L? §5.4 buries it at “$0.10/mo amortized” but doesn’t say where the line item lives in monitoring.

5. Cost back-of-envelope — does the architecture hit $1.10 Free / $6–12 Pro?

Assumptions (Anthropic public pricing, Apr 2026 — flag if these are stale):

  • Haiku 4.5: ~$1/M input, ~$5/M output
  • Sonnet 4.6: ~$3/M input, ~$15/M output
  • Free chat budget per turn: 12K input + 1K output (per arch-llm-cost.md)
  • Pro chat budget per turn: 12K input + 2K output

Free user ($1.10 target)

ComponentMathCost
5 Haiku chat turns5 × (12K × $1/M + 1K × $5/M) = 5 × ($0.012 + $0.005)$0.085
Weekly digest (Haiku, ~1.5K out, ~6K in × 4)4 × (6K × $1/M + 1.5K × $5/M) = 4 × ($0.006 + $0.0075)$0.054
Site analysis (Sonnet, weekly cached, ~8K in + 4K out × 4)4 × (8K × $3/M + 4K × $15/M) = 4 × ($0.024 + $0.060)$0.336
Cross-pollination amortizedper §5.4$0.10
Sync infra + storage + authper §5.4$0.50
Total~$1.07

That lands under $1.10 — but only if (a) site analysis stays Sonnet-cached weekly and never triggers ad-hoc on a Free user, (b) chat input stays bounded at 12K (not 30K because we forgot to pre-budget retrieval), and (c) infra cost actually holds at $0.50. SPEC §5.4 calls $0.50 “data sync infra + storage + auth” — that’s optimistic for GCP at 50K free users; Cloud SQL + pgvector + Cloud Run + Secret Manager + audit logging will more realistically land $0.70–$0.90 amortized per Free user. Aggressive caching alone does not get us to $1.10. Tight retrieval budgeting + a hard input-token ceiling per turn is what gets us there. I want the input ceiling in the SPEC, not just the architecture deep-dive.

Pro user ($6–12 target)

ComponentMathCost
50 Sonnet chat turns/mo50 × (12K × $3/M + 2K × $15/M) = 50 × ($0.036 + $0.030)$3.30
Daily site analysis refresh + on-demand (Sonnet)~30 × (8K × $3/M + 4K × $15/M) = 30 × $0.084$2.52
Sonnet digest + ~5 insight cards/day × 30 (500 out each)150 × (4K × $3/M + 0.5K × $15/M) = 150 × ($0.012 + $0.0075)$2.93
Sync infra + storage + brain ops + support (per §5.4)$2.00
Total~$10.75

Lands inside the $6–12 band, but the daily site-analysis refresh is the swing factor. SPEC §5.2 says “live, daily refresh” of MP/Stripe — fine — and weekly cached + on-demand for site analysis. If we accidentally let “daily refresh” leak into site analysis (i.e., re-fetch + Sonnet-extract every day) we add ~$2/Pro/mo and start crowding $12. Site analysis is weekly + on-demand only. Lock that in code, not in convention.

The number that actually matters

50K Free × $1.07 = $53.5K/mo Free-tier burn. SPEC §14 Risk #2 names this. Covering math: 5% of 50K = 2,500 Pro × $29 × ~70% margin after Stripe ≈ $50K/mo gross. The model breaks even at the SPEC’s floor conversion target. If we hit 8% conversion the model works; at 3% it does not, and the Free tier becomes a subsidy.

That’s not an architecture-fixable problem — that’s a Blair conversation about the conversion-rate floor below which we tighten Free entitlements (Risk #2 mitigation already names “circuit-breaker authority”). Build the entitlement layer with the dials exposed (chat cap, site-analysis cadence, digest model) so the Product Lead can turn them without a deploy when the weekly cost-per-cohort review demands it.


Bottom line: architecture as specified can hit the cost targets if we (1) pin input-token ceilings, not just output, (2) make model routing a typed handle from the entitlement service, not a string, (3) pre-budget retrieval at the tool layer, and (4) keep site analysis on a weekly cache for both tiers. The Free-tier P&L is real but it’s a conversion-rate question more than an architecture question — expose the entitlement dials for the Product Lead’s circuit-breaker authority. ADRs 0001–0005 this sprint.

For: S Seth Shoultes B Blair Williams P Product Lead A AI Engineer