reference
V1 Cost Discipline Review
AI Engineer persona's feasibility review of V1 unit economics: architecture as specified can hit $1.07 Free / $10.75 Pro targets if input-token ceilings are pinned, model routing is typed from the entitlement service, retrieval is pre-budgeted, and site analysis stays weekly-cached. The scariest finding is that Free breaks even AT the SPEC's floor conversion target — a Blair conversation about circuit-breaker authority, not an architecture fix.
Date: 2026-05-08
Persona: ai-engineer (Lead Architect / AI Engineer — Seth Shoultes’ role)
Source persona file: .claude/agents/ai-engineer.md (KB repo)
Review focus: SPEC §5.5 (cost controls) + architecture deep-dive against V1 unit-economics targets ($1.10/Free, $6–12/Pro at $29 list)
Sources read: v1-spec, seth-jd, arch-overview, arch-llm-cost, arch-evals
This is the inaugural artifact from the persona-based subagent system. It was produced as a smoke test of the ai-engineer agent definition, but the substance is real and load-bearing for O2 (V1 product implementation kickoff).
1. What’s sound
The four-layer cost stack in arch-llm-cost.md (rate limit → entitlement check-and-consume → explicit max_tokens per call → continuous spend monitoring + global circuit breaker) is the right shape, and pinning current_tenant_id via RLS middleware in arch-overview.md keeps tenant isolation an invariant rather than a discipline. The SPEC §8.4 commitment to Anthropic-direct, no LangChain, no fine-tuning is correct and I want to keep it that way.
2. What I’d change before we build it
- Tighten the Free chat cap from “5 messages/month” to “5 exchanges/month with a hard input-token ceiling per turn.” SPEC §5.1 specifies 5 messages, but
arch-llm-cost.mdalready budgets 12K input tokens per call and 1K Haiku output. Five turns × ~13K combined tokens at Haiku 4.5 pricing is fine; five turns where retrieval bloats input to 30K+ is not. Constraint: cost-per-Free ≤ $1.10. - Move “Haiku-only for Free” from a tier-routing rule to a typed model handle that the entitlement service mints. The handler should never receive a model string; it receives a token issued by
entitlement.check_and_consume(...)that names the resolved model. This closes the “admin tool, debug surface, internal job” hole the JD red-line names. Constraint: never accidentally route a Free user to Sonnet. - Make
max_tokensenforcement a wrapper, not a convention.arch-llm-cost.mdalready says this; I want it as a singlellm.call(operation, ...)function whereoperationis an enum and the per-op caps live in one table. No raw Anthropic SDK calls allowed except inside that wrapper — CI lint rule. Constraint: cost discipline. - Pre-budget retrieval, don’t post-truncate.
search_global_brainandsearch_customer_brainshould each return a token-budgeted page, not a top-K that the caller may forget to trim. Constraint: cost + citation discipline (truncating mid-chunk corrupts citations). - Cross-pollination job runs as the migration role and is the only code that touches multi-tenant data. Already implied in
arch-overview.md; I want it written as an ADR with the anonymization step explicit, because Risk #10 (cross-pollination privacy slip) is the single bug that blows up the moat narrative. - Eval gate before any prompt or retrieval change ships. Per
arch-evals.mdthe tier-routing-safety category is 100%-pass blocking — I want that wired into CI before Phase 1 ends, not Phase 2. JD Risk #1 mitigation is the eval suite; we don’t get to ship without it.
3. What’s a material choice (Blair sign-off + ADR)
Five ADRs to draft this sprint (all now written and committed in the V1 product repo):
- ADR-0001 — Entitlement service shape. Single module,
check_and_consumeatomic, Redis hot-path + Postgres durable, model handle issued by the service. Includes V1.5 trial-state fields now so V1.5 is additive. - ADR-0002 — Model routing single source of truth. Tier → operation → model is a server-side table; no client input, no admin override path, no env-flag bypass.
- ADR-0003 — pgvector + tenant_id partition — partition key invariant so a Pinecone migration is a connection-string swap, not a schema change. (Resolves SPEC Open Q9.)
- ADR-0004 — GCP + Cloud SQL + Cloud Run + Secret Manager per the conversation already captured in
arch-overview.md. (Resolves SPEC Open Q8.) Privacy counsel will ask; ADR is the artifact. - ADR-0005 — Anthropic dependency + mitigation. Direct API in V1, no LangChain, but the
llm.call(operation, ...)wrapper is the seam we’d use for portability. Surfacing this for Blair so it’s an explicit accepted risk, not a hand-wave (JD pushback trigger #10).
4. Open questions for the SPEC
- Q3 (token budgets) is still TBD in §5.5. The numbers in
arch-llm-cost.md(Free chat 1K out, Pro chat 2K out, etc.) are architecture-deep-dive numbers, not SPEC numbers. That’s drift. Fold them back into §5.5 as the V1 starting matrix. - Q7 (MP API surface) blocks the sync pipeline cost model. Live Pro sync at “daily refresh” (§5.2) costs different things if MP exposes a delta endpoint vs. forces full re-pulls. Need Paul’s commitment before Phase 1 freeze.
- Q8 (hosting) — recommend mark resolved as GCP per the brainstorm in
arch-overview.md. Currently still listed as open in §14. - Q9 (vector store) — recommend mark resolved as pgvector with the namespace/Pinecone-fallback note.
- Q11 (telemetry/observability) — open and load-bearing. The weekly cost-per-cohort report
arch-llm-cost.mdrequires is the observability stack for the Product Lead. Need a pick (Cloud Logging + BigQuery view, or LangSmith-style LLM tracer, or both) before Phase 2. - New Q to add: what’s the Free-tier input-token cap per chat turn? SPEC §5.5 names output budgets; input is where the runaway lives.
- New Q to add: does cross-pollination spend count against any user, or is it global P&L? §5.4 buries it at “$0.10/mo amortized” but doesn’t say where the line item lives in monitoring.
5. Cost back-of-envelope — does the architecture hit $1.10 Free / $6–12 Pro?
Assumptions (Anthropic public pricing, Apr 2026 — flag if these are stale):
- Haiku 4.5: ~$1/M input, ~$5/M output
- Sonnet 4.6: ~$3/M input, ~$15/M output
- Free chat budget per turn: 12K input + 1K output (per
arch-llm-cost.md) - Pro chat budget per turn: 12K input + 2K output
Free user ($1.10 target)
| Component | Math | Cost |
|---|---|---|
| 5 Haiku chat turns | 5 × (12K × $1/M + 1K × $5/M) = 5 × ($0.012 + $0.005) | $0.085 |
| Weekly digest (Haiku, ~1.5K out, ~6K in × 4) | 4 × (6K × $1/M + 1.5K × $5/M) = 4 × ($0.006 + $0.0075) | $0.054 |
| Site analysis (Sonnet, weekly cached, ~8K in + 4K out × 4) | 4 × (8K × $3/M + 4K × $15/M) = 4 × ($0.024 + $0.060) | $0.336 |
| Cross-pollination amortized | per §5.4 | $0.10 |
| Sync infra + storage + auth | per §5.4 | $0.50 |
| Total | ~$1.07 |
That lands under $1.10 — but only if (a) site analysis stays Sonnet-cached weekly and never triggers ad-hoc on a Free user, (b) chat input stays bounded at 12K (not 30K because we forgot to pre-budget retrieval), and (c) infra cost actually holds at $0.50. SPEC §5.4 calls $0.50 “data sync infra + storage + auth” — that’s optimistic for GCP at 50K free users; Cloud SQL + pgvector + Cloud Run + Secret Manager + audit logging will more realistically land $0.70–$0.90 amortized per Free user. Aggressive caching alone does not get us to $1.10. Tight retrieval budgeting + a hard input-token ceiling per turn is what gets us there. I want the input ceiling in the SPEC, not just the architecture deep-dive.
Pro user ($6–12 target)
| Component | Math | Cost |
|---|---|---|
| 50 Sonnet chat turns/mo | 50 × (12K × $3/M + 2K × $15/M) = 50 × ($0.036 + $0.030) | $3.30 |
| Daily site analysis refresh + on-demand (Sonnet) | ~30 × (8K × $3/M + 4K × $15/M) = 30 × $0.084 | $2.52 |
| Sonnet digest + ~5 insight cards/day × 30 (500 out each) | 150 × (4K × $3/M + 0.5K × $15/M) = 150 × ($0.012 + $0.0075) | $2.93 |
| Sync infra + storage + brain ops + support (per §5.4) | $2.00 | |
| Total | ~$10.75 |
Lands inside the $6–12 band, but the daily site-analysis refresh is the swing factor. SPEC §5.2 says “live, daily refresh” of MP/Stripe — fine — and weekly cached + on-demand for site analysis. If we accidentally let “daily refresh” leak into site analysis (i.e., re-fetch + Sonnet-extract every day) we add ~$2/Pro/mo and start crowding $12. Site analysis is weekly + on-demand only. Lock that in code, not in convention.
The number that actually matters
50K Free × $1.07 = $53.5K/mo Free-tier burn. SPEC §14 Risk #2 names this. Covering math: 5% of 50K = 2,500 Pro × $29 × ~70% margin after Stripe ≈ $50K/mo gross. The model breaks even at the SPEC’s floor conversion target. If we hit 8% conversion the model works; at 3% it does not, and the Free tier becomes a subsidy.
That’s not an architecture-fixable problem — that’s a Blair conversation about the conversion-rate floor below which we tighten Free entitlements (Risk #2 mitigation already names “circuit-breaker authority”). Build the entitlement layer with the dials exposed (chat cap, site-analysis cadence, digest model) so the Product Lead can turn them without a deploy when the weekly cost-per-cohort review demands it.
Bottom line: architecture as specified can hit the cost targets if we (1) pin input-token ceilings, not just output, (2) make model routing a typed handle from the entitlement service, not a string, (3) pre-budget retrieval at the tool layer, and (4) keep site analysis on a weekly cache for both tiers. The Free-tier P&L is real but it’s a conversion-rate question more than an architecture question — expose the entitlement dials for the Product Lead’s circuit-breaker authority. ADRs 0001–0005 this sprint.