reference

V1 Cost Discipline Review

AI Engineer persona's feasibility review of V1 unit economics: architecture as specified can hit $1.07 Free / $10.75 Pro targets if input-token ceilings are pinned, model routing is typed from the entitlement service, retrieval is pre-budgeted, and site analysis stays weekly-cached. The scariest finding is that Free breaks even AT the SPEC's floor conversion target — a Blair conversation about circuit-breaker authority, not an architecture fix.

Date: 2026-05-08
Persona: ai-engineer (Lead Architect / AI Engineer — Seth Shoultes’ role)
Source persona file: .claude/agents/ai-engineer.md (KB repo)
Review focus: SPEC §5.5 (cost controls) + architecture deep-dive against V1 unit-economics targets ($1.10/Free, $6–12/Pro at $29 list)
Sources read: v1-spec, seth-jd, arch-overview, arch-llm-cost, arch-evals

This is the inaugural artifact from the persona-based subagent system. It was produced as a smoke test of the ai-engineer agent definition, but the substance is real and load-bearing for O2 (V1 product implementation kickoff).

1. What’s sound

The four-layer cost stack in arch-llm-cost.md (rate limit → entitlement check-and-consume → explicit max_tokens per call → continuous spend monitoring + global circuit breaker) is the right shape, and pinning current_tenant_id via RLS middleware in arch-overview.md keeps tenant isolation an invariant rather than a discipline. The SPEC §8.4 commitment to Anthropic-direct, no LangChain, no fine-tuning is correct and I want to keep it that way.

2. What I’d change before we build it

Tighten the Free chat cap from “5 messages/month” to “5 exchanges/month with a hard input-token ceiling per turn.” SPEC §5.1 specifies 5 messages, but arch-llm-cost.md already budgets 12K input tokens per call and 1K Haiku output. Five turns × ~13K combined tokens at Haiku 4.5 pricing is fine; five turns where retrieval bloats input to 30K+ is not. Constraint: cost-per-Free ≤ $1.10.
Move “Haiku-only for Free” from a tier-routing rule to a typed model handle that the entitlement service mints. The handler should never receive a model string; it receives a token issued by entitlement.check_and_consume(...) that names the resolved model. This closes the “admin tool, debug surface, internal job” hole the JD red-line names. Constraint: never accidentally route a Free user to Sonnet.
Make max_tokens enforcement a wrapper, not a convention. arch-llm-cost.md already says this; I want it as a single llm.call(operation, ...) function where operation is an enum and the per-op caps live in one table. No raw Anthropic SDK calls allowed except inside that wrapper — CI lint rule. Constraint: cost discipline.
Pre-budget retrieval, don’t post-truncate. search_global_brain and search_customer_brain should each return a token-budgeted page, not a top-K that the caller may forget to trim. Constraint: cost + citation discipline (truncating mid-chunk corrupts citations).
Cross-pollination job runs as the migration role and is the only code that touches multi-tenant data. Already implied in arch-overview.md; I want it written as an ADR with the anonymization step explicit, because Risk #10 (cross-pollination privacy slip) is the single bug that blows up the moat narrative.
Eval gate before any prompt or retrieval change ships. Per arch-evals.md the tier-routing-safety category is 100%-pass blocking — I want that wired into CI before Phase 1 ends, not Phase 2. JD Risk #1 mitigation is the eval suite; we don’t get to ship without it.

3. What’s a material choice (Blair sign-off + ADR)

Five ADRs to draft this sprint (all now written and committed in the V1 product repo):

ADR-0001 — Entitlement service shape. Single module, check_and_consume atomic, Redis hot-path + Postgres durable, model handle issued by the service. Includes V1.5 trial-state fields now so V1.5 is additive.
ADR-0002 — Model routing single source of truth. Tier → operation → model is a server-side table; no client input, no admin override path, no env-flag bypass.
ADR-0003 — pgvector + tenant_id partition — partition key invariant so a Pinecone migration is a connection-string swap, not a schema change. (Resolves SPEC Open Q9.)
ADR-0004 — GCP + Cloud SQL + Cloud Run + Secret Manager per the conversation already captured in arch-overview.md. (Resolves SPEC Open Q8.) Privacy counsel will ask; ADR is the artifact.
ADR-0005 — Anthropic dependency + mitigation. Direct API in V1, no LangChain, but the llm.call(operation, ...) wrapper is the seam we’d use for portability. Surfacing this for Blair so it’s an explicit accepted risk, not a hand-wave (JD pushback trigger #10).

4. Open questions for the SPEC

Q3 (token budgets) is still TBD in §5.5. The numbers in arch-llm-cost.md (Free chat 1K out, Pro chat 2K out, etc.) are architecture-deep-dive numbers, not SPEC numbers. That’s drift. Fold them back into §5.5 as the V1 starting matrix.
Q7 (MP API surface) blocks the sync pipeline cost model. Live Pro sync at “daily refresh” (§5.2) costs different things if MP exposes a delta endpoint vs. forces full re-pulls. Need Paul’s commitment before Phase 1 freeze.
Q8 (hosting) — recommend mark resolved as GCP per the brainstorm in arch-overview.md. Currently still listed as open in §14.
Q9 (vector store) — recommend mark resolved as pgvector with the namespace/Pinecone-fallback note.
Q11 (telemetry/observability) — open and load-bearing. The weekly cost-per-cohort report arch-llm-cost.md requires is the observability stack for the Product Lead. Need a pick (Cloud Logging + BigQuery view, or LangSmith-style LLM tracer, or both) before Phase 2.
New Q to add: what’s the Free-tier input-token cap per chat turn? SPEC §5.5 names output budgets; input is where the runaway lives.
New Q to add: does cross-pollination spend count against any user, or is it global P&L? §5.4 buries it at “$0.10/mo amortized” but doesn’t say where the line item lives in monitoring.

5. Cost back-of-envelope — does the architecture hit $1.10 Free / $6–12 Pro?

Assumptions (Anthropic public pricing, Apr 2026 — flag if these are stale):

Haiku 4.5: ~$1/M input, ~$5/M output
Sonnet 4.6: ~$3/M input, ~$15/M output
Free chat budget per turn: 12K input + 1K output (per arch-llm-cost.md)
Pro chat budget per turn: 12K input + 2K output

Free user ($1.10 target)

Component	Math	Cost
5 Haiku chat turns	5 × (12K × $1/M + 1K × $5/M) = 5 × ($0.012 + $0.005)	$0.085
Weekly digest (Haiku, ~1.5K out, ~6K in × 4)	4 × (6K × $1/M + 1.5K × $5/M) = 4 × ($0.006 + $0.0075)	$0.054
Site analysis (Sonnet, weekly cached, ~8K in + 4K out × 4)	4 × (8K × $3/M + 4K × $15/M) = 4 × ($0.024 + $0.060)	$0.336
Cross-pollination amortized	per §5.4	$0.10
Sync infra + storage + auth	per §5.4	$0.50
Total		~$1.07

That lands under $1.10 — but only if (a) site analysis stays Sonnet-cached weekly and never triggers ad-hoc on a Free user, (b) chat input stays bounded at 12K (not 30K because we forgot to pre-budget retrieval), and (c) infra cost actually holds at $0.50. SPEC §5.4 calls $0.50 “data sync infra + storage + auth” — that’s optimistic for GCP at 50K free users; Cloud SQL + pgvector + Cloud Run + Secret Manager + audit logging will more realistically land $0.70–$0.90 amortized per Free user. Aggressive caching alone does not get us to $1.10. Tight retrieval budgeting + a hard input-token ceiling per turn is what gets us there. I want the input ceiling in the SPEC, not just the architecture deep-dive.

Pro user ($6–12 target)

Component	Math	Cost
50 Sonnet chat turns/mo	50 × (12K × $3/M + 2K × $15/M) = 50 × ($0.036 + $0.030)	$3.30
Daily site analysis refresh + on-demand (Sonnet)	~30 × (8K × $3/M + 4K × $15/M) = 30 × $0.084	$2.52
Sonnet digest + ~5 insight cards/day × 30 (500 out each)	150 × (4K × $3/M + 0.5K × $15/M) = 150 × ($0.012 + $0.0075)	$2.93
Sync infra + storage + brain ops + support (per §5.4)		$2.00
Total		~$10.75

Lands inside the $6–12 band, but the daily site-analysis refresh is the swing factor. SPEC §5.2 says “live, daily refresh” of MP/Stripe — fine — and weekly cached + on-demand for site analysis. If we accidentally let “daily refresh” leak into site analysis (i.e., re-fetch + Sonnet-extract every day) we add ~$2/Pro/mo and start crowding $12. Site analysis is weekly + on-demand only. Lock that in code, not in convention.

The number that actually matters

50K Free × $1.07 = $53.5K/mo Free-tier burn. SPEC §14 Risk #2 names this. Covering math: 5% of 50K = 2,500 Pro × $29 × ~70% margin after Stripe ≈ $50K/mo gross. The model breaks even at the SPEC’s floor conversion target. If we hit 8% conversion the model works; at 3% it does not, and the Free tier becomes a subsidy.

That’s not an architecture-fixable problem — that’s a Blair conversation about the conversion-rate floor below which we tighten Free entitlements (Risk #2 mitigation already names “circuit-breaker authority”). Build the entitlement layer with the dials exposed (chat cap, site-analysis cadence, digest model) so the Product Lead can turn them without a deploy when the weekly cost-per-cohort review demands it.

Bottom line: architecture as specified can hit the cost targets if we (1) pin input-token ceilings, not just output, (2) make model routing a typed handle from the entitlement service, not a string, (3) pre-budget retrieval at the tool layer, and (4) keep site analysis on a weekly cache for both tiers. The Free-tier P&L is real but it’s a conversion-rate question more than an architecture question — expose the entitlement dials for the Product Lead’s circuit-breaker authority. ADRs 0001–0005 this sprint.

For: S Seth Shoultes B Blair Williams P Product Lead A AI Engineer