spec
CI/CD & Code Flow
Defines the three separate promotion pipelines — code, Terraform, and database migrations — with GitHub Actions, Workload Identity Federation, manual prod gates, and the eval suite as a release-blocking check.
CI/CD and how code flows
The shape of the pipeline.
The pipeline has three parallel concerns that interleave: code (the application), infrastructure (Terraform), and database (migrations). Each has its own promotion path, its own gates, and its own failure modes. Most teams accidentally couple them — one big “deploy” button that does all three — and regret it. Keep them separate.
Code flow: from PR to production.
A developer pushes a branch. GitHub Actions (or Cloud Build — pick one and don’t mix) runs on the PR:
Lint, type-check, unit tests, integration tests with a throwaway Postgres container. The integration tests include the RLS isolation harness from the earlier conversation: they spin up a test database, create two tenants, authenticate as one, and confirm the queries cannot see the other’s data. This test suite is non-negotiable and has to pass for the PR to merge. Make it a required check on the branch protection rule.
The eval suite runs on PRs that touch prompts, retrieval logic, or the brain. Per the SPEC, prompts are version-controlled and offline evals run on each release. In practice this means you have a separate evals/ directory with golden test cases — questions paired with expected behaviors — and the suite runs Claude against your latest prompt and scores the output. This is slower and more expensive than unit tests, so it runs on prompt-related PRs only, not every commit.
On merge to main, the same workflow builds a Docker image and pushes it to Artifact Registry in memberintel-shared, tagged with both latest and the git SHA. Tagging with the SHA is important — latest is mutable and unsuitable for rollback.
The image is then deployed to memberintel-staging automatically. Smoke tests run against staging. If they pass, the image is eligible for prod.
Production deploy is a separate, manual step — a workflow_dispatch trigger or a “promote to prod” button. It takes the SHA you want, validates that it’s already passed staging, and deploys it to memberintel-prod. This separation matters: staging deploys are continuous, prod deploys are deliberate. You don’t want a hot-fix-merging engineer to accidentally ship four other unreviewed changes that happened to also be on main.
The deploy mechanism itself.
For Cloud Run, deploys use revisions and traffic splits. The standard pattern: deploy a new revision with --no-traffic, run smoke tests against the revision URL directly, then shift traffic with gcloud run services update-traffic. Either go straight to 100% or do a 10% canary for ten minutes before full cut-over. For V1, all-at-once is fine — you don’t have the volume yet to make canaries meaningful, and the rollback is fast (one command points traffic back to the previous revision).
Rollback is the part teams chronically under-invest in. A rollback should be one command, executable from a runbook by anyone on call, taking under 60 seconds. With Cloud Run revisions this is essentially free — gcloud run services update-traffic api-prod --to-revisions=PREVIOUS_REVISION=100. Document it. Test it. Don’t discover the rollback is broken during your first incident.
Infrastructure flow: Terraform with a remote backend.
Terraform state lives in a GCS bucket in memberintel-shared, with versioning enabled and state locking via the bucket’s built-in locking. One state file per environment.
The Terraform repo is separate from the application repo. This decoupling matters because infrastructure changes happen at a different cadence than code — you don’t want every PR triggering a Terraform plan, and you don’t want infra changes blocking on app tests.
A Terraform PR runs terraform plan against the relevant environment and posts the plan as a PR comment. The reviewer reads the plan as part of code review — “this will create one Cloud SQL instance, modify two IAM bindings, destroy zero resources” — and approves or rejects based on what the plan shows, not just on the code. This is critical. Reading the plan is the actual review; reading the code is secondary.
On merge, Terraform applies to staging automatically. Production apply is gated behind a manual approval step in GitHub Actions environments (or equivalent). Same separation as code — staging is fluid, prod requires intent.
The service account that runs Terraform is scoped per environment. The staging Terraform service account has no permissions in prod. The prod Terraform service account has elevated permissions in prod but is only ever used by the GitHub Actions workflow with manual approval — never by a human directly.
This means your laptop has zero Terraform credentials for prod. If you need to do something to prod that Terraform doesn’t support, you go through the workflow. This sounds annoying until you’ve watched a competent engineer accidentally terraform destroy the wrong workspace and then it sounds essential.
Database migrations: the third pipeline.
Migrations are the most dangerous part of the system because they’re the only thing that can corrupt data. They get their own pipeline.
Migrations live in the application repo (because they’re tightly coupled to schema-aware code) but run in a separate workflow from app deploys. Use a real migration tool — Flyway, Liquibase, Alembic if Python, Prisma migrate if TS, sqlx if Rust. Don’t write your own.
The pattern that works:
Migration files are versioned, immutable, and forward-only. No editing a migration after it’s been merged. If a migration was wrong, you write a new migration that corrects it. This rule is what makes migrations safely reproducible across environments and is non-negotiable.
On every PR, CI runs the migration against a fresh Postgres container and confirms it applies cleanly. It also runs the previous schema’s migrations followed by the new one to confirm the migration is additive on top of existing data shape.
On merge to main, migrations apply to staging automatically before the app code is deployed. This catches a category of bug — “code expects a column that doesn’t exist yet” — at staging rather than prod.
For prod, migrations apply via a separate, manually-triggered workflow that runs before the matching app deploy. The migration runner uses the migration role (the one that bypasses RLS, per the earlier conversation), and the role is provisioned just-in-time by Terraform: created at workflow start, granted necessary permissions, used by the migration tool, then revoked at workflow end. A standing migration role with full access to prod is a footgun.
Migrations that drop columns, drop tables, or transform data require an additional layer: a written-up plan posted to the PR (or wherever your team does design review), explicit Seth or Senior AI Engineer approval beyond the regular code review, and ideally a dry-run against a staging database loaded with synthetic data of similar shape and volume.
The pattern for risky migrations is expand-contract: ship a migration that adds the new shape without removing the old, deploy code that uses both shapes, deploy code that uses only the new shape, then ship a follow-up migration that removes the old shape. Three steps minimum, no big-bang renames or drops. This is slower and feels excessive — and it’s what keeps you from a 3am rollback.
The interaction between the three pipelines.
The three pipelines (code, infra, migrations) need to coordinate at deploy time. The order is always: infra first (so the resources exist), migrations second (so the schema matches), code third (so the running app uses the new schema). If you deploy code before migrations, the code crashes. If you deploy migrations before infra, the migration target may not exist.
For most changes only one pipeline runs. Infra changes are rare, migrations are weekly-ish, code changes are continuous. When all three need to coordinate — say, adding a new feature that requires a new Cloud Tasks queue, a new table, and new application code — the deploy is staged across the three pipelines manually. There’s no clever “one button does all three” — manual coordination is the right answer at your scale, because the alternative is a coupling that bites you when only one of the three needs to change.
The eval suite as a release gate, not a CI step.
This is worth calling out separately because the SPEC treats it as a release-blocking concern.
In V1, the eval suite runs offline against a fixed test set on prompt-related PRs. That’s the lightweight version. Pre-launch and post-launch, the eval suite earns a more serious role: it runs nightly against prod-equivalent prompts and a representative scenario set, and any drop in pass rate above a threshold (say, 5%) opens an alert. This catches Anthropic model updates that subtly change behavior, prompt edits that look fine in PR review but degrade on certain scenarios, and retrieval changes that affect grounding.
By V1.5, the agent eval suite — distinct from the chat eval suite — becomes a hard release gate. The SPEC is explicit about this: 100% pass on safety-critical evals, 95% on others, and a failure blocks release. That’s a stronger gate than most application changes get, and it’s correct given the action surface.
The eval suite needs its own infrastructure: a place to store golden test cases, a runner that calls Claude with the production prompts and current retrieval logic, and a scoring layer that compares outputs to expected behaviors. The scoring is the hard part — it’s usually Claude itself acting as judge, scoring the candidate output on a rubric. Build this once, properly, in V1; you’ll lean on it for V1.5 agent evals and again for V2 cross-platform evals.
Secrets, briefly, because they touch CI/CD.
Secrets live in Secret Manager, per environment. CI workflows access secrets via Workload Identity Federation — no service account keys, no JSON files in GitHub secrets. WIF lets a GitHub Actions workflow assume a GCP identity based on the repo and branch, with no long-lived credentials anywhere. Set this up once and you’ll never go back to service account keys. Claude Code can scaffold the WIF setup; it’s a moderately fiddly Terraform module that has a lot of working examples to reference.
Application code reads secrets at startup from Secret Manager, not from environment variables. This means rotating a secret doesn’t require a redeploy — restart the service and it picks up the new value. It also means secrets aren’t visible in deployment configs, audit logs of gcloud run deploy, or in the Cloud Run revision UI.
The one secret category that’s different: build-time secrets like the Anthropic API key used by the eval suite in CI. Those go through GitHub’s encrypted secrets, scoped to the appropriate environment, and never written to logs.
What the team needs to build vs configure.
For V1, this entire pipeline is mostly configuration and Terraform, not application code. With Claude Code, scaffolding it is a one-week project for a single engineer if they’re focused. The non-obvious time sink is testing the pipeline — running through every failure mode (failed migration, failed deploy, failed rollback, leaked secret rotation) in staging until you trust it. Budget the second week for that.
The temptation will be to ship the application before the pipeline is fully solid because “we can fix CI/CD later.” Resist it. The pipeline is the thing that lets you ship safely once you have customers; building it under pressure post-launch is how outages happen.
The decisions to make to move forward.
-
GitHub Actions or Cloud Build. Recommendation: GitHub Actions. Better ecosystem, better Claude Code support for scaffolding, easier to reason about, and the one place Cloud Build wins (tight GCP integration) is no longer meaningful. The team’s repos are likely already on GitHub.
-
One repo for app + migrations, separate repo for Terraform. Recommendation: yes. The cadence and access patterns are different enough to justify the split.
-
Manual approval gate on prod for both code and infra. Recommendation: yes, via GitHub Actions environments with required reviewers. Friction is the point — every prod deploy should have an intentional human in it during V1.
-
Workload Identity Federation from day one. Recommendation: yes. Service account keys are a known footgun and there’s no reason to start with them.
-
Eval suite as nightly drift detection in V1, hard release gate in V1.5. Recommendation: yes, and this is a build commitment Seth should own personally because it’s the thing that protects the brand promise.
-
Expand-contract migration discipline as a documented team norm, not a tool-enforced rule. Recommendation: yes. Tools can’t catch every variant; a team norm enforced through code review can.
The natural next threads from here are: observability and incident response (logging, metrics, tracing, on-call structure — and how privacy counsel’s audit-log requirements get satisfied through the same system), secrets management at depth (key rotation, customer Stripe tokens, the OAuth refresh token storage), or the LLM cost-control architecture (per-customer budget caps, the entitlement layer’s role in routing, the abuse-prevention pieces).