decision
ADR-0009: CI/CD Pipeline — GitHub Actions with Workload Identity Federation
ADR-0009 (Accepted, 2026-05-14): CI/CD Pipeline — GitHub Actions with Workload Identity Federation.
Status: Accepted
Date: 2026-05-14
Deciders: Seth (Lead Architect)
Context
MemberIntel V1 currently deploys manually: make build (Docker), docker push to Artifact Registry, gcloud run services update. This works for a solo developer but creates three problems as the team grows:
- No gate on
main. Pushing tomaindeploys without any automated checks — a bad merge can reach staging instantly. - Service account key drift. The
gcloudCLI authenticates via a user credential or a downloaded JSON key. JSON keys expire, get misplaced, and violate the principle of short-lived tokens. - No audit trail for deployments. Without a CI run linking a commit SHA to a Cloud Run revision, rollbacks require tribal knowledge of which SHA was “last good.”
The spec (Slice 1, CI/CD section) calls for GitHub Actions with Workload Identity Federation for all GCP deploys. Workload Identity Federation is already configured in the GCP project (project number 28785943838, memberintel-v1) — the Terraform state includes the IAM pool, provider, and service account bindings. The GitHub repo (sethshoultes/memberintel) has a branch protection rule on main requiring PRs and three status checks.
Two workflow files already exist in the repo:
.github/workflows/ci.yml— lint, guard-rules, and test (triggered on push and PR).github/workflows/deploy.yml— build/push/deploy (triggered on push tomain+workflow_dispatch)
Both are functional and already use google-github-actions/auth@v2 with workload_identity_provider and id-token: write permissions. This ADR ratifies the architecture those workflows implement and records the decisions that shaped them.
Decision
1. GitHub Actions for all CI/CD
Every merge to main and every pull request against main runs through GitHub Actions. There is no path to production that bypasses CI.
PR checks (ci.yml, triggered on push and pull_request):
| Job | Tools | Purpose |
|---|---|---|
lint | ruff (check + format), mypy | Catch style violations and type errors before review |
guard-rules | git grep scripts | Enforce architectural invariants: Anthropic SDK only in src/memberintel/llm/, .model_id reads only in src/memberintel/llm/, X-Mi-Tier header not referenced in src/, Voyage SDK only in src/memberintel/api/retrieval/ |
test | pytest (unit + integration + evals) against a pgvector/pgvector:pg16 service container | Validate correctness before merge |
Main-branch deploy (deploy.yml, triggered on push to main or workflow_dispatch):
- test — re-runs the full suite (unit + integration) as a gate before the image is built.
- build-and-push — builds the Docker image (
python:3.12-slimmulti-stage, Poetry--only main), tags it with the commit SHA andlatest, pushes tous-docker.pkg.dev/memberintel-v1/memberintel-repo/memberintel-api. - deploy-staging — runs automatically on push to
main. Callsgcloud run deploy memberintel-api-staging --region us-central1with the SHA-tagged image. - deploy-production — runs only via
workflow_dispatchwithenvironment: production. Promotes the same image tomemberintel-apiwith--min-instances 1. - run-migrations — connects to Cloud SQL to run
alembic upgrade headbefore traffic shifts (currently stubbed; full migration path pending Alembic-in-container wiring). - notify — always-run summary job that writes image tag, staging result, and production result to
$GITHUB_STEP_SUMMARY.
2. Workload Identity Federation — no service account keys
All GCP authentication in CI uses the google-github-actions/auth@v2 action with OIDC-based Workload Identity Federation. The GitHub Actions OIDC token (short-lived, audience-scoped) is exchanged for a GCP access token via a configured pool and provider. No JSON key files are stored in GitHub Secrets.
Required GitHub Secrets (four, all non-sensitive identifiers):
| Secret | Contents |
|---|---|
GCP_WORKLOAD_IDENTITY_PROVIDER | Full provider resource name: projects/28785943838/locations/global/workloadIdentityPools/<pool>/providers/<provider> |
GCP_SERVICE_ACCOUNT | Email of the CI service account, e.g. github-actions@memberintel-v1.iam.gserviceaccount.com |
GCP_PROJECT_ID | memberintel-v1 |
ANTHROPIC_API_KEY | API key for integration/eval test runs (stored as secret because it is a third-party credential, not a GCP auth artifact) |
The CI service account is scoped to the minimum roles needed: roles/run.admin (deploy), roles/artifactregistry.writer (push images), roles/cloudsql.admin (migrations), and roles/secretmanager.secretAccessor (read secrets at deploy time). This is an improvement over the previous gcloud CLI setup which relied on the developer’s personal Owner-level credentials.
3. Staging is automatic, production is manual
Pushing to main always deploys to staging. Production deploys require an explicit workflow_dispatch with environment: production. This matches the team’s current capacity: one engineer, manual verification on staging before promoting. When the team grows, a GitHub Environment protection rule (required reviewers) can gate production further without changing the pipeline.
4. Branch protection enforces the gate
The main branch protection rule requires:
- Pull request before merging
- Three status checks must pass:
lint,guard-rules,test
This ensures no commit reaches main — and therefore no commit triggers a staging deploy — without passing the full CI suite.
Consequences
Positive:
- Every change to
mainis vetted by lint, type-check, architectural guard rules, and tests before it reaches staging. - GCP authentication uses short-lived OIDC tokens; no service account JSON keys to rotate or leak.
- The commit SHA is the image tag in Artifact Registry, making every Cloud Run revision traceable to a specific commit and CI run.
- Staging deploys are automatic (fast feedback loop); production deploys are intentional (no accidental promotions).
- Guard-rule jobs enforce the architectural boundaries documented in ADR-0002 (model routing single source of truth) and ADR-0005 (Anthropic dependency mitigation) at CI time, not just in code review.
Negative / costs:
- CI minutes on private repos cost money; the deploy workflow re-runs the full test suite as a pre-build gate (duplicating the PR run).
- Workload Identity Federation setup is one-time GCP configuration that cannot be expressed in the current Terraform (
infra/main.tfdoes not yet include the WIF pool/provider resources). The pool and provider were created manually viagcloud iam workload-identity-poolscommands. - The
run-migrationsjob is currently a stub (gcloud sql connectwithout an actual Alembic command); wiring migrations through Cloud Run’s startup or a separate Job is a follow-up. - The
latesttag on Artifact Registry images creates a moving target; teams that need reproducible rollbacks must reference the SHA tag explicitly.
Mitigations:
- Duplicate test runs are acceptable for now (small team, fast test suite). If minutes become costly, the deploy workflow can skip its own test job and rely on the branch-protection gate, trusting that
mainonly receives merged PRs that already passed CI. - WIF pool/provider resources should be added to
infra/main.tfin a follow-up to make the infrastructure fully declarative. The current manual setup is documented in the GCP project and is stable. - Migrations will be wired as a Cloud Run Job (or startup probe) in a future slice; the stub in
deploy.ymlis a placeholder that does not block deploys. - The SHA tag is always available alongside
latest; production deploys reference the SHA tag explicitly via${{ github.sha }}.
Alternatives considered
- Cloud Build — GCP-native CI/CD. Rejected: the team’s expertise and existing workflows are in GitHub Actions; Cloud Build would require maintaining two CI systems if PR checks stay in GitHub.
- Service account JSON keys stored in GitHub Secrets — the “obvious” alternative for GCP auth from CI. Rejected: JSON keys are long-lived, cannot be scoped to a single repository, and must be rotated manually. WIF tokens are short-lived, audience-scoped, and require zero rotation.
- GitOps with ArgoCD / Flux — rejected for V1: the team is one engineer, the stack is Cloud Run (not Kubernetes), and GitOps adds operational complexity with no corresponding benefit. Can be revisited if the team adopts GKE or grows to multi-environment promotion gates.
- Automatic production deploys on merge to
main— rejected: the spec calls for manual promotion. Staging is the automatic target; production requires human verification and an explicitworkflow_dispatch.