M MemberIntel KB
Activity Decisions

decision

ADR-0009: CI/CD Pipeline — GitHub Actions with Workload Identity Federation

ADR-0009 (Accepted, 2026-05-14): CI/CD Pipeline — GitHub Actions with Workload Identity Federation.

Status: Accepted
Date: 2026-05-14
Deciders: Seth (Lead Architect)

Context

MemberIntel V1 currently deploys manually: make build (Docker), docker push to Artifact Registry, gcloud run services update. This works for a solo developer but creates three problems as the team grows:

  1. No gate on main. Pushing to main deploys without any automated checks — a bad merge can reach staging instantly.
  2. Service account key drift. The gcloud CLI authenticates via a user credential or a downloaded JSON key. JSON keys expire, get misplaced, and violate the principle of short-lived tokens.
  3. No audit trail for deployments. Without a CI run linking a commit SHA to a Cloud Run revision, rollbacks require tribal knowledge of which SHA was “last good.”

The spec (Slice 1, CI/CD section) calls for GitHub Actions with Workload Identity Federation for all GCP deploys. Workload Identity Federation is already configured in the GCP project (project number 28785943838, memberintel-v1) — the Terraform state includes the IAM pool, provider, and service account bindings. The GitHub repo (sethshoultes/memberintel) has a branch protection rule on main requiring PRs and three status checks.

Two workflow files already exist in the repo:

  • .github/workflows/ci.yml — lint, guard-rules, and test (triggered on push and PR)
  • .github/workflows/deploy.yml — build/push/deploy (triggered on push to main + workflow_dispatch)

Both are functional and already use google-github-actions/auth@v2 with workload_identity_provider and id-token: write permissions. This ADR ratifies the architecture those workflows implement and records the decisions that shaped them.

Decision

1. GitHub Actions for all CI/CD

Every merge to main and every pull request against main runs through GitHub Actions. There is no path to production that bypasses CI.

PR checks (ci.yml, triggered on push and pull_request):

JobToolsPurpose
lintruff (check + format), mypyCatch style violations and type errors before review
guard-rulesgit grep scriptsEnforce architectural invariants: Anthropic SDK only in src/memberintel/llm/, .model_id reads only in src/memberintel/llm/, X-Mi-Tier header not referenced in src/, Voyage SDK only in src/memberintel/api/retrieval/
testpytest (unit + integration + evals) against a pgvector/pgvector:pg16 service containerValidate correctness before merge

Main-branch deploy (deploy.yml, triggered on push to main or workflow_dispatch):

  1. test — re-runs the full suite (unit + integration) as a gate before the image is built.
  2. build-and-push — builds the Docker image (python:3.12-slim multi-stage, Poetry --only main), tags it with the commit SHA and latest, pushes to us-docker.pkg.dev/memberintel-v1/memberintel-repo/memberintel-api.
  3. deploy-staging — runs automatically on push to main. Calls gcloud run deploy memberintel-api-staging --region us-central1 with the SHA-tagged image.
  4. deploy-production — runs only via workflow_dispatch with environment: production. Promotes the same image to memberintel-api with --min-instances 1.
  5. run-migrations — connects to Cloud SQL to run alembic upgrade head before traffic shifts (currently stubbed; full migration path pending Alembic-in-container wiring).
  6. notify — always-run summary job that writes image tag, staging result, and production result to $GITHUB_STEP_SUMMARY.

2. Workload Identity Federation — no service account keys

All GCP authentication in CI uses the google-github-actions/auth@v2 action with OIDC-based Workload Identity Federation. The GitHub Actions OIDC token (short-lived, audience-scoped) is exchanged for a GCP access token via a configured pool and provider. No JSON key files are stored in GitHub Secrets.

Required GitHub Secrets (four, all non-sensitive identifiers):

SecretContents
GCP_WORKLOAD_IDENTITY_PROVIDERFull provider resource name: projects/28785943838/locations/global/workloadIdentityPools/<pool>/providers/<provider>
GCP_SERVICE_ACCOUNTEmail of the CI service account, e.g. github-actions@memberintel-v1.iam.gserviceaccount.com
GCP_PROJECT_IDmemberintel-v1
ANTHROPIC_API_KEYAPI key for integration/eval test runs (stored as secret because it is a third-party credential, not a GCP auth artifact)

The CI service account is scoped to the minimum roles needed: roles/run.admin (deploy), roles/artifactregistry.writer (push images), roles/cloudsql.admin (migrations), and roles/secretmanager.secretAccessor (read secrets at deploy time). This is an improvement over the previous gcloud CLI setup which relied on the developer’s personal Owner-level credentials.

3. Staging is automatic, production is manual

Pushing to main always deploys to staging. Production deploys require an explicit workflow_dispatch with environment: production. This matches the team’s current capacity: one engineer, manual verification on staging before promoting. When the team grows, a GitHub Environment protection rule (required reviewers) can gate production further without changing the pipeline.

4. Branch protection enforces the gate

The main branch protection rule requires:

  • Pull request before merging
  • Three status checks must pass: lint, guard-rules, test

This ensures no commit reaches main — and therefore no commit triggers a staging deploy — without passing the full CI suite.

Consequences

Positive:

  • Every change to main is vetted by lint, type-check, architectural guard rules, and tests before it reaches staging.
  • GCP authentication uses short-lived OIDC tokens; no service account JSON keys to rotate or leak.
  • The commit SHA is the image tag in Artifact Registry, making every Cloud Run revision traceable to a specific commit and CI run.
  • Staging deploys are automatic (fast feedback loop); production deploys are intentional (no accidental promotions).
  • Guard-rule jobs enforce the architectural boundaries documented in ADR-0002 (model routing single source of truth) and ADR-0005 (Anthropic dependency mitigation) at CI time, not just in code review.

Negative / costs:

  • CI minutes on private repos cost money; the deploy workflow re-runs the full test suite as a pre-build gate (duplicating the PR run).
  • Workload Identity Federation setup is one-time GCP configuration that cannot be expressed in the current Terraform (infra/main.tf does not yet include the WIF pool/provider resources). The pool and provider were created manually via gcloud iam workload-identity-pools commands.
  • The run-migrations job is currently a stub (gcloud sql connect without an actual Alembic command); wiring migrations through Cloud Run’s startup or a separate Job is a follow-up.
  • The latest tag on Artifact Registry images creates a moving target; teams that need reproducible rollbacks must reference the SHA tag explicitly.

Mitigations:

  • Duplicate test runs are acceptable for now (small team, fast test suite). If minutes become costly, the deploy workflow can skip its own test job and rely on the branch-protection gate, trusting that main only receives merged PRs that already passed CI.
  • WIF pool/provider resources should be added to infra/main.tf in a follow-up to make the infrastructure fully declarative. The current manual setup is documented in the GCP project and is stable.
  • Migrations will be wired as a Cloud Run Job (or startup probe) in a future slice; the stub in deploy.yml is a placeholder that does not block deploys.
  • The SHA tag is always available alongside latest; production deploys reference the SHA tag explicitly via ${{ github.sha }}.

Alternatives considered

  • Cloud Build — GCP-native CI/CD. Rejected: the team’s expertise and existing workflows are in GitHub Actions; Cloud Build would require maintaining two CI systems if PR checks stay in GitHub.
  • Service account JSON keys stored in GitHub Secrets — the “obvious” alternative for GCP auth from CI. Rejected: JSON keys are long-lived, cannot be scoped to a single repository, and must be rotated manually. WIF tokens are short-lived, audience-scoped, and require zero rotation.
  • GitOps with ArgoCD / Flux — rejected for V1: the team is one engineer, the stack is Cloud Run (not Kubernetes), and GitOps adds operational complexity with no corresponding benefit. Can be revisited if the team adopts GKE or grows to multi-environment promotion gates.
  • Automatic production deploys on merge to main — rejected: the spec calls for manual promotion. Staging is the automatic target; production requires human verification and an explicit workflow_dispatch.
For: S Seth Shoultes A AI Engineer B Blair Williams S Santiago Perez Asis P Product Lead