spec
Secrets Management
Defines five distinct secret categories with separate lifecycles, a layered KMS key hierarchy per environment, Secret Manager naming conventions with path-prefix IAM, 5-minute TTL caching, and the hard rule that no human ever reads a production secret.
secrets management
The framing: secrets aren’t all the same.
Most teams have a single mental model — “we put credentials in Secret Manager” — and miss that secrets fall into categories with very different lifecycle, blast-radius, and rotation requirements. For MemberIntel, there are at least five categories worth treating separately:
System credentials. Things like the Anthropic API key, Stripe platform keys, the database password the migration role uses. One per environment, used by the application itself, not tied to any individual customer.
Per-customer OAuth credentials. Stripe refresh tokens. Eventually BuddyBoss tokens in V2. One per customer, used by background workers acting on the customer’s behalf, lifecycle tied to the customer’s connection state.
Per-customer signing keys. The MP-license OAuth flow’s per-install signing keys. One per customer-MP-install pair, used to validate banner-click tokens, stored on both your side and the customer’s WordPress side.
Application-level signing keys. Session signing keys, JWT keys (if you ever use them), webhook-signature verification keys. One per environment, used by the application to validate things it issued itself.
KMS root keys. The encryption keys that encrypt everything else. One per environment, lifecycle measured in years not days, never directly used by the application.
Each category has different rotation rules, different blast radius if compromised, and different storage patterns. The discipline is to treat them as five categories explicitly, not as one homogeneous bucket.
The KMS key hierarchy: the foundation everything else sits on.
Start with the root of the trust chain. Cloud KMS gives you customer-managed encryption keys (CMEK), and the structure you want is layered.
A keyring per environment in memberintel-prod, memberintel-staging, memberintel-dev. Within each keyring, named keys for specific purposes:
app-data-key: encrypts customer data at rest. Used by Cloud SQL CMEK.secrets-key: encrypts Secret Manager values. Used by Secret Manager CMEK.backups-key: encrypts Cloud SQL automated backups and any GCS backups.audit-key: encrypts the audit dataset in BigQuery.
Why separate keys for separate purposes: in case of compromise of any one key, blast radius is bounded. Loss of secrets-key doesn’t compromise customer data at rest, only the encrypted secrets layer. Loss of audit-key doesn’t compromise live data, only historical audit records. This separation also makes rotation tractable — rotating audit-key doesn’t require touching the live data path.
Each key has its own rotation schedule. Cloud KMS handles automatic rotation natively (generates a new version, marks the old version as the previous active version, decryption still works against old versions but new encryptions use the new version). The recommended cadence: every 90 days for active keys. Over a year, a key has rotated four times, and old versions remain available for decryption of historical data.
The keys are owned by a service account that nobody has direct credentials to — KMS-admin access is granted just-in-time through Workload Identity Federation in approved workflows. The principle is that no human ever holds KMS-admin credentials at rest. Privacy counsel will appreciate this and your future self will appreciate it more during an incident.
Cloud SQL CMEK and the encryption-at-rest story.
Cloud SQL with CMEK enabled uses your app-data-key to encrypt the database storage. This is a one-line Terraform setting and there’s no operational cost — encryption and decryption happen transparently at the storage layer. Performance impact is negligible.
The reason this matters isn’t that Google is going to read your data — they’re not, the GCP service account terms forbid it. The reason is that with CMEK, you control whether Google can read your data. If you ever needed to revoke access, you could destroy the key version and the underlying storage becomes unreadable. This is a “break glass” capability rather than a daily operational concern, but it’s the lever that lets your privacy counsel write defensible language about data sovereignty.
Same pattern for Cloud SQL automated backups (separate key, can rotate independently), the audit BigQuery dataset (separate key, longer retention story), and any GCS buckets used for backups or exports.
Secret Manager: the working layer.
Secret Manager is where the application reads secrets from at runtime. Every secret category lives here, organized by naming convention:
system/anthropic-api-key— system credentialsystem/stripe-platform-key— system credentialsystem/db-app-password— system credentialcustomer/{tenant_id}/stripe-refresh-token— per-customer OAuthcustomer/{tenant_id}/mp-signing-key— per-customer signing keyapp/session-signing-key— application-level signingapp/webhook-signature-key-stripe— application-level signing
The path-prefixed naming matters because IAM in Secret Manager is per-secret. The application’s main service account has read access to system/* and app/*. The sync worker service account has read access to customer/*/stripe-refresh-token. The auth service has read access to customer/*/mp-signing-key. Each service account gets only the secrets it needs.
The per-customer secrets are the largest category by count. At 50K free users with Stripe connected, that’s 50K secrets. Secret Manager handles that fine — there’s no scaling concern at that volume — but the IAM management gets interesting. You don’t grant access secret-by-secret; you grant access to the path prefix via Conditional IAM bindings. The sync worker service account has a binding like “read access to secrets matching customer/*/stripe-refresh-token.” Adding a new customer doesn’t require an IAM change.
Secret Manager values are versioned. Every update creates a new version; old versions are retained until explicitly destroyed. Rotation happens by adding a new version — applications read the latest version and pick up the new value on next read. This is what makes rotation operationally tractable.
Application-level secret access patterns.
Where teams chronically get this wrong is the cache layer. Reading from Secret Manager on every request adds latency (~50ms) and cost. The naive solution is to read at startup and cache forever. The right solution is to read at startup and cache with a short TTL (5 minutes), with explicit refresh on certain triggers.
The pattern:
The application has a small secrets module that wraps Secret Manager. On first read, it fetches and caches with a 5-minute TTL. Subsequent reads within the TTL return the cached value. After TTL expiration, the next read refetches.
The 5-minute TTL is a balance: short enough that a rotated secret takes effect within 5 minutes without redeploy, long enough that you’re not hammering Secret Manager on every request.
Some secrets need explicit refresh: webhook signing keys when Stripe sends a notification that the key rotated, OAuth refresh tokens when Stripe returns a “refresh token rotated” response. The secrets module exposes a refresh(secret_name) method that bypasses the cache and updates the cached value immediately.
The secrets module never logs secret values. Period. Logging library configurations should redact known-sensitive keys, but the safer rule is “secrets never enter the logging path at all.” The secrets module returns a wrapper type that, if accidentally serialized to logs, prints <redacted> instead of the value. Defense in depth, because one logging mistake is the difference between a non-incident and a public disclosure.
The OAuth refresh token storage problem.
This is the secret category that earns the most scrutiny because it gives access to customer payment data. Worth being explicit about the lifecycle.
When a customer connects Stripe:
- User clicks “Connect Stripe” in MemberIntel settings.
- OAuth flow with Stripe completes, Stripe returns a refresh token.
- MemberIntel encrypts the refresh token using
secrets-keyand writes to Secret Manager atcustomer/{tenant_id}/stripe-refresh-token. - MemberIntel writes a
stripe_connectionrecord in Postgres with metadata (when connected, scope, account ID) but not the refresh token itself. - Audit log entry: “tenant {tenant_id} connected Stripe at {timestamp}, requested by user {user_id}.”
When the sync worker runs:
- Worker authenticates as its own service account.
- Worker reads the refresh token from Secret Manager (audit-logged on the GCP side).
- Worker exchanges refresh token for short-lived access token via Stripe API.
- Worker uses access token to fetch data, throws it away when done.
- Worker never logs the refresh token, never stores the access token longer than the request.
When the customer disconnects Stripe:
- User clicks “Disconnect Stripe” in settings.
- MemberIntel calls Stripe’s revoke endpoint with the refresh token.
- MemberIntel destroys the secret in Secret Manager (
destroy, not just delete — destroy removes the value permanently). - MemberIntel updates the
stripe_connectionrecord to disconnected. - Audit log entry: “tenant {tenant_id} disconnected Stripe at {timestamp}.”
When the customer deletes their MemberIntel account (GDPR):
- The data deletion job runs.
- All per-customer secrets matching
customer/{tenant_id}/*are destroyed. - The Stripe revoke is called on any active connections.
- Audit log entries are written for each destruction.
The destruction step matters. A secret that’s “deleted” but recoverable is not actually deleted from a privacy perspective. Secret Manager’s destroy operation is irrevocable, which is what you want. Document this in the data deletion runbook.
The MP-license signing key: a different lifecycle.
These keys are established when the customer first activates the MP-MemberIntel integration. They’re shared between MemberIntel and the customer’s WordPress install. They have to rotate periodically but rotation requires coordination with the customer’s MP plugin.
The pattern that works:
Each MP install has a current-key and a previous-key. When MemberIntel rotates the key, the new key becomes current, the old becomes previous, and previous remains valid for, say, 24 hours. The MP plugin polls a known endpoint daily and picks up new keys; signing tokens with the previous key during the overlap period still validates. After the overlap period, the previous key is destroyed.
This lets MemberIntel rotate per-customer signing keys on a schedule (say, every 90 days, per customer, smeared across the population) without coordinating outage windows with every customer. The MP plugin handles rotation transparently.
If a key is suspected to be compromised, the rotation can happen immediately with no overlap — the customer’s banner-click experience breaks for whoever has the old key, but if it’s compromised that’s the right tradeoff. Document the emergency rotation procedure in a runbook.
The signing keys are stored in Secret Manager on the MemberIntel side and in the WordPress database on the customer’s side. The customer’s side is, frankly, your weakest link — WordPress databases get backed up to dropbox by half-trained admins all the time, and your signing key goes with the backup. The mitigation is per-customer scoping (one compromised key affects one customer) and rapid rotation capability (you can revoke fast). This is also why the sensitive-operation re-authentication pattern from the auth conversation matters — even if a banner-click token is compromised, the attacker can’t drain a customer’s data without re-authenticating.
The system-credentials category and the rotation problem.
System credentials — Anthropic API key, Stripe platform key, database passwords — have a different problem: they’re used by every running instance of the application, and rotating them requires coordination.
The pattern:
System credentials are read from Secret Manager with the 5-minute TTL pattern. To rotate, you add a new secret version. Within 5 minutes, every running application instance has picked up the new value. The old version remains valid in the upstream service (Anthropic, Stripe) for an overlap window.
The Anthropic API key specifically: Anthropic’s pattern (as of last check) lets you have multiple active API keys per workspace, and you can rotate by creating a new key, updating Secret Manager, waiting for propagation, then revoking the old key. Total rotation window: ~10 minutes with no downtime.
The database password rotation is harder because Cloud SQL doesn’t support overlapping passwords on the same role. The workaround: create two app roles (app_role_a and app_role_b), rotate by switching which one the application uses. Application reads the active role name from a secret too, so the switch is a Secret Manager update. Old role gets its password invalidated after the switch propagates. Requires more setup but eliminates the “everyone reconnect at once” problem.
System credential rotation cadence: every 90 days at minimum, or immediately if compromise is suspected. Privacy counsel will ask for a documented rotation schedule; provide one and stick to it.
The webhook signature verification keys.
Stripe sends webhooks signed with a key Stripe gives you when you set up the webhook endpoint. MemberIntel needs that key to verify incoming webhooks are actually from Stripe. The key lives in Secret Manager as app/webhook-signature-key-stripe (or per-environment variants).
The wrinkle: Stripe lets you have multiple active webhook signing keys for rotation purposes. You can add a new endpoint signing secret, update MemberIntel, then remove the old one. Same overlap pattern as system credentials.
Webhook handlers should verify signatures against all currently-valid signing keys, not just the latest. This handles the rotation overlap window correctly. The verification logic: try the latest signing key, if signature doesn’t match try the previous, if neither match reject the webhook as untrusted.
The signed payload also includes a timestamp; reject webhooks where the timestamp is more than 5 minutes off from server time, to prevent replay attacks. Stripe’s docs cover this in detail; the pattern is well-trodden.
The thing teams underbuild: secret access auditing.
Cloud Audit Logs records every Secret Manager access — who read what secret when. By default these go to the platform audit log, which has a 400-day retention but lives in a place most engineers never look.
Wire these into your audit BigQuery dataset with a sink. Now you have a queryable record of every secret access: which service account, which secret, when. If a secret is suspected compromised, you query “every access to this secret in the last 30 days” and you have a complete picture in seconds.
This also lets you alert on anomalous access patterns: a service account that reads a secret it doesn’t normally read, a secret being read at a time of day when no scheduled job runs, a secret being read by a user identity instead of a service account (which should never happen if your IAM is configured correctly).
For the per-customer OAuth tokens, this audit trail is what lets you tell a customer “your Stripe credentials were accessed only by our automated sync worker, on this schedule, no exceptions.” That’s a strong claim and it requires this audit infrastructure to back it up.
A practical rule that simplifies everything: no human ever sees a production secret.
State this as a team norm: production secrets — the actual values — are never read by human eyes. Engineers debug using staging credentials and synthetic data. If a production issue requires examining a real value (it almost never does), the access is gated through a break-glass procedure with multi-person approval and is itself audited.
This sounds restrictive until you’ve thought about the alternatives. The alternative — engineers occasionally needing to “check” a production secret to debug — leaks secrets through screen shares, notes, terminal history, and human memory. Once a human has seen a production secret, you have to assume it’s compromised and rotate. Better to never let humans see them in the first place.
The implementation: production secrets are accessible only to specific service accounts. No human IAM principal has direct read access. Break-glass procedure goes through a workflow that requires two-person approval, accesses the secret programmatically (not through the console), and writes a heavily-audited record. In practice this is invoked once or twice a year if at all.
Privacy counsel will love this. So will future-you.
The decisions to make to move forward.
-
KMS key hierarchy with separate keys for app-data, secrets, backups, audit. Recommendation: yes — and set up the keyrings in Phase 1 before anything else needs them.
-
Cloud SQL CMEK enabled, automated backups encrypted with separate key. Recommendation: yes — it’s a Terraform variable, no operational cost.
-
Secret Manager naming convention with path prefixes (
system/,customer/,app/) and IAM bindings on prefixes via Conditional IAM. Recommendation: yes. -
Secrets module with 5-minute TTL cache, redacted-by-default wrapper type, explicit refresh hooks. Recommendation: yes.
-
Per-customer signing key pattern with current/previous overlap, MP plugin polls for rotation. Recommendation: yes — coordinate with Paul Carter’s team since the MP plugin needs to support this.
-
90-day rotation cadence for system credentials, immediate emergency rotation procedure documented. Recommendation: yes.
-
Workload Identity Federation for all CI/CD secret access, no service account keys anywhere. Recommendation: yes.
-
Secret access auditing wired to BigQuery audit dataset with anomaly alerts. Recommendation: yes.
-
The “no human reads production secrets” rule, with break-glass procedure documented. Recommendation: yes — and write the break-glass procedure before launch, not after the first incident.
-
Database password rotation via dual-role pattern, application reads active role name from Secret Manager. Recommendation: yes — costs a day of upfront work, eliminates a class of operational pain forever.
The natural next threads from here: the AI eval suite as architecture (the structure that makes evals into a real release gate, important for V1.5 agent gating and for catching prompt regressions), or pulling back to the higher level — how everything we’ve discussed sequences across the actual Phase 1 and Phase 2 of the build given the team and timeline. Either is a good place to land.