spec

Data Sync Pipeline

Defines the three separate sync pipelines — MP (queue-based with per-customer concurrency controls), Stripe (webhooks for Pro, polling for Free), and site analysis (weekly-cached Claude calls) — with shared convergence layer and a platform-agnostic canonical schema designed for V2 expansion.

Data sync pipeline architecture

The framing: this isn’t one pipeline, it’s three.

The SPEC implies sync is a single concept but the reality is three different pipelines with different failure modes, different cadences, and different reliability requirements:

The MP sync pipeline pulls membership and transaction data from the customer’s MemberPress site. Live for Pro, monthly snapshot for Free. Runs against the customer’s WordPress install over HTTP, depends on the MP plugin being installed and updated, depends on the customer’s hosting being reachable.

The Stripe sync pipeline pulls payment, subscription, and customer data from the customer’s Stripe account via OAuth. Cadence the same as MP. Runs against Stripe’s API, well-documented and reliable on Stripe’s side, but the customer’s data shape is unique to their setup.

The site analysis pipeline runs Claude over the customer’s public website to extract niche, audience, positioning, and content patterns. Weekly cached for both tiers, on-demand for Pro. Runs against the public web, not against an authenticated source.

Each has different reliability characteristics, different cost shapes, and different things that go wrong. Treating them as one “sync pipeline” in the codebase produces a worst-of-all-worlds layer that’s hard to debug. Treating them as three lets each have the right architecture for its job.

Pipeline 1: the MP sync — the hardest of the three.

This is where most of the engineering complexity lives, because it’s the one running against a customer-controlled environment that you don’t own.

The MP customer’s WordPress install is, from your perspective, an unreliable service. They might be on $5/month shared hosting that returns 504 under load. They might have a security plugin that blocks your User-Agent. They might have updated MP to a version with breaking schema changes. They might be on Cloudflare with rate limits you don’t know about. They might have moved hosts and the URL is now pointing to a parked domain. All of these happen in the wild and your sync needs to degrade gracefully.

The architectural pattern that handles this: a queue-based, retrying, isolated-per-customer sync worker, with circuit breakers and customer-facing visibility into sync health.

The shape:

A scheduler (Cloud Scheduler, daily for Pro, monthly for Free) puts sync jobs onto Cloud Tasks queues, one job per customer. Each job carries the tenant_id and the type of sync (full or incremental). The queue lets you control concurrency — at 50K free users syncing monthly, you don’t want all 50K firing on the first of the month. Cloud Tasks dispatches at a controlled rate, say 100 concurrent workers max, smearing the load across the day or week.

Workers are Cloud Run services that pick up a job, fetch a tenant-scoped MP API token from Secret Manager, hit the customer’s MP API endpoints, transform the results, write to the per-customer warehouse (with the RLS tenant context set per the earlier conversation), and update sync state. The worker is stateless; all state lives in Postgres or in the queue.

Retries with exponential backoff are first-class. Cloud Tasks handles this natively — if a worker returns 5xx, the task gets retried with backoff. The retry budget should be generous (10 attempts over ~24 hours) for transient failures, but it has to give up eventually. Permanent failures move the customer’s sync to a “failed” state and surface it to the user.

The “surface it to the user” part is the thing teams underbuild. A customer whose dashboard is showing data from three weeks ago because their MP install has been throwing 500s the whole time will assume the product is broken, not that their site is broken. The pattern that works: every customer’s sync state is visible in their MemberIntel settings (“Last successful sync: 3 days ago, status: degraded, last error: cannot reach https://yoursite.com/wp-json/memberintel/v1/sync”). Sync failures past a threshold trigger an in-app banner and an email to the customer with what to check.

This is also where the proactive support outreach from the observability conversation pays off. A customer with sustained sync failures is a customer at risk of churning; Cindy’s team flags them and reaches out. Most sync failures are fixable on the customer’s side (plugin needs updating, site has a security rule blocking us, hosting is overloaded) and the help-them-fix-it conversation is also a relationship-builder.

The MP API surface dependency.

This is SPEC Open Q7, and it’s load-bearing. The sync pipeline needs the MP plugin to expose the data MemberIntel needs to read. Per the phased plan, Seth + Paul Carter resolve this in May. Worth being concrete about what “the MP API surface” needs to look like, because the sync architecture depends on the answer.

What MemberIntel needs from MP per the SPEC:

Members: list, status, level, signup date, last activity
Transactions: list, amount, currency, status, member, date
Subscriptions: list, status, level, member, billing period
Content protection rules: levels and rule mappings
Site-level metadata: MP version, configured plans, configured email templates

The MP-side delivery options:

Option A: a dedicated MemberIntel REST endpoint in the MP plugin. The plugin exposes /wp-json/memberintel/v1/sync (and sub-endpoints) that returns exactly what MemberIntel needs in the format MemberIntel wants. Authentication via a per-install token established during the OAuth-style banner flow.

Option B: use MP’s existing public REST API. MP probably already has /wp-json/mp/v1/members and similar. Use those, transform on the MemberIntel side.

Option C: the MP MCP that V1.5 depends on. The V1.5 spec confirms MP is shipping an MCP. If it’s ready early, V1 could consume it for read operations as well.

Option A is what V1 should commit to. It’s a small additive piece of MP-side work (Paul Carter’s team), it gives MemberIntel exactly the shape of data it wants without transformation overhead on every sync, and it’s versioned independently from MP’s main REST API so MP can evolve their public API without breaking MemberIntel. The cost is one engineer-week of MP plugin work, which is well within Phase 1.

The MP plugin endpoints should be designed to support incremental sync from day one. Full syncs of large customers — say, 50K members and 200K transactions — are slow and expensive. The plugin endpoint should accept a since=<timestamp> parameter and return only records changed since then. Most syncs are incremental and fast; only the first sync per customer is full. This is a five-minute design decision in May that pays back forever; missing it means every sync is a full refetch and your costs and latencies are 100x worse.

Backpressure and the “MP install is slow” problem.

A real failure mode: a customer’s MP install responds, but slowly — say, 30 seconds per request. Your sync job times out, retries, times out again, eventually gives up. Meanwhile the customer’s WordPress install has been hammered and is even slower. You’ve made it worse.

The mitigation: per-customer concurrency limit of 1 (a customer can have at most one sync in flight at a time, enforced by the queue). Per-customer rate limits on retries (no more than N requests per minute to a single customer’s site). Per-customer timeouts that adapt — if a customer’s MP install consistently takes 20 seconds to respond, the timeout for that customer is 30 seconds, not the global 5-second default. The sync state in Postgres tracks the customer’s average response time and the worker reads it to set per-request timeouts.

This is over-engineering for V1 launch and you can ship without it. But the patterns where it bites are exactly the patterns where you want to look most professional — large enterprise-y customers with slow self-hosted WordPress installs. Worth designing the schema with these fields present even if the adaptive logic ships in V1.5.

Schema drift: when the customer updates MP and breaks your sync.

A customer updates MP to a new version. MP changes the response shape on their end, dropping a field your sync depended on. Your sync starts failing — or worse, succeeding but writing partial data. Now every dashboard insight for that customer is wrong.

The defense:

The sync worker validates every response against a schema (Pydantic, Zod, or equivalent). Any unexpected response is a hard failure, not a “best effort” success. Better to fail loudly and tell the customer “MP version 4.2 isn’t supported yet, please contact support” than to silently write malformed data.

The schema validation also captures which MP version produced the response, via a version field in the MemberIntel-side endpoint (Option A above). If the version doesn’t match what MemberIntel expects, MemberIntel knows to handle it specifically — either with a transformer for the new version, or by failing gracefully with a “we’ll support this version soon” message.

This implies MemberIntel and the MP plugin need to be loosely versioned together. The MP plugin advertises a sync-protocol version; MemberIntel knows which versions it supports. When the MP team ships a breaking change, MemberIntel gets a heads-up and ships support before the MP version goes general. Paul Carter’s team needs to be on the same release cadence as MemberIntel for sync-related changes — worth documenting this dependency in the cross-team agreement.

Pipeline 2: Stripe sync — easier but with its own twists.

Stripe’s API is fast, well-documented, well-versioned. The technical sync is straightforward: hit the customer’s Stripe with their OAuth refresh token, fetch transactions, customers, subscriptions, write to the per-customer warehouse. Use Stripe’s pagination and created[gt] parameter for incremental syncs.

The twists:

Stripe events vs Stripe API polling. Stripe offers webhooks that fire when something changes (payment succeeded, subscription updated, customer deleted). Polling the API is what a sync pipeline does; subscribing to webhooks is what a real-time integration does. For Pro tier “live” sync, webhooks are the right answer — instant updates on changes. For Free tier monthly snapshot, polling is fine.

The architectural implication: Pro customers have a webhook listener path (Stripe sends events to MemberIntel, MemberIntel updates the customer’s warehouse). Free customers have a polling path (scheduled job runs once a month). Two different code paths feeding the same warehouse tables.

The webhook path is where most reliability work lives. Stripe webhooks have an at-least-once delivery guarantee with retries; you have to handle duplicate events. The standard pattern is idempotency: every webhook event has a unique ID, you track processed event IDs in a stripe_events_processed table per tenant, and you skip events you’ve already handled. Stripe also signs webhook payloads; verify the signature on every event before processing.

OAuth scope breadth. Stripe OAuth lets you request narrow scopes (read-only, transactions only, etc.) or broad ones. MemberIntel V1 only reads, so the scope should be read_only only. Don’t request write scopes you don’t need — privacy counsel will ask, and the answer “we have write access we don’t use” is worse than “we only have what we need.”

Multiple Stripe accounts per MemberIntel customer. Edge case worth thinking about: a customer runs two membership sites with separate Stripes. Do they get one MemberIntel account with two Stripe connections, or two MemberIntel accounts? V1 answer: one Stripe per MemberIntel account, period. Multi-Stripe is a V2.1+ feature when multi-site arrives. Document this clearly so the customer understands.

Pipeline 3: site analysis — the cheapest pipeline that needs the most caching.

The site analysis pipeline is the simplest mechanically: fetch the customer’s public site URL, run a Claude analysis with structured extraction, write the result to the per-customer brain.

Cheapest in operational complexity but also the most expensive per call, because each run is a Claude API call processing a fairly large input (the site’s homepage and a few key pages). The cost discipline matters here.

The pattern:

Weekly scheduled job for both Free and Pro. The job fetches the site (using a Cloud Run service with Cloud NAT for static egress IP, so customers can allowlist), parses the HTML, extracts text, sends a structured Claude call with a fixed prompt that returns a JSON object describing niche, audience, positioning, pricing patterns, content patterns. The output is written to customer_brain.site_profile (versioned — old versions retained for diffing).

Aggressive caching: the SPEC says weekly. Make it actually weekly. Don’t refetch on every dashboard load; that’s a path to runaway costs. The cached version is served until the next scheduled run completes.

On-demand re-analysis (Pro feature) bypasses the cache but is rate-limited per user per day (one or two per day max). The Pro user clicks “refresh site analysis” and triggers a fresh run; subsequent clicks within the rate limit return the existing cached result.

Failure modes for the site analysis:

The customer’s site is down → retry once, then mark stale and surface in dashboard (“site unreachable last analysis”).

The customer changed their site significantly → the analysis is fine but doesn’t match the customer’s mental model of their site. The diffing of versions helps here; the brain tracks “site profile changed” and the AI can mention it: “I notice your site’s positioning shifted recently; let me know if you want me to update my understanding.”

The customer’s site is enormous → fetch only homepage + a handful of linked pages, not crawl the whole site. Set a hard limit on text-extracted volume to keep the Claude call cost bounded.

The customer has Cloudflare or similar bot protection → rotate User-Agent, identify yourself honestly (“MemberIntelBot/1.0”), respect robots.txt, surface to the customer if blocked. A customer who’s blocking you is fixable on their side (“please allowlist our crawler”), but only if they know.

The shared infrastructure: where the three pipelines converge.

All three pipelines feed the same per-customer warehouse and emit events to the same audit and metrics streams. The convergence layer:

A single sync_run table per tenant (RLS-isolated like everything else) tracks every run of every pipeline. Fields: pipeline_type, run_id, started_at, finished_at, status, error_message, records_processed. This is the customer-facing visibility from earlier — it powers the “last sync” indicator in their settings.

A single tenant_data_freshness view derives “what’s the freshest data we have for this customer” from the sync_run table. The dashboard uses this to render “data as of: 2 hours ago” or “data as of: 18 days ago.” The chat handler uses this to caveat answers grounded in stale data.

The transform layer is per-pipeline (MP transformer, Stripe transformer, site analysis transformer) but the destination schema is uniform — a member is a member regardless of whether the data came from MP. This matters for V2 when BuddyBoss data flows through. The MP transformer and the BuddyBoss-Memberships transformer write to the same members table, because BB Memberships is MP underneath. By V2.1 (PMP), there’s a third transformer writing the same shape.

This is the abstraction the SPEC v2 anticipates: a connector framework where each platform has a connector that produces the canonical data shape, and the rest of the system is platform-agnostic. V1 doesn’t have to build the abstraction explicitly — there’s only one connector — but the schema should be designed so the abstraction is easy to add later. Don’t put mp_membership_level_id on your members table; put source_platform and source_id. Future-you will thank present-you when V2.1 ships.

Cost control on the sync side.

The sync pipelines themselves cost money: Cloud Tasks dispatches, Cloud Run worker time, network egress, Anthropic API calls for site analysis. At 50K free users monthly-syncing plus a few thousand Pro daily-syncing plus weekly site analyses, this adds up.

Rough budget at full V1 scale:

MP and Stripe sync: a few cents per customer per month for compute, dominated by Cloud Run worker time. Probably $200-500/month in aggregate at 50K Free + 5K Pro.
Site analysis: this is the real cost. ~$0.20/customer/month per the SPEC. At 50K customers that’s $10K/month for site analysis alone, dwarfing the sync infrastructure cost.

The site analysis cost is the one to watch. It’s why “weekly” matters and “on-demand for Pro only, rate-limited” matters. A bug that re-analyzes every customer every day instead of every week is a $70K/month bug. The kill switch from the previous conversation applies here too — site analysis should be one of the kill-switchable subsystems.

The decisions to make to move forward.

Three separate pipelines (MP, Stripe, site analysis) with shared convergence layer, not one monolithic sync. Recommendation: yes.
MP-side dedicated MemberIntel REST endpoint (Option A) with versioning and incremental-sync support from day one. Recommendation: yes — coordinate with Paul Carter in May.
Cloud Tasks for queueing, Cloud Scheduler for triggering, Cloud Run for workers. Recommendation: yes, this is the standard GCP pattern for this shape and Claude Code can scaffold it cleanly.
Per-customer concurrency=1, adaptive timeouts, sync-state visibility in customer settings. Recommendation: yes. The visibility in particular is what turns sync failures from churn risk into support opportunity.
Stripe webhooks for Pro tier real-time, polling for Free tier monthly. Read-only OAuth scopes only. Recommendation: yes.
Site analysis weekly cached, Pro-only on-demand with daily rate limit. Recommendation: yes.
Schema versioning between MP plugin and MemberIntel, with Paul Carter’s team on the same release cadence for sync-related changes. Recommendation: yes — and document this dependency in the cross-team agreement.
Canonical data schema designed with platform-agnostic fields (source_platform, source_id) from V1, even though only MP exists. Recommendation: yes — costs nothing now, saves a refactor for V2.1.

The natural next threads from here: secrets management at depth (the OAuth refresh token storage, KMS key hierarchy, rotation strategy — this is the area privacy counsel will scrutinize most closely), the AI eval suite as architecture (the structure that makes evals into a real release gate, important for V1.5 agent safety), or pulling back to a higher level — for instance, how the pieces we’ve discussed should be sequenced across Phase 1 and Phase 2 of the actual build, given the team and timeline.

For: S Seth Shoultes S Santiago Perez Asis

← Previous in spec Auth & Identity Layer Next in spec → Secrets Management

Phase 1-2 Friction Points reference
A companion one-pager surfacing seven decisions where the May architectural commitments conflict with the v1 phased plan — each with a recommendation and named decider — designed to drive a 30–45 minute sign-off session before Rev 2 is approved.
Decision Rights Matrix reference
A binding contract defining who owns which decisions across engineering, product, compliance, and GTM — keeping Seth and Cindy unblocked as peers without escalating every disagreement to Blair.
Quarterly Architecture Review Template reference
A 90-minute fixed-agenda template for quarterly architectural health reviews — covering differentiation gap, cost-per-cohort, reliability, cross-pollination health, compliance posture, and a standing 'one thing that worried me' round — starting Q4 2026 post-GA.
Privacy Counsel Architecture Review Agenda reference
A 4-hour late-May working agenda for outside privacy counsel to review MemberIntel's per-tenant isolation, cross-pollination boundary, secrets management, and data lifecycle decisions — grounding counsel's June ToS and Privacy Policy drafting in the actual architecture.
Seth's Phase 1 Deliverable Checklist reference
Seth's operational working checklist for May 2026 — organized week-by-week with ADR drafts, GCP scaffolding, schema design, RLS prototype, hiring pipeline, and cross-functional coordination tasks required to unlock Phase 2 on June 1.
Seth — Lead Architect JD role
Seth Shoultes's Lead Architect role definition: end-to-end technical ownership of the brain, data pipeline, AI/ML architecture, engineering team, and vendor decisions for MemberIntel.