Data Feedback and Training

Problem overview

Every agent invocation of upstream model APIs (Anthropic, OpenAI, etc.) sends customer prompt content and agent output through the provider’s servers. The provider may by default retain this data for training future models — this is a direct threat to customer compliance and a hard topic in vendor contracts and sales.

Concerns across compliance scenarios:

GDPR: personal data used for other purposes requires explicit consent; training use without customer authorization is a violation
HIPAA: medical data may not be used for any secondary purpose, including model training
Enterprise trade secrets: customer prompts may contain unpublished strategy, contract clauses, customer information — provider retention means potential leakage
Industry-specific: finance, legal, government customers have similar but stricter restrictions

Three major providers’ default terms

The table below reflects major providers’ default terms at the time of writing (early 2026). Terms change frequently; always verify the latest version before signing contracts.

Provider	Standard API default	Enterprise plan	Methods to disable training use
Anthropic	Default not used for training (revised)	Same + DPA + BAA	Default already satisfies; extra needs via Enterprise contract
OpenAI	API default not used for training (since 2023)	ZDR (Zero Data Retention) option + DPA + BAA	Apply for ZDR; 30-day default retention reducible to 0
Google (Vertex AI)	Default not used for training (enterprise accounts)	Data residency region options + detailed DPA	Default already satisfies
Self-hosted models	Fully controlled	Custom	No external feedback

Key trend changes:

2022-2023 default terms were “may be used for training”; enterprise market pushback led providers to revise to “default not for training”
But terms can change at any time — written commitment at contract signing is safer than relying on default settings
ChatGPT and other consumer products differ from API terms — user conversations are by default used for training; only ChatGPT Enterprise offers ZDR options

Three layers to cut feedback

Layer 1: contract terms

Sign enterprise contracts with upstream model providers (not default API access); contracts explicitly state:

“Provider shall not use Customer Data to train, fine-tune, or otherwise improve any AI models.”
“Provider shall not retain Customer Data beyond [X] hours for operational purposes.”
Add DPA (Data Processing Agreement) — GDPR requirement
For healthcare add BAA (Business Associate Agreement) — HIPAA requirement

Layer 2: technical switches

API options provided by upstream:

OpenAI: set data_retention: "zero" on API calls or enable ZDR in account settings
Anthropic: default zero retention; enterprise accounts additionally sign DPA confirming terms
Vertex AI: account-level data control settings

Technical switches and contract terms should be used together — contract is legal protection, technical switch is technical enforcement.

Layer 3: audit verification

How is contract + technical switch proven to the customer?

Upstream providers typically do not directly issue certificates to “your customers” (your customer is not their direct customer)
Practical approach: you (the vendor) hold the upstream provider’s contracts / certifications, and prove the sub-processor relationship to customers
Customer contracts include sub-processor list + data usage terms for each sub-processor
Provider compliance certifications (SOC2, HIPAA-eligible, etc.) serve as indirect endorsement

Standard response to customers

Enterprise customer due diligence checklists almost always include this question. Prepare a standard response to avoid answering from scratch in every sales conversation:

Q: Will our data be used to train AI models?

A: No. Specifically:

1. Our upstream model providers ([list, e.g., Anthropic / OpenAI / ...])
   commit by default not to use API input for model training. We have
   signed enterprise-grade contracts with providers ([contract name])
   that further clarify this clause.

2. We enable Zero Data Retention (ZDR) configuration at the API call
   layer (where the provider supports this option).

3. At our own infrastructure layer, prompt and output are retained only
   in audit logs (for compliance and troubleshooting), retention period
   [X] days, accessible only to authorized personnel.

4. We can provide the following upon customer due diligence request:
   - Upstream provider DPA / BAA
   - Our SOC2 Type II report (if applicable)
   - Complete sub-processor list

For more details, contact compliance@[domain].

Self-hosted / private model trade-offs

Some customers (finance, government, healthcare) require fully private deployment — model weights inside customer infrastructure, prompts never leave the customer boundary.

Trade-offs:

Dimension	Public API (upstream provider)	Private deployment (self-hosted model / customer infra)
Compliance boundary	Cross-organization (multiple sub-processors)	Single boundary (customer’s own)
Cost	Low unit cost (shared infrastructure)	High unit cost (dedicated infra + model license)
Model capability	Always current (provider continuous upgrade)	Lagging (self-hosted updates slow)
Deployment complexity	Low (API call)	High (requires GPU cluster, model management)
Suitable customers	Most enterprise customers	Top-compliance customers, government, sensitive industries

Practical strategy:

Default product offering: public API + strict compliance configuration (suits 80% of customers)
Enterprise upsell: private deployment option (suits remaining 20%, unit price 5-10× standard contract)
Do not proactively push private deployment to all customers — cost and operational burden are large; offer only when customers explicitly require

Cross-section connections

Data residency physical location choices: overview
Private deployment impact on unit economics: metrics/unit-economics
Compliance-customer pricing tier: pricing/tier-design