Data Feedback and Training

Whether upstream model providers retain prompt and output for training — three major providers' default terms, methods to disable feedback, ways to prove to customers.

Problem overview

Every agent invocation of upstream model APIs (Anthropic, OpenAI, etc.) sends customer prompt content and agent output through the provider’s servers. The provider may by default retain this data for training future models — this is a direct threat to customer compliance and a hard topic in vendor contracts and sales.

Concerns across compliance scenarios:

  • GDPR: personal data used for other purposes requires explicit consent; training use without customer authorization is a violation
  • HIPAA: medical data may not be used for any secondary purpose, including model training
  • Enterprise trade secrets: customer prompts may contain unpublished strategy, contract clauses, customer information — provider retention means potential leakage
  • Industry-specific: finance, legal, government customers have similar but stricter restrictions

Three major providers’ default terms

The table below reflects major providers’ default terms at the time of writing (early 2026). Terms change frequently; always verify the latest version before signing contracts.

ProviderStandard API defaultEnterprise planMethods to disable training use
AnthropicDefault not used for training (revised)Same + DPA + BAADefault already satisfies; extra needs via Enterprise contract
OpenAIAPI default not used for training (since 2023)ZDR (Zero Data Retention) option + DPA + BAAApply for ZDR; 30-day default retention reducible to 0
Google (Vertex AI)Default not used for training (enterprise accounts)Data residency region options + detailed DPADefault already satisfies
Self-hosted modelsFully controlledCustomNo external feedback

Key trend changes:

  • 2022-2023 default terms were “may be used for training”; enterprise market pushback led providers to revise to “default not for training”
  • But terms can change at any time — written commitment at contract signing is safer than relying on default settings
  • ChatGPT and other consumer products differ from API terms — user conversations are by default used for training; only ChatGPT Enterprise offers ZDR options

Three layers to cut feedback

Layer 1: contract terms

Sign enterprise contracts with upstream model providers (not default API access); contracts explicitly state:

  • “Provider shall not use Customer Data to train, fine-tune, or otherwise improve any AI models.”
  • “Provider shall not retain Customer Data beyond [X] hours for operational purposes.”
  • Add DPA (Data Processing Agreement) — GDPR requirement
  • For healthcare add BAA (Business Associate Agreement) — HIPAA requirement

Layer 2: technical switches

API options provided by upstream:

  • OpenAI: set data_retention: "zero" on API calls or enable ZDR in account settings
  • Anthropic: default zero retention; enterprise accounts additionally sign DPA confirming terms
  • Vertex AI: account-level data control settings

Technical switches and contract terms should be used together — contract is legal protection, technical switch is technical enforcement.

Layer 3: audit verification

How is contract + technical switch proven to the customer?

  • Upstream providers typically do not directly issue certificates to “your customers” (your customer is not their direct customer)
  • Practical approach: you (the vendor) hold the upstream provider’s contracts / certifications, and prove the sub-processor relationship to customers
  • Customer contracts include sub-processor list + data usage terms for each sub-processor
  • Provider compliance certifications (SOC2, HIPAA-eligible, etc.) serve as indirect endorsement

Standard response to customers

Enterprise customer due diligence checklists almost always include this question. Prepare a standard response to avoid answering from scratch in every sales conversation:

Q: Will our data be used to train AI models?

A: No. Specifically:

1. Our upstream model providers ([list, e.g., Anthropic / OpenAI / ...])
   commit by default not to use API input for model training. We have
   signed enterprise-grade contracts with providers ([contract name])
   that further clarify this clause.

2. We enable Zero Data Retention (ZDR) configuration at the API call
   layer (where the provider supports this option).

3. At our own infrastructure layer, prompt and output are retained only
   in audit logs (for compliance and troubleshooting), retention period
   [X] days, accessible only to authorized personnel.

4. We can provide the following upon customer due diligence request:
   - Upstream provider DPA / BAA
   - Our SOC2 Type II report (if applicable)
   - Complete sub-processor list

For more details, contact compliance@[domain].

Self-hosted / private model trade-offs

Some customers (finance, government, healthcare) require fully private deployment — model weights inside customer infrastructure, prompts never leave the customer boundary.

Trade-offs:

DimensionPublic API (upstream provider)Private deployment (self-hosted model / customer infra)
Compliance boundaryCross-organization (multiple sub-processors)Single boundary (customer’s own)
CostLow unit cost (shared infrastructure)High unit cost (dedicated infra + model license)
Model capabilityAlways current (provider continuous upgrade)Lagging (self-hosted updates slow)
Deployment complexityLow (API call)High (requires GPU cluster, model management)
Suitable customersMost enterprise customersTop-compliance customers, government, sensitive industries

Practical strategy:

  • Default product offering: public API + strict compliance configuration (suits 80% of customers)
  • Enterprise upsell: private deployment option (suits remaining 20%, unit price 5-10× standard contract)
  • Do not proactively push private deployment to all customers — cost and operational burden are large; offer only when customers explicitly require

Cross-section connections

Was this page helpful?