# Where Does ChatGPT Get Its Data? LLM Training Data Sources Explained

ChatGPT's training data comes from Common Crawl, books, code, and licensed content. Learn what feeds into LLM responses and why it matters for your brand.

**Published:** April 10, 2026
**Author:** David Thomas

---

ChatGPT uses publicly available internet content and licensed datasets created or reviewed by human trainers. GPT-4, OpenAI's flagship large multimodal model, was reportedly trained on roughly 13 trillion tokens of text and code (The Decoder, July 2023). To put that in perspective, if you read non-stop at 250 words per minute, it would take you over 70,000 years to get through it all. And that's just one model from one company.

So where does ChatGPT actually get its data? If you're a business owner or marketer trying to understand why AI recommends certain brands and ignores others, the training data is where the story starts. If your brand isn't represented in those training sources, you're essentially invisible to one of the fastest-growing discovery channels in the world.

  
  
  
  
</KeyTakeaways>

## Why Training Data Matters for Your Brand

ChatGPT's training data determines which brands, products, and ideas the model "knows" about. If your business isn't well-represented across the trusted sources that feed into LLM training, AI tools are unlikely to recommend you, no matter how good your product is.

Research from Seer Interactive (2025) shows that LLMs tend to reflect information appearing consistently across multiple independent sources. A single glowing review on your own website carries little influence. What matters is whether your brand appears across third-party directories, publications, forums, and review platforms, the same types of content that form the backbone of LLM training datasets. This is the core of answer engine optimisation (AEO): getting into the answers that ChatGPT, Gemini, and Perplexity generate, not just ranking on Google.

## The Five Layers of ChatGPT's Knowledge

<div className="not-prose">
  
</div>

**Pre-training** is the model's education. Common Crawl, a non-profit archive of the public web, contributes over 250 billion pages totalling roughly 468 terabytes. OpenAI also draws from digitised books, academic papers, and GitHub code repositories. The original ChatGPT model was trained on approximately 570GB of text, though newer versions use vastly more.

**Fine-tuning** uses much smaller, curated datasets to teach the model how to follow instructions and format responses. The goal isn't to add new knowledge; it's to turn a raw text predictor into something useful. **RLHF** then layers human judgement on top: reviewers rank responses by accuracy, clarity, and safety, and the model learns to favour those qualities. Neither stage adds new facts; they shape how the model communicates.

**Browsing** (available to Plus and Enterprise users) uses retrieval-augmented generation (RAG): the model searches the web, reads results, and incorporates them into its response, extending its reach beyond the training cutoff. **System prompts** are hidden instructions at the start of each conversation that shape the model's behaviour, allowing developers to specialise it for specific tasks.

## How Big Is ChatGPT's Training Dataset?

GPT-4 was trained on trillions of tokens drawn from web crawls, books, and code. Training it cost over $100 million, according to OpenAI CEO Sam Altman. Common Crawl's raw archive contains at least 100 trillion tokens, though heavy filtering brings usable data down significantly: NVIDIA's filtered version holds 6.3 trillion tokens; HuggingFace's FineWeb holds 15 trillion. The model itself has an estimated 1.76 trillion parameters in a Mixture of Experts architecture, where only a fraction activate for any given query.

<div className="not-prose">
  <div className="grid grid-cols-1 sm:grid-cols-3 gap-4 my-8">
    
    
    
  </div>
</div>

## What ChatGPT Doesn't Know

Despite its massive training set, ChatGPT has significant blind spots. It has no access to real-time information unless browsing is enabled, cannot read paywalled or private content, and has a hard knowledge cutoff beyond which it has no training data.

ChatGPT 5.2 (released December 2025) has a knowledge cutoff of 31 August 2025 (ALLMO, 2026). Older versions have earlier cutoffs; GPT-4o's training stopped in October 2023. When gaps exist, ChatGPT doesn't always say "I don't know." As researchers at the NIH noted, the model "predicts the next token, not the next fact" (PMC, 2024), so it can fill gaps with plausible-sounding but fabricated information.

<div className="not-prose">
  
</div>

<div className="not-prose">
  
</div>

## The Copyright Question: Who Owns Training Data?

The legal ownership of AI training data is one of the most contested questions in technology right now. As of October 2025, there were 51 active copyright lawsuits against AI companies, and no court is expected to rule on the core fair use question before summer 2026 (ChatGPT Is Eating the World, October 2025).

The highest-profile case is The New York Times v. OpenAI, filed in December 2023. The Times alleges OpenAI used millions of its copyrighted articles to train GPT models without permission. In January 2026, a federal judge ordered OpenAI to produce 20 million anonymised ChatGPT logs as part of discovery.

<div className="not-prose">
  
</div>

OpenAI also signed a deal with Stack Overflow providing access to 15 years of developer Q&A data. Not everyone was happy: some Stack Overflow users deleted their top-rated answers in protest, only to have those posts restored and their accounts banned. For businesses thinking about LLM visibility, these deals signal a clear trend: the era of scraping everything freely is closing, and which content makes it into future training datasets will increasingly depend on licensing arrangements.

## How Training Data Shapes AI Search Recommendations

LLMs don't evaluate sources the way humans do. They look for patterns and consensus across their training data. When multiple independent, authoritative sources mention your brand in a positive context, the model learns that association.

<div className="not-prose">
  
</div>

This is the core principle behind AI search optimisation. It's not about gaming the system. It's about making sure your brand is genuinely well-represented across the open web, in the kinds of places that training data crawlers actually index.

<div className="not-prose">
  
</div>

## Common Misconceptions About ChatGPT's Data

<div className="not-prose">
  
</div>

## FAQs

<div className="not-prose">
  
</div>

## Conclusion

Where ChatGPT gets its data, from Common Crawl's 250 billion web pages to licensed deals with Reddit and Stack Overflow, directly shapes its answers to millions of user queries every day. For businesses, this creates both a risk and an opportunity. If your brand isn't well-represented across the open, crawlable web, you're leaving your AI visibility to chance.

The brands that show up in ChatGPT's recommendations aren't there by accident. They've built a consistent presence across trusted, independent sources: industry publications, review platforms, forums, and structured directories. Understanding where the data comes from is the first step. Making sure your brand appears in those sources is what actually moves the needle.

<div className="not-prose">
  
</div>

---

[Back to Blog](https://www.searchable.com/blog) | [Searchable Homepage](https://www.searchable.com)
