# We Built a Dual-Served Content System for AI Crawlers

How we reduced payload size by 97% for AI crawlers while serving the same URL. A technical deep-dive into serving markdown to GPTBot, ClaudeBot, and Perplexity while humans see the full React site.

**Published:** January 31, 2026
**Author:** Sam Hogan

---

# We Built a Dual-Served Content System for AI Crawlers

AI agents are not rendering your UI. They are not interacting with components or layouts. They are parsing text, structure, and intent. When you send them a fully hydrated React page, they have to sift through JavaScript, CSS, navigation, and layout wrappers to extract a few kilobytes of actual content.

We wanted to remove that friction.

Around the same time, we noticed a small group of teams experimenting with machine-first representations of web pages. [Parallel](https://parallel.ai) stood out. Their approach made a simple point clear: the web humans see and the web machines need do not have to be the same thing.

So we built a system that serves both.

- Humans see the full React site.
- AI crawlers receive clean, structured markdown.
- The URL never changes.

**By doing this, we reduced the payload a crawler receives by 97%. Our homepage went from 177KB of HTML to 4KB of markdown.**

---

## The Problem With Modern HTML for Crawlers

A typical modern page includes:

- Client-side JavaScript
- Layout wrappers
- Navigation and footer content
- Tracking scripts
- Styling and UI primitives

Out of roughly 177KB of HTML, only a small fraction contains meaningful content.

AI crawlers gain nothing from the rest. They need clear headings, structured sections, and explicit language. Everything else is overhead.

Rather than hoping crawlers infer the important parts correctly, we decided to deliver the content directly in a format designed for them.

---

## The Architecture

The system has four layers:

1. **Middleware** intercepts every request and checks the User-Agent
2. **Rewrite** (not redirect) to `/api/md/[path]` if it's an AI bot
3. **Markdown API** resolves the path to content
4. **Content** is served from a colocated `machine.md` or cleaned MDX

No redirects. No duplicate URLs. No changes to how humans browse the site.

---

## Layer 1: Bot Detection in Middleware

Every request passes through middleware. We inspect the User-Agent and compare it against a known list of AI crawlers.

Examples include `GPTBot`, `ClaudeBot`, `PerplexityBot`, `Google-Extended`, and `Applebot-Extended`.

If the request targets a content page and the User-Agent matches an AI crawler, we rewrite the request to `/api/md/[path]`.

We also allow humans to request markdown explicitly with `?format=md`.

This is a **rewrite**, not a redirect. The URL stays the same. The response changes.

---

## Layer 2: A Catch-All Markdown API

We created a single API route:

```
/api/md/[...path]
```

This route resolves markdown using a simple order:

1. Serve a colocated `machine.md` file if it exists
2. Handle dynamic content like blogs or guides
3. Generate a fallback for unknown pages

Responses are returned as `text/markdown`, cached aggressively, and marked `noindex` so the API endpoint itself is never indexed.

### Auto-Discovering machine.md Files

We did not want a large config mapping routes to files.

Instead, the API auto-discovers markdown colocated with pages.

For a route like `/pricing`, it checks multiple possible locations inside the app directory. If a `machine.md` file exists, it is served.

Adding a new page is simple. Create the page. Add a `machine.md`. Nothing else changes.

### Reusing Existing MDX Content

Blog posts and guides already lived in MDX.

We strip JSX at runtime by removing imports, exports, and custom components. What remains is clean markdown with normalized spacing.

This allows us to reuse existing content without maintaining parallel versions.

---

## Layer 3: Writing Markdown for AI Consumption

Each important page has a `machine.md` file alongside its `page.tsx`.

```
app/
  (root)/
    page.tsx
    machine.md
    pricing/
      page.tsx
      machine.md
```

The markdown is intentionally plain.

**It includes:**

- A clear H1
- Structured H2 and H3 sections
- Explicit CTAs with full URLs
- Company and trust signals
- FAQ sections aligned with real queries

**It excludes:** navigation, styling, and interactive elements.

The goal is clarity, not presentation.

---

## Layer 4: Discovery Signals

We added two discovery signals for AI agents.

First, a `llms.txt` file at the root that lists canonical markdown endpoints for major pages.

Second, an alternate link tag in the HTML head:

```html
<link
  rel="alternate"
  type="text/markdown"
  href="/api/md/pricing"
  title="Markdown version for AI agents"
/>
```

This makes the markdown version explicit and discoverable.

---

## The Result

For a typical page:

- **HTML served to humans:** ~177KB
- **Markdown served to AI crawlers:** ~4KB

The content stays the same. The overhead disappears.

Crawlers ingest faster.
Structure is clearer.
Outcomes are more predictable.

---

## Optional: Let Humans Inspect the Markdown

We added a simple toggle that fetches content directly from `/api/md/[path]`.

This lets humans see exactly what AI crawlers receive. It also keeps us honest. If the markdown reads poorly, the structure needs work.

---

## Why This Matters Long Term

This pattern is not a one-off optimization. It is where the internet is heading.

We are moving toward a version of the web that agents can read efficiently. Not render. Not hydrate. Not execute. Just read.

Large language models operate under token constraints. Every extra byte of markup, script, or layout noise consumes budget without adding meaning. When a crawler has to process 177KB of HTML to extract 4KB of content, that inefficiency compounds across millions of pages.

A simpler, machine-readable layer solves that.

Markdown works because it is:

- **Explicit**
- **Structured**
- **Cheap to parse**
- **Cheap to reason over**
- **Aligned with how LLMs already represent knowledge internally**

What we are effectively doing is separating presentation from meaning. Humans get the interface. Machines get the content.

That separation becomes more important as AI systems move from passive discovery to active recommendation.

### Preparing for AI Discovery and Ads

This also matters as monetization enters the picture.

OpenAI has been explicit about its direction with advertising. Their approach focuses on sponsored placements that coexist with organic answers rather than replacing them. You can read their [full position here](https://openai.com/index/our-approach-to-advertising-and-expanding-access/).

As systems like ChatGPT begin blending ads, citations, and organic recommendations, content clarity becomes a competitive advantage.

If your page is difficult to parse, ambiguous in structure, or bloated with noise, it becomes harder for AI systems to confidently reference, summarize, or recommend it. That affects both organic visibility and how your brand shows up alongside paid placements.

**Machine-friendly content does a few things well:**

- It makes intent unambiguous
- It reduces token waste
- It improves summarization accuracy
- It increases confidence when models choose what to cite or recommend

You are not optimizing for a crawler anymore. You are optimizing for a reasoning system.

Serving clean markdown alongside your human UI is one way to prepare for that shift without compromising the experience for either side.

---

## Key Takeaways

1. **Use middleware rewrites, not redirects** — Serve different representations at the same URL
2. **Colocate machine.md with page.tsx** — Scales naturally as you add pages
3. **Auto-discover files instead of hardcoding routes** — No config to maintain
4. **Strip JSX from MDX to reuse existing content** — Blog posts work automatically
5. **Expose markdown through explicit discovery signals** — `llms.txt` and `<link rel="alternate">`
6. **Cache markdown aggressively** — It's static, treat it that way

---

AI crawlers are now a primary interface to the web. Designing content delivery with that in mind changes how pages should be built.

---

[Back to Blog](https://www.searchable.com/blog) | [Searchable Homepage](https://www.searchable.com)
