What is the best Markdown flavor for AI?

GitHub Flavored Markdown (GFM) is the most widely used and well-supported. It adds tables, task lists, strikethrough, and fenced code blocks with language hints on top of standard Markdown. Most LLMs were trained on text that includes GFM, so they handle it well.

Does it matter which headings I use?

Yes. Heading hierarchy gives models structural context for the content beneath each heading. An H2 under an H1 is a sub-topic of the H1 content. Keeping this hierarchy intact rather than normalizing all headings to the same level gives the model better context for retrieval and generation tasks.

How should I handle very long Markdown documents?

Split at heading boundaries rather than arbitrary character limits. Chunks that start at a heading always contain a coherent unit of content. Add the page URL and heading path as metadata to each chunk so you can trace retrieved content back to its source. For very long single sections with no sub-headings, split at paragraph boundaries and aim for chunks of 500 to 1500 tokens.

Is there a token count formula I can use?

As a rough rule, one token is approximately four characters in English text. A 10,000 character Markdown document is roughly 2,500 tokens. For precise counts, use a tokenizer matching your model: tiktoken for OpenAI models, the Hugging Face tokenizers library for open-source models. Always measure on your actual content rather than relying on estimates.

Blog/ HTML vs Markdown for AI: which format is better for LLMs?

June 12, 2026 · 11 min read

HTML vs Markdown for AI: which format is better for LLMs?

Joel Olawanle

HTML vs Markdown for AI: which format is better for LLMs?

If you are feeding web content into a large language model, whether for RAG, fine-tuning, or in-context retrieval, the format you use matters more than most people realise when they first start building.

The obvious choice is to grab the raw HTML from a page and pass it in. It is what the browser receives. It contains everything. Why not use it?

The problem is that "everything" includes a lot of things that are not the content you care about. Navigation menus, script tags, CSS classes, cookie banners, ad slots, footer links repeated across every page on the site. These take up tokens and contribute nothing to what the model learns or retrieves.

This article looks at what raw HTML actually contains versus what Markdown looks like for the same page, how the token count compares, what each format preserves and loses, and which one to use depending on what you are building.

What is actually in raw HTML

Take a typical news article page. Before any of the article text appears, a raw HTTP response from a modern website might look something like this:

<!DOCTYPE html>
<html lang="en">
<head>
  <meta charset="UTF-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0">
  <title>Article Title | Site Name</title>
  <meta name="description" content="...">
  <link rel="preconnect" href="https://fonts.googleapis.com">
  <link rel="stylesheet" href="/static/css/main.a4f2e1.css">
  <link rel="stylesheet" href="/static/css/vendor.9b3c4d.css">
  <script>
    window.__INITIAL_STATE__ = {"user":null,"theme":"light","locale":"en-US",...}
  </script>
  <script async src="https://www.googletagmanager.com/gtm.js?id=GTM-XXXXX"></script>
  <!-- 40 more lines of head content -->
</head>
<body>
  <div id="cookie-banner" class="fixed bottom-0 w-full bg-gray-800 text-white p-4 z-50">
    <p>We use cookies to improve your experience. By continuing, you agree to our
    <a href="/privacy">Privacy Policy</a></p>
    <button id="accept-cookies" class="btn btn-primary ml-4">Accept All</button>
    <button id="reject-cookies" class="btn btn-secondary ml-2">Reject</button>
  </div>

  <nav class="navbar" role="navigation" aria-label="main navigation">
    <div class="navbar-brand">
      <a class="navbar-item" href="/">
        <img src="/images/logo.svg" alt="Site Logo" width="120" height="40">
      </a>
    </div>
    <div class="navbar-menu">
      <div class="navbar-start">
        <a class="navbar-item" href="/news">News</a>
        <a class="navbar-item" href="/tech">Technology</a>
        <a class="navbar-item" href="/science">Science</a>
        <!-- many more nav items -->
      </div>
    </div>
  </nav>

  <main>
    <article class="post-content" itemscope itemtype="https://schema.org/Article">
      <h1 class="post-title" itemprop="headline">The actual article title</h1>
      <div class="post-meta">
        <span class="author" itemprop="author">Author Name</span>
        <time class="published" itemprop="datePublished" datetime="2026-05-01">
          May 1, 2026
        </time>
      </div>
      <div class="post-body" itemprop="articleBody">
        <p>The actual article content starts here...</p>
      </div>
    </article>
  </main>

  <aside class="sidebar">
    <!-- related articles, ads, newsletter signup -->
  </aside>

  <footer>
    <!-- site map, legal links, social icons repeated on every page -->
  </footer>

  <script src="/static/js/main.chunk.js"></script>
  <script src="/static/js/vendors.chunk.js"></script>
  <!-- analytics, tracking, a/b testing scripts -->
</body>
</html>

The actual article content is a small fraction of this. Everything else: navigation, cookie banners, schema markup, script tags, CSS class names, the sidebar, the footer, the analytics scripts, is boilerplate that repeats across every page on the site. It is tokens the model has to process before it gets to anything meaningful.

What Markdown looks like for the same page

After stripping the HTML and extracting just the content, the same page as Markdown looks like this:

# The actual article title

**Author Name** · May 1, 2026

The actual article content starts here. The paragraphs flow naturally
without any surrounding markup. Links stay as [link text](url).

## A section heading

Content under this heading. Lists stay as lists:

- First item
- Second item
- Third item

Code blocks keep their syntax:

    def example():
        return "clean and readable"

| Column 1 | Column 2 |
|----------|----------|
| Value    | Value    |

The navigation, the footer, the cookie banner, the script tags, and the CSS classes are all gone. What remains is the content and its structure.

The token difference

Research from the University of Science and Technology of China, studying a Common Crawl benchmark, found that raw HTML averaged around 74,000 tokens per document while Markdown averaged around 7,600. That is roughly a 90% reduction in token count.

To put that in practical terms: with a 128K token context window, raw HTML might let you fit two or three typical articles. The same window with Markdown could fit 15 to 20 articles. For RAG pipelines where you are trying to retrieve and reference as much relevant content as possible, that difference is significant.

The token reduction comes from removing what HTML includes that Markdown does not:

<div>, <span>, and structural tags with no semantic meaning
CSS class names and inline style attributes
Script tags and their contents
Navigation menus, footers, sidebars repeated across every page
Schema markup (itemscope, itemtype, itemprop attributes)
Analytics, tracking, and A/B testing code
Cookie banners and modal overlays
Meta tags and head content

None of these contribute to what a model should learn from the page. They are engineering artifacts, not content.

What each format preserves

HTML preserves

The complete page structure including elements with no content value
All CSS class names and IDs (useful if you are analyzing UI patterns, not useful if you are analyzing content)
Schema.org markup and other structured metadata embedded in attributes
The exact rendering instructions for a browser
JavaScript code that runs on the page
Links including their href attributes, rel attributes, and any tracking parameters

Markdown preserves

Heading hierarchy (H1, H2, H3 mapped to #, ##, ###)
Paragraph structure and spacing
Lists, both ordered and unordered
Tables with header rows
Code blocks with language hints
Links with anchor text
Images with alt text
Bold and italic emphasis
Blockquotes

The key things Markdown keeps are the things that help a model understand the content: what is a heading, what is a list item, what is code, what is being emphasized. The things Markdown drops are the things that help a browser render a page but tell the model nothing about meaning.

When token count matters most

RAG pipelines

In a retrieval-augmented generation system, you typically embed chunks of content and retrieve the most relevant ones at query time. The retrieved chunks get injected into the model's context window before the question is answered.

If your chunks are made from raw HTML, a large portion of each chunk is boilerplate that is not relevant to any query. Navigation menus from a documentation site end up in your vector index. The model retrieves and reads these as if they might contain the answer.

Markdown chunks are denser. Each chunk contains more actual content per token. Retrieval is more accurate because the content signal is stronger relative to the noise. And you can fit more relevant chunks into the context window at query time without hitting the token limit.

Fine-tuning

When fine-tuning a model on web content, you want the model to learn from the text. Raw HTML trains it on HTML syntax as much as on content. A model fine-tuned on raw HTML learns that web pages have <nav> elements before <main> elements, that footers contain privacy policy links, and that articles are wrapped in <article class="post-content"> tags. That is not the knowledge you are after.

Markdown fine-tuning data has a much higher ratio of actual content to markup. The model learns from the text, not the scaffolding around it.

In-context retrieval

When you paste a webpage directly into an LLM prompt, every token counts toward the context limit. Raw HTML fills that limit quickly with content the model cannot use. Markdown lets you include substantially more of the actual page content within the same limit.

When HTML might be the better choice

HTML is the right format in specific circumstances:

Analyzing page structure itself. If you are building something that needs to understand how a page is laid out, which elements appear in which positions, how the navigation hierarchy is organized, then the HTML structure is the data. Markdown removes exactly the information you need.
Extracting metadata from attributes. Some structured data lives in HTML attributes rather than visible text. Schema.org itemprop values, Open Graph tags, canonical URLs, data attributes used by JavaScript. If you need this metadata, HTML preserves it and Markdown does not.
Legal or regulatory requirements. Some archiving and compliance use cases require the exact original document, not a converted representation.
Reproducing the rendered page. If your use case is about how the page looks in a browser rather than what it says, HTML is necessary.
DOM-level analysis. Any task that requires understanding the relationship between HTML elements, like accessibility auditing or layout analysis, needs the HTML.

Outside of these specific cases, Markdown is almost always the better choice for AI workloads.

The quality of the Markdown matters

Not all Markdown converters produce equally clean output. The common failure modes are:

Not rendering JavaScript first. Most web pages render their content through JavaScript after the initial page load. A converter that makes a plain HTTP request and converts the HTML response gets an empty shell or partial content. The Markdown looks clean but is missing most of the page.
Including boilerplate that survived the conversion. A naive converter might keep all the link text from the navigation menu, all the footer links, and all the cookie banner text, just without the HTML tags around them. You get clean Markdown that still contains all the noise.
Losing table structure. HTML tables converted badly become flat lists of values with no indication of which column a value belongs to. Well-converted tables keep the header row and column alignment.
Losing code block language hints. <code class="language-python"> should become ```python, not just ``` or plain indented text.
Mangling heading hierarchy. An H3 that follows an H1 directly in the source should become ### in the Markdown. Converters that normalize all headings or rearrange the hierarchy make the structure harder for models to parse.

A good Markdown conversion: renders the page in a real browser before converting, strips navigation and footer before converting, preserves heading hierarchy, keeps tables as proper Markdown tables, preserves code block language hints, and keeps link text while removing tracking parameters.

Converting HTML to Markdown in practice

In Python

import html2text
import requests

converter = html2text.HTML2Text()
converter.ignore_links = False
converter.body_width = 0  # no line wrapping
converter.ignore_images = False

response = requests.get("https://example.com/article")
markdown = converter.handle(response.text)

The problem with this approach is the one covered above. requests.get() only fetches the initial HTML response. On JavaScript-rendered sites you get an empty shell.

In Node.js

import TurndownService from 'turndown'

const turndown = new TurndownService({
  headingStyle: 'atx',       // use # for headings
  codeBlockStyle: 'fenced',  // use ``` for code blocks
})

const markdown = turndown.convert(htmlString)

Same limitation applies if the HTML string came from a plain HTTP request.

Using a browser-based converter

To get accurate Markdown from modern JavaScript-heavy pages, the conversion needs to happen after the page has rendered in a real browser. The simplest way is to use a tool that handles the browser part for you.

Spidra's Website to Markdown tool does this. Paste a URL, get back clean Markdown. The page loads in a real browser, JavaScript executes, and then the conversion runs on the actual rendered content. You can toggle off navigation, footers, ads, and cookie banners before converting.

For programmatic use, the API returns Markdown by default:

import requests, time

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.spidra.io/api"
HEADERS = {"x-api-key": API_KEY, "Content-Type": "application/json"}

job = requests.post(
    f"{BASE_URL}/scrape",
    headers=HEADERS,
    json={
        "urls": [{"url": "https://example.com/article"}],
        "extractContentOnly": True  # strips nav, footer, ads
    }
).json()

while True:
    status = requests.get(
        f"{BASE_URL}/scrape/{job['jobId']}",
        headers=HEADERS
    ).json()
    if status["status"] == "completed":
        markdown = status["result"]["data"][0]["markdownContent"]
        break
    time.sleep(2)

print(markdown)

extractContentOnly is the important parameter here. It strips the boilerplate before returning the content, so the Markdown you get is the article text, not the article text plus the entire navigation and footer.

Chunking Markdown for vector databases

Once you have clean Markdown, splitting it for a vector database is more straightforward than splitting raw HTML.

The natural split points are heading boundaries. Content under an ## Introduction heading belongs together. Content under ## Installation belongs together. Splitting at these boundaries keeps semantically related content in the same chunk rather than arbitrarily cutting mid-paragraph.

import re

def chunk_by_headings(markdown: str, min_length: int = 200) -> list[dict]:
    """Split Markdown at heading boundaries."""
    sections = re.split(r'\n(?=#{1,3} )', markdown)
    chunks = []

    for section in sections:
        section = section.strip()
        if len(section) < min_length:
            continue

        # extract the heading if present
        lines = section.split('\n', 1)
        heading = lines[0].lstrip('#').strip() if lines[0].startswith('#') else ''
        body = lines[1].strip() if len(lines) > 1 else section

        chunks.append({
            'heading': heading,
            'content': section,  # include heading in the chunk
            'length': len(section),
        })

    return chunks

chunks = chunk_by_headings(markdown)

Splitting at headings is better than splitting at arbitrary character limits because a heading-based chunk always contains a coherent unit of content. Character-based splitting can cut mid-sentence or mid-paragraph, producing chunks that start or end with incomplete thoughts.

Summary

	Raw HTML	Markdown
Token count	Very high (~74K avg per page)	Low (~7.6K avg per page)
Context window usage	Inefficient	Efficient
Content-to-noise ratio	Low	High
Heading structure	Preserved in tags	Preserved visually
Table support	Full	Partial (GFM tables)
Code blocks	`<code>` / `<pre>` tags	Fenced code blocks
Metadata in attributes	Preserved	Lost
Navigation / boilerplate	Included	Stripped (with good converter)
JavaScript content	Not captured by plain fetch	Captured with browser rendering
Best for	Structure analysis, metadata extraction, archiving	RAG, fine-tuning, in-context retrieval

For almost every AI workload that processes web content, Markdown is the better choice. The token reduction is significant, the content-to-noise ratio is much higher, and the structural information that matters for understanding content (headings, lists, tables, code) is preserved.

The exception is when the HTML structure itself is the thing you are analyzing, when you need metadata from attributes, or when you are doing DOM-level work. For everything else, convert to Markdown first.

Frequently asked questions

Yes. Markdown does not preserve CSS class names, HTML attributes, schema markup in attributes, rel tags, tracking parameters on links, or anything else that lives in the HTML rather than the visible content. For most AI workloads this is desirable because that information is noise. For specific use cases like metadata extraction or DOM analysis, you need the HTML.

Share this article

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.