If you are feeding web content into a large language model, whether for RAG, fine-tuning, or in-context retrieval, the format you use matters more than most people realise when they first start building.
The obvious choice is to grab the raw HTML from a page and pass it in. It is what the browser receives. It contains everything. Why not use it?
The problem is that "everything" includes a lot of things that are not the content you care about. Navigation menus, script tags, CSS classes, cookie banners, ad slots, footer links repeated across every page on the site. These take up tokens and contribute nothing to what the model learns or retrieves.
This article looks at what raw HTML actually contains versus what Markdown looks like for the same page, how the token count compares, what each format preserves and loses, and which one to use depending on what you are building.
What is actually in raw HTML
Take a typical news article page. Before any of the article text appears, a raw HTTP response from a modern website might look something like this:
<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<meta name="viewport" content="width=device-width, initial-scale=1.0">
<title>Article Title | Site Name</title>
<meta name="description" content="...">
<link rel="preconnect" href="https://fonts.googleapis.com">
<link rel="stylesheet" href="/static/css/main.a4f2e1.css">
<link rel="stylesheet" href="/static/css/vendor.9b3c4d.css">
<script>
window.__INITIAL_STATE__ = {"user":null,"theme":"light","locale":"en-US",...}
</script>
<script async src="https://www.googletagmanager.com/gtm.js?id=GTM-XXXXX"></script>
<!-- 40 more lines of head content -->
</head>
<body>
<div id="cookie-banner" class="fixed bottom-0 w-full bg-gray-800 text-white p-4 z-50">
<p>We use cookies to improve your experience. By continuing, you agree to our
<a href="/privacy">Privacy Policy</a></p>
<button id="accept-cookies" class="btn btn-primary ml-4">Accept All</button>
<button id="reject-cookies" class="btn btn-secondary ml-2">Reject</button>
</div>
<nav class="navbar" role="navigation" aria-label="main navigation">
<div class="navbar-brand">
<a class="navbar-item" href="/">
<img src="/images/logo.svg" alt="Site Logo" width="120" height="40">
</a>
</div>
<div class="navbar-menu">
<div class="navbar-start">
<a class="navbar-item" href="/news">News</a>
<a class="navbar-item" href="/tech">Technology</a>
<a class="navbar-item" href="/science">Science</a>
<!-- many more nav items -->
</div>
</div>
</nav>
<main>
<article class="post-content" itemscope itemtype="https://schema.org/Article">
<h1 class="post-title" itemprop="headline">The actual article title</h1>
<div class="post-meta">
<span class="author" itemprop="author">Author Name</span>
<time class="published" itemprop="datePublished" datetime="2026-05-01">
May 1, 2026
</time>
</div>
<div class="post-body" itemprop="articleBody">
<p>The actual article content starts here...</p>
</div>
</article>
</main>
<aside class="sidebar">
<!-- related articles, ads, newsletter signup -->
</aside>
<footer>
<!-- site map, legal links, social icons repeated on every page -->
</footer>
<script src="/static/js/main.chunk.js"></script>
<script src="/static/js/vendors.chunk.js"></script>
<!-- analytics, tracking, a/b testing scripts -->
</body>
</html>The actual article content is a small fraction of this. Everything else: navigation, cookie banners, schema markup, script tags, CSS class names, the sidebar, the footer, the analytics scripts, is boilerplate that repeats across every page on the site. It is tokens the model has to process before it gets to anything meaningful.
What Markdown looks like for the same page
After stripping the HTML and extracting just the content, the same page as Markdown looks like this:
# The actual article title
**Author Name** · May 1, 2026
The actual article content starts here. The paragraphs flow naturally
without any surrounding markup. Links stay as [link text](url).
## A section heading
Content under this heading. Lists stay as lists:
- First item
- Second item
- Third item
Code blocks keep their syntax:
def example():
return "clean and readable"
| Column 1 | Column 2 |
|----------|----------|
| Value | Value |The navigation, the footer, the cookie banner, the script tags, and the CSS classes are all gone. What remains is the content and its structure.
The token difference
Research from the University of Science and Technology of China, studying a Common Crawl benchmark, found that raw HTML averaged around 74,000 tokens per document while Markdown averaged around 7,600. That is roughly a 90% reduction in token count.
To put that in practical terms: with a 128K token context window, raw HTML might let you fit two or three typical articles. The same window with Markdown could fit 15 to 20 articles. For RAG pipelines where you are trying to retrieve and reference as much relevant content as possible, that difference is significant.
The token reduction comes from removing what HTML includes that Markdown does not:
<div>,<span>, and structural tags with no semantic meaning- CSS class names and inline style attributes
- Script tags and their contents
- Navigation menus, footers, sidebars repeated across every page
- Schema markup (
itemscope,itemtype,itempropattributes) - Analytics, tracking, and A/B testing code
- Cookie banners and modal overlays
- Meta tags and head content
None of these contribute to what a model should learn from the page. They are engineering artifacts, not content.
What each format preserves
HTML preserves
- The complete page structure including elements with no content value
- All CSS class names and IDs (useful if you are analyzing UI patterns, not useful if you are analyzing content)
- Schema.org markup and other structured metadata embedded in attributes
- The exact rendering instructions for a browser
- JavaScript code that runs on the page
- Links including their
hrefattributes,relattributes, and any tracking parameters
Markdown preserves
- Heading hierarchy (H1, H2, H3 mapped to
#,##,###) - Paragraph structure and spacing
- Lists, both ordered and unordered
- Tables with header rows
- Code blocks with language hints
- Links with anchor text
- Images with alt text
- Bold and italic emphasis
- Blockquotes
The key things Markdown keeps are the things that help a model understand the content: what is a heading, what is a list item, what is code, what is being emphasized. The things Markdown drops are the things that help a browser render a page but tell the model nothing about meaning.
When token count matters most
RAG pipelines
In a retrieval-augmented generation system, you typically embed chunks of content and retrieve the most relevant ones at query time. The retrieved chunks get injected into the model's context window before the question is answered.
If your chunks are made from raw HTML, a large portion of each chunk is boilerplate that is not relevant to any query. Navigation menus from a documentation site end up in your vector index. The model retrieves and reads these as if they might contain the answer.
Markdown chunks are denser. Each chunk contains more actual content per token. Retrieval is more accurate because the content signal is stronger relative to the noise. And you can fit more relevant chunks into the context window at query time without hitting the token limit.
Fine-tuning
When fine-tuning a model on web content, you want the model to learn from the text. Raw HTML trains it on HTML syntax as much as on content. A model fine-tuned on raw HTML learns that web pages have <nav> elements before <main> elements, that footers contain privacy policy links, and that articles are wrapped in <article class="post-content"> tags. That is not the knowledge you are after.
Markdown fine-tuning data has a much higher ratio of actual content to markup. The model learns from the text, not the scaffolding around it.
In-context retrieval
When you paste a webpage directly into an LLM prompt, every token counts toward the context limit. Raw HTML fills that limit quickly with content the model cannot use. Markdown lets you include substantially more of the actual page content within the same limit.
When HTML might be the better choice
HTML is the right format in specific circumstances:
- Analyzing page structure itself. If you are building something that needs to understand how a page is laid out, which elements appear in which positions, how the navigation hierarchy is organized, then the HTML structure is the data. Markdown removes exactly the information you need.
- Extracting metadata from attributes. Some structured data lives in HTML attributes rather than visible text. Schema.org
itempropvalues, Open Graph tags, canonical URLs, data attributes used by JavaScript. If you need this metadata, HTML preserves it and Markdown does not. - Legal or regulatory requirements. Some archiving and compliance use cases require the exact original document, not a converted representation.
- Reproducing the rendered page. If your use case is about how the page looks in a browser rather than what it says, HTML is necessary.
- DOM-level analysis. Any task that requires understanding the relationship between HTML elements, like accessibility auditing or layout analysis, needs the HTML.
Outside of these specific cases, Markdown is almost always the better choice for AI workloads.
The quality of the Markdown matters
Not all Markdown converters produce equally clean output. The common failure modes are:
- Not rendering JavaScript first. Most web pages render their content through JavaScript after the initial page load. A converter that makes a plain HTTP request and converts the HTML response gets an empty shell or partial content. The Markdown looks clean but is missing most of the page.
- Including boilerplate that survived the conversion. A naive converter might keep all the link text from the navigation menu, all the footer links, and all the cookie banner text, just without the HTML tags around them. You get clean Markdown that still contains all the noise.
- Losing table structure. HTML tables converted badly become flat lists of values with no indication of which column a value belongs to. Well-converted tables keep the header row and column alignment.
- Losing code block language hints.
<code class="language-python">should become```python, not just```or plain indented text. - Mangling heading hierarchy. An H3 that follows an H1 directly in the source should become
###in the Markdown. Converters that normalize all headings or rearrange the hierarchy make the structure harder for models to parse.
A good Markdown conversion: renders the page in a real browser before converting, strips navigation and footer before converting, preserves heading hierarchy, keeps tables as proper Markdown tables, preserves code block language hints, and keeps link text while removing tracking parameters.
Converting HTML to Markdown in practice
In Python
import html2text
import requests
converter = html2text.HTML2Text()
converter.ignore_links = False
converter.body_width = 0 # no line wrapping
converter.ignore_images = False
response = requests.get("https://example.com/article")
markdown = converter.handle(response.text)The problem with this approach is the one covered above. requests.get() only fetches the initial HTML response. On JavaScript-rendered sites you get an empty shell.
In Node.js
import TurndownService from 'turndown'
const turndown = new TurndownService({
headingStyle: 'atx', // use # for headings
codeBlockStyle: 'fenced', // use ``` for code blocks
})
const markdown = turndown.convert(htmlString)Same limitation applies if the HTML string came from a plain HTTP request.
Using a browser-based converter
To get accurate Markdown from modern JavaScript-heavy pages, the conversion needs to happen after the page has rendered in a real browser. The simplest way is to use a tool that handles the browser part for you.
Spidra's Website to Markdown tool does this. Paste a URL, get back clean Markdown. The page loads in a real browser, JavaScript executes, and then the conversion runs on the actual rendered content. You can toggle off navigation, footers, ads, and cookie banners before converting.
For programmatic use, the API returns Markdown by default:
import requests, time
API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.spidra.io/api"
HEADERS = {"x-api-key": API_KEY, "Content-Type": "application/json"}
job = requests.post(
f"{BASE_URL}/scrape",
headers=HEADERS,
json={
"urls": [{"url": "https://example.com/article"}],
"extractContentOnly": True # strips nav, footer, ads
}
).json()
while True:
status = requests.get(
f"{BASE_URL}/scrape/{job['jobId']}",
headers=HEADERS
).json()
if status["status"] == "completed":
markdown = status["result"]["data"][0]["markdownContent"]
break
time.sleep(2)
print(markdown)extractContentOnly is the important parameter here. It strips the boilerplate before returning the content, so the Markdown you get is the article text, not the article text plus the entire navigation and footer.
Chunking Markdown for vector databases
Once you have clean Markdown, splitting it for a vector database is more straightforward than splitting raw HTML.
The natural split points are heading boundaries. Content under an ## Introduction heading belongs together. Content under ## Installation belongs together. Splitting at these boundaries keeps semantically related content in the same chunk rather than arbitrarily cutting mid-paragraph.
import re
def chunk_by_headings(markdown: str, min_length: int = 200) -> list[dict]:
"""Split Markdown at heading boundaries."""
sections = re.split(r'\n(?=#{1,3} )', markdown)
chunks = []
for section in sections:
section = section.strip()
if len(section) < min_length:
continue
# extract the heading if present
lines = section.split('\n', 1)
heading = lines[0].lstrip('#').strip() if lines[0].startswith('#') else ''
body = lines[1].strip() if len(lines) > 1 else section
chunks.append({
'heading': heading,
'content': section, # include heading in the chunk
'length': len(section),
})
return chunks
chunks = chunk_by_headings(markdown)
Splitting at headings is better than splitting at arbitrary character limits because a heading-based chunk always contains a coherent unit of content. Character-based splitting can cut mid-sentence or mid-paragraph, producing chunks that start or end with incomplete thoughts.
Summary
| Raw HTML | Markdown | |
|---|---|---|
| Token count | Very high (~74K avg per page) | Low (~7.6K avg per page) |
| Context window usage | Inefficient | Efficient |
| Content-to-noise ratio | Low | High |
| Heading structure | Preserved in tags | Preserved visually |
| Table support | Full | Partial (GFM tables) |
| Code blocks | <code> / <pre> tags | Fenced code blocks |
| Metadata in attributes | Preserved | Lost |
| Navigation / boilerplate | Included | Stripped (with good converter) |
| JavaScript content | Not captured by plain fetch | Captured with browser rendering |
| Best for | Structure analysis, metadata extraction, archiving | RAG, fine-tuning, in-context retrieval |
For almost every AI workload that processes web content, Markdown is the better choice. The token reduction is significant, the content-to-noise ratio is much higher, and the structural information that matters for understanding content (headings, lists, tables, code) is preserved.
The exception is when the HTML structure itself is the thing you are analyzing, when you need metadata from attributes, or when you are doing DOM-level work. For everything else, convert to Markdown first.
