Blog/ How to extract web data for AI training pipelines
May 22, 2026 · 16 min read

How to extract web data for AI training pipelines

Joel Olawanle
Joel Olawanle
How to extract web data for AI training pipelines

A model's training data determines what it knows and what it can do. Web pages make up a large share of that data for most large language models, which makes web extraction a core part of most AI training pipelines.

Collecting that data is not simple. Many sites use anti-bot systems to detect and block automated requests. Raw HTML is noisy and wastes LLM context. And keeping a training corpus fresh as the web changes around you requires more than a one-time scrape.

In this guide, you will learn what AI data collection involves, why web data matters for training pipelines, how to turn raw pages into clean model-ready text, and how to build a scalable extraction pipeline using Spidra.

What is AI data collection?

AI data collection is the process of finding and gathering the data a machine learning system needs to be trained, fine-tuned, or evaluated. It involves selecting relevant sources, extracting data from them, loading that data into storage or processing systems, and assembling it into datasets the model can learn from.

Those sources can include websites, documents, databases, APIs, and more. The goal is to collect data that matches the system's intended task, whether that is text generation, question answering, summarization, classification, or something else. The collected data needs to be relevant, diverse, and of sufficient quality for the model to actually learn from it.

AI data types you can collect from the web

Web data comes in several forms. A single webpage can contain more than one at the same time.

  • Structured data follows clearly defined fields where every record has the same shape. Product specs, prices, ratings, job listings, directory entries, and metadata-rich pages all fall here.
  • Semi-structured data has some organization but not a strict row and column format. JSON-LD, embedded metadata, forum layouts, help center markup, and page metadata are common examples.
  • Unstructured data is long-form or loosely organized content written for human readers. Articles, documentation pages, reviews, comments, transcripts, and PDFs all fit here.
  • Multimodal data combines text with other media types like images or video, along with captions, transcripts, and metadata that describe them.

All of these types are scattered across websites built for human readers, not training pipelines. To use them for AI training you need to extract them and convert them into a format a model can learn from.

Why scraping web data is necessary for AI training pipelines

Web scraping is not just one way to collect training data. In many cases, it is the only way to keep datasets current, control what goes into them, and fill gaps left by public corpora that are becoming harder to depend on.

Public web corpora are becoming less dependable

The Data Provenance Initiative reports that restrictions on two major web-text datasets used for AI training, C4 and RefinedWeb, rose by more than 500% from April 2023 to April 2024. Around 45% of C4 is now restricted by Terms of Service.

These shifts are narrowing the range of sources in public training datasets and making those datasets harder to keep up to date. Relying on public corpora alone when you need fresh training data carries real risk.

AI pipelines need data that stays current

Models have a knowledge cutoff. As the web changes, training data pulled from it becomes stale and the model grows less relevant. For many AI use cases, a static snapshot from a public corpus is not enough. You need a pipeline that can refresh content as it changes.

Scraping gives you control over what goes in

When you scrape your own training data, you decide what goes into the dataset. You choose the domains, page types, topics, and quality thresholds that match your use case rather than inheriting the source mix of someone else's corpus. That control matters when you are building a domain-specific model or fine-tuning for a particular task.

Why extracting web data for AI training pipelines is hard

Web scraping for AI training tends to fail for the same reasons across most targets.

Anti-bot blocking

When sites use anti-bot systems, automated traffic triggers blocks, rate limits, or verification flows. The server responds with HTML, but what comes back is a CAPTCHA page, a JavaScript challenge, or a bot detection screen rather than the content you requested.

This makes extraction unstable at scale because a crawl can shift from collecting target content to collecting block pages as IP reputation and request patterns change.

JavaScript rendering

Many sites do not include page content in the initial HTML response. The response contains placeholder markup and JavaScript. The real data is fetched and inserted into the DOM after those scripts run. If you scrape with a plain HTTP client, you collect the placeholder HTML and miss the content you actually need. Browser-based scraping is required on these targets.

Structural changes that break extraction logic

Websites change over time. A redesign can move fields into different sections, rename selectors, or change how content loads. Parsing logic that was working can start returning empty fields or wrong values without throwing an obvious error.

Boilerplate in raw HTML

Raw HTML includes navigation menus, headers, footers, ads, cookie banners, and scripts that repeat across every page on a site. If you store raw HTML as training data, that boilerplate makes up a large share of each sample, diluting the page-specific text you actually want.

The challenge of token efficiency and context windows

Boilerplate in raw HTML does more than lower data quality. It inflates token counts across the extracted dataset. Scripts, CSS, navigation elements, cookie banners, and repeated layout blocks all add tokens while contributing almost no training signal.

At scale, that means more text to clean, preprocess, and deduplicate before the data is usable.

It also affects training efficiency. LLM training datasets are commonly organized into fixed-length token sequences by concatenating documents and segmenting them into blocks. When a large share of those tokens comes from markup and repeated layout, each sequence carries less useful page content. The model sees less informative text per training example, making the dataset less efficient.

Optimizing extracted web data for model input

To reduce token overhead and increase the share of useful text in each training sequence, you need to convert extracted web data into cleaner text, preserve headings, lists, tables, and other structural elements, and keep metadata that makes each sample traceable.

Markdown vs. raw HTML

Markdown is a more efficient output format than raw HTML for AI training pipelines because it keeps useful structure while removing markup noise.

Research from the University of Science and Technology of China found that on their Common Crawl benchmark, raw HTML averaged around 74,000 tokens per document while Markdown averaged around 7,600. That is roughly a 90% reduction in token count while preserving the structural information that helps models learn from the content.

Document segmentation

Once a page is in cleaner text, it still needs to be split in a way that preserves structure. Cutting at arbitrary token limits separates headings from the sections they introduce and breaks related content across training sequences. Segmenting at headings and clear section boundaries keeps each unit coherent before it gets packed into fixed-length sequences.

Metadata and provenance

Data provenance is the record of where a training sample came from and when it was collected. Keeping the source URL, timestamp, license, and attribution alongside each sample lets you filter stale or restricted content, rerun the same crawl later, and audit exactly what entered the corpus.

Research from EPFL found that prepending metadata like URLs and timestamps to training documents speeds up LLM pretraining, and that finer-grained metadata like quality signals and time markers performs better than coarse labels.

Pages often expose part of this metadata through JSON-LD, including fields like author, datePublished, and dateModified. Keeping those fields with each sample makes it easier to compare page versions, remove stale content, and keep the corpus auditable.

How to extract web data for AI training pipelines with Spidra

Two approaches are common in AI training pipelines. One is managing the browser and extraction stack yourself using open-source tools. The other is using a managed scraping API that handles rendering, anti-bot bypass, proxy rotation, and Markdown output for you.

We will cover both, starting with the self-managed approach so you understand the trade-offs, then moving to Spidra as the managed option.

Approach 1: Open-source stealth browsers

Open-source stealth browsers are browser automation setups that try to make scraper traffic look like normal user traffic during anti-bot checks. They use a real browser engine through tools like Playwright or Puppeteer but modify them to hide the automation signals that headless sessions normally expose.

Here is a minimal example using Playwright with the playwright-extra stealth plugin:

Step 1. Install the required packages

pip install playwright playwright-extra playwright-extra-plugin-stealth html2text
playwright install chromium

Step 2. Write the scraper

from playwright.sync_api import sync_playwright
from playwright_extra import use
from playwright_extra.plugins.stealth import StealthPlugin
from datetime import datetime, timezone
import html2text

use(StealthPlugin())

TARGET_URL = "https://www.scrapingcourse.com/javascript-rendering"

def scrape_to_markdown(url: str) -> str:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle", timeout=60000)
        html = page.content()
        browser.close()

    # convert HTML to markdown
    converter = html2text.HTML2Text()
    converter.ignore_links = False
    converter.body_width = 0
    markdown_body = converter.handle(html).strip()

    # add provenance metadata
    collected_at = datetime.now(timezone.utc).isoformat()
    document = "\n".join([
        "---",
        f'source_url: "{url}"',
        f'collected_at: "{collected_at}"',
        'extraction_method: "playwright-stealth + html2text"',
        "---",
        "",
        markdown_body,
    ])
    return document

content = scrape_to_markdown(TARGET_URL)

with open("output.md", "w", encoding="utf-8") as f:
    f.write(content)

print("Saved to output.md")

The output looks like this:

---
source_url: "https://www.scrapingcourse.com/javascript-rendering"
collected_at: "2026-03-19T06:09:23.187174+00:00"
extraction_method: "playwright-stealth + html2text"
---

# JS Rendering

[![Chaz Kangeroo Hoodie](...)
Chaz Kangeroo Hoodie
$52](https://scrapingcourse.com/ecommerce/product/...)

<!-- rest of output omitted for brevity -->

This is cleaner than raw HTML and includes provenance metadata. But running it at scale surfaces real limitations fast.

Limitations of the open-source approach

  • No guaranteed anti-bot bypass. Stealth plugins improve your odds but do not guarantee access. Anti-bot systems update their detection logic regularly, and setups that work today may not work next month without manual updates.
  • Per-target tuning required. Proxies, session handling, request pacing, and retry logic all need to be configured per site. What works on one domain can fail on another because the detection checks differ.
  • Markdown output requires extra work. Most open-source setups return raw HTML. Converting that to clean, training-ready Markdown is an additional processing step you build and maintain yourself.
  • Browser overhead. Each Chromium instance uses 200 to 400 MB of RAM. At 20 concurrent sessions that is 4 to 8 GB just for the browsers, plus memory leaks that require restart logic.
  • Ongoing maintenance. Anti-bot measures evolve. Stealth plugins need to keep up with those changes, or bypass rates drop. That maintenance falls on you.

Open-source tools are a good fit for prototyping, debugging specific targets, or smaller research jobs. For production AI training pipelines that need consistent data at volume, the maintenance burden becomes the bottleneck.

Approach 2: Managed scraping API with Spidra

A managed scraping API moves the browser infrastructure and anti-bot handling out of your code. You send a URL, describe what you want, and get back clean Markdown. No browser to manage. No proxy rotation to configure. No stealth logic to maintain.

Spidra runs every request through a real headless browser, handles CAPTCHA solving and residential proxy rotation across 50 countries automatically, and returns AI-ready Markdown. The extractContentOnly option strips navigation, ads, cookie banners, and other boilerplate before returning the content, which is exactly what you want for training data.

Install the Python SDK

pip install spidra

Step 1. Scrape a single page to Markdown

from spidra import SpidraClient, ScrapeParams, ScrapeUrl
from datetime import datetime, timezone
import os

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

TARGET_URL = "https://www.scrapingcourse.com/javascript-rendering"

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url=TARGET_URL)],
    extract_content_only=True,  # strips nav, ads, and boilerplate
))

# add provenance metadata
collected_at = datetime.now(timezone.utc).isoformat()
document = "\n".join([
    "---",
    f'source_url: "{TARGET_URL}"',
    f'collected_at: "{collected_at}"',
    'extraction_method: "spidra"',
    "---",
    "",
    job.result.markdown_content or "",
])

with open("output.md", "w", encoding="utf-8") as f:
    f.write(document)

print("Saved to output.md")

extractContentOnly tells Spidra to strip boilerplate before returning the page. What comes back is the main content in Markdown, with no extra processing step needed.

Step 2. Handle anti-bot-protected pages

Anti-bot bypass is built in. You do not change anything in your request. The same code works on protected pages as it does on open ones:

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://www.scrapingcourse.com/antibot-challenge")],
    extract_content_only=True,
    use_proxy=True,       # route through residential proxy
    proxy_country="us",   # target country for geo-restricted content
))

print(job.result.markdown_content)
# Output: "## You bypassed the Antibot challenge!"

No configuration changes between targets. The same single endpoint handles JavaScript rendering, proxy rotation, and CAPTCHA solving.

Step 3. Collect training data at scale with batch scraping

When you have a list of URLs to collect, the batch endpoint processes up to 50 at a time in parallel:

from spidra import SpidraClient, BatchScrapeParams
from datetime import datetime, timezone
import os, json

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

urls = [
    "https://example.com/docs/page-1",
    "https://example.com/docs/page-2",
    "https://example.com/docs/page-3",
    # up to 50 per request
]

batch = spidra.batch.run_sync(BatchScrapeParams(
    urls=urls,
    extract_content_only=True,
))

corpus = []
collected_at = datetime.now(timezone.utc).isoformat()

for item in batch.items:
    if item.status == "completed" and item.result:
        corpus.append({
            "source_url": item.url,
            "collected_at": collected_at,
            "content": item.result.markdown_content or "",
        })
    else:
        print(f"Failed: {item.url} — {item.error}")

with open("corpus.jsonl", "w", encoding="utf-8") as f:
    for record in corpus:
        f.write(json.dumps(record) + "\n")

print(f"Collected {len(corpus)} pages")

Each record includes source URL, timestamp, and clean Markdown content. JSONL makes it easy to stream large datasets through downstream processing without loading everything into memory at once.

Step 4. Crawl an entire domain automatically

For building a corpus from a full documentation site, blog, or any multi-page source, the crawl endpoint discovers and processes pages automatically. You describe which pages to follow and what to extract from each one:

from spidra import SpidraClient, CrawlParams, PollOptions
import os, json

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

job = spidra.crawl.run_sync(
    CrawlParams(
        base_url="https://docs.example.com",
        crawl_instruction="Follow all documentation pages. Skip changelog, release notes, and login pages.",
        transform_instruction="Extract the page title and full body text as clean Markdown. Preserve all headings and code examples.",
        max_pages=100,
        use_proxy=True,
    ),
    PollOptions(timeout=600),
)

corpus = []
for page in job.result:
    if page.data:
        corpus.append({
            "source_url": page.url,
            "content": page.data,
        })

with open("docs_corpus.jsonl", "w", encoding="utf-8") as f:
    for record in corpus:
        f.write(json.dumps(record) + "\n")

print(f"Crawled {len(corpus)} pages")

The crawler follows links, handles pagination, bypasses any bot protection it encounters, and applies your transform_instruction to every page it visits. The result is a structured corpus ready for chunking and indexing.

Step 5. Extract structured fields for fine-tuning datasets

For fine-tuning you often need structured data rather than raw text. Pass a schema and Spidra returns JSON matching that exact shape from every page, which makes it straightforward to build instruction datasets:

from spidra import SpidraClient, ScrapeParams, ScrapeUrl
import os, json

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://example.com/article")],
    prompt="Extract the article details for a fine-tuning dataset",
    output="json",
    schema={
        "type": "object",
        "required": ["title", "body", "author", "published_date", "topics"],
        "properties": {
            "title":          {"type": "string"},
            "body":           {"type": "string"},
            "author":         {"type": ["string", "null"]},
            "published_date": {"type": ["string", "null"]},
            "topics":         {"type": "array", "items": {"type": "string"}},
        },
    },
))

print(json.dumps(job.result.content, indent=2))
{
  "title": "How Transformers Changed NLP",
  "body": "The introduction of the attention mechanism...",
  "author": "Jane Smith",
  "published_date": "2025-11-14",
  "topics": ["NLP", "transformers", "deep learning"]
}

Required fields always appear in every record as null if the page does not have that value. That consistency matters when the output feeds a downstream training script that expects a fixed schema.

Putting it together: A full training data pipeline

Here is a complete, production-ready data collection pipeline using Spidra:

from spidra import SpidraClient, CrawlParams, PollOptions
from datetime import datetime, timezone
import os, json, hashlib

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

def collect_corpus(base_url: str, max_pages: int = 200) -> list[dict]:
    job = spidra.crawl.run_sync(
        CrawlParams(
            base_url=base_url,
            crawl_instruction="Follow all article and documentation pages. Skip tag pages, author pages, and login pages.",
            transform_instruction="Extract the full page content as clean Markdown. Preserve all headings, lists, tables, and code blocks.",
            max_pages=max_pages,
            use_proxy=True,
        ),
        PollOptions(timeout=900),
    )

    collected_at = datetime.now(timezone.utc).isoformat()
    corpus = []
    seen = set()

    for page in job.result:
        content = page.data or ""
        if not content:
            continue

        # deduplicate by content hash
        content_hash = hashlib.md5(content.encode()).hexdigest()
        if content_hash in seen:
            continue
        seen.add(content_hash)

        corpus.append({
            "source_url": page.url,
            "collected_at": collected_at,
            "content_hash": content_hash,
            "content": content,
        })

    return corpus

corpus = collect_corpus("https://docs.example.com", max_pages=200)

with open("training_corpus.jsonl", "w", encoding="utf-8") as f:
    for record in corpus:
        f.write(json.dumps(record) + "\n")

print(f"Collected {len(corpus)} unique pages")

This crawls the full domain, extracts clean Markdown from every page, deduplicates by content hash, attaches provenance metadata to each record, and saves the output as JSONL ready for chunking and indexing.

Conclusion

In this guide, you learned what AI data collection involves, why web data is difficult to collect reliably, and how to turn raw pages into clean model-ready text using Markdown conversion, content extraction, and provenance metadata.

You also saw two ways to collect that data in practice. The open-source path gives you full control and works well for prototyping and smaller jobs, but requires you to maintain the browser stack, stealth logic, and HTML conversion layer yourself. For production AI training pipelines that need consistent data at volume, a managed API like Spidra handles rendering, anti-bot bypass, clean Markdown output, and full-site crawling through a single endpoint that does not need ongoing maintenance on your side.

Get started for free at spidra.io and explore the SDK docs at docs.spidra.io.

Frequently asked questions

Can I scrape web data for AI training?

Yes. Scraping publicly available web data for AI training is a common and legitimate practice. You are responsible for respecting each site's terms of service, robots.txt, and applicable laws. Focus on content you have the right to use, avoid personal data, and implement rate limiting so you do not overload target servers.

What format should I store training data in?

Markdown with provenance metadata stored as JSONL is a practical choice for most text-based training pipelines. Markdown strips the token overhead from raw HTML while preserving structural information like headings, lists, and tables. JSONL makes it easy to stream large datasets through processing pipelines without loading everything into memory at once.

What is the difference between fine-tuning data and RAG data?

Fine-tuning data is used to update a model's weights, changing how it behaves or what it knows. It typically takes the form of input and output pairs that teach the model a specific task or domain. RAG data is retrieved at inference time and injected into the model's context window. It does not change the model itself but gives it access to information beyond its training cutoff. Fine-tuning requires higher-quality, labeled data. RAG can work with a wider range of source quality but needs fast retrieval infrastructure.

Why is Markdown better than raw HTML for AI training?

Raw HTML includes navigation, scripts, CSS, ads, and layout markup that adds tokens without adding useful training signal. Converting to Markdown reduces token count by roughly 90% while keeping structural information that matters: headings, lists, tables, links, and code blocks. Less token overhead means more useful content packed into each training sequence, making the dataset more efficient.

How do I keep training data fresh?

Run the same crawl or scrape job on a schedule and compare content hashes to find pages that changed since the last run. Re-extract those pages and replace the old records in your corpus. Storing collected_at timestamps alongside each record makes it straightforward to filter out content older than a certain date or to prioritize fresher samples during training.

How does Spidra handle sites protected by Cloudflare or other anti-bot systems?

Anti-bot bypass is built into every Spidra request. Spidra uses residential proxy rotation, browser fingerprint randomization, and CAPTCHA solving automatically. You do not configure any of this. The same request you send to an open page works on a Cloudflare-protected page without any changes. Proxy usage is billed against your bandwidth quota rather than your credit balance, so there is no credit multiplier when bypass is needed.

Share this article

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.