What if a required field is not present on some pages?

Spidra's schema enforcement returns null for required fields that are not found rather than omitting them. This means your output always has the same shape even when pages are missing data. Handle nulls explicitly in your database schema so filtering works correctly.

How do I handle pages that require login?

Pass session cookies in the cookies parameter on your ScrapeUrl. Spidra will use them when loading the page.

What embedding model should I use?

text-embedding-3-small from OpenAI is a good balance of quality and cost for most use cases. For higher retrieval accuracy text-embedding-3-large is better but costs more per token. For a self-hosted option, nomic-embed-text runs locally and performs well on most English-language retrieval tasks.

How do I handle very long page content that exceeds the embedding model's context limit?

Use chunking. Split the content at section boundaries (Markdown headings are natural split points) rather than at arbitrary character limits. Store each chunk as a separate record in the vector DB with the same structured metadata fields. At query time, retrieve chunks and deduplicate by URL before returning results.

Blog/ How to scrape unstructured website data and turn it into structured JSON for AI

June 8, 2026 · 11 min read

How to scrape unstructured website data and turn it into structured JSON for AI

Joel Olawanle

How to scrape unstructured website data and turn it into structured JSON for AI

Here is a problem a lot of developers run into when building AI search or RAG pipelines.

You have a website with hundreds of pages. The content is inconsistent. Product pages, article pages, listing pages, all laid out slightly differently from each other. You want to extract structured data from all of them, push it into a vector database, and make it searchable.

The obvious approach is to loop through every page, send the HTML to an LLM with a structured output prompt, and save what comes back. It works. Then you have 300 pages to process, rate limits kicking in every few minutes, a token bill that adds up faster than you expected, and you realize the HTML you are sending is mostly navigation, scripts, ads, and footer content rather than the actual page data you care about.

This guide covers the right way to approach this problem: how to minimize token overhead, enforce consistent output shapes across inconsistent pages, avoid hammering an LLM directly for every extraction, and build a pipeline that scales.

The core problem with scraping unstructured pages

Most web pages are not designed for machines. The same information appears in different positions on different pages. Class names change between sections. Some fields are present on some pages and absent on others. There is no guarantee that the title is always in an h1, the price is always in a .price element, or the category is labeled consistently.

Traditional scraping with CSS selectors breaks here. You cannot write a selector for "the price" when the price appears in a <span class="price"> on one page and a <div class="product-cost"> on another.

The three options developers usually land on are:

Write page-type-specific scrapers. Map out all the different page layouts, write a parser for each, and maintain them as the site changes. This works but is expensive in time and breaks every time the site redesigns.
Send raw HTML to an LLM. Let the model figure out where everything is. This works but the raw HTML includes navigation, scripts, ads, and boilerplate that can easily push a single page to 10,000+ tokens. At scale the cost and rate limit problem becomes significant.
Convert HTML to Markdown first, then send to an LLM. This is better. Research has shown that the same page in Markdown averages around 90% fewer tokens than the raw HTML while preserving the structural content that matters. Still involves one LLM call per page. Spidra's free Website to Markdown tool does this conversion if you want to see the difference on a real page before writing any code.

The better approach is to use a scraping layer that handles AI extraction natively so the LLM calls happen inside the scraping infrastructure rather than in your application code, and you only receive clean structured JSON back.

What good structured output looks like for a vector DB

Before writing any code, it helps to think about what the output should look like. For a searchable product database, you want two things in each record:

A structured fields object with the discrete data points: title, price, category, URL, date, and any other fields that are filterable or sortable.
A text content field for semantic search: the full cleaned body text of the page, stripped of boilerplate, that you will embed and store in the vector database.

Keeping these separate matters. Structured fields support exact filtering ("show me all cars under $30,000"). The text content supports semantic search ("find me something with good fuel economy for city driving"). You want both.

This is the pattern used in most data enrichment and AI search pipelines. A good record looks like this:

{
  "url": "https://www.toyota.com/camry",
  "title": "2026 Camry",
  "brand": "Toyota",
  "price": 29300,
  "currency": "USD",
  "categories": ["car"],
  "description": "Camry's refined performance and elevated design let you chase your daily thrills in style.",
  "content": "2026 Camry Starting MSRP $29,300. Up to 51 combined MPG. Up to 232 net combined horsepower...",
  "year": 2026,
  "mileage_km": null,
  "scraped_at": "2026-06-26T10:22:00Z"
}

The content field is what you embed and store in pgvector. The rest of the fields are stored as structured columns for filtering.

Why raw HTML is the wrong input format

If you send raw page HTML to an LLM, here is what you are actually sending:

<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>2026 Camry | Toyota</title>
  <link rel="stylesheet" href="/css/main.css">
  <script src="/js/analytics.js"></script>
  <!-- 40 more lines of head content -->
</head>
<body>
  <nav class="main-nav">
    <ul>
      <li><a href="/cars">Cars</a></li>
      <li><a href="/trucks">Trucks</a></li>
      <!-- navigation items -->
    </ul>
  </nav>
  <!-- actual product content buried somewhere in here -->
  <footer>
    <!-- footer repeating across every page -->
  </footer>
</body>
</html>

For a typical product page, the markup that actually contains useful information is maybe 10 to 20% of the total HTML. The rest is navigation, scripts, CSS, analytics tags, cookie banners, and footers that repeat on every single page. You are paying to send all of that to the LLM on every request.

Stripping that boilerplate before extraction is not optional — it is the first thing you should do.

The right approach: structured extraction at the scraping layer

Instead of fetching raw HTML and then calling an LLM yourself, use a scraping tool that handles both steps. You provide a URL, a plain-English description of what to extract, and an output schema. The tool renders the page, strips boilerplate, extracts exactly what you described, and returns structured JSON that matches your schema.

This eliminates several problems at once:

No raw HTML in your application code
No LLM rate limit management on your side
No token overhead from boilerplate
Consistent output shape regardless of page structure differences
Extraction works on JavaScript-rendered pages

The Spidra API handles all of this through a single endpoint. Here is what the full pipeline looks like.

Building the pipeline with Spidra

Let's go over the steps.

Step 1: Install the SDK

Install the Python SDK:

pip install spidra

Step 2: Define your schema

Define the exact output shape you want. Spidra's AI extraction will match this schema on every page regardless of how the page is structured. Required fields always appear in the output, as null if the page does not have that value.

Note that "content" is in the schema here — this is the full cleaned page text the AI will extract, which is what you embed for semantic search in Step 5.

PRODUCT_SCHEMA = {
    "type": "object",
    "required": ["title", "price", "categories", "description", "content"],
    "properties": {
        "title":       {"type": "string"},
        "brand":       {"type": ["string", "null"]},
        "price":       {"type": ["number", "null"]},
        "currency":    {"type": ["string", "null"]},
        "categories":  {
            "type": "array",
            "items": {
                "type": "string",
                "enum": ["car", "bike", "truck", "van", "electric"]
            }
        },
        "description": {"type": "string"},
        "content":     {"type": "string"},
        "year":        {"type": ["integer", "null"]},
        "mileage_km":  {"type": ["number", "null"]},
    }
}

The categories field uses an enum so the AI matches each page to your predefined taxonomy rather than inventing new category names. The content field is where the AI puts the full cleaned body text, suitable for embedding.

If you have an existing JSON sample and want to generate a schema from it automatically, the free JSON Schema Generator does this in seconds — paste your JSON in and it infers the full schema structure for you.

Step 3: Scrape a single page

from spidra import SpidraClient, ScrapeParams, ScrapeUrl
from datetime import datetime, timezone
import os

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

def scrape_product_page(url: str) -> dict:
    job = spidra.scrape.run_sync(ScrapeParams(
        urls=[ScrapeUrl(url=url)],
        prompt="""
            Extract the product details from this page.
            Normalize price to a number without currency symbols.
            Map the vehicle type to the closest matching category.
            For the content field, include the full cleaned body text of the page
            with boilerplate removed, suitable for semantic search and embedding.
            Return null for any numeric fields not present on the page.
        """,
        output="json",
        schema=PRODUCT_SCHEMA,
        extract_content_only=True,  # strips nav, ads, boilerplate
    ))

    return {
        "url": url,
        **job.result.content,
        "scraped_at": datetime.now(timezone.utc).isoformat(),
    }

record = scrape_product_page("https://www.toyota.com/camry")
print(record)

{
    "url": "https://www.toyota.com/camry",
    "title": "2026 Camry",
    "brand": "Toyota",
    "price": 29300,
    "currency": "USD",
    "categories": ["car"],
    "description": "Camry's refined performance and elevated design let you chase your daily thrills in style.",
    "content": "2026 Camry Starting MSRP $29,300. Up to 51 combined MPG. Up to 232 net combined horsepower...",
    "year": 2026,
    "mileage_km": null,
    "scraped_at": "2026-06-26T10:22:00Z"
}

Consistent output, every time, regardless of how the source page is laid out.

Step 4: Process 300 URLs without rate limits

Instead of calling an LLM 300 times yourself and managing rate limits, use Spidra's batch endpoint. It processes up to 50 URLs in parallel per request. For 300 URLs, send 6 batch requests:

from spidra import SpidraClient, BatchScrapeParams
from datetime import datetime, timezone
import os, json

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

def batch_scrape(urls: list[str], batch_size: int = 50) -> list[dict]:
    all_records = []

    for i in range(0, len(urls), batch_size):
        chunk = urls[i:i + batch_size]
        print(f"Processing batch {i // batch_size + 1} ({len(chunk)} URLs)...")

        batch = spidra.batch.run_sync(BatchScrapeParams(
            urls=chunk,
            prompt="""
                Extract the product details from this page.
                Normalize price to a number without currency symbols.
                Map the vehicle type to the closest matching category.
                For the content field, include the full cleaned body text of the page
                with boilerplate removed, suitable for semantic search and embedding.
                Return null for any numeric fields not present on the page.
            """,
            output="json",
            schema=PRODUCT_SCHEMA,
            extract_content_only=True,
        ))

        collected_at = datetime.now(timezone.utc).isoformat()

        for item in batch.items:
            if item.status == "completed" and item.result:
                # item.result is the extracted JSON dict directly
                record = {
                    "url": item.url,
                    **item.result,
                    "scraped_at": collected_at,
                }
                all_records.append(record)
            else:
                print(f"  Failed: {item.url} — {item.error}")

    return all_records

# load your URL list
with open("product_urls.txt") as f:
    urls = [line.strip() for line in f if line.strip()]

records = batch_scrape(urls)
print(f"Scraped {len(records)} records")

# save to JSONL for vector DB ingestion
with open("products.jsonl", "w") as f:
    for record in records:
        f.write(json.dumps(record) + "\n")

No rate limit handling. No token management. No boilerplate stripping. 300 URLs processed across 6 parallel batches.

Step 5: Embed and store in pgvector

With the records saved, generate embeddings for the content field and store everything in PostgreSQL with pgvector:

import psycopg2
import json
import os
from openai import OpenAI

openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def get_embedding(text: str) -> list[float]:
    response = openai_client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

# connect to postgres
conn = psycopg2.connect(os.environ["DATABASE_URL"])
cur = conn.cursor()

# create the table
cur.execute("""
    CREATE TABLE IF NOT EXISTS products (
        id          SERIAL PRIMARY KEY,
        url         TEXT UNIQUE NOT NULL,
        title       TEXT,
        brand       TEXT,
        price       NUMERIC,
        currency    TEXT,
        categories  TEXT[],
        description TEXT,
        year        INTEGER,
        mileage_km  NUMERIC,
        content     TEXT,
        embedding   vector(1536),
        scraped_at  TIMESTAMPTZ
    )
""")

# load and insert records
with open("products.jsonl") as f:
    for line in f:
        record = json.loads(line)

        if not record.get("content"):
            continue

        embedding = get_embedding(record["content"])

        cur.execute("""
            INSERT INTO products
                (url, title, brand, price, currency, categories, description,
                 year, mileage_km, content, embedding, scraped_at)
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
            ON CONFLICT (url) DO UPDATE SET
                title       = EXCLUDED.title,
                price       = EXCLUDED.price,
                content     = EXCLUDED.content,
                embedding   = EXCLUDED.embedding,
                scraped_at  = EXCLUDED.scraped_at
        """, (
            record["url"],
            record.get("title"),
            record.get("brand"),
            record.get("price"),
            record.get("currency"),
            record.get("categories", []),
            record.get("description"),
            record.get("year"),
            record.get("mileage_km"),
            record["content"],
            embedding,
            record["scraped_at"],
        ))

conn.commit()
cur.close()
conn.close()
print("Done")

Step 6: Semantic search with structured filtering

Now query with combined semantic search and structured filters:

def search_products(query: str, category: str = None, max_price: float = None, limit: int = 10):
    query_embedding = get_embedding(query)

    filters = []
    params = [query_embedding, limit]

    if category:
        filters.append("%s = ANY(categories)")
        params.insert(-1, category)

    if max_price:
        filters.append("price <= %s")
        params.insert(-1, max_price)

    where_clause = ("WHERE " + " AND ".join(filters)) if filters else ""

    sql = f"""
        SELECT url, title, brand, price, categories, description,
               1 - (embedding <=> %s::vector) AS similarity
        FROM products
        {where_clause}
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """

    params = [query_embedding] + params[:-1] + [query_embedding, limit]

    cur.execute(sql, params)
    return cur.fetchall()

# find fuel-efficient cars under $35,000
results = search_products(
    query="fuel efficient car easy to park",
    category="car",
    max_price=35000,
    limit=5
)

for row in results:
    print(f"{row[1]} ({row[4]}) — ${row[3]:,.0f} — similarity: {row[6]:.3f}")

# Output
Toyota Camry 2026 (['car']) — $29,300 — similarity: 0.891
Kia Sportage 2026 (['car']) — $28,790 — similarity: 0.876
Ford Mustang 2026 (['car']) — $30,658 — similarity: 0.863

Semantic search combined with exact price and category filtering, all from data that was structured consistently from pages that had nothing consistent about them.

Handling pages with bot protection

For sites with bot protection, add use_proxy=True:

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url=url)],
    prompt="Extract the product details...",
    output="json",
    schema=PRODUCT_SCHEMA,
    extract_content_only=True,
    use_proxy=True,
    proxy_country="us",
))

Same code, same output, with residential proxy rotation and CAPTCHA solving handled automatically.

Keeping the corpus fresh

Set up a refresh job that re-scrapes pages on a schedule and updates records where the content has changed:

import hashlib

def needs_update(url: str, new_content: str) -> bool:
    new_hash = hashlib.md5(new_content.encode()).hexdigest()
    cur.execute("SELECT content FROM products WHERE url = %s", (url,))
    row = cur.fetchone()
    if not row:
        return True
    old_hash = hashlib.md5(row[0].encode()).hexdigest()
    return new_hash != old_hash

Only re-embed and update records that actually changed. This keeps your vector index fresh without re-processing everything on every run.

The full pipeline summary

URL list
   ↓
Spidra batch endpoint (50 URLs in parallel)
   ↓
Real browser render + boilerplate stripped + AI extraction + schema enforced
   ↓
Clean structured JSON (consistent shape regardless of source page layout)
   ↓
products.jsonl
   ↓
Embed content field with text-embedding-3-small
   ↓
Insert into pgvector (structured fields + embedding)
   ↓
Semantic search + structured filters

No LLM rate limits to manage. No token overhead from boilerplate. No selectors to maintain. No parser to update when the site redesigns. The schema enforces consistent output even when the source pages are completely inconsistent with each other.

Frequently asked questions

You need both. Embedding the full page content gives you semantic search capability. Storing structured fields separately gives you exact filtering (price ranges, categories, dates). If you only embed raw text you lose the ability to filter, and if you only store structured fields you lose semantic understanding. The pattern above does both.

You can, but you will hit rate limits at scale and pay for a lot of tokens that come from boilerplate HTML rather than actual page content. Handling boilerplate stripping, retry logic, rate limit backoff, and schema enforcement in your application code adds significant complexity. Using a scraping layer that handles all of this and returns clean structured JSON reduces that to a few lines of code.

Share this article

Tutorials

How to scrape web data with Beautiful Soup: step-by-step guide in 2026

How to scrape web data with Beautiful Soup in Python. Covers requests, HTML parsing, CSS selectors, XPath, pagination, CSV and JSON export, and how to handle dynamic content.

June 27, 2026 · 12 min read

Tutorials

How to scrape eBay with JavaScript and Node.js in 2026

How to scrape eBay with JavaScript and Node.js using the Spidra API. Covers item pages, search results, batch scraping, price monitoring, and CSV export with real tested output.

June 27, 2026 · 14 min read

Tutorials

How to scrape eBay with the Spidra API (2026)

Complete guide to scraping eBay with the Spidra API. Covers item pages, search results, pagination, batch scraping 50 items in parallel, price monitoring, and error handling with real output.

June 27, 2026 · 17 min read

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.

How to scrape unstructured website data and turn it into structured JSON for AI

The core problem with scraping unstructured pages

What good structured output looks like for a vector DB

Why raw HTML is the wrong input format

The right approach: structured extraction at the scraping layer

Building the pipeline with Spidra

Step 1: Install the SDK

Step 2: Define your schema

Step 3: Scrape a single page

Step 4: Process 300 URLs without rate limits

Step 5: Embed and store in pgvector

Step 6: Semantic search with structured filtering

Handling pages with bot protection

Keeping the corpus fresh

The full pipeline summary

Frequently asked questions

Share this article

Related posts

How to scrape web data with Beautiful Soup: step-by-step guide in 2026

How to scrape eBay with JavaScript and Node.js in 2026

How to scrape eBay with the Spidra API (2026)

Start scraping for free.