Skip to main content
Blog/ How to scrape unstructured website data and turn it into structured JSON for AI
June 8, 2026 · 11 min read

How to scrape unstructured website data and turn it into structured JSON for AI

Joel Olawanle
Joel Olawanle
How to scrape unstructured website data and turn it into structured JSON for AI

Here is a problem a lot of developers run into when building AI search or RAG pipelines.

You have a website with hundreds of pages. The content is inconsistent. Product pages, article pages, listing pages, all laid out slightly differently from each other. You want to extract structured data from all of them, push it into a vector database, and make it searchable.

The obvious approach is to loop through every page, send the HTML to an LLM with a structured output prompt, and save what comes back. It works. Then you have 300 pages to process, rate limits kicking in every few minutes, a token bill that adds up faster than you expected, and you realize the HTML you are sending is mostly navigation, scripts, ads, and footer content rather than the actual page data you care about.

This guide covers the right way to approach this problem: how to minimize token overhead, enforce consistent output shapes across inconsistent pages, avoid hammering an LLM directly for every extraction, and build a pipeline that scales.

The core problem with scraping unstructured pages

Most web pages are not designed for machines. The same information appears in different positions on different pages. Class names change between sections. Some fields are present on some pages and absent on others. There is no guarantee that the title is always in an h1, the price is always in a .price element, or the category is labeled consistently.

Traditional scraping with CSS selectors breaks here. You cannot write a selector for "the price" when the price appears in a <span class="price"> on one page and a <div class="product-cost"> on another.

The three options developers usually land on are:

  • Write page-type-specific scrapers. Map out all the different page layouts, write a parser for each, and maintain them as the site changes. This works but is expensive in time and breaks every time the site redesigns.
  • Send raw HTML to an LLM. Let the model figure out where everything is. This works but the raw HTML includes navigation, scripts, ads, and boilerplate that can easily push a single page to 10,000+ tokens. At scale the cost and rate limit problem becomes significant.
  • Convert HTML to Markdown first, then send to an LLM. This is better. Research has shown that the same page in Markdown averages around 90% fewer tokens than the raw HTML while preserving the structural content that matters. Still involves one LLM call per page.

The better approach is to use a scraping layer that handles AI extraction natively so the LLM calls happen inside the scraping infrastructure rather than in your application code, and you only receive clean structured JSON back.

What good structured output looks like for a vector DB

Before writing any code, it helps to think about what the output should look like. For a searchable product database, you want two things in each record:

A structured fields object with the discrete data points: title, price, category, URL, date, and any other fields that are filterable or sortable.

A text content field for semantic search: the full cleaned body text of the page, stripped of boilerplate, that you will embed and store in the vector database.

Keeping these separate matters. Structured fields support exact filtering ("show me all cars under $20,000"). The text content supports semantic search ("find me something with good fuel economy for city driving"). You want both.

A good record looks something like this:

{
  "url": "https://example.com/products/fiat-500",
  "title": "Fiat 500 Hatchback 2024",
  "price": 18990,
  "currency": "USD",
  "categories": ["car", "hatchback", "city car"],
  "brand": "Fiat",
  "description": "The Fiat 500 is a compact city car with a 1.0L mild hybrid engine...",
  "content": "Full cleaned page text for embedding goes here...",
  "scraped_at": "2026-05-01T10:22:00Z"
}

The content field is what you embed and store in pgvector. The rest of the fields are stored as structured columns for filtering.

Why raw HTML is the wrong input format

If you send raw page HTML to an LLM, here is what you are actually sending:

<!DOCTYPE html>
<html>
<head>
  <meta charset="UTF-8">
  <title>Fiat 500 | Cars for Sale</title>
  <link rel="stylesheet" href="/css/main.css">
  <script src="/js/analytics.js"></script>
  <!-- 40 more lines of head content -->
</head>
<body>
  <nav class="main-nav">
    <ul>
      <li><a href="/cars">Cars</a></li>
      <li><a href="/bikes">Bikes</a></li>
      <!-- navigation items -->
    </ul>
  </nav>
  <header class="site-header">
    <!-- header content -->
  </header>
  <!-- actual product content buried somewhere in here -->
  <footer>
    <!-- footer repeating across every page -->
  </footer>
</body>
</html>

For a typical e-commerce product page, the markup that actually contains useful product information is maybe 10 to 20% of the total HTML. The rest is navigation, scripts, CSS, analytics tags, cookie banners, and footers that repeat on every single page. You are paying to send all of that to the LLM on every request.

Stripping that boilerplate before extraction is not optional, it is the first thing you should do.

The right approach: Structured extraction at the scraping layer

Instead of fetching raw HTML and then calling an LLM yourself, use a scraping tool that handles both steps. You provide a URL, a plain-English description of what to extract, and an output schema. The tool renders the page, strips boilerplate, extracts exactly what you described, and returns structured JSON that matches your schema.

This eliminates several problems at once:

  • No raw HTML in your application code
  • No LLM rate limit management on your side
  • No token overhead from boilerplate
  • Consistent output shape regardless of page structure differences
  • Extraction works even on JavaScript-rendered pages

Spidra handles all of this through a single API. Here is what the full pipeline looks like.

Building the pipeline with Spidra

Step 1: Install the SDK

pip install spidra

Step 2: Define your schema

Define the exact output shape you want. Spidra's AI extraction will match this schema on every page regardless of how the page is structured. Required fields always appear in the output, as null if the page does not have that value.

PRODUCT_SCHEMA = {
    "type": "object",
    "required": ["title", "price", "categories", "description"],
    "properties": {
        "title":       {"type": "string"},
        "brand":       {"type": ["string", "null"]},
        "price":       {"type": ["number", "null"]},
        "currency":    {"type": ["string", "null"]},
        "categories":  {
            "type": "array",
            "items": {
                "type": "string",
                "enum": ["car", "bike", "truck", "van", "electric"]
            }
        },
        "description": {"type": "string"},
        "year":        {"type": ["integer", "null"]},
        "mileage_km":  {"type": ["number", "null"]},
    }
}

The categories field uses an enum so the AI matches each page to your predefined taxonomy rather than inventing new category names. This is the right way to handle classification across inconsistent pages.

Step 3: Scrape a single page

from spidra import SpidraClient, ScrapeParams, ScrapeUrl
from datetime import datetime, timezone
import os

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

def scrape_product_page(url: str) -> dict:
    job = spidra.scrape.run_sync(ScrapeParams(
        urls=[ScrapeUrl(url=url)],
        prompt="""
            Extract the product details from this page.
            Normalize price to a number without currency symbols.
            Map the vehicle type to the closest matching category.
            Return null for any fields not present on the page.
        """,
        output="json",
        schema=PRODUCT_SCHEMA,
        extract_content_only=True,  # strips nav, ads, boilerplate
    ))

    structured = job.result.content
    content = job.result.markdown_content or ""

    return {
        "url": url,
        **structured,
        "content": content,
        "scraped_at": datetime.now(timezone.utc).isoformat(),
    }

record = scrape_product_page("https://example.com/products/fiat-500")
print(record)
{
    "url": "https://example.com/products/fiat-500",
    "title": "Fiat 500 Hatchback 2024",
    "brand": "Fiat",
    "price": 18990,
    "currency": "USD",
    "categories": ["car"],
    "description": "The Fiat 500 is a compact city car with a 1.0L mild hybrid engine...",
    "year": 2024,
    "mileage_km": null,
    "content": "Full cleaned page text...",
    "scraped_at": "2026-05-01T10:22:00Z"
}

Consistent output, every time, regardless of how the source page is laid out.

Step 4: Process 300 URLs without rate limits

Instead of calling an LLM 300 times yourself and managing rate limits, use Spidra's batch endpoint. It processes up to 50 URLs in parallel per request. For 300 URLs, send 6 batch requests:

from spidra import SpidraClient, BatchScrapeParams
from datetime import datetime, timezone
import os, json, time

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

def batch_scrape(urls: list[str], batch_size: int = 50) -> list[dict]:
    all_records = []

    for i in range(0, len(urls), batch_size):
        chunk = urls[i:i + batch_size]
        print(f"Processing batch {i // batch_size + 1} ({len(chunk)} URLs)...")

        batch = spidra.batch.run_sync(BatchScrapeParams(
            urls=chunk,
            prompt="""
                Extract the product details from this page.
                Normalize price to a number without currency symbols.
                Map the vehicle type to the closest matching category.
                Return null for any fields not present on the page.
            """,
            output="json",
            schema=PRODUCT_SCHEMA,
            extract_content_only=True,
        ))

        collected_at = datetime.now(timezone.utc).isoformat()

        for item in batch.items:
            if item.status == "completed" and item.result:
                record = {
                    "url": item.url,
                    **item.result.content,
                    "content": item.result.markdown_content or "",
                    "scraped_at": collected_at,
                }
                all_records.append(record)
            else:
                print(f"  Failed: {item.url} — {item.error}")

    return all_records

# load your URL list
with open("product_urls.txt") as f:
    urls = [line.strip() for line in f if line.strip()]

records = batch_scrape(urls)
print(f"Scraped {len(records)} records")

# save to JSONL for vector DB ingestion
with open("products.jsonl", "w") as f:
    for record in records:
        f.write(json.dumps(record) + "\n")

No rate limit handling. No token management. No boilerplate stripping. 300 URLs processed across 6 parallel batches.

Step 5: Embed and store in pgvector

With the records saved, generate embeddings for the content field and store everything in PostgreSQL with pgvector:

import psycopg2
import json
from openai import OpenAI

openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])

def get_embedding(text: str) -> list[float]:
    response = openai_client.embeddings.create(
        input=text,
        model="text-embedding-3-small"
    )
    return response.data[0].embedding

# connect to postgres
conn = psycopg2.connect(os.environ["DATABASE_URL"])
cur = conn.cursor()

# create the table
cur.execute("""
    CREATE TABLE IF NOT EXISTS products (
        id          SERIAL PRIMARY KEY,
        url         TEXT UNIQUE NOT NULL,
        title       TEXT,
        brand       TEXT,
        price       NUMERIC,
        currency    TEXT,
        categories  TEXT[],
        description TEXT,
        year        INTEGER,
        mileage_km  NUMERIC,
        content     TEXT,
        embedding   vector(1536),
        scraped_at  TIMESTAMPTZ
    )
""")

# load and insert records
with open("products.jsonl") as f:
    for line in f:
        record = json.loads(line)

        if not record.get("content"):
            continue

        embedding = get_embedding(record["content"])

        cur.execute("""
            INSERT INTO products
                (url, title, brand, price, currency, categories, description,
                 year, mileage_km, content, embedding, scraped_at)
            VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
            ON CONFLICT (url) DO UPDATE SET
                title       = EXCLUDED.title,
                price       = EXCLUDED.price,
                content     = EXCLUDED.content,
                embedding   = EXCLUDED.embedding,
                scraped_at  = EXCLUDED.scraped_at
        """, (
            record["url"],
            record.get("title"),
            record.get("brand"),
            record.get("price"),
            record.get("currency"),
            record.get("categories", []),
            record.get("description"),
            record.get("year"),
            record.get("mileage_km"),
            record["content"],
            embedding,
            record["scraped_at"],
        ))

conn.commit()
cur.close()
conn.close()
print("Done")

Step 6: Semantic search with structured filtering

Now query with combined semantic search and structured filters:

def search_products(query: str, category: str = None, max_price: float = None, limit: int = 10):
    # embed the search query
    query_embedding = get_embedding(query)

    # build the query with optional filters
    filters = []
    params = [query_embedding, limit]

    if category:
        filters.append(f"%s = ANY(categories)")
        params.insert(-1, category)

    if max_price:
        filters.append(f"price <= %s")
        params.insert(-1, max_price)

    where_clause = ("WHERE " + " AND ".join(filters)) if filters else ""

    sql = f"""
        SELECT url, title, brand, price, categories, description,
               1 - (embedding <=> %s::vector) AS similarity
        FROM products
        {where_clause}
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """

    params = [query_embedding] + params[:-1] + [query_embedding, limit]

    cur.execute(sql, params)
    return cur.fetchall()

# find fuel-efficient city cars under $25,000
results = search_products(
    query="fuel efficient city car easy to park",
    category="car",
    max_price=25000,
    limit=5
)

for row in results:
    print(f"{row[1]} ({row[4]}) — ${row[3]:,.0f} — similarity: {row[6]:.3f}")
# Output
Fiat 500 Hatchback 2024 (['car']) — $18,990 — similarity: 0.891
Toyota Yaris 2024 (['car']) — $22,450 — similarity: 0.876
Mini Cooper 3-Door 2023 (['car']) — $24,100 — similarity: 0.863

Semantic search combined with exact price and category filtering, all from data that was structured consistently from pages that had nothing consistent about them.

Handling pages with no JavaScript

For sites where content is server-rendered and there is no bot protection, you can scrape faster without using a full browser. Spidra detects this automatically and uses the appropriate rendering strategy per page. You do not need to configure anything.

For sites with bot protection, add use_proxy=True:

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url=url)],
    prompt="Extract the product details...",
    output="json",
    schema=PRODUCT_SCHEMA,
    extract_content_only=True,
    use_proxy=True,
    proxy_country="us",
))

Same code, same output, with residential proxy rotation and CAPTCHA solving handled automatically.

Keeping the corpus fresh

Set up a refresh job that re-scrapes pages on a schedule and updates records where the content hash has changed:

import hashlib

def needs_update(url: str, new_content: str) -> bool:
    new_hash = hashlib.md5(new_content.encode()).hexdigest()
    cur.execute("SELECT content FROM products WHERE url = %s", (url,))
    row = cur.fetchone()
    if not row:
        return True
    old_hash = hashlib.md5(row[0].encode()).hexdigest()
    return new_hash != old_hash

Only re-embed and update records that actually changed. This keeps your vector index fresh without re-processing everything on every run.

The full pipeline summary

URL list
   ↓
Spidra batch endpoint (50 URLs in parallel)
   ↓
Real browser render + boilerplate stripped + AI extraction + schema enforced
   ↓
Clean structured JSON (consistent shape regardless of source page layout)
   ↓
products.jsonl
   ↓
Embed content field with text-embedding-3-small
   ↓
Insert into pgvector (structured fields + embedding)
   ↓
Semantic search + structured filters

No LLM rate limits to manage. No token overhead from boilerplate. No selectors to maintain. No parser to update when the site redesigns. The schema enforces consistent output even when the source pages are completely inconsistent with each other.

Frequently asked questions

You need both. Embedding the full page content gives you semantic search capability. Storing structured fields separately gives you exact filtering (price ranges, categories, dates). If you only embed raw text you lose the ability to filter, and if you only store structured fields you lose semantic understanding. The pattern above does both.

Share this article

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.