Here is a problem a lot of developers run into when building AI search or RAG pipelines.
You have a website with hundreds of pages. The content is inconsistent. Product pages, article pages, listing pages, all laid out slightly differently from each other. You want to extract structured data from all of them, push it into a vector database, and make it searchable.
The obvious approach is to loop through every page, send the HTML to an LLM with a structured output prompt, and save what comes back. It works. Then you have 300 pages to process, rate limits kicking in every few minutes, a token bill that adds up faster than you expected, and you realize the HTML you are sending is mostly navigation, scripts, ads, and footer content rather than the actual page data you care about.
This guide covers the right way to approach this problem: how to minimize token overhead, enforce consistent output shapes across inconsistent pages, avoid hammering an LLM directly for every extraction, and build a pipeline that scales.
The core problem with scraping unstructured pages
Most web pages are not designed for machines. The same information appears in different positions on different pages. Class names change between sections. Some fields are present on some pages and absent on others. There is no guarantee that the title is always in an h1, the price is always in a .price element, or the category is labeled consistently.
Traditional scraping with CSS selectors breaks here. You cannot write a selector for "the price" when the price appears in a <span class="price"> on one page and a <div class="product-cost"> on another.
The three options developers usually land on are:
- Write page-type-specific scrapers. Map out all the different page layouts, write a parser for each, and maintain them as the site changes. This works but is expensive in time and breaks every time the site redesigns.
- Send raw HTML to an LLM. Let the model figure out where everything is. This works but the raw HTML includes navigation, scripts, ads, and boilerplate that can easily push a single page to 10,000+ tokens. At scale the cost and rate limit problem becomes significant.
- Convert HTML to Markdown first, then send to an LLM. This is better. Research has shown that the same page in Markdown averages around 90% fewer tokens than the raw HTML while preserving the structural content that matters. Still involves one LLM call per page.
The better approach is to use a scraping layer that handles AI extraction natively so the LLM calls happen inside the scraping infrastructure rather than in your application code, and you only receive clean structured JSON back.
What good structured output looks like for a vector DB
Before writing any code, it helps to think about what the output should look like. For a searchable product database, you want two things in each record:
A structured fields object with the discrete data points: title, price, category, URL, date, and any other fields that are filterable or sortable.
A text content field for semantic search: the full cleaned body text of the page, stripped of boilerplate, that you will embed and store in the vector database.
Keeping these separate matters. Structured fields support exact filtering ("show me all cars under $20,000"). The text content supports semantic search ("find me something with good fuel economy for city driving"). You want both.
A good record looks something like this:
{
"url": "https://example.com/products/fiat-500",
"title": "Fiat 500 Hatchback 2024",
"price": 18990,
"currency": "USD",
"categories": ["car", "hatchback", "city car"],
"brand": "Fiat",
"description": "The Fiat 500 is a compact city car with a 1.0L mild hybrid engine...",
"content": "Full cleaned page text for embedding goes here...",
"scraped_at": "2026-05-01T10:22:00Z"
}The content field is what you embed and store in pgvector. The rest of the fields are stored as structured columns for filtering.
Why raw HTML is the wrong input format
If you send raw page HTML to an LLM, here is what you are actually sending:
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>Fiat 500 | Cars for Sale</title>
<link rel="stylesheet" href="/css/main.css">
<script src="/js/analytics.js"></script>
<!-- 40 more lines of head content -->
</head>
<body>
<nav class="main-nav">
<ul>
<li><a href="/cars">Cars</a></li>
<li><a href="/bikes">Bikes</a></li>
<!-- navigation items -->
</ul>
</nav>
<header class="site-header">
<!-- header content -->
</header>
<!-- actual product content buried somewhere in here -->
<footer>
<!-- footer repeating across every page -->
</footer>
</body>
</html>For a typical e-commerce product page, the markup that actually contains useful product information is maybe 10 to 20% of the total HTML. The rest is navigation, scripts, CSS, analytics tags, cookie banners, and footers that repeat on every single page. You are paying to send all of that to the LLM on every request.
Stripping that boilerplate before extraction is not optional, it is the first thing you should do.
The right approach: Structured extraction at the scraping layer
Instead of fetching raw HTML and then calling an LLM yourself, use a scraping tool that handles both steps. You provide a URL, a plain-English description of what to extract, and an output schema. The tool renders the page, strips boilerplate, extracts exactly what you described, and returns structured JSON that matches your schema.
This eliminates several problems at once:
- No raw HTML in your application code
- No LLM rate limit management on your side
- No token overhead from boilerplate
- Consistent output shape regardless of page structure differences
- Extraction works even on JavaScript-rendered pages
Spidra handles all of this through a single API. Here is what the full pipeline looks like.
Building the pipeline with Spidra
Step 1: Install the SDK
pip install spidraStep 2: Define your schema
Define the exact output shape you want. Spidra's AI extraction will match this schema on every page regardless of how the page is structured. Required fields always appear in the output, as null if the page does not have that value.
PRODUCT_SCHEMA = {
"type": "object",
"required": ["title", "price", "categories", "description"],
"properties": {
"title": {"type": "string"},
"brand": {"type": ["string", "null"]},
"price": {"type": ["number", "null"]},
"currency": {"type": ["string", "null"]},
"categories": {
"type": "array",
"items": {
"type": "string",
"enum": ["car", "bike", "truck", "van", "electric"]
}
},
"description": {"type": "string"},
"year": {"type": ["integer", "null"]},
"mileage_km": {"type": ["number", "null"]},
}
}The categories field uses an enum so the AI matches each page to your predefined taxonomy rather than inventing new category names. This is the right way to handle classification across inconsistent pages.
Step 3: Scrape a single page
from spidra import SpidraClient, ScrapeParams, ScrapeUrl
from datetime import datetime, timezone
import os
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
def scrape_product_page(url: str) -> dict:
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url=url)],
prompt="""
Extract the product details from this page.
Normalize price to a number without currency symbols.
Map the vehicle type to the closest matching category.
Return null for any fields not present on the page.
""",
output="json",
schema=PRODUCT_SCHEMA,
extract_content_only=True, # strips nav, ads, boilerplate
))
structured = job.result.content
content = job.result.markdown_content or ""
return {
"url": url,
**structured,
"content": content,
"scraped_at": datetime.now(timezone.utc).isoformat(),
}
record = scrape_product_page("https://example.com/products/fiat-500")
print(record){
"url": "https://example.com/products/fiat-500",
"title": "Fiat 500 Hatchback 2024",
"brand": "Fiat",
"price": 18990,
"currency": "USD",
"categories": ["car"],
"description": "The Fiat 500 is a compact city car with a 1.0L mild hybrid engine...",
"year": 2024,
"mileage_km": null,
"content": "Full cleaned page text...",
"scraped_at": "2026-05-01T10:22:00Z"
}Consistent output, every time, regardless of how the source page is laid out.
Step 4: Process 300 URLs without rate limits
Instead of calling an LLM 300 times yourself and managing rate limits, use Spidra's batch endpoint. It processes up to 50 URLs in parallel per request. For 300 URLs, send 6 batch requests:
from spidra import SpidraClient, BatchScrapeParams
from datetime import datetime, timezone
import os, json, time
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
def batch_scrape(urls: list[str], batch_size: int = 50) -> list[dict]:
all_records = []
for i in range(0, len(urls), batch_size):
chunk = urls[i:i + batch_size]
print(f"Processing batch {i // batch_size + 1} ({len(chunk)} URLs)...")
batch = spidra.batch.run_sync(BatchScrapeParams(
urls=chunk,
prompt="""
Extract the product details from this page.
Normalize price to a number without currency symbols.
Map the vehicle type to the closest matching category.
Return null for any fields not present on the page.
""",
output="json",
schema=PRODUCT_SCHEMA,
extract_content_only=True,
))
collected_at = datetime.now(timezone.utc).isoformat()
for item in batch.items:
if item.status == "completed" and item.result:
record = {
"url": item.url,
**item.result.content,
"content": item.result.markdown_content or "",
"scraped_at": collected_at,
}
all_records.append(record)
else:
print(f" Failed: {item.url} — {item.error}")
return all_records
# load your URL list
with open("product_urls.txt") as f:
urls = [line.strip() for line in f if line.strip()]
records = batch_scrape(urls)
print(f"Scraped {len(records)} records")
# save to JSONL for vector DB ingestion
with open("products.jsonl", "w") as f:
for record in records:
f.write(json.dumps(record) + "\n")No rate limit handling. No token management. No boilerplate stripping. 300 URLs processed across 6 parallel batches.
Step 5: Embed and store in pgvector
With the records saved, generate embeddings for the content field and store everything in PostgreSQL with pgvector:
import psycopg2
import json
from openai import OpenAI
openai_client = OpenAI(api_key=os.environ["OPENAI_API_KEY"])
def get_embedding(text: str) -> list[float]:
response = openai_client.embeddings.create(
input=text,
model="text-embedding-3-small"
)
return response.data[0].embedding
# connect to postgres
conn = psycopg2.connect(os.environ["DATABASE_URL"])
cur = conn.cursor()
# create the table
cur.execute("""
CREATE TABLE IF NOT EXISTS products (
id SERIAL PRIMARY KEY,
url TEXT UNIQUE NOT NULL,
title TEXT,
brand TEXT,
price NUMERIC,
currency TEXT,
categories TEXT[],
description TEXT,
year INTEGER,
mileage_km NUMERIC,
content TEXT,
embedding vector(1536),
scraped_at TIMESTAMPTZ
)
""")
# load and insert records
with open("products.jsonl") as f:
for line in f:
record = json.loads(line)
if not record.get("content"):
continue
embedding = get_embedding(record["content"])
cur.execute("""
INSERT INTO products
(url, title, brand, price, currency, categories, description,
year, mileage_km, content, embedding, scraped_at)
VALUES (%s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s, %s)
ON CONFLICT (url) DO UPDATE SET
title = EXCLUDED.title,
price = EXCLUDED.price,
content = EXCLUDED.content,
embedding = EXCLUDED.embedding,
scraped_at = EXCLUDED.scraped_at
""", (
record["url"],
record.get("title"),
record.get("brand"),
record.get("price"),
record.get("currency"),
record.get("categories", []),
record.get("description"),
record.get("year"),
record.get("mileage_km"),
record["content"],
embedding,
record["scraped_at"],
))
conn.commit()
cur.close()
conn.close()
print("Done")Step 6: Semantic search with structured filtering
Now query with combined semantic search and structured filters:
def search_products(query: str, category: str = None, max_price: float = None, limit: int = 10):
# embed the search query
query_embedding = get_embedding(query)
# build the query with optional filters
filters = []
params = [query_embedding, limit]
if category:
filters.append(f"%s = ANY(categories)")
params.insert(-1, category)
if max_price:
filters.append(f"price <= %s")
params.insert(-1, max_price)
where_clause = ("WHERE " + " AND ".join(filters)) if filters else ""
sql = f"""
SELECT url, title, brand, price, categories, description,
1 - (embedding <=> %s::vector) AS similarity
FROM products
{where_clause}
ORDER BY embedding <=> %s::vector
LIMIT %s
"""
params = [query_embedding] + params[:-1] + [query_embedding, limit]
cur.execute(sql, params)
return cur.fetchall()
# find fuel-efficient city cars under $25,000
results = search_products(
query="fuel efficient city car easy to park",
category="car",
max_price=25000,
limit=5
)
for row in results:
print(f"{row[1]} ({row[4]}) — ${row[3]:,.0f} — similarity: {row[6]:.3f}")# Output
Fiat 500 Hatchback 2024 (['car']) — $18,990 — similarity: 0.891
Toyota Yaris 2024 (['car']) — $22,450 — similarity: 0.876
Mini Cooper 3-Door 2023 (['car']) — $24,100 — similarity: 0.863Semantic search combined with exact price and category filtering, all from data that was structured consistently from pages that had nothing consistent about them.
Handling pages with no JavaScript
For sites where content is server-rendered and there is no bot protection, you can scrape faster without using a full browser. Spidra detects this automatically and uses the appropriate rendering strategy per page. You do not need to configure anything.
For sites with bot protection, add use_proxy=True:
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url=url)],
prompt="Extract the product details...",
output="json",
schema=PRODUCT_SCHEMA,
extract_content_only=True,
use_proxy=True,
proxy_country="us",
))Same code, same output, with residential proxy rotation and CAPTCHA solving handled automatically.
Keeping the corpus fresh
Set up a refresh job that re-scrapes pages on a schedule and updates records where the content hash has changed:
import hashlib
def needs_update(url: str, new_content: str) -> bool:
new_hash = hashlib.md5(new_content.encode()).hexdigest()
cur.execute("SELECT content FROM products WHERE url = %s", (url,))
row = cur.fetchone()
if not row:
return True
old_hash = hashlib.md5(row[0].encode()).hexdigest()
return new_hash != old_hashOnly re-embed and update records that actually changed. This keeps your vector index fresh without re-processing everything on every run.
The full pipeline summary
URL list
↓
Spidra batch endpoint (50 URLs in parallel)
↓
Real browser render + boilerplate stripped + AI extraction + schema enforced
↓
Clean structured JSON (consistent shape regardless of source page layout)
↓
products.jsonl
↓
Embed content field with text-embedding-3-small
↓
Insert into pgvector (structured fields + embedding)
↓
Semantic search + structured filters
No LLM rate limits to manage. No token overhead from boilerplate. No selectors to maintain. No parser to update when the site redesigns. The schema enforces consistent output even when the source pages are completely inconsistent with each other.
