Blog/ How to scrape any website and get structured data with a single API call
May 21, 2026 · 14 min read

How to scrape any website and get structured data with a single API call

Joel Olawanle
Joel Olawanle
How to scrape any website and get structured data with a single API call

Sometimes you do not just need the HTML. You need the actual data inside it, like the product names, prices, article text, and contact details, formatted cleanly, predictably, and ready to use without writing a custom parser for every site.

The gap between "I have a URL" and "I have clean structured JSON" is a lot wider than most developers expect the first time they try to cross it.

This guide walks through why that gap exists, what the DIY path actually looks like in practice, and how a single API call can take care of all of it.

Why "just pull the data" is harder than it sounds

Scraping a webpage sounds like a solved problem. It is not. Here is what you are actually dealing with.

JavaScript-rendered pages

Most modern sites do not deliver their actual content in the first HTML response. React, Vue, Angular, and Next.js pages typically send back a nearly empty shell and then populate all the real content using JavaScript after the page loads.

<!DOCTYPE html>
<html>
  <body>
    <div id="app"></div>
    <script src="/main.js"></script>
  </body>
</html>

Run requests.get(url) on a page like this, and that shell is all you get. Not the product listings. Not the pricing table. Not the article text. To get the actual rendered content, you need a headless browser that runs the JavaScript, waits for all the async calls to finish, and hands you the DOM in its final state.

Anti-bot protections

A large share of websites now actively block automated requests. Cloudflare, DataDome, PerimeterX, and similar systems look at how your requests are structured and either serve you a CAPTCHA, return a 403, or silently give you a broken version of the page.

Getting past these reliably means rotating residential proxies, randomizing browser fingerprints, matching TLS fingerprints, and keeping all of it updated as detection methods improve. None of that is a one-time setup you do and forget.

Data buried in inconsistent HTML

Even when you do get the rendered HTML, pulling structured data out of it is its own problem. CSS selectors and XPath are brittle. A class name change, a layout update, or an A/B test on the target site silently breaks your scraper. Regex on HTML is even worse. You end up writing a lot of defensive code that still fails on edge cases.

Pages that need interaction before you can scrape them

Some data only shows up after the user does something first. Clicking a "Load More" button. Dismissing a cookie banner. Picking a filter. Scrolling down far enough to trigger lazy loading. A basic HTTP fetch does not touch any of this. You need a browser that can interact with the page first and then scrape it.

The DIY approach: headless browsers and custom parsers

The standard path most developers take is combining a headless browser with a parsing library. Playwright or Puppeteer handles the rendering, and then BeautifulSoup, Cheerio, or custom code handles pulling out the fields you want.

A basic Playwright example for extracting product data looks something like this:

from playwright.sync_api import sync_playwright
from bs4 import BeautifulSoup

def scrape_products(url: str) -> list[dict]:
    with sync_playwright() as p:
        browser = p.chromium.launch(headless=True)
        page = browser.new_page()
        page.goto(url, wait_until="networkidle")
        html = page.content()
        browser.close()

    soup = BeautifulSoup(html, "html.parser")
    products = []

    for card in soup.select(".product-card"):
        products.append({
            "name": card.select_one(".product-title").get_text(strip=True),
            "price": card.select_one(".product-price").get_text(strip=True),
            "available": "out-of-stock" not in card.get("class", [])
        })

    return products

This works fine for a demo or a one-off script. It falls apart in production. Here are the real problems you will run into.

  • Browser memory usage. Each Chromium instance chews through 200 to 400 MB of RAM. Running 20 concurrent scrapes means 4 to 8 GB just for the browser processes, plus memory leaks that will require you to build restart logic.
  • Selectors break constantly. That .product-card selector is one site redesign away from returning nothing. You will spend as much time maintaining your selectors as you spend on actual product work.
  • Headless Chrome is detectable. Sites check navigator.webdriver, the absence of GPU fingerprints, timing patterns, and TLS fingerprints. You will need playwright-extra with stealth plugins, and those plugins need updates every time detection techniques improve.
  • You need proxy infrastructure. At any real scale you will hit IP bans. That means signing up with proxy providers, building rotation logic, detecting dead proxies, and dealing with the billing.
  • It is slow. Launching a browser, loading a page, waiting for JavaScript to finish, and then parsing the result takes 5 to 15 seconds per URL in normal conditions. That adds up fast when you have any kind of volume.
  • No schema enforcement. Your parser returns whatever it finds. If a field is missing or formatted differently on one page, you get None, an empty string, or an exception somewhere downstream. The bug often surfaces far from where the extraction happened.

All of that infrastructure, all of that maintenance, just to get a few fields out of a webpage.

The API approach: describe what you want, get it back as JSON

Spidra takes a completely different approach. Instead of writing selectors or parsing logic, you describe what you want in plain English. The API loads the page in a real browser, runs an AI extraction pass over the rendered content, and returns clean structured JSON.

Spidra has official SDKs for ten languages: Node.js, Python, Go, PHP, Ruby, Elixir, .NET, Swift, Java, and Rust. They all follow the same pattern, so you only have to learn it once. Every SDK handles job submission, polling, retry logic, and error mapping to typed exceptions. You never write a polling loop by hand.

Install the one you need:

# Node.js / TypeScript
npm install spidra-js

# Python
pip install spidra

# Go
go get github.com/spidra-io/spidra-go

# PHP
composer require spidra/spidra-php

# Ruby
gem install spidra

# .NET
dotnet add package Spidra

# Rust
cargo add spidra

# Java (Gradle)
implementation 'io.spidra:spidra-java-sdk:0.1.0'

Node.js / TypeScript

import { SpidraClient } from 'spidra-js'

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY })

const job = await spidra.scrape.run({
  urls: [{ url: 'https://example.com/products' }],
  prompt: 'Extract all product names, prices, and availability',
  output: 'json',
})

console.log(job.result.content)
// [{ name: 'Wireless Headphones', price: '$79.99', available: true }, ...]

run() submits the job and waits until it completes. If you want to fire and come back later, use submit() to get a job ID and get() to check on it.

Python

from spidra import SpidraClient, ScrapeParams, ScrapeUrl

spidra = SpidraClient(api_key="spd_YOUR_API_KEY")

job = await spidra.scrape.run(ScrapeParams(
    urls=[ScrapeUrl(url="https://example.com/products")],
    prompt="Extract all product names, prices, and availability",
    output="json",
))

print(job.result.content)

The SDK is async by default, but every method has a _sync counterpart, so it works in regular scripts, Django views, Flask routes, or Jupyter notebooks without any event loop setup:

# Works anywhere without async/await
job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://example.com/products")],
    prompt="Extract all product names, prices, and availability",
    output="json",
))

Go

package main

import (
    "context"
    "fmt"
    "github.com/spidra-io/spidra-go"
)

func main() {
    client := spidra.NewClient(os.Getenv("SPIDRA_API_KEY"))

    job, err := client.Scrape.Run(context.Background(), spidra.ScrapeParams{
        URLs: []spidra.ScrapeURL{
            {URL: "https://example.com/products"},
        },
        Prompt: "Extract all product names, prices, and availability",
        Output: "json",
    })
    if err != nil {
        log.Fatal(err)
    }

    fmt.Println(job.Result.Content)
}

PHP

use Spidra\SpidraClient;
use Spidra\ScrapeParams;
use Spidra\ScrapeUrl;

$spidra = new SpidraClient($_ENV['SPIDRA_API_KEY']);

$job = $spidra->scrape->run(new ScrapeParams(
    urls: [new ScrapeUrl(url: 'https://example.com/products')],
    prompt: 'Extract all product names, prices, and availability',
    output: 'json',
));

print_r($job->result->content);

Ruby

require 'spidra'

spidra = Spidra::Client.new(api_key: ENV['SPIDRA_API_KEY'])

job = spidra.scrape.run(
  urls: [{ url: 'https://example.com/products' }],
  prompt: 'Extract all product names, prices, and availability',
  output: 'json'
)

puts job.result.content

cURL (raw API)

If you prefer to work directly against the REST API without an SDK, it is just a POST and then a GET to poll:

# Submit the job
curl -X POST "https://api.spidra.io/api/scrape" \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [{"url": "https://example.com/products", "actions": []}],
    "prompt": "Extract all product names, prices, and availability status",
    "output": "json"
  }'

# Returns: { "jobId": "job_abc123", "status": "waiting" }

# Poll until status is "completed"
curl "https://api.spidra.io/api/scrape/job_abc123" \
  -H "x-api-key: YOUR_API_KEY"

Locking down the output shape with JSON schema

A plain-language prompt is flexible, but sometimes you need a guaranteed structure. If you are inserting results into a database, passing them to another service, or working with typed code, you want to know exactly what shape is coming back every single time.

The schema parameter enforces that. Required fields always appear in the output, as null if the page does not have that value. The AI cannot invent fields or skip required ones.

job = await spidra.scrape.run(ScrapeParams(
    urls=[ScrapeUrl(url="https://jobs.example.com/senior-engineer")],
    prompt="Extract the job listing details. Normalize salary to USD annual.",
    output="json",
    schema={
        "type": "object",
        "required": ["title", "company", "remote"],
        "properties": {
            "title":           {"type": "string"},
            "company":         {"type": "string"},
            "location":        {"type": ["string", "null"]},
            "remote":          {"type": ["boolean", "null"]},
            "salary_min":      {"type": ["number", "null"]},
            "salary_max":      {"type": ["number", "null"]},
            "skills":          {"type": "array", "items": {"type": "string"}},
            "employment_type": {
                "type": ["string", "null"],
                "enum": ["full_time", "part_time", "contract", None]
            },
        },
    },
))

If you are already using Zod or Pydantic to define your data models, you can generate the schema directly from those instead of writing it by hand. One schema definition in your codebase, used in both your application logic and your scraping requests.

import { z } from 'zod'
import { zodToJsonSchema } from 'zod-to-json-schema'

const JobSchema = z.object({
  title: z.string(),
  company: z.string(),
  location: z.string().nullable(),
  remote: z.boolean().nullable(),
  salary_min: z.number().nullable(),
  salary_max: z.number().nullable(),
  employment_type: z.enum(['full_time', 'part_time', 'contract']).nullable(),
})

const job = await spidra.scrape.run({
  urls: [{ url: 'https://jobs.example.com/senior-engineer' }],
  prompt: 'Extract the job listing details',
  output: 'json',
  schema: zodToJsonSchema(JobSchema),
})

What actually happens when you make the request

When you call the scrape endpoint, Spidra runs a full browser pipeline before any extraction happens.

The URL loads in a real browser, not a lightweight HTTP client. All the JavaScript runs. Async data calls resolve. The DOM gets to its final state. If the site is behind Cloudflare or another anti-bot system, residential proxy rotation and browser fingerprinting handle it automatically. You do not configure any of that.

Once the page is fully rendered, an AI model reads the content and pulls out exactly what your prompt or schema describes. It understands context the way a person reading the page would, rather than pattern-matching on class names. If you passed a schema, the output gets validated against it before it comes back to you.

The result is the same data you would get if you opened the page yourself, read through it, and typed the fields into a form. Delivered as a response.

Scraping pages that need interaction first

Some content only appears after the user does something on the page. Cookie banners that block everything underneath. Load More buttons. Search forms. Tabs that hide content until clicked. A plain HTTP fetch does not reach any of it.

Pass browser actions inside the URL object, and they run in order inside the browser before the extraction step:

from spidra import BrowserAction

job = await spidra.scrape.run(ScrapeParams(
    urls=[
        ScrapeUrl(
            url="https://example.com/search",
            actions=[
                BrowserAction(type="click", value="Accept all cookies"),
                BrowserAction(type="type", selector="input[name='q']", value="wireless headphones"),
                BrowserAction(type="click", value="Search button"),
                BrowserAction(type="wait", duration=1500),
                BrowserAction(type="scroll", to="80%"),
            ],
        ),
    ],
    prompt="Extract all product names and prices from the search results",
    output="json",
))

For the selector field you can use a CSS selector or XPath. If you would rather just describe the element in plain English, use value and Spidra locates it with AI.

The most powerful action is forEach. It finds every matching element on the page and processes each one individually, then combines everything into a single output. Pair it with mode: "navigate" and it follows each element as a link, scrapes the destination page, and comes back. Add automatic pagination and you can scrape an entire category, directory, or search result set across multiple pages in one API call:

BrowserAction(
    type="forEach",
    observe="Find all product listing cards",
    mode="navigate",
    max_items=50,
    item_prompt="Extract product name, price, rating, and full description",
    pagination={
        "nextSelector": "a.next-page",
        "maxPages": 5
    }
)

No other tool handles this natively in a single API call. You would normally need to build that loop and pagination logic yourself.

Handling errors properly

Every SDK maps API errors to typed exceptions so you can catch exactly what you care about:

from spidra import (
    SpidraAuthenticationError,
    SpidraInsufficientCreditsError,
    SpidraRateLimitError,
    SpidraServerError,
    SpidraError,
)

try:
    job = await spidra.scrape.run(ScrapeParams(
        urls=[ScrapeUrl(url="https://example.com")],
        prompt="Extract the main headline",
    ))
except SpidraAuthenticationError:
    print("Check your API key")
except SpidraInsufficientCreditsError:
    print("Account is out of credits")
except SpidraRateLimitError:
    print("Slow down, hitting rate limits")
except SpidraServerError as e:
    print(f"Server error {e.status}: {e.message}")
except SpidraError as e:
    print(f"API error {e.status}: {e.message}")

What people actually use this for

  • Price monitoring. Track competitor pricing without maintaining a separate scraper for every site. The AI pulls the right fields even when page layouts change between runs.
  • Lead generation. Pull structured contact and company info from business directories or industry databases. Use the batch endpoint to process up to 50 URLs at the same time.
  • Job board aggregation. Collect listings from individual company career pages and normalize them into one format. Schema enforcement means every record comes back with the same fields regardless of how the source page was structured.
  • Feeding AI pipelines. Structured JSON is cleaner for most LLM and RAG workflows than raw HTML. The extraction step also normalizes inconsistent formatting from different sources, which saves a lot of cleaning work downstream.
  • Market research. Pull pricing, features, and positioning from competitor sites at scale. Spidra's crawl endpoint can discover and process every page on a domain automatically, applying your extraction instructions to each one.
  • Change monitoring. Watch pages for specific updates by comparing structured data snapshots over time. Comparing two JSON objects is a lot simpler than diffing raw HTML.

DIY vs. API: when to use which

SituationWhat to do
Static page, no bot protection, one-off userequests.get or fetch is fine
JS-rendered page, running locally, occasional usePlaywright or Puppeteer
JS-rendered page with anti-bot protectionSpidra
You need structured fields, not just HTMLSpidra with a prompt or schema
Dozens to thousands of URLs in a pipelineSpidra batch endpoint
Pages that need clicks or form input firstSpidra with actions
Paginating through lists automaticallySpidra with forEach and pagination
Login-protected pagesSpidra with cookie passthrough

The DIY path is fine for local scripts, research, or sites you control. As soon as reliability matters, you are hitting protected sites, or you need to run at any real volume, the cost of maintaining browser infrastructure is almost always higher than just using an API.

Getting started

Spidra has a free tier and no credit card required to sign up. Install the SDK for your language, grab your API key from app.spidra.io under Settings, and run your first extraction:

from spidra import SpidraClient, ScrapeParams, ScrapeUrl
import os

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

job = spidra.scrape.run_sync(ScrapeParams(
    urls=[ScrapeUrl(url="https://news.ycombinator.com")],
    prompt="Extract the top 10 post titles and their point scores",
    output="json",
))

print(job.result.content)

The free plan gives you 300 credits to start. Paid plans begin at $19 per month.

Full SDK documentation at docs.spidra.io/sdks/overview.

Frequently asked questions

Does Spidra execute JavaScript before scraping?

Yes. Every scrape runs inside a full browser. The page renders completely, including JavaScript, dynamic content, and async API calls, before the extraction step runs. You get the same content a real user would see.

Which languages does Spidra have official SDKs for?

Node.js (TypeScript included), Python, Go, PHP, Ruby, Elixir, .NET, Swift, Java, and Rust. All of them are open source at github.com/spidra-io. You can also call the REST API directly from any language if you prefer.

How does Spidra handle sites that block bots?

Anti-bot bypass is built in and runs automatically. Spidra rotates residential proxies across 50 countries, randomizes browser fingerprints, and solves CAPTCHAs without you doing anything. Proxy usage is billed against your bandwidth quota separately, so there is no credit multiplier applied when bypass is needed.

What is the difference between a prompt and a schema?

A prompt lets the AI decide which fields to return based on what it finds relevant on the page. A schema locks the output to an exact shape you define upfront. Required fields always appear in the result, as null if the page does not have that value, and the structure stays consistent across every run. Use prompts when you are exploring. Use schemas in production pipelines where other systems depend on consistent output.

Can I scrape pages behind a login?

Yes. Pass your session cookies in the cookies parameter and Spidra will use them when loading the page. You get access to authenticated content without needing to automate the login flow itself.

What happens if the AI extraction fails?

Spidra falls back to returning the raw page content as Markdown in the markdownContent field, and sets an ai_extraction_failed flag so your code can detect and handle it.

How do I scrape a lot of URLs at once?

Use spidra.batch.run(). You can submit up to 50 URLs in one request and they all run in parallel. Same AI extraction, same proxy options, same schema support as single-URL scraping. Each item in the response has its own status so you can see which ones completed and which ones failed.

Are there limits on the JSON schema?

The schema supports type, properties, required, items, enum, nullable, and description. Maximum nesting depth is 5 levels and maximum schema size is 10 KB. References using $ref, conditional keywords like if/then/else, and string constraint keywords like minLength and pattern are not currently supported.

Share this article

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.