Skip to main content
Blog/ Spidra batch scraping API: how to scrape 50 URLs in parallel
June 22, 2026 · 10 min read

Spidra batch scraping API: how to scrape 50 URLs in parallel

Joel Olawanle
Joel Olawanle
Spidra batch scraping API: how to scrape 50 URLs in parallel

When you need data from more than a handful of pages, scraping one URL at a time stops making sense quickly. You are waiting for each request to complete before starting the next one, and the time adds up fast. If each page takes five seconds to scrape and you have two hundred URLs, that is over sixteen minutes of waiting.

The Spidra batch scraping API processes up to 50 URLs in parallel in a single request. Each URL runs in its own independent worker. You get all the results back together when the batch finishes, with a per-item status so you can see exactly which URLs succeeded and which did not.

This guide walks through everything about the batch API: how it works, how to use it in Python and Node.js, how to handle failures, and how to process URL lists larger than 50.

How batch scraping works

The batch endpoint follows the same async job pattern as single scraping. You submit a list of URLs and get a batchId back immediately. The URLs start processing in parallel right away. You then poll the status endpoint every few seconds until the batch is complete and the results are ready.

POST /api/batch/scrape  →  { batchId: "abc123" }
GET  /api/batch/scrape/abc123  →  { status: "active", ... }
GET  /api/batch/scrape/abc123  →  { status: "completed", items: [...] }

The status moves through these stages:

  • queued — the batch has been accepted and workers are being allocated
  • active — URLs are being processed in parallel
  • completed — all URLs finished successfully
  • partial — some URLs succeeded and some failed
  • failed — the batch itself failed (not individual URL failures — see partial)

On credits: when you submit a batch, 2 credits per URL are reserved upfront. If a URL fails for any reason, the credits for that item are returned. AI token costs, CAPTCHA solving, and proxy usage are calculated per item once it finishes, the same as a regular scrape.

Your first batch request

The only required field is urls, which takes an array of plain strings. Unlike the single scrape endpoint where URLs are objects, batch URLs are just strings.

cURL

# Submit the batch
curl -X POST https://api.spidra.io/api/batch/scrape \
  -H "x-api-key: YOUR_API_KEY" \
  -H "Content-Type: application/json" \
  -d '{
    "urls": [
      "https://example.com/product/1",
      "https://example.com/product/2",
      "https://example.com/product/3"
    ]
  }'

Response:

{
  "status": "queued",
  "batchId": "550e8400-e29b-41d4-a716-446655440000",
  "message": "Batch job queued. Poll /api/batch/scrape/550e8400... for results."
}

Poll until complete:

curl https://api.spidra.io/api/batch/scrape/550e8400-e29b-41d4-a716-446655440000 \
  -H "x-api-key: YOUR_API_KEY"

When the batch finishes:

{
  "status": "completed",
  "batchId": "550e8400-e29b-41d4-a716-446655440000",
  "totalUrls": 3,
  "completedCount": 3,
  "failedCount": 0,
  "items": [
    {
      "url": "https://example.com/product/1",
      "status": "completed",
      "result": {
        "content": "...",
        "data": [{ "url": "...", "markdownContent": "...", "success": true }]
      },
      "error": null
    }
  ]
}

Each item in the items array has its own status, result, and error. A failure on one URL does not affect the others. They all run independently.

Python: raw REST API

import requests
import time

API_KEY = "YOUR_API_KEY"
BASE_URL = "https://api.spidra.io/api"
HEADERS = {"x-api-key": API_KEY, "Content-Type": "application/json"}

urls = [
    "https://shop.example.com/product/wireless-headphones",
    "https://shop.example.com/product/mechanical-keyboard",
    "https://shop.example.com/product/usb-microphone",
]

# Submit the batch
response = requests.post(
    f"{BASE_URL}/batch/scrape",
    headers=HEADERS,
    json={"urls": urls}
)
response.raise_for_status()
batch_id = response.json()["batchId"]
print(f"Batch submitted: {batch_id}")

# Poll until complete
while True:
    status = requests.get(
        f"{BASE_URL}/batch/scrape/{batch_id}",
        headers=HEADERS
    ).json()

    print(f"Status: {status['status']} ({status.get('completedCount', 0)}/{status.get('totalUrls', 0)})")

    if status["status"] in ("completed", "partial", "failed"):
        break

    time.sleep(3)

# Process results
for item in status["items"]:
    if item["status"] == "completed":
        markdown = item["result"]["data"][0].get("markdownContent", "")
        print(f"\n✓ {item['url']}")
        print(markdown[:300])
    else:
        print(f"\n✗ {item['url']}: {item['error']}")

Python SDK

The Python SDK's batch.run_sync() handles submission and polling for you. Note that batch URLs are plain strings, not ScrapeUrl objects.

from spidra import SpidraClient, BatchScrapeParams
import os

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

batch = spidra.batch.run_sync(BatchScrapeParams(
    urls=[
        "https://shop.example.com/product/wireless-headphones",
        "https://shop.example.com/product/mechanical-keyboard",
        "https://shop.example.com/product/usb-microphone",
    ]
))

print(f"{batch.completed_count}/{batch.total_urls} completed")

for item in batch.items:
    if item.status == "completed":
        print(f"✓ {item.url}")
    else:
        print(f"✗ {item.url} — {item.error}")

Node.js SDK

import { SpidraClient } from 'spidra-js'

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

const batch = await spidra.batch.run({
  urls: [
    'https://shop.example.com/product/wireless-headphones',
    'https://shop.example.com/product/mechanical-keyboard',
    'https://shop.example.com/product/usb-microphone',
  ]
})

console.log(`${batch.completedCount}/${batch.totalUrls} completed`)

for (const item of batch.items) {
  if (item.status === 'completed') {
    console.log(`✓ ${item.url}`)
  } else {
    console.error(`✗ ${item.url} — ${item.error}`)
  }
}

AI extraction across a batch

Without a prompt, the batch returns raw Markdown for each URL. Add a prompt and Spidra runs AI extraction on every page in the batch.

batch = spidra.batch.run_sync(BatchScrapeParams(
    urls=[
        "https://shop.example.com/product/wireless-headphones",
        "https://shop.example.com/product/mechanical-keyboard",
        "https://shop.example.com/product/usb-microphone",
    ],
    prompt="Extract the product name, current price, and whether it is in stock",
    output="json",
))

for item in batch.items:
    if item.status == "completed":
        print(item.url, item.result)

Each item returns its own extracted JSON. The AI runs independently on each page.

Enforcing output shape with JSON schema

When you need every item in the batch to come back in exactly the same shape, use a schema. Required fields always appear in the output, as null if the page does not have that value. The structure never varies between items regardless of how different the source pages are.

PRODUCT_SCHEMA = {
    "type": "object",
    "required": ["name", "price", "in_stock"],
    "properties": {
        "name":       {"type": "string"},
        "price":      {"type": ["number", "null"]},
        "currency":   {"type": ["string", "null"]},
        "in_stock":   {"type": ["boolean", "null"]},
        "sku":        {"type": ["string", "null"]},
        "rating":     {"type": ["number", "null"]},
        "review_count": {"type": ["number", "null"]},
    }
}

batch = spidra.batch.run_sync(BatchScrapeParams(
    urls=product_urls,
    prompt="Extract the product details",
    output="json",
    schema=PRODUCT_SCHEMA,
))

for item in batch.items:
    if item.status == "completed":
        product = item.result  # always matches PRODUCT_SCHEMA
        print(product["name"], product["price"], product["in_stock"])

This is the pattern to use when batch results feed directly into a database or downstream service. No validation step required. No surprises on field names.

Proxy routing and geo-targeting

Pass use_proxy and proxy_country to route every URL in the batch through a residential proxy. Useful when scraping e-commerce sites that show different prices by region, or when the target sites block cloud IP ranges.

batch = spidra.batch.run_sync(BatchScrapeParams(
    urls=amazon_de_urls,
    prompt="Extract the product name and price in EUR",
    output="json",
    use_proxy=True,
    proxy_country="de",
))

All proxy options that work in single scraping work in batch: two-letter country codes ("us", "de", "gb"), "eu" for rotating across EU countries, or omit for global rotation.

Authenticated pages

Batch scraping supports session cookies for pages behind a login. Pass the cookies as a raw cookie header string, the same format as single scraping.

batch = spidra.batch.run_sync(BatchScrapeParams(
    urls=[
        "https://app.example.com/invoices/2026-01",
        "https://app.example.com/invoices/2026-02",
        "https://app.example.com/invoices/2026-03",
    ],
    prompt="Extract invoice number, amount, and status",
    output="json",
    cookies="session=abc123; auth_token=xyz789",
))

The cookies are applied to every URL in the batch.

Screenshots

Enable screenshot or full_page_screenshot to capture every page in the batch. Screenshot URLs are returned in each item's result.

batch = spidra.batch.run_sync(BatchScrapeParams(
    urls=competitor_pages,
    screenshot=True,
    full_page_screenshot=True,
))

for item in batch.items:
    if item.status == "completed":
        print(item.url, item.result.get("screenshots"))

Handling failures and retrying

Real-world batches have failures. A site times out. A page throws a 403. A CAPTCHA does not resolve in time. The batch API handles this gracefully: a failed item does not affect the others, and you can retry just the failures without resubmitting the whole batch.

Checking what failed

batch = spidra.batch.run_sync(BatchScrapeParams(urls=urls))

succeeded = [i for i in batch.items if i.status == "completed"]
failed = [i for i in batch.items if i.status != "completed"]

print(f"{len(succeeded)} succeeded, {len(failed)} failed")
for item in failed:
    print(f"  ✗ {item.url}: {item.error}")

Retrying failed items

If some items fail, retry() re-queues only those items without touching the ones that already succeeded. You do not pay again for the successful ones.

if batch.failed_count > 0:
    print(f"Retrying {batch.failed_count} failed items...")
    spidra.batch.retry_sync(batch.batch_id)

In Node.js:

if (batch.failedCount > 0) {
  await spidra.batch.retry(batch.batchId)
}

Cancelling a batch

If you need to stop a running batch, cancelling it returns credits for any items that have not started processing yet.

result = spidra.batch.cancel_sync(batch_id)
print(f"Cancelled {result.cancelled_items} items, refunded {result.credits_refunded} credits")

Processing URL lists larger than 50

The batch endpoint caps at 50 URLs per request. For larger lists, chunk them and process sequentially.

import os, json
from spidra import SpidraClient, BatchScrapeParams

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

def batch_scrape_all(
    urls: list[str],
    prompt: str,
    schema: dict | None = None,
    chunk_size: int = 50,
) -> list[dict]:
    results = []
    total_batches = -(-len(urls) // chunk_size)  # ceiling division

    for i in range(0, len(urls), chunk_size):
        chunk = urls[i:i + chunk_size]
        batch_num = i // chunk_size + 1
        print(f"Batch {batch_num}/{total_batches}: {len(chunk)} URLs")

        params = BatchScrapeParams(
            urls=chunk,
            prompt=prompt,
            output="json",
        )
        if schema:
            params.schema = schema

        batch = spidra.batch.run_sync(params)

        for item in batch.items:
            if item.status == "completed":
                results.append({"url": item.url, "data": item.result})
            else:
                print(f"  ✗ Failed: {item.url}")

        print(f"  {batch.completed_count}/{batch.total_urls} succeeded")

    return results

# Use it
urls = [f"https://example.com/product/{i}" for i in range(1, 201)]

results = batch_scrape_all(
    urls=urls,
    prompt="Extract product name, price, and availability",
    schema={
        "type": "object",
        "required": ["name", "price"],
        "properties": {
            "name":     {"type": "string"},
            "price":    {"type": ["number", "null"]},
            "available": {"type": ["boolean", "null"]},
        }
    }
)

# Save as JSONL
with open("products.jsonl", "w") as f:
    for record in results:
        f.write(json.dumps(record) + "\n")

print(f"\nSaved {len(results)} products to products.jsonl")

Listing batch history

from spidra import BatchListParams

page = spidra.batch.list_sync(BatchListParams(page=1, limit=20))

for job in page.jobs:
    print(f"{job.uuid}  {job.status}  {job.completed_count}/{job.total_urls}")

Via the REST API directly:

curl https://api.spidra.io/api/batch/scrape \
  -H "x-api-key: YOUR_API_KEY"

A complete real-world example: e-commerce price monitoring

This pulls current prices from a list of competitor product pages, compares them to the previous run, and reports what changed.

import os, json, time
from datetime import datetime, timezone
from pathlib import Path
from spidra import SpidraClient, BatchScrapeParams

spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])

PRICE_SCHEMA = {
    "type": "object",
    "required": ["name", "price"],
    "properties": {
        "name":     {"type": "string"},
        "price":    {"type": ["number", "null"]},
        "currency": {"type": ["string", "null"]},
        "in_stock": {"type": ["boolean", "null"]},
    }
}

def load_previous(path="data/prices.json") -> dict:
    p = Path(path)
    return json.loads(p.read_text()) if p.exists() else {}

def save_current(data: dict, path="data/prices.json"):
    Path(path).parent.mkdir(parents=True, exist_ok=True)
    Path(path).write_text(json.dumps(data, indent=2))

def check_prices(urls: list[str]) -> dict:
    results = {}

    for i in range(0, len(urls), 50):
        chunk = urls[i:i + 50]
        batch = spidra.batch.run_sync(BatchScrapeParams(
            urls=chunk,
            prompt="Extract the product name, current price as a number, currency code, and whether it is in stock",
            output="json",
            schema=PRICE_SCHEMA,
            use_proxy=True,
        ))

        for item in batch.items:
            if item.status == "completed" and item.result:
                results[item.url] = item.result

    return results

def find_changes(previous: dict, current: dict, threshold_pct=3.0) -> list[dict]:
    changes = []

    for url, data in current.items():
        if not data or data.get("price") is None:
            continue

        prev = previous.get(url, {})
        prev_price = prev.get("price")
        curr_price = data["price"]

        if prev_price is None or prev_price == 0:
            continue

        change_pct = ((curr_price - prev_price) / prev_price) * 100

        if abs(change_pct) >= threshold_pct:
            changes.append({
                "url":        url,
                "name":       data.get("name", ""),
                "prev_price": prev_price,
                "curr_price": curr_price,
                "change_pct": round(change_pct, 2),
                "currency":   data.get("currency", ""),
                "direction":  "up" if change_pct > 0 else "down",
            })

    return sorted(changes, key=lambda x: abs(x["change_pct"]), reverse=True)

# Load URLs from config
urls = Path("config/competitor_urls.txt").read_text().splitlines()
urls = [u.strip() for u in urls if u.strip()]

print(f"Checking {len(urls)} URLs...")
previous = load_previous()
current = check_prices(urls)
save_current(current)

changes = find_changes(previous, current)

if changes:
    print(f"\n{len(changes)} significant price changes:")
    for c in changes:
        direction = "↑" if c["direction"] == "up" else "↓"
        print(
            f"  {direction} {c['name']}: "
            f"{c['currency']}{c['prev_price']:.2f} → "
            f"{c['currency']}{c['curr_price']:.2f} "
            f"({c['change_pct']:+.1f}%)"
        )
else:
    print("\nNo significant price changes.")

print(f"\nDone. {len(current)} URLs checked.")

Batch API reference

EndpointMethodWhat it does
/api/batch/scrapePOSTSubmit a batch job. Returns batchId.
/api/batch/scrape/{batchId}GETPoll for status and per-item results.
/api/batch/scrapeGETList all your past batch jobs.
/api/batch/scrape/{batchId}DELETECancel a running batch. Credits refunded for unprocessed items.
/api/batch/scrape/{batchId}/retryPOSTRe-queue only failed items without resubmitting succeeded ones.

All batch parameters

ParameterTypeDescription
urlsarray of stringsUp to 50 URLs to process. Plain strings, not URL objects.
promptstringWhat to extract from each page, in plain English
outputstring"markdown" (default) or "json"
schemaobjectJSON Schema for a guaranteed output shape on every item
use_proxybooleanRoute through residential proxies
proxy_countrystringCountry code, "eu", or "global"
cookiesstringSession cookies for authenticated pages
screenshotbooleanViewport screenshot of each page
full_page_screenshotbooleanFull-page screenshot of each page
extract_content_onlybooleanStrip nav, footer, and ads before extraction

Frequently asked questions

Up to 50. For larger lists, chunk them into groups of 50 and process sequentially as shown in the examples above.

Share this article

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.