Web scraping with Python has a well-worn path. You start with requests and BeautifulSoup for simple static pages. Then you hit a JavaScript-rendered site and reach for Playwright. Then you hit Cloudflare and spend two hours debugging stealth plugins. Then your selectors break because the site redesigned.
Spidra's Python SDK cuts across that whole progression. You install one package, describe what you want in plain English, and get back structured data from any website. The browser rendering, anti-bot bypass, CAPTCHA solving, and AI extraction all happen on Spidra's infrastructure. You get clean results back.
This tutorial walks through the entire Python SDK from installation to crawling a full website. All code examples come directly from the SDK and will work as written.
Prerequisites
- Python 3.9 or higher
- A Spidra API key (get one free at app.spidra.io under Settings → API Keys)
Installation
pip install spidraOnce installed, store your API key as an environment variable. Never hardcode it in your scripts.
export SPIDRA_API_KEY="spd_YOUR_API_KEY"Setting up the client
Everything in the SDK flows through a single SpidraClient instance. You initialise it once and then access all functionality through its namespaced attributes.
from spidra import SpidraClient
spidra = SpidraClient(api_key="spd_YOUR_API_KEY")In practice, pull the key from your environment:
import os
from spidra import SpidraClient
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])The client exposes five namespaces:
| Namespace | What it does |
|---|---|
spidra.scrape | Scrape one to three URLs with browser automation and AI extraction |
spidra.batch | Process up to 50 URLs in parallel |
spidra.crawl | Discover and scrape pages across an entire site |
spidra.logs | Access the history of every scrape your API key has made |
spidra.usage | Check credit and request consumption |
Async by default, sync anywhere
The SDK is async-first. Every method is an async function that you await inside an async context.
import asyncio
from spidra import SpidraClient, ScrapeParams, ScrapeUrl
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
async def main():
job = await spidra.scrape.run(ScrapeParams(
urls=[ScrapeUrl(url="https://news.ycombinator.com")],
prompt="Extract the top 5 post titles and their point scores",
output="json",
))
print(job.result.content)
asyncio.run(main())If you are working in a regular script, a Django view, a Flask route, or a Jupyter notebook, use the _sync counterpart instead. It handles the event loop automatically, including environments like Jupyter where calling asyncio.run() directly would fail.
from spidra import SpidraClient, ScrapeParams, ScrapeUrl
import os
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
# Works anywhere without async/await
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url="https://news.ycombinator.com")],
prompt="Extract the top 5 post titles and their point scores",
output="json",
))
print(job.result.content)Every method in the SDK has both versions. The rest of this tutorial uses _sync in the examples for simplicity, but the async versions work identically — just add await.
Part 1: Scraping a page
The scrape namespace handles single-page scraping. You can pass up to three URLs per request and they run in parallel.
Your first scrape
from spidra import SpidraClient, ScrapeParams, ScrapeUrl
import os
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url="https://news.ycombinator.com")],
))
print(job.result.content)Without a prompt, Spidra returns the raw page content as Markdown. The page loads in a real browser, JavaScript executes, and the full rendered content is converted to clean Markdown. That is what ends up in job.result.content.
How the job lifecycle works
When you call run_sync(), the SDK submits the job, then polls in the background every 3 seconds until it is done. From your side it looks synchronous. Under the hood, the job moves through these states:
waiting → active → completed (or failed)waiting means the job is queued. active means the browser is running. completed means the result is ready. failed means something went wrong.
If you want to submit a job and check on it later rather than waiting for it to finish, use submit() and get() separately:
from spidra import SpidraClient, ScrapeParams, ScrapeUrl
import os, time
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
# Submit and get a job ID immediately
queued = spidra.scrape.submit_sync(ScrapeParams(
urls=[ScrapeUrl(url="https://example.com")],
prompt="Extract the main headline",
))
print(f"Job submitted: {queued.job_id}")
# Come back later and check
time.sleep(5)
status = spidra.scrape.get_sync(queued.job_id)
if status.status == "completed":
print(status.result.content)
elif status.status == "failed":
print(f"Failed: {status.error}")Part 2: Extracting data with prompts
The prompt field is what makes Spidra different from a plain headless browser scraper. Instead of writing CSS selectors to find elements, you describe what you want in plain English and the AI figures out where it is on the page.
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url="https://news.ycombinator.com")],
prompt="Extract the top 10 post titles and their point scores",
output="json",
))
print(job.result.content)
# [{"title": "Show HN: I built a thing", "points": 342}, ...]Setting output="json" tells the AI to return structured JSON rather than formatted text. The default is "markdown".
The AI reads the rendered page the way a person would. It knows a number next to a currency symbol is a price, that a short bold line at the top of a product page is probably the title, and that a longer block of text is probably a description. You do not need to know the class names or DOM structure of the page.
That said, Spidra also fully supports CSS selectors and XPath for browser actions if you prefer to be explicit about where to find things. We will cover that in the browser actions section.
Part 3: Enforcing output shape with JSON schema
Plain prompts are flexible but not predictable. The AI decides what fields to return and what to name them. That works for exploration but it is a problem in production where a database or downstream service expects a specific shape every time.
The schema field solves this. Pass a JSON Schema object and the AI must return data matching it exactly. Fields marked as required always appear in the output. If the page does not have a value for a required field, it comes back as None rather than being silently omitted.
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url="https://jobs.example.com/senior-engineer")],
prompt="Extract the job listing details. Normalize salary to a USD number.",
output="json",
schema={
"type": "object",
"required": ["title", "company", "remote"],
"properties": {
"title": {"type": "string"},
"company": {"type": "string"},
"remote": {"type": ["boolean", "null"]},
"salary_min": {"type": ["number", "null"]},
"salary_max": {"type": ["number", "null"]},
"employment_type": {
"type": ["string", "null"],
"enum": ["full_time", "part_time", "contract", None]
},
"skills": {"type": "array", "items": {"type": "string"}},
},
},
))
print(job.result.content)
# {
# "title": "Senior Software Engineer",
# "company": "Acme Corp",
# "remote": True,
# "salary_min": 120000,
# "salary_max": 160000,
# "employment_type": "full_time",
# "skills": ["Python", "PostgreSQL", "AWS"]
# }When you provide a schema, output is automatically set to "json". You do not need to set it yourself.
If you use Pydantic for data validation in your application, you can generate the schema from your existing models rather than writing it by hand:
from pydantic import BaseModel
from typing import Optional
class JobListing(BaseModel):
title: str
company: str
remote: Optional[bool] = None
salary_min: Optional[float] = None
salary_max: Optional[float] = None
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url="https://jobs.example.com/senior-engineer")],
prompt="Extract the job listing details",
schema=JobListing.model_json_schema(),
))One schema definition in your codebase. Works in your application logic and in your scraping requests.
Part 4: Browser actions
Some pages require interaction before the content you want is visible. A cookie banner blocking everything. A search form that needs filling. Lazy-loaded content that only appears after scrolling. Tabs that hide data until clicked.
The actions list inside each ScrapeUrl lets you interact with the page before extraction runs. Actions execute in order inside the browser.
from spidra import BrowserAction
job = spidra.scrape.run_sync(ScrapeParams(
urls=[
ScrapeUrl(
url="https://example.com/products",
actions=[
BrowserAction(type="click", selector="#accept-cookies"),
BrowserAction(type="wait", duration=1000),
BrowserAction(type="scroll", to="80%"),
],
),
],
prompt="Extract all product names and prices visible on the page",
))For click, check, and uncheck actions, you have two options for targeting elements:
selectorfor a CSS selector or XPath expression like"#accept-cookies"or".submit-btn"valuefor a plain English description like"Accept cookies button"and Spidra locates the element using AI
Both are valid and you can mix them in the same actions list:
actions=[
BrowserAction(type="click", selector="#accept-cookies"), # CSS selector
BrowserAction(type="click", value="Search button"), # plain English
]Use whichever is more convenient for the page you are working with.
All available actions
| Action | What it does | Key fields |
|---|---|---|
click | Clicks a button, link, or any element | selector or value |
type | Types text into an input field | selector, value |
check | Checks a checkbox | selector or value |
uncheck | Unchecks a checkbox | selector or value |
wait | Pauses for a number of milliseconds | duration |
scroll | Scrolls to a percentage of the page height | to (e.g. "80%") |
forEach | Finds matching elements and processes each one | value, mode |
The forEach action
forEach is the most powerful action in the SDK. It finds a set of matching elements on the page and processes each one individually, then combines all the results into a single output.
It works in three modes:
inline reads the content of each matched element directly. Use this for product cards, table rows, or any content that lives inside the element.
navigate follows each element as a link, loads the destination page, and scrapes it. Use this when the data you want is on detail pages you need to click into.
click clicks each element to expand or reveal content, then scrapes what appears. Use this for accordions, modals, or expandable sections.
job = spidra.scrape.run_sync(ScrapeParams(
urls=[
ScrapeUrl(
url="https://directory.example.com/companies",
actions=[
BrowserAction(type="click", value="Accept cookies"),
BrowserAction(
type="forEach",
value="Find all company listing cards",
mode="navigate",
max_items=20,
item_prompt="Extract company name, website, and industry",
pagination={
"nextSelector": "a.next-page",
"maxPages": 3
}
),
],
),
],
output="json",
))This dismisses the cookie banner, finds every company card on the page, navigates into each company's profile page, extracts the company details, and repeats across three pages of pagination. All in a single request.
Part 5: Proxy and geo-targeting
Some sites block requests from cloud infrastructure IP ranges. Others show different content depending on where you are browsing from. Setting use_proxy=True routes the request through a residential proxy.
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url="https://www.amazon.de/gp/bestsellers")],
prompt="List the top 10 products with name and price",
use_proxy=True,
proxy_country="de",
))proxy_country accepts:
- A two-letter ISO country code like
"us","de","gb","fr","jp" "eu"to rotate randomly across all 27 EU member states"global"or omit it for no country preference
Proxy usage is billed from your bandwidth quota, not your credits. There is no credit multiplier for enabling proxy routing.
Part 6: Scraping pages behind a login
To access content that requires authentication, pass your session cookies as a raw cookie header string. Log in through your browser, open DevTools, copy the Cookie header from any authenticated request, and pass it here.
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url="https://app.example.com/dashboard")],
prompt="Extract the monthly revenue and active user count",
cookies="session=abc123; auth_token=xyz789",
))Both standard cookie format (name=value; name2=value2) and Chrome DevTools paste format work. Cookies are passed ephemerally to the browser worker and never stored by Spidra.
Part 7: Stripping boilerplate with extract_content_only
By default Spidra returns the full page content including navigation, headers, footers, and sidebars. If you only want the main content, turn on extract_content_only. It strips the noise before the AI sees the page, which reduces token usage and keeps the result focused.
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url="https://blog.example.com/long-article")],
prompt="Summarize this article in three sentences",
extract_content_only=True,
))Particularly useful for article pages, documentation, and any page where the main content is surrounded by heavy navigation.
Part 8: Screenshots
Capture screenshots of scraped pages for debugging, monitoring, or archival.
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url="https://example.com")],
screenshot=True,
full_page_screenshot=True,
))
# Screenshot URLs are in the result
print(job.result.screenshots) # list of URLsscreenshot=True captures the visible viewport. full_page_screenshot=True captures the entire scrollable page.
Part 9: Controlling polling behaviour
By default run_sync() polls every 3 seconds and gives up after 120 seconds. For complex pages or large crawls that take longer, pass a PollOptions object to override both.
from spidra import PollOptions
job = spidra.scrape.run_sync(
ScrapeParams(
urls=[ScrapeUrl(url="https://example.com")],
prompt="Extract all content from this page",
),
PollOptions(poll_interval=5, timeout=180),
)PollOptions works on batch.run_sync() and crawl.run_sync() too.
Part 10: Batch scraping
When you have a list of URLs to process, the batch endpoint handles up to 50 at a time in parallel. Each URL runs in its own independent worker.
Note that batch URLs are plain strings, not ScrapeUrl objects. Per-URL browser actions are not supported in batch mode.
from spidra import SpidraClient, BatchScrapeParams
import os
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
batch = spidra.batch.run_sync(BatchScrapeParams(
urls=[
"https://shop.example.com/product/1",
"https://shop.example.com/product/2",
"https://shop.example.com/product/3",
],
prompt="Extract product name, price, and whether it is in stock",
output="json",
))
print(f"{batch.completed_count}/{batch.total_urls} completed")
for item in batch.items:
if item.status == "completed":
print(item.url, item.result)
else:
print(f"Failed: {item.url} — {item.error}")Batch with schema
The same schema enforcement that works in single scraping works in batch. Every item returns data matching the same shape:
batch = spidra.batch.run_sync(BatchScrapeParams(
urls=urls,
prompt="Extract the product details",
schema={
"type": "object",
"required": ["name", "price"],
"properties": {
"name": {"type": "string"},
"price": {"type": ["number", "null"]},
"currency": {"type": ["string", "null"]},
"available": {"type": ["boolean", "null"]}
}
}
))Managing batches
Once a batch is running, you have a few additional operations available:
Retrying failures. If some items fail due to transient errors, retry just those without re-running the ones that already succeeded:
if batch.failed_count > 0:
spidra.batch.retry_sync(queued.batch_id)Cancelling a batch. Stop a running batch and get credits refunded for anything that has not started yet:
response = spidra.batch.cancel_sync(batch_id)
print(f"Cancelled {response.cancelled_items} items, refunded {response.credits_refunded} credits")Listing past batches:
from spidra import BatchListParams
page = spidra.batch.list_sync(BatchListParams(page=1, limit=20))
for job in page.jobs:
print(job.uuid, job.status, f"{job.completed_count}/{job.total_urls}")Processing large URL lists
The batch endpoint caps at 50 URLs per request. For larger lists, chunk them and process in batches:
import os, json
from spidra import SpidraClient, BatchScrapeParams
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
def scrape_url_list(urls: list[str], prompt: str, batch_size: int = 50) -> list:
all_results = []
for i in range(0, len(urls), batch_size):
chunk = urls[i:i + batch_size]
print(f"Processing batch {i // batch_size + 1} of {-(-len(urls) // batch_size)}...")
batch = spidra.batch.run_sync(BatchScrapeParams(
urls=chunk,
prompt=prompt,
output="json",
))
for item in batch.items:
if item.status == "completed":
all_results.append({
"url": item.url,
"data": item.result
})
else:
print(f" Failed: {item.url}")
return all_results
urls = [f"https://example.com/product/{i}" for i in range(1, 201)]
results = scrape_url_list(urls, "Extract product name and price")
with open("results.jsonl", "w") as f:
for record in results:
f.write(json.dumps(record) + "\n")
print(f"Saved {len(results)} results")Part 11: Crawling entire websites
Batch scraping works when you already have a list of URLs. Crawling is for when you want Spidra to discover pages for you.
You give it a starting URL, describe which pages to follow, and describe what to extract from each one. Spidra loads the base URL, finds links matching your crawl instruction, visits each one, and applies your transform instruction to every page it visits.
from spidra import SpidraClient, CrawlParams, PollOptions
import os
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
job = spidra.crawl.run_sync(
CrawlParams(
base_url="https://competitor.com/blog",
crawl_instruction="Follow links to blog posts only. Skip tag pages, category pages, and the homepage.",
transform_instruction="Extract the post title, author name, publish date, and a one-sentence summary.",
max_pages=30,
use_proxy=True,
),
PollOptions(timeout=360),
)
for page in job.result:
print(page.url, page.data)Three fields are required: base_url, crawl_instruction, and transform_instruction.
crawl_instruction tells the crawler which links to follow. transform_instruction tells the AI what to extract from each page it visits. max_pages defaults to 5 and goes up to 20. Pass a higher timeout in PollOptions for larger crawls since the default 120 seconds may not be enough.
The same use_proxy, proxy_country, and cookies options from single scraping all work here too.
Downloading the raw content
Once a crawl completes, you can fetch the raw HTML and Markdown for every page that was crawled. The URLs are signed and expire after an hour.
response = spidra.crawl.pages_sync(job_id)
for page in response.pages:
print(page.url, page.status)
# page.html_url — download the raw HTML
# page.markdown_url — download the cleaned MarkdownRe-extracting with a different prompt
If you crawled a site and later want to pull out different information, you do not have to re-crawl. extract() runs a new AI pass over the already-crawled content and only charges transformation credits.
queued = spidra.crawl.extract_sync(
completed_job_id,
"Extract only product SKUs and prices as structured JSON",
)
# This creates a new job — check it like any other
result = spidra.crawl.get_sync(queued.job_id)Browsing crawl history
from spidra import CrawlHistoryParams
response = spidra.crawl.history_sync(CrawlHistoryParams(page=1, limit=10))
print(f"Total crawl jobs: {response.total}")
stats = spidra.crawl.stats_sync()
print(f"All-time crawls: {stats.total}")Part 12: Logs and usage
Browsing your scrape logs
Every request your API key makes is logged automatically. You can filter by status, URL, date range, and more.
from spidra import ScrapeLogsParams
response = spidra.logs.list_sync(ScrapeLogsParams(
status="failed",
search_term="amazon.com",
date_start="2025-01-01",
date_end="2025-12-31",
page=1,
limit=20,
))
for log in response.logs:
print(log.urls[0].get("url"), log.status, log.credits_used)To get full details of a single log entry including the extraction output:
log = spidra.logs.get_sync(log_uuid)
print(log.result_data)Checking usage
Track your credit and request consumption over time:
rows = spidra.usage.get_sync("30d") # "7d" | "30d" | "weekly"
for row in rows:
print(row.date, row.requests, row.credits)"7d" gives one row per day for the last week. "30d" gives the last 30 days. "weekly" gives one row per week for the last seven weeks.
Part 13: Error handling
Every API error maps to a typed exception class. Catch exactly what you care about and let everything else bubble up.
from spidra import (
SpidraError,
SpidraAuthenticationError,
SpidraInsufficientCreditsError,
SpidraRateLimitError,
SpidraServerError,
)
try:
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url="https://example.com")],
prompt="Extract the main headline",
))
print(job.result.content)
except SpidraAuthenticationError:
print("API key is missing or invalid. Check your SPIDRA_API_KEY.")
except SpidraInsufficientCreditsError:
print("Account is out of credits. Top up at app.spidra.io.")
except SpidraRateLimitError:
print("Rate limit hit. Wait before retrying.")
except SpidraServerError as e:
print(f"Server error ({e.status}): {e.message}. Retry is usually safe.")
except SpidraError as e:
print(f"API error {e.status}: {e.message}")| Exception | HTTP status | When it fires |
|---|---|---|
SpidraAuthenticationError | 401 | API key missing or invalid |
SpidraInsufficientCreditsError | 403 | No credits remaining |
SpidraRateLimitError | 429 | Too many requests |
SpidraServerError | 500 | Unexpected error on Spidra's side |
SpidraError | any | Base class for all Spidra exceptions |
All exceptions expose .status for the HTTP code and .message for a human-readable explanation.
Also check the ai_extraction_failed flag in the result. If AI extraction fails for any reason, Spidra falls back to returning the raw page Markdown and sets this flag so your code can detect it:
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url="https://example.com")],
prompt="Extract the main headline",
))
if job.result.ai_extraction_failed:
# AI extraction failed — raw Markdown fallback is in the data array
raw = job.result.data[0].markdown_content
print("Extraction failed, falling back to raw content")
else:
print(job.result.content)Putting it all together: a complete pipeline
Here is a full example that uses browser actions with forEach to collect job listings from a directory, enforces a schema on the output, handles errors properly, and saves results to JSONL:
import os, json
from spidra import (
SpidraClient,
ScrapeParams,
ScrapeUrl,
BrowserAction,
SpidraError,
SpidraInsufficientCreditsError,
)
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
JOB_SCHEMA = {
"type": "object",
"required": ["title", "company", "location"],
"properties": {
"title": {"type": "string"},
"company": {"type": "string"},
"location": {"type": ["string", "null"]},
"remote": {"type": ["boolean", "null"]},
"salary_min": {"type": ["number", "null"]},
"salary_max": {"type": ["number", "null"]},
"employment_type": {
"type": ["string", "null"],
"enum": ["full_time", "part_time", "contract", None]
},
},
}
def collect_listings(board_url: str) -> list:
try:
job = spidra.scrape.run_sync(ScrapeParams(
urls=[
ScrapeUrl(
url=board_url,
actions=[
BrowserAction(type="click", value="Accept cookies"),
BrowserAction(
type="forEach",
value="Find all job listing cards",
mode="navigate",
max_items=50,
item_prompt="Extract job title, company, location, remote status, salary range, and employment type",
pagination={
"nextSelector": "a.next-page",
"maxPages": 3
}
),
],
)
],
output="json",
schema=JOB_SCHEMA,
))
if job.result.ai_extraction_failed:
print(f"Warning: AI extraction failed for {board_url}")
return []
content = job.result.content
return content if isinstance(content, list) else [content]
except SpidraInsufficientCreditsError:
print("Out of credits. Stopping.")
return []
except SpidraError as e:
print(f"Error scraping {board_url}: {e.message}")
return []
boards = [
"https://jobs.example.com/engineering",
"https://careers.anothersite.com/remote",
]
all_jobs = []
for board in boards:
print(f"Collecting from {board}...")
listings = collect_listings(board)
all_jobs.extend(listings)
print(f" Got {len(listings)} listings")
with open("jobs.jsonl", "w") as f:
for job in all_jobs:
f.write(json.dumps(job) + "\n")
print(f"\nDone. {len(all_jobs)} jobs saved to jobs.jsonl")All scrape parameters
For reference, here is the full list of parameters you can pass to ScrapeParams:
| Parameter | Type | Description |
|---|---|---|
urls | list | Up to 3 ScrapeUrl objects. Each takes a url and optional actions. |
prompt | str | What to extract, in plain English |
output | str | "markdown" (default) or "json" |
schema | dict | JSON Schema for a guaranteed output shape |
use_proxy | bool | Route through a residential proxy |
proxy_country | str | Two-letter country code or "eu" / "global" |
extract_content_only | bool | Strip nav, ads, and boilerplate before AI extraction |
screenshot | bool | Capture a viewport screenshot |
full_page_screenshot | bool | Capture a full-page screenshot |
cookies | str | Raw Cookie header string for authenticated pages |
What to read next
If you want to go deeper on any part of the SDK:
- Browser actions guide covers every option for each action type including all
forEachparameters - Structured output guide covers schemas in depth including Pydantic integration and schema limits
- Stealth mode guide has the full country list and proxy options
- Authenticated scraping guide covers how to get cookies from your browser and the formats Spidra accepts
Get your API key at app.spidra.io. The free plan has 300 credits and no card required.
