Web scraping in Node.js has a familiar progression. You start with axios or node-fetch for static pages. Then a modern site returns an empty HTML shell and you reach for Puppeteer. Then Cloudflare blocks you and you spend an evening on stealth plugins. Then the page structure changes and your selectors are worthless again.
Spidra's Node.js SDK (spidra-js) cuts across all of that. You describe what you want from a page in plain English, and the SDK returns structured data. The browser rendering, anti-bot bypass, CAPTCHA solving, and AI extraction all run on Spidra's infrastructure. Your code just handles the result.
This tutorial covers the full SDK, from installation through crawling an entire website. The SDK is TypeScript-native so you get complete type safety out of the box. Every example works as-is with no additional configuration.
Prerequisites
- Node.js 18 or higher
- A Spidra API key from app.spidra.io under Settings → API Keys
Installation
npm install spidra-jsThe package includes TypeScript types. You do not need a separate @types/spidra-js package.
Store your API key as an environment variable. Never hardcode it in source files.
export SPIDRA_API_KEY="spd_YOUR_API_KEY"Setting up the client
Import SpidraClient and initialise it with your API key.
TypeScript / ESM:
import { SpidraClient } from 'spidra-js'
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })CommonJS:
const { SpidraClient } = require('spidra-js')
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY })The client exposes five namespaces:
| Namespace | What it handles |
|---|---|
spidra.scrape | Scraping one to three URLs with browser automation and AI extraction |
spidra.batch | Processing up to 50 URLs in parallel |
spidra.crawl | Discovering and scraping pages across an entire website |
spidra.logs | History of every scrape your API key has made |
spidra.usage | Credit and request consumption statistics |
Every method is async and returns a Promise. The examples below use top-level await for clarity. If your project does not support top-level await, wrap the calls in an async function.
Scraping a page
Your first scrape
import { SpidraClient } from 'spidra-js'
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })
const job = await spidra.scrape.run({
urls: [{ url: 'https://news.ycombinator.com' }],
})
console.log(job.result.content)Without a prompt, Spidra loads the page in a real browser, executes all JavaScript, and returns the full rendered content as Markdown. That is what ends up in job.result.content.
How the job lifecycle works
run() submits the job and polls in the background until it completes. From your side it looks like a single await. Under the hood, the job moves through these states:
waiting → active → completed (or failed)If you want to submit a job and check on it yourself rather than waiting, use submit() and get() separately:
// Submit and get a job ID immediately
const queued = await spidra.scrape.submit({
urls: [{ url: 'https://example.com' }],
prompt: 'Extract the main headline',
})
console.log(`Job submitted: ${queued.jobId}`)
// Check later
await new Promise(r => setTimeout(r, 5000))
const status = await spidra.scrape.get(queued.jobId)
if (status.status === 'completed') {
console.log(status.result.content)
} else if (status.status === 'failed') {
console.error(`Failed: ${status.error}`)
}Extracting data with prompts
Add a prompt and Spidra uses AI to extract exactly what you described from the rendered page. You do not need to know the page structure or write any selectors.
const job = await spidra.scrape.run({
urls: [{ url: 'https://news.ycombinator.com' }],
prompt: 'Extract the top 10 post titles and their point scores',
output: 'json',
})
console.log(job.result.content)
// [{ "title": "Show HN: I built a thing", "points": 342 }, ...]Setting output: 'json' tells the AI to return structured JSON. The default is 'markdown'.
The AI understands context. It knows a number next to a currency symbol is a price, a short bold line at the top of a product page is probably the title, and a longer block of text is likely a description. You describe the result you want and it finds it on the page.
That said, the SDK also fully supports CSS selectors and XPath for browser interactions when you want to be precise. We will cover that in the browser actions section.
Enforcing output shape with JSON schema
Plain prompts are flexible but not predictable. The AI decides what fields to return and what to call them. That works for exploration but causes problems in production when a database or another service expects a consistent shape every single time.
The schema field solves this. Pass a JSON Schema object and the AI must match it exactly. Fields in required always appear in the output, as null if the page does not have that value.
const job = await spidra.scrape.run({
urls: [{ url: 'https://jobs.example.com/senior-engineer' }],
prompt: 'Extract the job listing details. Normalize salary to a USD number.',
output: 'json',
schema: {
type: 'object',
required: ['title', 'company', 'remote'],
properties: {
title: { type: 'string' },
company: { type: 'string' },
remote: { type: ['boolean', 'null'] },
salary_min: { type: ['number', 'null'] },
salary_max: { type: ['number', 'null'] },
employment_type: {
type: ['string', 'null'],
enum: ['full_time', 'part_time', 'contract', null],
},
skills: { type: 'array', items: { type: 'string' } },
},
},
})
console.log(job.result.content)
// {
// title: "Senior Software Engineer",
// company: "Acme Corp",
// remote: true,
// salary_min: 120000,
// salary_max: 160000,
// employment_type: "full_time",
// skills: ["TypeScript", "PostgreSQL", "AWS"]
// }Since the SDK is TypeScript-native, you can type the result directly:
interface JobListing {
title: string
company: string
remote: boolean | null
salary_min: number | null
salary_max: number | null
employment_type: 'full_time' | 'part_time' | 'contract' | null
skills: string[]
}
const content = job.result.content as JobListing
console.log(`${content.title} at ${content.company}`)If you use Zod for runtime validation, generate the schema from your existing Zod type and pass it directly:
import { z } from 'zod'
import { zodToJsonSchema } from 'zod-to-json-schema'
const JobListingSchema = z.object({
title: z.string(),
company: z.string(),
remote: z.boolean().nullable(),
salary_min: z.number().nullable(),
salary_max: z.number().nullable(),
employment_type: z.enum(['full_time', 'part_time', 'contract']).nullable(),
skills: z.array(z.string()),
})
const job = await spidra.scrape.run({
urls: [{ url: 'https://jobs.example.com/senior-engineer' }],
prompt: 'Extract the job listing details',
schema: zodToJsonSchema(JobListingSchema),
})
const listing = JobListingSchema.parse(job.result.content)One schema definition in your codebase that handles both runtime validation and scraping output shape.
Browser actions
Some pages require interaction before the content you want is visible. A cookie banner blocking everything. A search form that needs filling. Lazy-loaded content that only appears after scrolling. Tabs that hide data by default.
Pass an actions array inside the URL object and those actions execute in order inside a real browser before extraction runs.
const job = await spidra.scrape.run({
urls: [
{
url: 'https://example.com/products',
actions: [
{ type: 'click', selector: '#accept-cookies' },
{ type: 'wait', duration: 1000 },
{ type: 'scroll', to: '80%' },
],
},
],
prompt: 'Extract all product names and prices visible on the page',
})For click, check, and uncheck actions, you have two options for targeting an element:
selectorfor a CSS selector or XPath expression like'#accept-cookies'or'.submit-btn'valuefor a plain English description like'Accept cookies button'and Spidra locates it using AI
Both are valid, and you can mix them in the same actions array:
actions: [
{ type: 'click', selector: '#accept-cookies' }, // CSS selector
{ type: 'click', value: 'Search button' }, // plain English
]Use whichever is more convenient. If the element has a clean, stable ID or class, use selector. If the page is complex or you want the action to survive layout changes, use value.
All available actions
| Action | What it does | Key fields |
|---|---|---|
click | Clicks a button, link, or any element | selector or value |
type | Types text into an input field | selector, value |
check | Checks a checkbox | selector or value |
uncheck | Unchecks a checkbox | selector or value |
wait | Pauses for a number of milliseconds | duration |
scroll | Scrolls to a percentage of the page height | to (e.g. '80%') |
forEach | Finds matching elements and processes each one | value, mode |
The forEach action
forEach is the most powerful action in the SDK. It finds a set of matching elements on the page and processes each one individually, combining all the results into a single output.
Three modes:
inlinereads the content of each matched element directly. For product cards, table rows, or content that lives inside the element itself.navigatefollows each element as a link, loads the destination page, and scrapes it. For detail pages you need to click into.clickclicks each element to expand or reveal content, then scrapes what appears. For accordions, modals, or expandable sections.
const job = await spidra.scrape.run({
urls: [
{
url: 'https://directory.example.com/companies',
actions: [
{ type: 'click', value: 'Accept cookies' },
{
type: 'forEach',
value: 'Find all company listing cards',
mode: 'navigate',
maxItems: 20,
itemPrompt: 'Extract company name, website, and industry',
pagination: {
nextSelector: 'a.next-page',
maxPages: 3,
},
},
],
},
],
output: 'json',
})This dismisses the cookie banner, finds every company card on the page, navigates into each company profile, extracts the company details, and repeats across three pages of pagination. One request, one await.
Proxy and geo-targeting
Some sites block cloud infrastructure IP ranges or serve different content based on location. Set useProxy: true to route through a residential proxy.
const job = await spidra.scrape.run({
urls: [{ url: 'https://www.amazon.de/gp/bestsellers' }],
prompt: 'List the top 10 products with name and price',
useProxy: true,
proxyCountry: 'de',
})proxyCountry accepts:
- A two-letter ISO country code like
'us','de','gb','fr','jp' 'eu'to rotate randomly across all 27 EU member states'global'or omit it for no country preference
Proxy usage is billed from your bandwidth quota, not your credits.
Scraping pages behind a login
Pass session cookies to access authenticated content. Log in through your browser, open DevTools, copy the Cookie header from any authenticated request, and pass it as a string.
const job = await spidra.scrape.run({
urls: [{ url: 'https://app.example.com/dashboard' }],
prompt: 'Extract the monthly revenue and active user count',
cookies: 'session=abc123; auth_token=xyz789',
})Standard cookie format (name=value; name2=value2) and Chrome DevTools paste format both work.
Stripping boilerplate
extractContentOnly strips navigation, headers, footers, and sidebars before extraction runs. Useful for articles, documentation pages, and any page where the main content is surrounded by heavy navigation.
const job = await spidra.scrape.run({
urls: [{ url: 'https://blog.example.com/long-article' }],
prompt: 'Summarize this article in three sentences',
extractContentOnly: true,
})Screenshots
Capture screenshots of pages for debugging, monitoring, or archival.
const job = await spidra.scrape.run({
urls: [{ url: 'https://example.com' }],
screenshot: true,
fullPageScreenshot: true,
})
console.log(job.result.screenshots) // array of URLsscreenshot: true captures the visible viewport. fullPageScreenshot: true captures the entire scrollable page.
Batch scraping
When you have a list of URLs, the batch endpoint processes up to 50 at a time in parallel. Each URL runs in its own independent worker.
const batch = await spidra.batch.run({
urls: [
'https://shop.example.com/product/1',
'https://shop.example.com/product/2',
'https://shop.example.com/product/3',
],
prompt: 'Extract the product name, price, and whether it is in stock',
output: 'json',
})
console.log(`${batch.completedCount}/${batch.totalUrls} completed`)
for (const item of batch.items) {
if (item.status === 'completed') {
console.log(item.url, item.result)
} else {
console.error(`Failed: ${item.url} — ${item.error}`)
}
}Processing large URL lists
The batch endpoint caps at 50 URLs per request. For larger lists, chunk them:
async function scrapeAll(urls: string[], prompt: string) {
const results: Array<{ url: string; data: unknown }> = []
const chunkSize = 50
for (let i = 0; i < urls.length; i += chunkSize) {
const chunk = urls.slice(i, i + chunkSize)
const batchNum = Math.floor(i / chunkSize) + 1
const totalBatches = Math.ceil(urls.length / chunkSize)
console.log(`Processing batch ${batchNum} of ${totalBatches}...`)
const batch = await spidra.batch.run({
urls: chunk,
prompt,
output: 'json',
})
for (const item of batch.items) {
if (item.status === 'completed') {
results.push({ url: item.url, data: item.result })
} else {
console.warn(`Failed: ${item.url}`)
}
}
}
return results
}
const urls = Array.from(
{ length: 200 },
(_, i) => `https://example.com/product/${i + 1}`
)
const results = await scrapeAll(urls, 'Extract product name and price')Managing batches
Retry failed items without resubmitting the ones that already succeeded:
if (batch.failedCount > 0) {
await spidra.batch.retry(batch.batchId)
}Cancel a running batch and get credits refunded for items that have not started yet:
const response = await spidra.batch.cancel(batchId)
console.log(`Cancelled ${response.cancelledItems} items, refunded ${response.creditsRefunded} credits`)Crawling entire websites
Batch scraping works when you already know the URLs. Crawling is for when you want Spidra to discover them for you.
Give it a starting URL, describe which links to follow, and describe what to extract from each page. Spidra loads the base URL, finds matching links, visits each one up to your maxPages limit, and applies your transformInstruction to every page it visits.
import { SpidraClient } from 'spidra-js'
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })
const job = await spidra.crawl.run({
baseUrl: 'https://competitor.com/blog',
crawlInstruction: 'Follow links to blog posts only. Skip tag pages, category pages, and the homepage.',
transformInstruction: 'Extract the post title, author name, publish date, and a one-sentence summary.',
maxPages: 30,
useProxy: true,
})
for (const page of job.result) {
console.log(page.url, page.data)
}Three fields are required: baseUrl, crawlInstruction, and transformInstruction. maxPages defaults to 5 and can be set up to 20.
For larger crawls that take more time, the default 120-second timeout may not be enough. If you are hitting timeouts, fire the crawl with submit() and poll with get() yourself:
const queued = await spidra.crawl.submit({
baseUrl: 'https://docs.example.com',
crawlInstruction: 'Follow all documentation pages. Skip changelog and login pages.',
transformInstruction: 'Extract the page title and full body text.',
maxPages: 20,
})
// Poll every 10 seconds
let status = await spidra.crawl.get(queued.jobId)
while (status.status !== 'completed' && status.status !== 'failed') {
await new Promise(r => setTimeout(r, 10000))
status = await spidra.crawl.get(queued.jobId)
console.log(`Status: ${status.status}`)
}
for (const page of status.result ?? []) {
console.log(page.url, page.data)
}Re-extracting with a different prompt
If you crawled a site and want to pull out different information, use extract() to run a new AI pass over the already-crawled content without making new browser requests:
const queued = await spidra.crawl.extract(
completedJobId,
'Extract only product SKUs and prices as structured JSON',
)
const result = await spidra.crawl.get(queued.jobId)Using the SDK in different environments
Next.js API route
// app/api/scrape/route.ts
import { SpidraClient } from 'spidra-js'
import { NextResponse } from 'next/server'
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })
export async function POST(request: Request) {
const { url, prompt } = await request.json()
try {
const job = await spidra.scrape.run({
urls: [{ url }],
prompt,
output: 'json',
})
return NextResponse.json({ data: job.result.content })
} catch (error) {
return NextResponse.json({ error: 'Scrape failed' }, { status: 500 })
}
}Express
import express from 'express'
import { SpidraClient } from 'spidra-js'
const app = express()
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })
app.use(express.json())
app.post('/scrape', async (req, res) => {
const { url, prompt } = req.body
try {
const job = await spidra.scrape.run({
urls: [{ url }],
prompt,
output: 'json',
})
res.json({ data: job.result.content })
} catch (err) {
res.status(500).json({ error: 'Scrape failed' })
}
})
app.listen(3000)Bun
The SDK works with Bun out of the box. No changes needed.
bun add spidra-jsimport { SpidraClient } from 'spidra-js'
const spidra = new SpidraClient({ apiKey: Bun.env.SPIDRA_API_KEY! })
const job = await spidra.scrape.run({
urls: [{ url: 'https://example.com' }],
prompt: 'Extract the main headline',
})
console.log(job.result.content)Error handling
Every API error maps to a typed exception class. Catch exactly what you care about and let everything else propagate.
import {
SpidraError,
SpidraAuthenticationError,
SpidraInsufficientCreditsError,
SpidraRateLimitError,
SpidraServerError,
} from 'spidra-js'
try {
const job = await spidra.scrape.run({
urls: [{ url: 'https://example.com' }],
prompt: 'Extract the main headline',
})
console.log(job.result.content)
} catch (err) {
if (err instanceof SpidraAuthenticationError) {
console.error('API key is missing or invalid. Check your SPIDRA_API_KEY.')
} else if (err instanceof SpidraInsufficientCreditsError) {
console.error('Account is out of credits. Top up at app.spidra.io.')
} else if (err instanceof SpidraRateLimitError) {
console.warn('Rate limit hit. Slow down and retry.')
} else if (err instanceof SpidraServerError) {
console.error(`Server error (${err.status}): ${err.message}. Retry is usually safe.`)
} else if (err instanceof SpidraError) {
console.error(`API error ${err.status}: ${err.message}`)
} else {
throw err
}
}| Exception | HTTP status | When it fires |
|---|---|---|
SpidraAuthenticationError | 401 | API key missing or invalid |
SpidraInsufficientCreditsError | 403 | No credits remaining |
SpidraRateLimitError | 429 | Too many requests |
SpidraServerError | 500 | Unexpected error on Spidra's side |
SpidraError | any | Base class for all exceptions |
Also check the ai_extraction_failed flag in the result. If AI extraction fails for any reason, Spidra falls back to raw Markdown and sets this flag:
const job = await spidra.scrape.run({
urls: [{ url: 'https://example.com' }],
prompt: 'Extract the main headline',
})
if (job.result.ai_extraction_failed) {
// Raw Markdown fallback is in the data array
const raw = job.result.data[0]?.markdownContent
console.warn('AI extraction failed, using raw content')
} else {
console.log(job.result.content)
}Putting it all together: a complete pipeline
A full example that uses forEach with pagination to collect job listings from a directory, enforces a schema on the output, handles errors, and saves results to a JSONL file:
import { SpidraClient, SpidraError, SpidraInsufficientCreditsError } from 'spidra-js'
import { writeFileSync } from 'fs'
import * as os from 'os'
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })
const JOB_SCHEMA = {
type: 'object',
required: ['title', 'company', 'location'],
properties: {
title: { type: 'string' },
company: { type: 'string' },
location: { type: ['string', 'null'] },
remote: { type: ['boolean', 'null'] },
salary_min: { type: ['number', 'null'] },
salary_max: { type: ['number', 'null'] },
employment_type: {
type: ['string', 'null'],
enum: ['full_time', 'part_time', 'contract', null],
},
},
}
async function collectListings(boardUrl: string) {
try {
const job = await spidra.scrape.run({
urls: [
{
url: boardUrl,
actions: [
{ type: 'click', value: 'Accept cookies' },
{
type: 'forEach',
value: 'Find all job listing cards',
mode: 'navigate',
maxItems: 50,
itemPrompt: 'Extract job title, company, location, remote status, salary range, and employment type',
pagination: {
nextSelector: 'a.next-page',
maxPages: 3,
},
},
],
},
],
output: 'json',
schema: JOB_SCHEMA,
})
if (job.result.ai_extraction_failed) {
console.warn(`AI extraction failed for ${boardUrl}`)
return []
}
const content = job.result.content
return Array.isArray(content) ? content : [content]
} catch (err) {
if (err instanceof SpidraInsufficientCreditsError) {
throw err // bubble up — stop processing
}
if (err instanceof SpidraError) {
console.error(`Error scraping ${boardUrl}: ${err.message}`)
return []
}
throw err
}
}
const boards = [
'https://jobs.example.com/engineering',
'https://careers.anothersite.com/remote',
]
const allJobs: unknown[] = []
for (const board of boards) {
console.log(`Collecting from ${board}...`)
const listings = await collectListings(board)
allJobs.push(...listings)
console.log(` Got ${listings.length} listings`)
}
const jsonl = allJobs.map(job => JSON.stringify(job)).join(os.EOL)
writeFileSync('jobs.jsonl', jsonl)
console.log(`\nDone. ${allJobs.length} jobs saved to jobs.jsonl`)All scrape options
| Option | Type | Description |
|---|---|---|
urls | array | Up to 3 URL objects. Each takes a url and optional actions. |
prompt | string | What to extract, in plain English |
output | string | 'markdown' (default) or 'json' |
schema | object | JSON Schema for a guaranteed output shape |
useProxy | boolean | Route through a residential proxy |
proxyCountry | string | Two-letter country code or 'eu' / 'global' |
extractContentOnly | boolean | Strip nav, ads, and boilerplate before extraction |
screenshot | boolean | Capture a viewport screenshot |
fullPageScreenshot | boolean | Capture a full-page screenshot |
cookies | string | Raw Cookie header string for authenticated pages |
What to read next
- Browser actions guide covers every option for each action type including all
forEachparameters - Structured output guide covers schemas in depth including Zod integration and schema limits
- Stealth mode guide has the full country list and proxy options
- Python SDK tutorial if you are working in Python
- Full API reference if you want to use the REST API directly
Get your API key at app.spidra.io. The free plan has 300 credits and no card required.
