Blog/ Spidra API Node.js tutorial: scrape any website with JavaScript and TypeScript

June 12, 2026 · 16 min read

Spidra API Node.js tutorial: scrape any website with JavaScript and TypeScript

Joel Olawanle

Spidra API Node.js tutorial: scrape any website with JavaScript and TypeScript

Web scraping in Node.js has a familiar progression. You start with axios or node-fetch for static pages. Then a modern site returns an empty HTML shell and you reach for Puppeteer. Then Cloudflare blocks you and you spend an evening on stealth plugins. Then the page structure changes and your selectors are worthless again.

Spidra's Node.js SDK cuts across all of that. You describe what you want from a page in plain English, and the SDK returns structured data. The browser rendering, anti-bot bypass, CAPTCHA solving, and AI extraction all run on Spidra's infrastructure. Your code just handles the result.

This tutorial covers the full SDK, from installation through crawling an entire website. The SDK is TypeScript-native so you get complete type safety out of the box. Every example works as-is with no additional configuration.

Prerequisites

Node.js 18 or higher
A Spidra API key from app.spidra.io under Settings → API Keys

Installation

npm install spidra

The package includes TypeScript types. You do not need a separate @types/spidra package.

Store your API key as an environment variable. Never hardcode it in source files.

export SPIDRA_API_KEY="spd_YOUR_API_KEY"

Setting up the client

Import SpidraClient and initialise it with your API key.

TypeScript / ESM:

import { SpidraClient } from 'spidra'

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

CommonJS:

const { SpidraClient } = require('spidra')

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY })

The client exposes five namespaces:

Namespace	What it handles
`spidra.scrape`	Scraping one to three URLs with browser automation and AI extraction
`spidra.batch`	Processing up to 50 URLs in parallel
`spidra.crawl`	Discovering and scraping pages across an entire website
`spidra.logs`	History of every scrape your API key has made
`spidra.usage`	Credit and request consumption statistics

Every method is async and returns a Promise. The examples below use top-level await for clarity. If your project does not support top-level await, wrap the calls in an async function.

Part 1: Scraping a page

Your first scrape

import { SpidraClient } from 'spidra'

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

const job = await spidra.scrape.run({
  urls: [{ url: 'https://news.ycombinator.com' }],
})

console.log(job.result.content)

Without a prompt, Spidra still applies AI extraction to the page — job.result.content contains structured data the AI extracted automatically. For readable output, use JSON.stringify:

console.log(JSON.stringify(job.result.content, null, 2))

Add a prompt to control exactly what gets extracted and output: 'json' to enforce structured JSON output.

How the job lifecycle works

run() submits the job and polls in the background until it completes. From your side it looks like a single await. Under the hood, the job moves through these states:

waiting → active → completed (or failed)

If you want to submit a job and check on it yourself rather than waiting, use submit() and get() separately:

// Submit and get a job ID immediately
const queued = await spidra.scrape.submit({
  urls: [{ url: 'https://example.com' }],
  prompt: 'Extract the main headline',
})

console.log(`Job submitted: ${queued.jobId}`)

// Check later
await new Promise(r => setTimeout(r, 5000))
const status = await spidra.scrape.get(queued.jobId)

if (status.status === 'completed') {
  console.log(status.result.content)
} else if (status.status === 'failed') {
  console.error(`Failed: ${status.error}`)
}

Part 2: Extracting data with prompts

Add a prompt and Spidra uses AI to extract exactly what you described from the rendered page. You do not need to know the page structure or write any selectors.

const job = await spidra.scrape.run({
  urls: [{ url: 'https://news.ycombinator.com' }],
  prompt: 'Extract the top 10 post titles and their point scores',
  output: 'json',
})

console.log(job.result.content)
// [{ "title": "Show HN: I built a thing", "points": 342 }, ...]

Setting output: 'json' tells the AI to return structured JSON. The default is 'markdown'.

The AI understands context. It knows a number next to a currency symbol is a price, a short bold line at the top of a product page is probably the title, and a longer block of text is likely a description. You describe the result you want and it finds it on the page.

That said, the SDK also fully supports CSS selectors and XPath for browser interactions when you want to be precise. We will cover that in the browser actions section.

Part 3: Enforcing output shape with JSON schema

Plain prompts are flexible but not predictable. The AI decides what fields to return and what to call them. That works for exploration but causes problems in production when a database or another service expects a consistent shape every single time.

The schema field solves this. Pass a JSON Schema object and the AI must match it exactly. If you want to generate a schema from an existing JSON sample, the free JSON Schema Generator infers the full structure for you. Fields in required always appear in the output, as null if the page does not have that value.

const job = await spidra.scrape.run({
  urls: [{ url: 'https://jobs.example.com/senior-engineer' }],
  prompt: 'Extract the job listing details. Normalize salary to a USD number.',
  output: 'json',
  schema: {
    type: 'object',
    required: ['title', 'company', 'remote'],
    properties: {
      title:           { type: 'string' },
      company:         { type: 'string' },
      remote:          { type: ['boolean', 'null'] },
      salary_min:      { type: ['number', 'null'] },
      salary_max:      { type: ['number', 'null'] },
      employment_type: {
        type: ['string', 'null'],
        enum: ['full_time', 'part_time', 'contract', null],
      },
      skills: { type: 'array', items: { type: 'string' } },
    },
  },
})

console.log(job.result.content)
// {
//   title: "Senior Software Engineer",
//   company: "Acme Corp",
//   remote: true,
//   salary_min: 120000,
//   salary_max: 160000,
//   employment_type: "full_time",
//   skills: ["TypeScript", "PostgreSQL", "AWS"]
// }

Since the SDK is TypeScript-native, you can type the result directly:

interface JobListing {
  title: string
  company: string
  remote: boolean | null
  salary_min: number | null
  salary_max: number | null
  employment_type: 'full_time' | 'part_time' | 'contract' | null
  skills: string[]
}

const content = job.result.content as JobListing
console.log(`${content.title} at ${content.company}`)

If you use Zod for runtime validation, generate the schema from your existing Zod type and pass it directly:

import { z } from 'zod'
import { zodToJsonSchema } from 'zod-to-json-schema'

const JobListingSchema = z.object({
  title:           z.string(),
  company:         z.string(),
  remote:          z.boolean().nullable(),
  salary_min:      z.number().nullable(),
  salary_max:      z.number().nullable(),
  employment_type: z.enum(['full_time', 'part_time', 'contract']).nullable(),
  skills:          z.array(z.string()),
})

const job = await spidra.scrape.run({
  urls: [{ url: 'https://jobs.example.com/senior-engineer' }],
  prompt: 'Extract the job listing details',
  schema: zodToJsonSchema(JobListingSchema),
})

const listing = JobListingSchema.parse(job.result.content)

One schema definition in your codebase that handles both runtime validation and scraping output shape.

Part 4: Browser actions

Some pages require interaction before the content you want is visible. A cookie banner blocking everything. A search form that needs filling. Lazy-loaded content that only appears after scrolling. Tabs that hide data by default.

Pass an actions array inside the URL object and those actions execute in order inside a real browser before extraction runs. The full list of available actions is in the browser actions docs.

const job = await spidra.scrape.run({
  urls: [
    {
      url: 'https://example.com/products',
      actions: [
        { type: 'click', selector: '#accept-cookies' },
        { type: 'wait', duration: 1000 },
        { type: 'scroll', to: '80%' },
      ],
    },
  ],
  prompt: 'Extract all product names and prices visible on the page',
})

For click, check, and uncheck actions, you have two options for targeting an element:

selector for a CSS selector or XPath expression like '#accept-cookies' or '.submit-btn'
value for a plain English description like 'Accept cookies button' and Spidra locates it using AI

Both are valid and you can mix them in the same actions array:

actions: [
  { type: 'click', selector: '#accept-cookies' },  // CSS selector
  { type: 'click', value: 'Search button' },         // plain English
]

Use whichever is more convenient. If the element has a clean, stable ID or class, use selector. If the page is complex or you want the action to survive layout changes, use value.

All available actions

Action	What it does	Key fields
`click`	Clicks a button, link, or any element	`selector` or `value`
`type`	Types text into an input field	`selector`, `value`
`check`	Checks a checkbox	`selector` or `value`
`uncheck`	Unchecks a checkbox	`selector` or `value`
`wait`	Pauses for a number of milliseconds	`duration`
`scroll`	Scrolls to a percentage of the page height	`to` (e.g. `'80%'`)
`forEach`	Finds matching elements and processes each one	`value`, `mode`

The forEach action

forEach is the most powerful action in the SDK. It finds a set of matching elements on the page and processes each one individually, combining all the results into a single output.

Three modes:

inline reads the content of each matched element directly. For product cards, table rows, or content that lives inside the element itself.
navigate follows each element as a link, loads the destination page, and scrapes it. For detail pages you need to click into.
click clicks each element to expand or reveal content, then scrapes what appears. For accordions, modals, or expandable sections.

const job = await spidra.scrape.run({
  urls: [
    {
      url: 'https://directory.example.com/companies',
      actions: [
        { type: 'click', value: 'Accept cookies' },
        {
          type: 'forEach',
          value: 'Find all company listing cards',
          mode: 'navigate',
          maxItems: 20,
          itemPrompt: 'Extract company name, website, and industry',
          pagination: {
            nextSelector: 'a.next-page',
            maxPages: 3,
          },
        },
      ],
    },
  ],
  output: 'json',
})

This dismisses the cookie banner, finds every company card on the page, navigates into each company profile, extracts the company details, and repeats across three pages of pagination. One request, one await.

Part 5: Proxy and geo-targeting

Some sites block cloud infrastructure IP ranges or serve different content based on location. Set useProxy: true to route through a residential proxy.

const job = await spidra.scrape.run({
  urls: [{ url: 'https://www.amazon.de/gp/bestsellers' }],
  prompt: 'List the top 10 products with name and price',
  useProxy: true,
  proxyCountry: 'de',
})

proxyCountry accepts:

A two-letter ISO country code like 'us', 'de', 'gb', 'fr', 'jp'
'eu' to rotate randomly across all 27 EU member states
'global' or omit it for no country preference

Proxy usage is billed from your bandwidth quota, not your credits.

Part 6: Scraping pages behind a login

Pass session cookies to access authenticated content. Log in through your browser, open DevTools, copy the Cookie header from any authenticated request, and pass it as a string.

const job = await spidra.scrape.run({
  urls: [{ url: 'https://app.example.com/dashboard' }],
  prompt: 'Extract the monthly revenue and active user count',
  cookies: 'session=abc123; auth_token=xyz789',
})

Standard cookie format (name=value; name2=value2) and Chrome DevTools paste format both work.

Part 7: Stripping boilerplate

extractContentOnly strips navigation, headers, footers, and sidebars before extraction runs. Useful for articles, documentation pages, and any page where the main content is surrounded by heavy navigation.

const job = await spidra.scrape.run({
  urls: [{ url: 'https://blog.example.com/long-article' }],
  prompt: 'Summarize this article in three sentences',
  extractContentOnly: true,
})

Part 8: Screenshots

Capture screenshots of pages for debugging, monitoring, or archival.

const job = await spidra.scrape.run({
  urls: [{ url: 'https://example.com' }],
  screenshot: true,
  fullPageScreenshot: true,
})

console.log(job.result.screenshots)  // array of URLs

screenshot: true captures the visible viewport. fullPageScreenshot: true captures the entire scrollable page.

Part 9: Batch scraping

When you have a list of URLs, the batch endpoint processes up to 50 at a time in parallel. Each URL runs in its own independent worker.

const batch = await spidra.batch.run({
  urls: [
    'https://www.toyota.com/camry',
    'https://www.ford.com/cars/mustang/',
    'https://www.kia.com/us/en/sportage',
  ],
  prompt: 'Extract the vehicle name and starting price',
  output: 'json',
})

console.log(`${batch.completedCount}/${batch.totalUrls} completed`)

for (const item of batch.items) {
  if (item.status === 'completed') {
    console.log(item.url, item.result)
  } else {
    console.error(`Failed: ${item.url} — ${item.error}`)
  }
}

Processing large URL lists

The batch endpoint caps at 50 URLs per request. For larger lists, chunk them:

async function scrapeAll(urls: string[], prompt: string) {
  const results: Array<{ url: string; data: unknown }> = []
  const chunkSize = 50

  for (let i = 0; i < urls.length; i += chunkSize) {
    const chunk = urls.slice(i, i + chunkSize)
    const batchNum = Math.floor(i / chunkSize) + 1
    const totalBatches = Math.ceil(urls.length / chunkSize)

    console.log(`Processing batch ${batchNum} of ${totalBatches}...`)

    const batch = await spidra.batch.run({
      urls: chunk,
      prompt,
      output: 'json',
    })

    for (const item of batch.items) {
      if (item.status === 'completed') {
        results.push({ url: item.url, data: item.result })
      } else {
        console.warn(`Failed: ${item.url}`)
      }
    }
  }

  return results
}

const urls = Array.from(
  { length: 200 },
  (_, i) => `https://example.com/product/${i + 1}`
)

const results = await scrapeAll(urls, 'Extract product name and price')

Managing batches

Retry failed items — batchId only exists on the object returned by submit(), not on the response from run(). Use submit() when you need retry control:

import time from 'node:timers/promises'

const queued = await spidra.batch.submit({
  urls: [
    'https://www.toyota.com/camry',
    'https://www.ford.com/cars/mustang/',
    'https://www.kia.com/us/en/sportage',
  ],
  prompt: 'Extract the vehicle name and starting price',
  output: 'json',
})

let batch
do {
  await time.setTimeout(3000)
  batch = await spidra.batch.get(queued.batchId)
} while (!['completed', 'failed', 'cancelled'].includes(batch.status))

if (batch.failedCount > 0) {
  await spidra.batch.retry(queued.batchId)
}

Cancel a running batch and get credits refunded for items that have not started yet:

const response = await spidra.batch.cancel(batchId)
console.log(`Cancelled ${response.cancelledItems} items, refunded ${response.creditsRefunded} credits`)

Part 10: Crawling entire websites

Batch scraping works when you already know the URLs. Crawling is for when you want Spidra to discover them for you.

Give it a starting URL, describe which links to follow, and describe what to extract from each page. Spidra loads the base URL, finds matching links, visits each one up to your maxPages limit, and applies your transformInstruction to every page it visits.

import { SpidraClient } from 'spidra'

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

const job = await spidra.crawl.run({
  baseUrl: 'https://competitor.com/blog',
  crawlInstruction: 'Follow links to blog posts only. Skip tag pages, category pages, and the homepage.',
  transformInstruction: 'Extract the post title, author name, publish date, and a one-sentence summary.',
  maxPages: 30,
  useProxy: true,
})

for (const page of job.result) {
  console.log(page.url, page.data)
}

Three fields are required: baseUrl, crawlInstruction, and transformInstruction. maxPages defaults to 5 and can be set up to 50. If you need more, reach out to the team.

For larger crawls that take more time, the default 120-second timeout may not be enough. If you are hitting timeouts, fire the crawl with submit() and poll with get() yourself:

const queued = await spidra.crawl.submit({
  baseUrl: 'https://docs.example.com',
  crawlInstruction: 'Follow all documentation pages. Skip changelog and login pages.',
  transformInstruction: 'Extract the page title and full body text.',
  maxPages: 20,
})

// Poll every 10 seconds
let status = await spidra.crawl.get(queued.jobId)

while (status.status !== 'completed' && status.status !== 'failed') {
  await new Promise(r => setTimeout(r, 10000))
  status = await spidra.crawl.get(queued.jobId)
  console.log(`Status: ${status.status}`)
}

for (const page of status.result ?? []) {
  console.log(page.url, page.data)
}

Re-extracting with a different prompt

If you crawled a site and want to pull out different information, use extract() to run a new AI pass over the already-crawled content without making new browser requests:

const queued = await spidra.crawl.extract(
  completedJobId,
  'Extract only product SKUs and prices as structured JSON',
)

const result = await spidra.crawl.get(queued.jobId)

Part 11: Using the SDK in different environments

Next.js API route

// app/api/scrape/route.ts
import { SpidraClient } from 'spidra'
import { NextResponse } from 'next/server'

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

export async function POST(request: Request) {
  const { url, prompt } = await request.json()

  try {
    const job = await spidra.scrape.run({
      urls: [{ url }],
      prompt,
      output: 'json',
    })

    return NextResponse.json({ data: job.result.content })
  } catch (error) {
    return NextResponse.json({ error: 'Scrape failed' }, { status: 500 })
  }
}

Express

import express from 'express'
import { SpidraClient } from 'spidra'

const app = express()
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

app.use(express.json())

app.post('/scrape', async (req, res) => {
  const { url, prompt } = req.body

  try {
    const job = await spidra.scrape.run({
      urls: [{ url }],
      prompt,
      output: 'json',
    })
    res.json({ data: job.result.content })
  } catch (err) {
    res.status(500).json({ error: 'Scrape failed' })
  }
})

app.listen(3000)

Bun

The SDK works with Bun out of the box. No changes needed.

bun add spidra

import { SpidraClient } from 'spidra'

const spidra = new SpidraClient({ apiKey: Bun.env.SPIDRA_API_KEY! })

const job = await spidra.scrape.run({
  urls: [{ url: 'https://example.com' }],
  prompt: 'Extract the main headline',
})

console.log(job.result.content)

Part 12: Error handling

Every API error maps to a typed exception class. Catch exactly what you care about and let everything else propagate.

import {
  SpidraError,
  SpidraAuthenticationError,
  SpidraInsufficientCreditsError,
  SpidraRateLimitError,
  SpidraServerError,
} from 'spidra'

try {
  const job = await spidra.scrape.run({
    urls: [{ url: 'https://example.com' }],
    prompt: 'Extract the main headline',
  })
  console.log(job.result.content)

} catch (err) {
  if (err instanceof SpidraAuthenticationError) {
    console.error('API key is missing or invalid. Check your SPIDRA_API_KEY.')
  } else if (err instanceof SpidraInsufficientCreditsError) {
    // 403 covers both exhausted credits AND invalid API key
    if (err.message.includes('Invalid token') || err.message.includes('API key')) {
      console.error('API key is invalid. Check your SPIDRA_API_KEY.')
    } else {
      console.error('Account is out of credits. Top up at app.spidra.io.')
    }
  } else if (err instanceof SpidraRateLimitError) {
    console.warn('Rate limit hit. Slow down and retry.')
  } else if (err instanceof SpidraServerError) {
    console.error(`Server error (${err.status}): ${err.message}. Retry is usually safe.`)
  } else if (err instanceof SpidraError) {
    console.error(`API error ${err.status}: ${err.message}`)
  } else {
    throw err
  }
}

Exception	HTTP status	When it fires
`SpidraAuthenticationError`	401	API key missing (no key sent)
`SpidraInsufficientCreditsError`	403	No credits remaining, or API key is invalid
`SpidraRateLimitError`	429	Too many requests
`SpidraServerError`	500	Unexpected error on Spidra's side
`SpidraError`	any	Base class for all exceptions

Also check the ai_extraction_failed flag in the result. If AI extraction fails for any reason, Spidra falls back to raw Markdown and sets this flag:

const job = await spidra.scrape.run({
  urls: [{ url: 'https://example.com' }],
  prompt: 'Extract the main headline',
})

if (job.result.ai_extraction_failed) {
  // AI extraction failed — job.result.content contains the raw fallback text
  const raw = job.result.content
  console.warn('AI extraction failed, falling back to raw content')
  console.log(raw)
} else {
  console.log(job.result.content)
}

Putting it all together: a complete pipeline

A full example that uses forEach with pagination to collect job listings from a directory, enforces a schema on the output, handles errors, and saves results to a JSONL file. This pattern is the foundation of any lead generation pipeline:

import { SpidraClient, SpidraError, SpidraInsufficientCreditsError } from 'spidra'
import { writeFileSync } from 'fs'
import * as os from 'os'

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

const JOB_SCHEMA = {
  type: 'object',
  required: ['title', 'company', 'location'],
  properties: {
    title:           { type: 'string' },
    company:         { type: 'string' },
    location:        { type: ['string', 'null'] },
    remote:          { type: ['boolean', 'null'] },
    salary_min:      { type: ['number', 'null'] },
    salary_max:      { type: ['number', 'null'] },
    employment_type: {
      type: ['string', 'null'],
      enum: ['full_time', 'part_time', 'contract', null],
    },
  },
}

async function collectListings(boardUrl: string) {
  try {
    const job = await spidra.scrape.run({
      urls: [
        {
          url: boardUrl,
          actions: [
            { type: 'click', value: 'Accept cookies' },
            {
              type: 'forEach',
              value: 'Find all job listing cards',
              mode: 'navigate',
              maxItems: 50,
              itemPrompt: 'Extract job title, company, location, remote status, salary range, and employment type',
              pagination: {
                nextSelector: 'a.next-page',
                maxPages: 3,
              },
            },
          ],
        },
      ],
      output: 'json',
      schema: JOB_SCHEMA,
    })

    if (job.result.ai_extraction_failed) {
      console.warn(`AI extraction failed for ${boardUrl}`)
      return []
    }

    const content = job.result.content
    return Array.isArray(content) ? content : [content]

  } catch (err) {
    if (err instanceof SpidraInsufficientCreditsError) {
      throw err // bubble up — stop processing
    }
    if (err instanceof SpidraError) {
      console.error(`Error scraping ${boardUrl}: ${err.message}`)
      return []
    }
    throw err
  }
}

const boards = [
  'https://jobs.example.com/engineering',
  'https://careers.anothersite.com/remote',
]

const allJobs: unknown[] = []

for (const board of boards) {
  console.log(`Collecting from ${board}...`)
  const listings = await collectListings(board)
  allJobs.push(...listings)
  console.log(`  Got ${listings.length} listings`)
}

const jsonl = allJobs.map(job => JSON.stringify(job)).join(os.EOL)
writeFileSync('jobs.jsonl', jsonl)

console.log(`\nDone. ${allJobs.length} jobs saved to jobs.jsonl`)

All scrape options

Option	Type	Description
`urls`	array	Up to 3 URL objects. Each takes a `url` and optional `actions`.
`prompt`	string	What to extract, in plain English
`output`	string	`'markdown'` (default) or `'json'`
`schema`	object	JSON Schema for a guaranteed output shape
`useProxy`	boolean	Route through a residential proxy
`proxyCountry`	string	Two-letter country code or `'eu'` / `'global'`
`extractContentOnly`	boolean	Strip nav, ads, and boilerplate before extraction
`screenshot`	boolean	Capture a viewport screenshot
`fullPageScreenshot`	boolean	Capture a full-page screenshot
`cookies`	string	Raw `Cookie` header string for authenticated pages

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.