Skip to main content
Blog/ Spidra API Node.js tutorial: scrape any website with JavaScript and TypeScript
June 12, 2026 · 15 min read

Spidra API Node.js tutorial: scrape any website with JavaScript and TypeScript

Joel Olawanle
Joel Olawanle
Spidra API Node.js tutorial: scrape any website with JavaScript and TypeScript

Web scraping in Node.js has a familiar progression. You start with axios or node-fetch for static pages. Then a modern site returns an empty HTML shell and you reach for Puppeteer. Then Cloudflare blocks you and you spend an evening on stealth plugins. Then the page structure changes and your selectors are worthless again.

Spidra's Node.js SDK (spidra-js) cuts across all of that. You describe what you want from a page in plain English, and the SDK returns structured data. The browser rendering, anti-bot bypass, CAPTCHA solving, and AI extraction all run on Spidra's infrastructure. Your code just handles the result.

This tutorial covers the full SDK, from installation through crawling an entire website. The SDK is TypeScript-native so you get complete type safety out of the box. Every example works as-is with no additional configuration.

Prerequisites

  • Node.js 18 or higher
  • A Spidra API key from app.spidra.io under SettingsAPI Keys

Installation

npm install spidra-js

The package includes TypeScript types. You do not need a separate @types/spidra-js package.

Store your API key as an environment variable. Never hardcode it in source files.

export SPIDRA_API_KEY="spd_YOUR_API_KEY"

Setting up the client

Import SpidraClient and initialise it with your API key.

TypeScript / ESM:

import { SpidraClient } from 'spidra-js'

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

CommonJS:

const { SpidraClient } = require('spidra-js')

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY })

The client exposes five namespaces:

NamespaceWhat it handles
spidra.scrapeScraping one to three URLs with browser automation and AI extraction
spidra.batchProcessing up to 50 URLs in parallel
spidra.crawlDiscovering and scraping pages across an entire website
spidra.logsHistory of every scrape your API key has made
spidra.usageCredit and request consumption statistics

Every method is async and returns a Promise. The examples below use top-level await for clarity. If your project does not support top-level await, wrap the calls in an async function.

Scraping a page

Your first scrape

import { SpidraClient } from 'spidra-js'

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

const job = await spidra.scrape.run({
  urls: [{ url: 'https://news.ycombinator.com' }],
})

console.log(job.result.content)

Without a prompt, Spidra loads the page in a real browser, executes all JavaScript, and returns the full rendered content as Markdown. That is what ends up in job.result.content.

How the job lifecycle works

run() submits the job and polls in the background until it completes. From your side it looks like a single await. Under the hood, the job moves through these states:

waiting → active → completed (or failed)

If you want to submit a job and check on it yourself rather than waiting, use submit() and get() separately:

// Submit and get a job ID immediately
const queued = await spidra.scrape.submit({
  urls: [{ url: 'https://example.com' }],
  prompt: 'Extract the main headline',
})

console.log(`Job submitted: ${queued.jobId}`)

// Check later
await new Promise(r => setTimeout(r, 5000))
const status = await spidra.scrape.get(queued.jobId)

if (status.status === 'completed') {
  console.log(status.result.content)
} else if (status.status === 'failed') {
  console.error(`Failed: ${status.error}`)
}

Extracting data with prompts

Add a prompt and Spidra uses AI to extract exactly what you described from the rendered page. You do not need to know the page structure or write any selectors.

const job = await spidra.scrape.run({
  urls: [{ url: 'https://news.ycombinator.com' }],
  prompt: 'Extract the top 10 post titles and their point scores',
  output: 'json',
})

console.log(job.result.content)
// [{ "title": "Show HN: I built a thing", "points": 342 }, ...]

Setting output: 'json' tells the AI to return structured JSON. The default is 'markdown'.

The AI understands context. It knows a number next to a currency symbol is a price, a short bold line at the top of a product page is probably the title, and a longer block of text is likely a description. You describe the result you want and it finds it on the page.

That said, the SDK also fully supports CSS selectors and XPath for browser interactions when you want to be precise. We will cover that in the browser actions section.

Enforcing output shape with JSON schema

Plain prompts are flexible but not predictable. The AI decides what fields to return and what to call them. That works for exploration but causes problems in production when a database or another service expects a consistent shape every single time.

The schema field solves this. Pass a JSON Schema object and the AI must match it exactly. Fields in required always appear in the output, as null if the page does not have that value.

const job = await spidra.scrape.run({
  urls: [{ url: 'https://jobs.example.com/senior-engineer' }],
  prompt: 'Extract the job listing details. Normalize salary to a USD number.',
  output: 'json',
  schema: {
    type: 'object',
    required: ['title', 'company', 'remote'],
    properties: {
      title:           { type: 'string' },
      company:         { type: 'string' },
      remote:          { type: ['boolean', 'null'] },
      salary_min:      { type: ['number', 'null'] },
      salary_max:      { type: ['number', 'null'] },
      employment_type: {
        type: ['string', 'null'],
        enum: ['full_time', 'part_time', 'contract', null],
      },
      skills: { type: 'array', items: { type: 'string' } },
    },
  },
})

console.log(job.result.content)
// {
//   title: "Senior Software Engineer",
//   company: "Acme Corp",
//   remote: true,
//   salary_min: 120000,
//   salary_max: 160000,
//   employment_type: "full_time",
//   skills: ["TypeScript", "PostgreSQL", "AWS"]
// }

Since the SDK is TypeScript-native, you can type the result directly:

interface JobListing {
  title: string
  company: string
  remote: boolean | null
  salary_min: number | null
  salary_max: number | null
  employment_type: 'full_time' | 'part_time' | 'contract' | null
  skills: string[]
}

const content = job.result.content as JobListing
console.log(`${content.title} at ${content.company}`)

If you use Zod for runtime validation, generate the schema from your existing Zod type and pass it directly:

import { z } from 'zod'
import { zodToJsonSchema } from 'zod-to-json-schema'

const JobListingSchema = z.object({
  title:           z.string(),
  company:         z.string(),
  remote:          z.boolean().nullable(),
  salary_min:      z.number().nullable(),
  salary_max:      z.number().nullable(),
  employment_type: z.enum(['full_time', 'part_time', 'contract']).nullable(),
  skills:          z.array(z.string()),
})

const job = await spidra.scrape.run({
  urls: [{ url: 'https://jobs.example.com/senior-engineer' }],
  prompt: 'Extract the job listing details',
  schema: zodToJsonSchema(JobListingSchema),
})

const listing = JobListingSchema.parse(job.result.content)

One schema definition in your codebase that handles both runtime validation and scraping output shape.

Browser actions

Some pages require interaction before the content you want is visible. A cookie banner blocking everything. A search form that needs filling. Lazy-loaded content that only appears after scrolling. Tabs that hide data by default.

Pass an actions array inside the URL object and those actions execute in order inside a real browser before extraction runs.

const job = await spidra.scrape.run({
  urls: [
    {
      url: 'https://example.com/products',
      actions: [
        { type: 'click', selector: '#accept-cookies' },
        { type: 'wait', duration: 1000 },
        { type: 'scroll', to: '80%' },
      ],
    },
  ],
  prompt: 'Extract all product names and prices visible on the page',
})

For click, check, and uncheck actions, you have two options for targeting an element:

  • selector for a CSS selector or XPath expression like '#accept-cookies' or '.submit-btn'
  • value for a plain English description like 'Accept cookies button' and Spidra locates it using AI

Both are valid, and you can mix them in the same actions array:

actions: [
  { type: 'click', selector: '#accept-cookies' },  // CSS selector
  { type: 'click', value: 'Search button' },         // plain English
]

Use whichever is more convenient. If the element has a clean, stable ID or class, use selector. If the page is complex or you want the action to survive layout changes, use value.

All available actions

ActionWhat it doesKey fields
clickClicks a button, link, or any elementselector or value
typeTypes text into an input fieldselector, value
checkChecks a checkboxselector or value
uncheckUnchecks a checkboxselector or value
waitPauses for a number of millisecondsduration
scrollScrolls to a percentage of the page heightto (e.g. '80%')
forEachFinds matching elements and processes each onevalue, mode

The forEach action

forEach is the most powerful action in the SDK. It finds a set of matching elements on the page and processes each one individually, combining all the results into a single output.

Three modes:

  • inline reads the content of each matched element directly. For product cards, table rows, or content that lives inside the element itself.
  • navigate follows each element as a link, loads the destination page, and scrapes it. For detail pages you need to click into.
  • click clicks each element to expand or reveal content, then scrapes what appears. For accordions, modals, or expandable sections.
const job = await spidra.scrape.run({
  urls: [
    {
      url: 'https://directory.example.com/companies',
      actions: [
        { type: 'click', value: 'Accept cookies' },
        {
          type: 'forEach',
          value: 'Find all company listing cards',
          mode: 'navigate',
          maxItems: 20,
          itemPrompt: 'Extract company name, website, and industry',
          pagination: {
            nextSelector: 'a.next-page',
            maxPages: 3,
          },
        },
      ],
    },
  ],
  output: 'json',
})

This dismisses the cookie banner, finds every company card on the page, navigates into each company profile, extracts the company details, and repeats across three pages of pagination. One request, one await.

Proxy and geo-targeting

Some sites block cloud infrastructure IP ranges or serve different content based on location. Set useProxy: true to route through a residential proxy.

const job = await spidra.scrape.run({
  urls: [{ url: 'https://www.amazon.de/gp/bestsellers' }],
  prompt: 'List the top 10 products with name and price',
  useProxy: true,
  proxyCountry: 'de',
})

proxyCountry accepts:

  • A two-letter ISO country code like 'us', 'de', 'gb', 'fr', 'jp'
  • 'eu' to rotate randomly across all 27 EU member states
  • 'global' or omit it for no country preference

Proxy usage is billed from your bandwidth quota, not your credits.

Scraping pages behind a login

Pass session cookies to access authenticated content. Log in through your browser, open DevTools, copy the Cookie header from any authenticated request, and pass it as a string.

const job = await spidra.scrape.run({
  urls: [{ url: 'https://app.example.com/dashboard' }],
  prompt: 'Extract the monthly revenue and active user count',
  cookies: 'session=abc123; auth_token=xyz789',
})

Standard cookie format (name=value; name2=value2) and Chrome DevTools paste format both work.

Stripping boilerplate

extractContentOnly strips navigation, headers, footers, and sidebars before extraction runs. Useful for articles, documentation pages, and any page where the main content is surrounded by heavy navigation.

const job = await spidra.scrape.run({
  urls: [{ url: 'https://blog.example.com/long-article' }],
  prompt: 'Summarize this article in three sentences',
  extractContentOnly: true,
})

Screenshots

Capture screenshots of pages for debugging, monitoring, or archival.

const job = await spidra.scrape.run({
  urls: [{ url: 'https://example.com' }],
  screenshot: true,
  fullPageScreenshot: true,
})

console.log(job.result.screenshots)  // array of URLs

screenshot: true captures the visible viewport. fullPageScreenshot: true captures the entire scrollable page.

Batch scraping

When you have a list of URLs, the batch endpoint processes up to 50 at a time in parallel. Each URL runs in its own independent worker.

const batch = await spidra.batch.run({
  urls: [
    'https://shop.example.com/product/1',
    'https://shop.example.com/product/2',
    'https://shop.example.com/product/3',
  ],
  prompt: 'Extract the product name, price, and whether it is in stock',
  output: 'json',
})

console.log(`${batch.completedCount}/${batch.totalUrls} completed`)

for (const item of batch.items) {
  if (item.status === 'completed') {
    console.log(item.url, item.result)
  } else {
    console.error(`Failed: ${item.url} — ${item.error}`)
  }
}

Processing large URL lists

The batch endpoint caps at 50 URLs per request. For larger lists, chunk them:

async function scrapeAll(urls: string[], prompt: string) {
  const results: Array<{ url: string; data: unknown }> = []
  const chunkSize = 50

  for (let i = 0; i < urls.length; i += chunkSize) {
    const chunk = urls.slice(i, i + chunkSize)
    const batchNum = Math.floor(i / chunkSize) + 1
    const totalBatches = Math.ceil(urls.length / chunkSize)

    console.log(`Processing batch ${batchNum} of ${totalBatches}...`)

    const batch = await spidra.batch.run({
      urls: chunk,
      prompt,
      output: 'json',
    })

    for (const item of batch.items) {
      if (item.status === 'completed') {
        results.push({ url: item.url, data: item.result })
      } else {
        console.warn(`Failed: ${item.url}`)
      }
    }
  }

  return results
}

const urls = Array.from(
  { length: 200 },
  (_, i) => `https://example.com/product/${i + 1}`
)

const results = await scrapeAll(urls, 'Extract product name and price')

Managing batches

Retry failed items without resubmitting the ones that already succeeded:

if (batch.failedCount > 0) {
  await spidra.batch.retry(batch.batchId)
}

Cancel a running batch and get credits refunded for items that have not started yet:

const response = await spidra.batch.cancel(batchId)
console.log(`Cancelled ${response.cancelledItems} items, refunded ${response.creditsRefunded} credits`)

Crawling entire websites

Batch scraping works when you already know the URLs. Crawling is for when you want Spidra to discover them for you.

Give it a starting URL, describe which links to follow, and describe what to extract from each page. Spidra loads the base URL, finds matching links, visits each one up to your maxPages limit, and applies your transformInstruction to every page it visits.

import { SpidraClient } from 'spidra-js'

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

const job = await spidra.crawl.run({
  baseUrl: 'https://competitor.com/blog',
  crawlInstruction: 'Follow links to blog posts only. Skip tag pages, category pages, and the homepage.',
  transformInstruction: 'Extract the post title, author name, publish date, and a one-sentence summary.',
  maxPages: 30,
  useProxy: true,
})

for (const page of job.result) {
  console.log(page.url, page.data)
}

Three fields are required: baseUrl, crawlInstruction, and transformInstruction. maxPages defaults to 5 and can be set up to 20.

For larger crawls that take more time, the default 120-second timeout may not be enough. If you are hitting timeouts, fire the crawl with submit() and poll with get() yourself:

const queued = await spidra.crawl.submit({
  baseUrl: 'https://docs.example.com',
  crawlInstruction: 'Follow all documentation pages. Skip changelog and login pages.',
  transformInstruction: 'Extract the page title and full body text.',
  maxPages: 20,
})

// Poll every 10 seconds
let status = await spidra.crawl.get(queued.jobId)

while (status.status !== 'completed' && status.status !== 'failed') {
  await new Promise(r => setTimeout(r, 10000))
  status = await spidra.crawl.get(queued.jobId)
  console.log(`Status: ${status.status}`)
}

for (const page of status.result ?? []) {
  console.log(page.url, page.data)
}

Re-extracting with a different prompt

If you crawled a site and want to pull out different information, use extract() to run a new AI pass over the already-crawled content without making new browser requests:

const queued = await spidra.crawl.extract(
  completedJobId,
  'Extract only product SKUs and prices as structured JSON',
)

const result = await spidra.crawl.get(queued.jobId)

Using the SDK in different environments

Next.js API route

// app/api/scrape/route.ts
import { SpidraClient } from 'spidra-js'
import { NextResponse } from 'next/server'

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

export async function POST(request: Request) {
  const { url, prompt } = await request.json()

  try {
    const job = await spidra.scrape.run({
      urls: [{ url }],
      prompt,
      output: 'json',
    })

    return NextResponse.json({ data: job.result.content })
  } catch (error) {
    return NextResponse.json({ error: 'Scrape failed' }, { status: 500 })
  }
}

Express

import express from 'express'
import { SpidraClient } from 'spidra-js'

const app = express()
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

app.use(express.json())

app.post('/scrape', async (req, res) => {
  const { url, prompt } = req.body

  try {
    const job = await spidra.scrape.run({
      urls: [{ url }],
      prompt,
      output: 'json',
    })
    res.json({ data: job.result.content })
  } catch (err) {
    res.status(500).json({ error: 'Scrape failed' })
  }
})

app.listen(3000)

Bun

The SDK works with Bun out of the box. No changes needed.

bun add spidra-js
import { SpidraClient } from 'spidra-js'

const spidra = new SpidraClient({ apiKey: Bun.env.SPIDRA_API_KEY! })

const job = await spidra.scrape.run({
  urls: [{ url: 'https://example.com' }],
  prompt: 'Extract the main headline',
})

console.log(job.result.content)

Error handling

Every API error maps to a typed exception class. Catch exactly what you care about and let everything else propagate.

import {
  SpidraError,
  SpidraAuthenticationError,
  SpidraInsufficientCreditsError,
  SpidraRateLimitError,
  SpidraServerError,
} from 'spidra-js'

try {
  const job = await spidra.scrape.run({
    urls: [{ url: 'https://example.com' }],
    prompt: 'Extract the main headline',
  })
  console.log(job.result.content)

} catch (err) {
  if (err instanceof SpidraAuthenticationError) {
    console.error('API key is missing or invalid. Check your SPIDRA_API_KEY.')
  } else if (err instanceof SpidraInsufficientCreditsError) {
    console.error('Account is out of credits. Top up at app.spidra.io.')
  } else if (err instanceof SpidraRateLimitError) {
    console.warn('Rate limit hit. Slow down and retry.')
  } else if (err instanceof SpidraServerError) {
    console.error(`Server error (${err.status}): ${err.message}. Retry is usually safe.`)
  } else if (err instanceof SpidraError) {
    console.error(`API error ${err.status}: ${err.message}`)
  } else {
    throw err
  }
}
ExceptionHTTP statusWhen it fires
SpidraAuthenticationError401API key missing or invalid
SpidraInsufficientCreditsError403No credits remaining
SpidraRateLimitError429Too many requests
SpidraServerError500Unexpected error on Spidra's side
SpidraErroranyBase class for all exceptions

Also check the ai_extraction_failed flag in the result. If AI extraction fails for any reason, Spidra falls back to raw Markdown and sets this flag:

const job = await spidra.scrape.run({
  urls: [{ url: 'https://example.com' }],
  prompt: 'Extract the main headline',
})

if (job.result.ai_extraction_failed) {
  // Raw Markdown fallback is in the data array
  const raw = job.result.data[0]?.markdownContent
  console.warn('AI extraction failed, using raw content')
} else {
  console.log(job.result.content)
}

Putting it all together: a complete pipeline

A full example that uses forEach with pagination to collect job listings from a directory, enforces a schema on the output, handles errors, and saves results to a JSONL file:

import { SpidraClient, SpidraError, SpidraInsufficientCreditsError } from 'spidra-js'
import { writeFileSync } from 'fs'
import * as os from 'os'

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })

const JOB_SCHEMA = {
  type: 'object',
  required: ['title', 'company', 'location'],
  properties: {
    title:           { type: 'string' },
    company:         { type: 'string' },
    location:        { type: ['string', 'null'] },
    remote:          { type: ['boolean', 'null'] },
    salary_min:      { type: ['number', 'null'] },
    salary_max:      { type: ['number', 'null'] },
    employment_type: {
      type: ['string', 'null'],
      enum: ['full_time', 'part_time', 'contract', null],
    },
  },
}

async function collectListings(boardUrl: string) {
  try {
    const job = await spidra.scrape.run({
      urls: [
        {
          url: boardUrl,
          actions: [
            { type: 'click', value: 'Accept cookies' },
            {
              type: 'forEach',
              value: 'Find all job listing cards',
              mode: 'navigate',
              maxItems: 50,
              itemPrompt: 'Extract job title, company, location, remote status, salary range, and employment type',
              pagination: {
                nextSelector: 'a.next-page',
                maxPages: 3,
              },
            },
          ],
        },
      ],
      output: 'json',
      schema: JOB_SCHEMA,
    })

    if (job.result.ai_extraction_failed) {
      console.warn(`AI extraction failed for ${boardUrl}`)
      return []
    }

    const content = job.result.content
    return Array.isArray(content) ? content : [content]

  } catch (err) {
    if (err instanceof SpidraInsufficientCreditsError) {
      throw err // bubble up — stop processing
    }
    if (err instanceof SpidraError) {
      console.error(`Error scraping ${boardUrl}: ${err.message}`)
      return []
    }
    throw err
  }
}

const boards = [
  'https://jobs.example.com/engineering',
  'https://careers.anothersite.com/remote',
]

const allJobs: unknown[] = []

for (const board of boards) {
  console.log(`Collecting from ${board}...`)
  const listings = await collectListings(board)
  allJobs.push(...listings)
  console.log(`  Got ${listings.length} listings`)
}

const jsonl = allJobs.map(job => JSON.stringify(job)).join(os.EOL)
writeFileSync('jobs.jsonl', jsonl)

console.log(`\nDone. ${allJobs.length} jobs saved to jobs.jsonl`)

All scrape options

OptionTypeDescription
urlsarrayUp to 3 URL objects. Each takes a url and optional actions.
promptstringWhat to extract, in plain English
outputstring'markdown' (default) or 'json'
schemaobjectJSON Schema for a guaranteed output shape
useProxybooleanRoute through a residential proxy
proxyCountrystringTwo-letter country code or 'eu' / 'global'
extractContentOnlybooleanStrip nav, ads, and boilerplate before extraction
screenshotbooleanCapture a viewport screenshot
fullPageScreenshotbooleanCapture a full-page screenshot
cookiesstringRaw Cookie header string for authenticated pages

What to read next

Get your API key at app.spidra.io. The free plan has 300 credits and no card required.

Share this article

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.