There's a version of web scraping that looks deceptively simple. You write twenty lines of code, pull some HTML, and the data comes back clean. You feel like you've solved it.
Then you run the same script the next morning, and nothing works. The site added a CAPTCHA. The product prices are loaded by JavaScript after the initial HTML response, so your parser sees an empty div. The IP you're scraping from has been quietly blocked. The CSS class you were targeting got renamed in a deploy.
This is the actual experience of web scraping, and most tutorials skip right past it.
This guide won't. We'll start with the basics of how to scrape a static HTML page using nothing but Node's built-in fetch and a parsing library, then work up through JavaScript-rendered pages, anti-bot systems, and finally to the point where DIY infrastructure stops being worth the trouble. By the end, you'll have a clear picture of every tool in the Node.js scraping ecosystem, when to reach for each one, and what a production-ready scraping workflow actually looks like.
What you'll need
- Node.js 18 or later (we'll use the native
fetchAPI throughout) - A basic familiarity with
async/await - npm for installing packages
That's genuinely it. No browser required, no servers to spin up.
Scraping static HTML with fetch and Cheerio
Some websites, like documentation sites, blogs, and simple listing pages, serve their content directly in the HTML response. No JavaScript required to render the data, no login walls, no CAPTCHA. These are the easy ones, and they're a good place to start because they strip the problem down to its core.
The pipeline is simple: make an HTTP request, get the HTML back as a string, parse it to find the elements you care about, and extract the text.
Setting up the project
Start by creating a folder and then installing Cheerio.
mkdir scraper && cd scraper
npm init -y
npm install cheerioCheerio is a server-side HTML parser that uses a jQuery-like API. If you've ever written $('.product-title').text() in a browser, Cheerio will feel immediately familiar.
Your first scrape
Let's scrape the Hacker News front page. It's static HTML, publicly accessible, and has a reasonably clean structure.
import * as cheerio from "cheerio";
async function scrapeHackerNews() {
const response = await fetch("https://news.ycombinator.com");
const html = await response.text();
const $ = cheerio.load(html);
const stories = [];
$(".athing").each((index, element) => {
const titleEl = $(element).find(".titleline > a");
const title = titleEl.text();
const url = titleEl.attr("href");
// The score and metadata live on the next sibling row
const subtext = $(element).next(".subtext");
const score = subtext.find(".score").text();
const author = subtext.find(".hnuser").text();
if (title) {
stories.push({ title, url, score, author });
}
});
return stories;
}
const stories = await scrapeHackerNews();
console.log(JSON.stringify(stories.slice(0, 5), null, 2));Run it with node --input-type=module < index.js (or add "type": "module" to your package.json) and you'll get something like:
[
{
"title": "Show HN: I built a terminal-based spreadsheet",
"url": "https://example.com/spreadsheet",
"score": "312 points",
"author": "username123"
}
]This works until it doesn't. The selectors .athing, .titleline, .subtext are structural details of Hacker News's HTML as it exists right now. If the site's developers rename a class, restructure the DOM, or switch to a different templating approach in a refactor, your scraper silently starts returning empty arrays. You won't know until something downstream complains about missing data.
This brittleness is a fundamental property of selector-based scraping, not a bug in your code. We'll come back to it.
JavaScript-rendered pages and Puppeteer
A large portion of the modern web doesn't send you content in the HTML response. It sends you a near-empty shell and a bundle of JavaScript. The browser runs that JavaScript, which makes API calls, builds the DOM, and renders the content you actually see. When your fetch request hits one of these pages, you get the shell (not the content).
This includes most e-commerce sites (product listings loaded dynamically), social platforms, single-page applications built with React or Vue, and any site that uses infinite scroll instead of traditional pagination.
To scrape these, you need a real browser or something that behaves like one, such as Puppeteer.
Puppeteer
Puppeteer is a Node.js library that gives you programmatic control over a headless Chromium browser. It can navigate pages, click buttons, fill out forms, scroll, wait for elements to appear, and give you the final rendered HTML after JavaScript has run.
npm install puppeteerHere's a simple example of scraping a product listing from a JavaScript-heavy page:
import puppeteer from "puppeteer";
async function scrapeProducts(url) {
const browser = await puppeteer.launch({ headless: true });
const page = await browser.newPage();
await page.goto(url, { waitUntil: "networkidle2" });
// Wait until the product grid is actually in the DOM
await page.waitForSelector(".product-card");
const products = await page.evaluate(() => {
return Array.from(document.querySelectorAll(".product-card")).map((card) => ({
name: card.querySelector(".product-name")?.textContent?.trim(),
price: card.querySelector(".product-price")?.textContent?.trim(),
}));
});
await browser.close();
return products;
}The waitUntil: "networkidle2" option tells Puppeteer to wait until the network has been idle for at least 500ms, a decent heuristic for "the page has probably finished loading its data."
You can also waitForSelector to wait for a specific element to appear, which is more reliable when you know exactly what you're looking for.
Interacting with the page before scraping
Sometimes the data you need is locked behind a user action like an "Accept cookies" dialog you have to dismiss, a "Load more" button you need to click, or a dropdown filter you need to change. Puppeteer handles all of this.
// Dismiss a cookie consent dialog
await page.click("#accept-cookies");
// Wait a beat for the animation to finish
await page.waitForTimeout(500);
// Click "Load more" until it disappears
while (await page.$(".load-more-button")) {
await page.click(".load-more-button");
await page.waitForTimeout(1000);
}
// Now scrape the fully loaded content
const items = await page.evaluate(() => {
// ...
});This gets the job done. It's also where things start getting complicated in practice.
Where DIY scraping breaks down
Puppeteer solves the JavaScript rendering problem. It does not solve the infrastructure problem. And at any meaningful scale, infrastructure is what actually breaks things.
Anti-bot detection
Sites spend significant engineering effort detecting and blocking automated traffic. They look for signals like:
- Request headers. A raw
fetchrequest doesn't send the same headers a real browser does. It's missing theAccept-Language,Accept-Encoding, andSec-Fetch-*headers, as well as the specific User-Agent fingerprint of a real Chrome installation. Detection systems notice. - Behavioral patterns. Humans scroll unevenly, hover over elements, and move the mouse before clicking. Scripts move directly and instantly. Machine learning classifiers trained on real user sessions can identify this pattern with high accuracy.
- TLS fingerprinting. Even before your request reaches the application layer, the TLS handshake leaves a fingerprint. Puppeteer's fingerprint is well-known. Services like Cloudflare check it.
- IP reputation. Data center IPs, such as the kind you get from a typical cloud server, are flagged by default on many sites. You need residential proxy IPs that look like regular home users to avoid automatic blocks.
Handling all of this yourself means maintaining a pool of proxy IPs, rotating user agents, randomizing timing, managing cookies and sessions, patching browser fingerprints, and keeping up with detection as it evolves. It's a genuine engineering problem, and it's continuous techniques that work today get patched next month.
CAPTCHAs
Some sites will let the first few requests through and then present a CAPTCHA. Solving CAPTCHAs programmatically means using a model capable of solving visual challenges, or in the case of newer CAPTCHAs like Cloudflare Turnstile, having infrastructure that can pass the behavioral checks that trigger them in the first place.
Maintenance
The selector-based brittleness we mentioned earlier gets worse at scale. The more selectors you maintain, the more sites you scrape, the more frequently something breaks. A production scraping system built on raw selectors requires active maintenance. Someone needs to notice when a scraper silently starts returning empty results and fix it before the downstream system that depends on the data breaks.
None of this means you shouldn't use Puppeteer or Cheerio. For small projects, one-off data pulls, or sites where you control the source, they're excellent tools. But for anything you need to run reliably and repeatedly, against sites that actively resist being scraped, the DIY approach has a real cost that compounds over time.
The Spidra Node.js SDK
This is where the Spidra SDK fits into the picture. Spidra is a managed scraping API that handles the browser infrastructure, proxy rotation, CAPTCHA solving, and anti-bot evasion for you. You describe what you want in plain language or with a JSON schema for strict structure, and get clean data back.
The Node.js SDK wraps the Spidra API with a clean TypeScript-first interface:
npm install spidraYou'll need an API key from app.spidra.io. There's a free tier with 300 credits to get started.
Basic scraping
Here is a very basic example:
import { SpidraClient } from "spidra";
const spidra = new SpidraClient({ apiKey: "spd_YOUR_API_KEY" });
const job = await spidra.scrape.run({
urls: [{ url: "https://news.ycombinator.com" }],
prompt: "Extract the title, URL, score, and author of each story on the front page",
output: "json",
});
console.log(job.result.content);The run() method submits the job and polls for completion automatically. The result comes back as structured JSON matching what you described. No selectors, no parsing, no fragile DOM traversal.
Structured output with a JSON schema
Natural language prompts are convenient, but sometimes you need a guaranteed shape, especially when the output feeds into a database or another API that expects specific fields. Pass a schema, and Spidra will enforce it, returning null for any field it can't find rather than inventing a value.
const job = await spidra.scrape.run({
urls: [{ url: "https://jobs.example.com/senior-engineer" }],
prompt: "Extract the job listing details",
output: "json",
schema: {
type: "object",
required: ["title", "company", "location"],
properties: {
title: { type: "string" },
company: { type: "string" },
location: { type: "string" },
remote: { type: ["boolean", "null"] },
salary_min: { type: ["number", "null"] },
salary_max: { type: ["number", "null"] },
skills: { type: "array", items: { type: "string" } },
},
},
});This is particularly useful for scraping at scale, where inconsistent output structure makes downstream processing painful.
Browser actions
Not everything is scrape-and-done. Some pages require interaction before the content you want is visible. The SDK supports browser actions such as click, type, scroll, and wait that run before extraction, using either CSS selectors or plain English descriptions (Spidra will locate the element using AI).
const job = await spidra.scrape.run({
urls: [
{
url: "https://example.com/products",
actions: [
{ type: "click", value: "Accept cookies button" },
{ type: "wait", duration: 1000 },
{ type: "scroll", to: "80%" },
],
},
],
prompt: "Extract all product names and prices",
});Processing every item on a page with forEach
One of the more powerful features is forEach. It finds a set of elements on the page and processes each one individually. This is the right tool when you're scraping a product listing and need to click into each product's detail page to get the full description, or when a page has more than 50 items, and you need per-item extraction to stay consistent.
const job = await spidra.scrape.run({
urls: [
{
url: "https://books.toscrape.com/catalogue/mystery_3/index.html",
actions: [
{
type: "forEach",
observe: "Find all book cards in the product grid",
mode: "navigate", // click into each book's page
maxItems: 20,
waitAfterClick: 800,
itemPrompt: "Extract the title, price, star rating, and availability. Return as JSON.",
},
],
},
],
prompt: "Return a clean JSON array of all books",
output: "json",
});The mode: "navigate" setting follows each item's link to its destination page and captures content there. Use mode: "inline" to read content directly from the listing without navigating, or mode: "click" for UIs where clicking opens a modal or expands a section.
You can also add pagination to walk through multiple pages automatically:
{
type: "forEach",
observe: "Find all book title links",
mode: "navigate",
maxItems: 60,
pagination: {
nextSelector: "li.next > a",
maxPages: 3,
},
}Batch scraping
When you have a list of URLs, such as product pages, job listings, and company profiles, you can submit up to 50 at once for parallel processing.
const batch = await spidra.batch.run({
urls: [
"https://shop.example.com/product/wireless-headphones",
"https://shop.example.com/product/smart-watch",
"https://shop.example.com/product/laptop-stand",
],
prompt: "Extract product name, current price, and whether it's in stock",
output: "json",
});
for (const item of batch.items) {
if (item.status === "completed") {
console.log(item.url, item.result);
} else {
console.error(`Failed: ${item.url}`, item.error);
}
}Each item in the response has its own status and result, so partial failures don't break the whole job.
Crawling an entire site
If you need data from across an entire domain rather than from specific URLs, the crawl API handles discovery for you. You provide a starting URL and specify which pages to visit and what to extract. Spidra follows links, handles pagination, and returns structured data from every matching page.
const job = await spidra.crawl.run({
baseUrl: "https://competitor.com/blog",
crawlInstruction: "Find all blog posts published in 2024 and 2025",
transformInstruction: "Extract the title, author, publish date, and a two-sentence summary",
maxPages: 50,
});
for (const page of job.result) {
console.log(page.url, page.data);
}Error handling
The SDK exports typed error classes for each failure mode, so you can handle them specifically:
import {
SpidraClient,
SpidraAuthenticationError,
SpidraInsufficientCreditsError,
SpidraRateLimitError,
} from "spidra";
try {
await spidra.scrape.run({ urls: [{ url: "https://example.com" }], prompt: "..." });
} catch (err) {
if (err instanceof SpidraAuthenticationError) {
console.error("Invalid API key");
} else if (err instanceof SpidraInsufficientCreditsError) {
console.error("Out of credits -- top up your account");
} else if (err instanceof SpidraRateLimitError) {
console.error("Rate limited -- back off and retry");
}
}A complete real-world example
Let's pull everything together into something concrete. We'll scrape the top 30 mystery books from Books to Scrape, including their titles, prices, and star ratings, using forEach with pagination to walk through three pages of results.
import { SpidraClient } from "spidra";
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! });
async function getMysteryBooks() {
const job = await spidra.scrape.run(
{
urls: [
{
url: "https://books.toscrape.com/catalogue/category/books/mystery_3/index.html",
actions: [
{
type: "forEach",
observe: "Find all book cards in the product grid",
mode: "inline",
captureSelector: "article.product_pod",
maxItems: 30,
itemPrompt: "Extract title, price, and star rating. Return JSON: { title, price, star_rating }",
pagination: {
nextSelector: "li.next > a",
maxPages: 2,
},
},
],
},
],
prompt: "Return a clean JSON array of all books with title, price, and star_rating fields",
output: "json",
},
{
pollInterval: 3000,
timeout: 60_000,
}
);
return job.result.content;
}
const books = await getMysteryBooks();
console.log(`Found ${(books as any[]).length} books`);
console.log(JSON.stringify(books, null, 2));Store your API key in a .env file (or your environment) and load it with dotenv or Node's --env-file flag. Running this will return a clean array of book objects without you having to write a single selector.
Choosing the right tool
There's no single answer that fits every situation. Here's a practical way to think about it:
- Use
fetch+cheeriowhen the content is in the HTML source (right-click → View Source, and the data is there), the site doesn't block basic requests, and you're building something simple or one-off. - Use Puppeteer or Playwright when the content is rendered by JavaScript, you need to interact with the page (click, scroll, fill forms), and you're scraping a small number of sites that you can maintain selectors for over time.
- Use the Spidra SDK when you need reliability at any real scale, when the sites you're scraping use CAPTCHAs or anti-bot systems, when your selectors keep breaking because sites change, or when you want to describe what you want in plain language rather than maintain fragile DOM traversal code.
For most production use cases, the managed API approach saves more engineering time than it costs. The DIY path has genuine learning value and is the right choice for small or experimental work. Knowing both puts you in a position to make that call deliberately.
What's next
From here, the natural directions to explore are:
- Authenticated scraping — Spidra supports passing cookies for pages behind login walls, which opens up scraping of dashboards, paywalled content, and internal tools
- Webhooks and integrations — scraped data can be piped directly to Slack, Discord, Airtable, or Google Sheets through Spidra's integration layer without an intermediary server
- AI agent pipelines — the SDK plugs naturally into frameworks like the Vercel AI SDK as a tool, letting an LLM decide what to scrape and what to do with the results
The Spidra docs cover all of these in detail, and the API playground lets you test any scrape interactively before writing code.
