Blog/ How to use Puppeteer Cluster to scale up web scraping
June 1, 2026 · 9 min read

How to use Puppeteer Cluster to scale up web scraping

Joel Olawanle
Joel Olawanle
How to use Puppeteer Cluster to scale up web scraping

Running a single Puppeteer browser instance is fine for one page. Running one instance per URL is not. Each Chromium process uses 200 to 400 MB of RAM, and launching a fresh browser for every URL in a list burns through memory fast, adds startup latency to every request, and caps how much work you can do in parallel.

Browser clustering solves this by sharing a fixed pool of browser workers across a queue of URLs. Instead of spinning up a new browser for each task, you define how many workers you want running at once and let them pull from the queue continuously until everything is processed.

In this tutorial you will learn how puppeteer-cluster works, the three concurrency models it supports, how to set up basic and advanced cluster configurations, and where the self-managed approach hits its limits when you need to scale further.

What is browser clustering in Puppeteer?

Browser clustering in Puppeteer is a pattern for managing multiple browser workers that share a task queue and run concurrently. Each worker independently picks up a URL from the queue, processes it, and returns for the next one without waiting for other workers to finish.

Instead of scraping 12 URLs one at a time with a single browser, a cluster of 3 workers processes 3 URLs simultaneously. When one finishes, it picks up the next available URL. All 12 get done in roughly the time it takes to process 4 in sequence.

The puppeteer-cluster library handles this coordination. It manages worker creation, task queuing, retries, error handling, and cluster teardown so you do not have to build that infrastructure yourself.

Concurrency models in puppeteer-cluster

  • puppeteer-cluster supports three concurrency models. The right one depends on how isolated you need each worker to be.
  • CONCURRENCY_BROWSER runs multiple independent browser instances. Each worker is a completely separate Chromium process with its own cookies, user agent, and fingerprint. This is the most isolated option and the right choice when you need each worker to look like completely separate traffic, especially when pairing workers with different proxies.
  • CONCURRENCY_CONTEXT runs multiple browser contexts within a single browser instance. Workers share some browser resources like user agent but maintain separate cookie jars and storage. More memory-efficient than CONCURRENCY_BROWSER but carries higher detection risk on the same domain since contexts share fingerprint data. Assigning a different proxy per context helps reduce that risk.
  • CONCURRENCY_PAGE runs multiple pages within the same browser context. This is the most memory-efficient option. All pages share cookies, session storage, and caches. Best suited for scraping different domains where shared session data is not a problem. For the same domain it carries the highest detection risk since all pages are indistinguishable from each other.

How to use Puppeteer Cluster

Step 1: Install dependencies

npm install puppeteer puppeteer-cluster

Step 2: Set up the Cluster

Import puppeteer-cluster and launch a cluster. This configuration runs 3 browser instances concurrently:

// npm install puppeteer puppeteer-cluster
const { Cluster } = require('puppeteer-cluster');

(async () => {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_BROWSER,
        maxConcurrency: 3,
    });

    // ...
})();

Step 3: Define a task and queue URLs

Define the scraping logic as a task, then queue each URL. Workers pick up URLs from the queue as they become available:

const { Cluster } = require('puppeteer-cluster');

(async () => {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_BROWSER,
        maxConcurrency: 3,
    });

    const URLs = [
        'https://www.scrapingcourse.com/ecommerce/',
        'https://www.scrapingcourse.com/ecommerce/page/2/',
        'https://www.scrapingcourse.com/ecommerce/page/3/',
        'https://www.scrapingcourse.com/ecommerce/page/4/',
        'https://www.scrapingcourse.com/ecommerce/page/5/',
        'https://www.scrapingcourse.com/ecommerce/page/6/',
    ];

    await cluster.task(async ({ page, data: url }) => {
        await page.goto(url);
        console.log(await page.title());
    });

    URLs.forEach((url) => cluster.queue(url));

    await cluster.idle();
    await cluster.close();
})();
# Output (order varies since tasks run concurrently)
Ecommerce Test Site to Learn Web Scraping - Page 3 - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - Page 2 - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - Page 5 - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - Page 4 - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - Page 6 - ScrapingCourse.com

The output order is non-deterministic because tasks run in parallel. All 6 URLs processed with 3 workers.

Step 4: Queue different tasks for different URLs

When URLs require different scraping logic, pass a task function directly to cluster.queue instead of using a shared task:

const { Cluster } = require('puppeteer-cluster');

(async () => {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_BROWSER,
        maxConcurrency: 3,
    });

    // task: extract product data
    const scrapeProducts = async ({ page, data: url }) => {
        await page.goto(url);
        const products = await page.$$eval('.product', (items) =>
            items.map((item) => ({
                name:  item.querySelector('.product-name')?.textContent.trim() || '',
                price: item.querySelector('.price')?.textContent.trim() || '',
            }))
        );
        console.log(products);
    };

    // task: extract page title only
    const scrapeTitle = async ({ page, data: url }) => {
        await page.goto(url);
        console.log(await page.title());
    };

    cluster.queue('https://www.scrapingcourse.com/ecommerce/', scrapeProducts);
    cluster.queue('https://www.scrapingcourse.com/ecommerce/page/2/', scrapeTitle);

    await cluster.idle();
    await cluster.close();
})();
// Output
[
    { name: 'Abominable Hoodie', price: '$69.00' },
    { name: 'Adrienne Trek Jacket', price: '$57.00' },
    // ...
    { name: 'Artemis Running Short', price: '$45.00' }
]
Ecommerce Test Site to Learn Web Scraping - Page 2 - ScrapingCourse.com

Each URL runs its own task function. The cluster still manages concurrency and queuing for both.

Advanced cluster configuration

puppeteer-cluster exposes several configuration options inside Cluster.launch for controlling retry behavior, timeouts, and worker creation pacing:

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_BROWSER,
    maxConcurrency: 3,
    retryLimit: 3,            // retry failed tasks up to 3 times
    retryDelay: 2000,         // wait 2 seconds between retries
    skipDuplicateUrls: true,  // ignore URLs already in the queue
    timeout: 60000,           // task timeout in ms (default 30s)
    workerCreationDelay: 100, // pause between creating each worker (ms)
});
OptionDescription
retryLimitMaximum retries per task before marking it failed
retryDelayMilliseconds to wait between retry attempts
skipDuplicateUrlsPrevents the same URL being queued more than once
timeoutMaximum time allowed for a single task before it is killed
workerCreationDelaySpacing between worker startups to avoid resource spikes

workerCreationDelay is worth paying attention to. Launching all workers simultaneously spikes memory and CPU at startup. Spacing them out by 100 to 200 ms gives the system time to stabilize between each browser launch.

Error handling

Unhandled task errors crash the entire cluster by default. Add an error handler to catch failures per task without stopping the rest of the queue:

cluster.on('taskerror', (err, data) => {
    console.error(`Task failed for ${data}: ${err.message}`);
});

This fires for every task that throws or times out. Combined with retryLimit, it gives you full control over how failures are handled without manual try-catch inside every task function.

The limits of self-managed clustering

puppeteer-cluster is a well-designed library for what it does. The limits are not in the library. They are in what running browser clusters on local hardware actually costs.

  • Memory caps your concurrency. Each CONCURRENCY_BROWSER worker is a full Chromium process. Three workers at 300 MB each use nearly 1 GB of RAM before your task logic runs. More than 5 workers on a typical server starts degrading performance rather than improving it. The ceiling on how fast you can scrape is your hardware.
  • Startup latency adds up. Launching browser instances takes time. On a cold start with 5 workers, you are waiting several seconds before any scraping happens. For large URL lists this adds meaningful overhead.
  • Anti-bot bypass is your problem. puppeteer-cluster handles concurrency. It does not handle Cloudflare, DataDome, or any other protection. Adding proxy rotation, CAPTCHA solving, and browser fingerprinting on top of a cluster setup means building and maintaining a significant amount of additional infrastructure yourself.
  • The cluster still runs on one machine. Even a well-tuned local cluster is a single point of failure. No redundancy. No distributed processing. No auto-scaling when your URL list grows.

For processing dozens of URLs in a script, puppeteer-cluster is a clean and practical solution. For production pipelines processing hundreds or thousands of URLs reliably, these limits matter.

Scaling further with Spidra's batch scraping

When local cluster limits become the bottleneck, the alternative is to move the browser infrastructure out of your machine entirely and into a managed API that handles concurrency, proxies, and anti-bot bypass as a service.

Spidra's batch scraping processes up to 50 URLs in parallel per request, runs each one through a real browser in the cloud, handles anti-bot bypass automatically, and returns AI-extracted structured data rather than raw HTML that still needs parsing.

Here is the same multi-page product scraping task using Spidra's Node.js SDK instead of puppeteer-cluster:

npm install spidra-js
import { SpidraClient } from 'spidra-js';

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY });

const urls = [
    'https://www.scrapingcourse.com/ecommerce/',
    'https://www.scrapingcourse.com/ecommerce/page/2/',
    'https://www.scrapingcourse.com/ecommerce/page/3/',
    'https://www.scrapingcourse.com/ecommerce/page/4/',
    'https://www.scrapingcourse.com/ecommerce/page/5/',
    'https://www.scrapingcourse.com/ecommerce/page/6/',
];

const batch = await spidra.batch.run({
    urls,
    prompt: 'Extract all product names and prices',
    output: 'json',
});

for (const item of batch.items) {
    if (item.status === 'completed') {
        console.log(`${item.url}:`, item.result.content);
    } else {
        console.error(`Failed: ${item.url} — ${item.error}`);
    }
}
// Output (each page processed in parallel)
"https://www.scrapingcourse.com/ecommerce/": [
    { "name": "Abominable Hoodie", "price": "$69.00" },
    { "name": "Adrienne Trek Jacket", "price": "$57.00" }
],
"https://www.scrapingcourse.com/ecommerce/page/2/": [
    { "name": "Beaumont Summit Kit", "price": "$36.00" }
]

All 6 pages run in parallel in the cloud. No browser to launch. No memory to manage. No selectors to write. Each item in the response has its own status so you can see exactly which ones completed and which ones failed.

For protected pages, add useProxy: true and anti-bot bypass is handled automatically:

const batch = await spidra.batch.run({
    urls,
    prompt: 'Extract all product names and prices',
    output: 'json',
    useProxy: true,
    proxyCountry: 'us',
});

No change to the rest of the code. The same batch request works on open pages and protected ones.

For even larger jobs where you need to crawl an entire domain rather than a known list of URLs, Spidra's crawl endpoint discovers pages automatically:

const job = await spidra.crawl.run({
    baseUrl: 'https://www.scrapingcourse.com/ecommerce/',
    crawlInstruction: 'Follow all paginated product pages',
    transformInstruction: 'Extract all product names and prices from each page',
    maxPages: 50,
});

for (const page of job.result) {
    console.log(page.url, page.data);
}

puppeteer-cluster vs. Spidra batch

puppeteer-clusterSpidra Batch
Concurrency model3 models: browser, context, pageParallel cloud workers
Max concurrent workersLimited by local RAMUp to 50 URLs per request
Browser infrastructureYou manage itFully managed
Anti-bot bypassNot includedBuilt in, automatic
Proxy rotationNot includedBuilt in, 50 countries
Structured outputRaw HTML, you write the parserAI extraction, optional JSON schema
Retry logicretryLimit + retryDelayAutomatic per item
Error handlingtaskerror eventPer-item status in response
Scales across machinesNo, single nodeYes, cloud infrastructure
Best forMedium-scale local scrapingProduction pipelines, protected sites

Conclusion

puppeteer-cluster is the right tool for running a Puppeteer scraper across multiple pages concurrently without launching a separate browser for each URL. The concurrency model options give you control over the isolation and memory trade-offs, and the advanced configuration handles retries, timeouts, and duplicate skipping cleanly.

The ceiling is hardware. Memory limits how many workers you can run, startup latency adds overhead at scale, and everything above the clustering layer including proxies, anti-bot bypass, and fingerprinting is still your responsibility to build and maintain.

When you need to go beyond what local infrastructure can support, Spidra's batch endpoint handles the concurrency, anti-bot bypass, and structured extraction in the cloud so you can process more URLs with less code and no infrastructure to manage.

Get started free at spidra.io. No credit card required.

Frequently asked questions

CONCURRENCY_BROWSER gives each worker a completely isolated Chromium process with its own fingerprint. Most isolated but most memory-intensive. CONCURRENCY_CONTEXT shares one browser process across multiple contexts, reducing memory but sharing some browser-level fingerprint data. CONCURRENCY_PAGE is the lightest option, sharing a full browser context including cookies and storage across all pages. Right for different domains, risky for the same domain.

Share this article

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.