How many workers should I run in a puppeteer-cluster?

Start with 3 to 5. Each CONCURRENCY_BROWSER worker is a full Chromium process using 200 to 400 MB of RAM. Going above 5 workers on a typical server starts reducing performance rather than improving it because memory pressure causes the OS to swap. Monitor RAM usage and find the point where adding workers stops decreasing total job time.

Does puppeteer-cluster handle anti-bot protection?

No. It handles concurrency and task queuing. Anti-bot bypass, proxy rotation, and browser fingerprinting are separate concerns you need to handle yourself, typically by adding puppeteer-extra with stealth plugins and a proxy provider on top of your cluster setup.

What happens if a task fails in puppeteer-cluster?

By default an unhandled error in a task crashes the cluster. Use cluster.on('taskerror', handler) to catch errors per task without stopping the queue. Combine it with retryLimit and retryDelay in the launch config to automatically retry failed tasks before marking them as failed.

How does Spidra handle large URL lists larger than 50?

Reach out to us, and we can increase this for you. You can also send multiple batch requests. Each request handles up to 50 URLs in parallel. For very large lists, loop through them in chunks of 50 and await each batch before sending the next. For full domain crawls where you do not know all the URLs upfront, use the crawl endpoint and let Spidra discover and process pages automatically.

Blog/ How to use Puppeteer Cluster to scale up web scraping

June 1, 2026 · 9 min read

How to use Puppeteer Cluster to scale up web scraping

Joel Olawanle

How to use Puppeteer Cluster to scale up web scraping

Running a single Puppeteer browser instance is fine for one page. Running one instance per URL is not. Each Chromium process uses 200 to 400 MB of RAM, and launching a fresh browser for every URL in a list burns through memory fast, adds startup latency to every request, and caps how much work you can do in parallel.

Browser clustering solves this by sharing a fixed pool of browser workers across a queue of URLs. Instead of spinning up a new browser for each task, you define how many workers you want running at once and let them pull from the queue continuously until everything is processed.

In this tutorial you will learn how puppeteer-cluster works, the three concurrency models it supports, how to set up basic and advanced cluster configurations, and where the self-managed approach hits its limits when you need to scale further.

What is browser clustering in Puppeteer?

Browser clustering in Puppeteer is a pattern for managing multiple browser workers that share a task queue and run concurrently. Each worker independently picks up a URL from the queue, processes it, and returns for the next one without waiting for other workers to finish.

Instead of scraping 12 URLs one at a time with a single browser, a cluster of 3 workers processes 3 URLs simultaneously. When one finishes, it picks up the next available URL. All 12 get done in roughly the time it takes to process 4 in sequence.

The puppeteer-cluster library handles this coordination. It manages worker creation, task queuing, retries, error handling, and cluster teardown so you do not have to build that infrastructure yourself.

Concurrency models in puppeteer-cluster

puppeteer-cluster supports three concurrency models. The right one depends on how isolated you need each worker to be.
CONCURRENCY_BROWSER runs multiple independent browser instances. Each worker is a completely separate Chromium process with its own cookies, user agent, and fingerprint. This is the most isolated option and the right choice when you need each worker to look like completely separate traffic, especially when pairing workers with different proxies.
CONCURRENCY_CONTEXT runs multiple browser contexts within a single browser instance. Workers share some browser resources like user agent but maintain separate cookie jars and storage. More memory-efficient than CONCURRENCY_BROWSER but carries higher detection risk on the same domain since contexts share fingerprint data. Assigning a different proxy per context helps reduce that risk.
CONCURRENCY_PAGE runs multiple pages within the same browser context. This is the most memory-efficient option. All pages share cookies, session storage, and caches. Best suited for scraping different domains where shared session data is not a problem. For the same domain it carries the highest detection risk since all pages are indistinguishable from each other.

How to use Puppeteer Cluster

Step 1: Install dependencies

npm install puppeteer puppeteer-cluster

Step 2: Set up the Cluster

Import puppeteer-cluster and launch a cluster. This configuration runs 3 browser instances concurrently:

// npm install puppeteer puppeteer-cluster
const { Cluster } = require('puppeteer-cluster');

(async () => {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_BROWSER,
        maxConcurrency: 3,
    });

    // ...
})();

Step 3: Define a task and queue URLs

Define the scraping logic as a task, then queue each URL. Workers pick up URLs from the queue as they become available:

const { Cluster } = require('puppeteer-cluster');

(async () => {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_BROWSER,
        maxConcurrency: 3,
    });

    const URLs = [
        'https://www.scrapingcourse.com/ecommerce/',
        'https://www.scrapingcourse.com/ecommerce/page/2/',
        'https://www.scrapingcourse.com/ecommerce/page/3/',
        'https://www.scrapingcourse.com/ecommerce/page/4/',
        'https://www.scrapingcourse.com/ecommerce/page/5/',
        'https://www.scrapingcourse.com/ecommerce/page/6/',
    ];

    await cluster.task(async ({ page, data: url }) => {
        await page.goto(url);
        console.log(await page.title());
    });

    URLs.forEach((url) => cluster.queue(url));

    await cluster.idle();
    await cluster.close();
})();

# Output (order varies since tasks run concurrently)
Ecommerce Test Site to Learn Web Scraping - Page 3 - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - Page 2 - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - Page 5 - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - Page 4 - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - Page 6 - ScrapingCourse.com

The output order is non-deterministic because tasks run in parallel. All 6 URLs processed with 3 workers.

Step 4: Queue different tasks for different URLs

When URLs require different scraping logic, pass a task function directly to cluster.queue instead of using a shared task:

const { Cluster } = require('puppeteer-cluster');

(async () => {
    const cluster = await Cluster.launch({
        concurrency: Cluster.CONCURRENCY_BROWSER,
        maxConcurrency: 3,
    });

    // task: extract product data
    const scrapeProducts = async ({ page, data: url }) => {
        await page.goto(url);
        const products = await page.$$eval('.product', (items) =>
            items.map((item) => ({
                name:  item.querySelector('.product-name')?.textContent.trim() || '',
                price: item.querySelector('.price')?.textContent.trim() || '',
            }))
        );
        console.log(products);
    };

    // task: extract page title only
    const scrapeTitle = async ({ page, data: url }) => {
        await page.goto(url);
        console.log(await page.title());
    };

    cluster.queue('https://www.scrapingcourse.com/ecommerce/', scrapeProducts);
    cluster.queue('https://www.scrapingcourse.com/ecommerce/page/2/', scrapeTitle);

    await cluster.idle();
    await cluster.close();
})();

// Output
[
    { name: 'Abominable Hoodie', price: '$69.00' },
    { name: 'Adrienne Trek Jacket', price: '$57.00' },
    // ...
    { name: 'Artemis Running Short', price: '$45.00' }
]
Ecommerce Test Site to Learn Web Scraping - Page 2 - ScrapingCourse.com

Each URL runs its own task function. The cluster still manages concurrency and queuing for both.

Advanced cluster configuration

puppeteer-cluster exposes several configuration options inside Cluster.launch for controlling retry behavior, timeouts, and worker creation pacing:

const cluster = await Cluster.launch({
    concurrency: Cluster.CONCURRENCY_BROWSER,
    maxConcurrency: 3,
    retryLimit: 3,            // retry failed tasks up to 3 times
    retryDelay: 2000,         // wait 2 seconds between retries
    skipDuplicateUrls: true,  // ignore URLs already in the queue
    timeout: 60000,           // task timeout in ms (default 30s)
    workerCreationDelay: 100, // pause between creating each worker (ms)
});

Option	Description
`retryLimit`	Maximum retries per task before marking it failed
`retryDelay`	Milliseconds to wait between retry attempts
`skipDuplicateUrls`	Prevents the same URL being queued more than once
`timeout`	Maximum time allowed for a single task before it is killed
`workerCreationDelay`	Spacing between worker startups to avoid resource spikes

workerCreationDelay is worth paying attention to. Launching all workers simultaneously spikes memory and CPU at startup. Spacing them out by 100 to 200 ms gives the system time to stabilize between each browser launch.

Error handling

Unhandled task errors crash the entire cluster by default. Add an error handler to catch failures per task without stopping the rest of the queue:

cluster.on('taskerror', (err, data) => {
    console.error(`Task failed for ${data}: ${err.message}`);
});

This fires for every task that throws or times out. Combined with retryLimit, it gives you full control over how failures are handled without manual try-catch inside every task function.

The limits of self-managed clustering

puppeteer-cluster is a well-designed library for what it does. The limits are not in the library. They are in what running browser clusters on local hardware actually costs.

Memory caps your concurrency. Each CONCURRENCY_BROWSER worker is a full Chromium process. Three workers at 300 MB each use nearly 1 GB of RAM before your task logic runs. More than 5 workers on a typical server starts degrading performance rather than improving it. The ceiling on how fast you can scrape is your hardware.
Startup latency adds up. Launching browser instances takes time. On a cold start with 5 workers, you are waiting several seconds before any scraping happens. For large URL lists this adds meaningful overhead.
Anti-bot bypass is your problem. puppeteer-cluster handles concurrency. It does not handle Cloudflare, DataDome, or any other protection. Adding proxy rotation, CAPTCHA solving, and browser fingerprinting on top of a cluster setup means building and maintaining a significant amount of additional infrastructure yourself.
The cluster still runs on one machine. Even a well-tuned local cluster is a single point of failure. No redundancy. No distributed processing. No auto-scaling when your URL list grows.

For processing dozens of URLs in a script, puppeteer-cluster is a clean and practical solution. For production pipelines processing hundreds or thousands of URLs reliably, these limits matter.

Scaling further with Spidra's batch scraping

When local cluster limits become the bottleneck, the alternative is to move the browser infrastructure out of your machine entirely and into a managed API that handles concurrency, proxies, and anti-bot bypass as a service.

Spidra's batch scraping processes up to 50 URLs in parallel per request, runs each one through a real browser in the cloud, handles anti-bot bypass automatically, and returns AI-extracted structured data rather than raw HTML that still needs parsing.

Here is the same multi-page product scraping task using Spidra's Node.js SDK instead of puppeteer-cluster:

npm install spidra-js

import { SpidraClient } from 'spidra-js';

const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY });

const urls = [
    'https://www.scrapingcourse.com/ecommerce/',
    'https://www.scrapingcourse.com/ecommerce/page/2/',
    'https://www.scrapingcourse.com/ecommerce/page/3/',
    'https://www.scrapingcourse.com/ecommerce/page/4/',
    'https://www.scrapingcourse.com/ecommerce/page/5/',
    'https://www.scrapingcourse.com/ecommerce/page/6/',
];

const batch = await spidra.batch.run({
    urls,
    prompt: 'Extract all product names and prices',
    output: 'json',
});

for (const item of batch.items) {
    if (item.status === 'completed') {
        console.log(`${item.url}:`, item.result.content);
    } else {
        console.error(`Failed: ${item.url} — ${item.error}`);
    }
}

// Output (each page processed in parallel)
"https://www.scrapingcourse.com/ecommerce/": [
    { "name": "Abominable Hoodie", "price": "$69.00" },
    { "name": "Adrienne Trek Jacket", "price": "$57.00" }
],
"https://www.scrapingcourse.com/ecommerce/page/2/": [
    { "name": "Beaumont Summit Kit", "price": "$36.00" }
]

All 6 pages run in parallel in the cloud. No browser to launch. No memory to manage. No selectors to write. Each item in the response has its own status so you can see exactly which ones completed and which ones failed.

For protected pages, add useProxy: true and anti-bot bypass is handled automatically:

const batch = await spidra.batch.run({
    urls,
    prompt: 'Extract all product names and prices',
    output: 'json',
    useProxy: true,
    proxyCountry: 'us',
});

No change to the rest of the code. The same batch request works on open pages and protected ones.

For even larger jobs where you need to crawl an entire domain rather than a known list of URLs, Spidra's crawl endpoint discovers pages automatically:

const job = await spidra.crawl.run({
    baseUrl: 'https://www.scrapingcourse.com/ecommerce/',
    crawlInstruction: 'Follow all paginated product pages',
    transformInstruction: 'Extract all product names and prices from each page',
    maxPages: 50,
});

for (const page of job.result) {
    console.log(page.url, page.data);
}

puppeteer-cluster vs. Spidra batch

	puppeteer-cluster	Spidra Batch
Concurrency model	3 models: browser, context, page	Parallel cloud workers
Max concurrent workers	Limited by local RAM	Up to 50 URLs per request
Browser infrastructure	You manage it	Fully managed
Anti-bot bypass	Not included	Built in, automatic
Proxy rotation	Not included	Built in, 50 countries
Structured output	Raw HTML, you write the parser	AI extraction, optional JSON schema
Retry logic	`retryLimit` + `retryDelay`	Automatic per item
Error handling	`taskerror` event	Per-item status in response
Scales across machines	No, single node	Yes, cloud infrastructure
Best for	Medium-scale local scraping	Production pipelines, protected sites

Conclusion

puppeteer-cluster is the right tool for running a Puppeteer scraper across multiple pages concurrently without launching a separate browser for each URL. The concurrency model options give you control over the isolation and memory trade-offs, and the advanced configuration handles retries, timeouts, and duplicate skipping cleanly.

The ceiling is hardware. Memory limits how many workers you can run, startup latency adds overhead at scale, and everything above the clustering layer including proxies, anti-bot bypass, and fingerprinting is still your responsibility to build and maintain.

When you need to go beyond what local infrastructure can support, Spidra's batch endpoint handles the concurrency, anti-bot bypass, and structured extraction in the cloud so you can process more URLs with less code and no infrastructure to manage.

Get started free at spidra.io. No credit card required.

Frequently asked questions

CONCURRENCY_BROWSER gives each worker a completely isolated Chromium process with its own fingerprint. Most isolated but most memory-intensive. CONCURRENCY_CONTEXT shares one browser process across multiple contexts, reducing memory but sharing some browser-level fingerprint data. CONCURRENCY_PAGE is the lightest option, sharing a full browser context including cookies and storage across all pages. Right for different domains, risky for the same domain.

For most production scraping use cases, yes. Spidra's batch endpoint processes up to 50 URLs in parallel per request, handles JavaScript rendering, anti-bot bypass, and proxy rotation automatically, and returns structured JSON without you writing a parser. If you have very specific browser control requirements that an API cannot meet, puppeteer-cluster on a well-resourced server is still a valid option.

Share this article

Guides

Get structured data from popular websites

Learn how to get structured data from popular websites like Amazon using a JSON Schema and AI prompt, no selectors or proxies required.

July 8, 2026 · 5 min read

Guides

Spidra crawl API: how to crawl an entire website and extract data

Discover and extract data from entire websites with Python and Node.js. Covers re-extraction, authenticated crawling, and proxy routing.

June 24, 2026 · 15 min read

Guides

Spidra browser actions: complete guide to clicking, scrolling, and interacting before scraping

Complete guide to Spidra browser actions. Learn how to click, scroll, type, and use forEach with real examples.

June 23, 2026 · 15 min read

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.

How to use Puppeteer Cluster to scale up web scraping

What is browser clustering in Puppeteer?

Concurrency models in puppeteer-cluster

How to use Puppeteer Cluster

Step 1: Install dependencies

Step 2: Set up the Cluster

Step 3: Define a task and queue URLs

Step 4: Queue different tasks for different URLs

Advanced cluster configuration

Error handling

The limits of self-managed clustering

Scaling further with Spidra's batch scraping

puppeteer-cluster vs. Spidra batch

Conclusion

Frequently asked questions

Share this article

Related posts

Get structured data from popular websites

Spidra crawl API: how to crawl an entire website and extract data

Spidra browser actions: complete guide to clicking, scrolling, and interacting before scraping

Start scraping for free.