Running a single Puppeteer browser instance is fine for one page. Running one instance per URL is not. Each Chromium process uses 200 to 400 MB of RAM, and launching a fresh browser for every URL in a list burns through memory fast, adds startup latency to every request, and caps how much work you can do in parallel.
Browser clustering solves this by sharing a fixed pool of browser workers across a queue of URLs. Instead of spinning up a new browser for each task, you define how many workers you want running at once and let them pull from the queue continuously until everything is processed.
In this tutorial you will learn how puppeteer-cluster works, the three concurrency models it supports, how to set up basic and advanced cluster configurations, and where the self-managed approach hits its limits when you need to scale further.
What is browser clustering in Puppeteer?
Browser clustering in Puppeteer is a pattern for managing multiple browser workers that share a task queue and run concurrently. Each worker independently picks up a URL from the queue, processes it, and returns for the next one without waiting for other workers to finish.
Instead of scraping 12 URLs one at a time with a single browser, a cluster of 3 workers processes 3 URLs simultaneously. When one finishes, it picks up the next available URL. All 12 get done in roughly the time it takes to process 4 in sequence.
The puppeteer-cluster library handles this coordination. It manages worker creation, task queuing, retries, error handling, and cluster teardown so you do not have to build that infrastructure yourself.
Concurrency models in puppeteer-cluster
puppeteer-clustersupports three concurrency models. The right one depends on how isolated you need each worker to be.CONCURRENCY_BROWSERruns multiple independent browser instances. Each worker is a completely separate Chromium process with its own cookies, user agent, and fingerprint. This is the most isolated option and the right choice when you need each worker to look like completely separate traffic, especially when pairing workers with different proxies.CONCURRENCY_CONTEXTruns multiple browser contexts within a single browser instance. Workers share some browser resources like user agent but maintain separate cookie jars and storage. More memory-efficient thanCONCURRENCY_BROWSERbut carries higher detection risk on the same domain since contexts share fingerprint data. Assigning a different proxy per context helps reduce that risk.CONCURRENCY_PAGEruns multiple pages within the same browser context. This is the most memory-efficient option. All pages share cookies, session storage, and caches. Best suited for scraping different domains where shared session data is not a problem. For the same domain it carries the highest detection risk since all pages are indistinguishable from each other.
How to use Puppeteer Cluster
Step 1: Install dependencies
npm install puppeteer puppeteer-clusterStep 2: Set up the Cluster
Import puppeteer-cluster and launch a cluster. This configuration runs 3 browser instances concurrently:
// npm install puppeteer puppeteer-cluster
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_BROWSER,
maxConcurrency: 3,
});
// ...
})();Step 3: Define a task and queue URLs
Define the scraping logic as a task, then queue each URL. Workers pick up URLs from the queue as they become available:
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_BROWSER,
maxConcurrency: 3,
});
const URLs = [
'https://www.scrapingcourse.com/ecommerce/',
'https://www.scrapingcourse.com/ecommerce/page/2/',
'https://www.scrapingcourse.com/ecommerce/page/3/',
'https://www.scrapingcourse.com/ecommerce/page/4/',
'https://www.scrapingcourse.com/ecommerce/page/5/',
'https://www.scrapingcourse.com/ecommerce/page/6/',
];
await cluster.task(async ({ page, data: url }) => {
await page.goto(url);
console.log(await page.title());
});
URLs.forEach((url) => cluster.queue(url));
await cluster.idle();
await cluster.close();
})();# Output (order varies since tasks run concurrently)
Ecommerce Test Site to Learn Web Scraping - Page 3 - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - Page 2 - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - Page 5 - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - Page 4 - ScrapingCourse.com
Ecommerce Test Site to Learn Web Scraping - Page 6 - ScrapingCourse.comThe output order is non-deterministic because tasks run in parallel. All 6 URLs processed with 3 workers.
Step 4: Queue different tasks for different URLs
When URLs require different scraping logic, pass a task function directly to cluster.queue instead of using a shared task:
const { Cluster } = require('puppeteer-cluster');
(async () => {
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_BROWSER,
maxConcurrency: 3,
});
// task: extract product data
const scrapeProducts = async ({ page, data: url }) => {
await page.goto(url);
const products = await page.$$eval('.product', (items) =>
items.map((item) => ({
name: item.querySelector('.product-name')?.textContent.trim() || '',
price: item.querySelector('.price')?.textContent.trim() || '',
}))
);
console.log(products);
};
// task: extract page title only
const scrapeTitle = async ({ page, data: url }) => {
await page.goto(url);
console.log(await page.title());
};
cluster.queue('https://www.scrapingcourse.com/ecommerce/', scrapeProducts);
cluster.queue('https://www.scrapingcourse.com/ecommerce/page/2/', scrapeTitle);
await cluster.idle();
await cluster.close();
})();// Output
[
{ name: 'Abominable Hoodie', price: '$69.00' },
{ name: 'Adrienne Trek Jacket', price: '$57.00' },
// ...
{ name: 'Artemis Running Short', price: '$45.00' }
]
Ecommerce Test Site to Learn Web Scraping - Page 2 - ScrapingCourse.comEach URL runs its own task function. The cluster still manages concurrency and queuing for both.
Advanced cluster configuration
puppeteer-cluster exposes several configuration options inside Cluster.launch for controlling retry behavior, timeouts, and worker creation pacing:
const cluster = await Cluster.launch({
concurrency: Cluster.CONCURRENCY_BROWSER,
maxConcurrency: 3,
retryLimit: 3, // retry failed tasks up to 3 times
retryDelay: 2000, // wait 2 seconds between retries
skipDuplicateUrls: true, // ignore URLs already in the queue
timeout: 60000, // task timeout in ms (default 30s)
workerCreationDelay: 100, // pause between creating each worker (ms)
});| Option | Description |
|---|---|
retryLimit | Maximum retries per task before marking it failed |
retryDelay | Milliseconds to wait between retry attempts |
skipDuplicateUrls | Prevents the same URL being queued more than once |
timeout | Maximum time allowed for a single task before it is killed |
workerCreationDelay | Spacing between worker startups to avoid resource spikes |
workerCreationDelay is worth paying attention to. Launching all workers simultaneously spikes memory and CPU at startup. Spacing them out by 100 to 200 ms gives the system time to stabilize between each browser launch.
Error handling
Unhandled task errors crash the entire cluster by default. Add an error handler to catch failures per task without stopping the rest of the queue:
cluster.on('taskerror', (err, data) => {
console.error(`Task failed for ${data}: ${err.message}`);
});This fires for every task that throws or times out. Combined with retryLimit, it gives you full control over how failures are handled without manual try-catch inside every task function.
The limits of self-managed clustering
puppeteer-cluster is a well-designed library for what it does. The limits are not in the library. They are in what running browser clusters on local hardware actually costs.
- Memory caps your concurrency. Each
CONCURRENCY_BROWSERworker is a full Chromium process. Three workers at 300 MB each use nearly 1 GB of RAM before your task logic runs. More than 5 workers on a typical server starts degrading performance rather than improving it. The ceiling on how fast you can scrape is your hardware. - Startup latency adds up. Launching browser instances takes time. On a cold start with 5 workers, you are waiting several seconds before any scraping happens. For large URL lists this adds meaningful overhead.
- Anti-bot bypass is your problem. puppeteer-cluster handles concurrency. It does not handle Cloudflare, DataDome, or any other protection. Adding proxy rotation, CAPTCHA solving, and browser fingerprinting on top of a cluster setup means building and maintaining a significant amount of additional infrastructure yourself.
- The cluster still runs on one machine. Even a well-tuned local cluster is a single point of failure. No redundancy. No distributed processing. No auto-scaling when your URL list grows.
For processing dozens of URLs in a script, puppeteer-cluster is a clean and practical solution. For production pipelines processing hundreds or thousands of URLs reliably, these limits matter.
Scaling further with Spidra's batch scraping
When local cluster limits become the bottleneck, the alternative is to move the browser infrastructure out of your machine entirely and into a managed API that handles concurrency, proxies, and anti-bot bypass as a service.
Spidra's batch scraping processes up to 50 URLs in parallel per request, runs each one through a real browser in the cloud, handles anti-bot bypass automatically, and returns AI-extracted structured data rather than raw HTML that still needs parsing.
Here is the same multi-page product scraping task using Spidra's Node.js SDK instead of puppeteer-cluster:
npm install spidra-jsimport { SpidraClient } from 'spidra-js';
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY });
const urls = [
'https://www.scrapingcourse.com/ecommerce/',
'https://www.scrapingcourse.com/ecommerce/page/2/',
'https://www.scrapingcourse.com/ecommerce/page/3/',
'https://www.scrapingcourse.com/ecommerce/page/4/',
'https://www.scrapingcourse.com/ecommerce/page/5/',
'https://www.scrapingcourse.com/ecommerce/page/6/',
];
const batch = await spidra.batch.run({
urls,
prompt: 'Extract all product names and prices',
output: 'json',
});
for (const item of batch.items) {
if (item.status === 'completed') {
console.log(`${item.url}:`, item.result.content);
} else {
console.error(`Failed: ${item.url} — ${item.error}`);
}
}// Output (each page processed in parallel)
"https://www.scrapingcourse.com/ecommerce/": [
{ "name": "Abominable Hoodie", "price": "$69.00" },
{ "name": "Adrienne Trek Jacket", "price": "$57.00" }
],
"https://www.scrapingcourse.com/ecommerce/page/2/": [
{ "name": "Beaumont Summit Kit", "price": "$36.00" }
]All 6 pages run in parallel in the cloud. No browser to launch. No memory to manage. No selectors to write. Each item in the response has its own status so you can see exactly which ones completed and which ones failed.
For protected pages, add useProxy: true and anti-bot bypass is handled automatically:
const batch = await spidra.batch.run({
urls,
prompt: 'Extract all product names and prices',
output: 'json',
useProxy: true,
proxyCountry: 'us',
});
No change to the rest of the code. The same batch request works on open pages and protected ones.
For even larger jobs where you need to crawl an entire domain rather than a known list of URLs, Spidra's crawl endpoint discovers pages automatically:
const job = await spidra.crawl.run({
baseUrl: 'https://www.scrapingcourse.com/ecommerce/',
crawlInstruction: 'Follow all paginated product pages',
transformInstruction: 'Extract all product names and prices from each page',
maxPages: 50,
});
for (const page of job.result) {
console.log(page.url, page.data);
}puppeteer-cluster vs. Spidra batch
| puppeteer-cluster | Spidra Batch | |
|---|---|---|
| Concurrency model | 3 models: browser, context, page | Parallel cloud workers |
| Max concurrent workers | Limited by local RAM | Up to 50 URLs per request |
| Browser infrastructure | You manage it | Fully managed |
| Anti-bot bypass | Not included | Built in, automatic |
| Proxy rotation | Not included | Built in, 50 countries |
| Structured output | Raw HTML, you write the parser | AI extraction, optional JSON schema |
| Retry logic | retryLimit + retryDelay | Automatic per item |
| Error handling | taskerror event | Per-item status in response |
| Scales across machines | No, single node | Yes, cloud infrastructure |
| Best for | Medium-scale local scraping | Production pipelines, protected sites |
Conclusion
puppeteer-cluster is the right tool for running a Puppeteer scraper across multiple pages concurrently without launching a separate browser for each URL. The concurrency model options give you control over the isolation and memory trade-offs, and the advanced configuration handles retries, timeouts, and duplicate skipping cleanly.
The ceiling is hardware. Memory limits how many workers you can run, startup latency adds overhead at scale, and everything above the clustering layer including proxies, anti-bot bypass, and fingerprinting is still your responsibility to build and maintain.
When you need to go beyond what local infrastructure can support, Spidra's batch endpoint handles the concurrency, anti-bot bypass, and structured extraction in the cloud so you can process more URLs with less code and no infrastructure to manage.
Get started free at spidra.io. No credit card required.
