Web scraping is one of those tasks that exposes the real costs of your language choice fast. Slow HTTP libraries, leaky memory, callback hell, race conditions when you try to parallelize: these problems compound quickly at scale. Rust sidesteps most of them by design.
This guide covers everything from basic HTML parsing with reqwest and scraper, through headless browser automation with chromiumoxide, to using the Spidra Rust SDK when you want to offload the infrastructure entirely. Whether you are building a one-shot data extraction script or a production crawler that runs continuously, there is a Rust approach that fits.
Why Rust for web scraping?
Three properties make Rust genuinely useful for scraping work, not just academically interesting.
- Performance. Rust HTTP clients benchmark close to raw C. When you are making thousands of concurrent requests, the overhead of a garbage-collected runtime adds up. Rust's zero-cost abstractions mean your hot path stays fast without tuning a JVM heap or a Python event loop.
- Memory safety without a GC. Long-running crawlers leak memory in languages that rely on reference counting or conservative collection. Rust's ownership model prevents whole categories of leaks at compile time. A crawler that runs for days without ballooning in memory is not something you have to fight for in Rust. It is the default.
- First-class async with Tokio. The
tokioruntime gives you M:N threading that is production-grade and well-maintained. Spawning thousands of concurrent scraping tasks, rate-limiting with semaphores, and streaming results through channels are all patterns that feel natural once you get used to them.
Can Rust scrape JavaScript-rendered pages?
Yes. The options are roughly the same as in any other language: run a headless browser, call an external rendering API, or intercept the underlying API calls the JavaScript makes. Rust has solid bindings to Chromium through chromiumoxide, and external services like Spidra handle rendering transparently so your code never has to think about it.
The section on chromiumoxide below walks through the headless browser path. The Spidra SDK section shows the API approach.
The basics: reqwest and scraper
For static HTML, you need two crates:
reqwest: async HTTP client, built onhyperandtokioscraper: CSS selector-based HTML parsing, similar to BeautifulSoup in Python
Add them to your project:
[dependencies]
reqwest = { version = "0.12", features = ["json"] }
scraper = "0.19"
tokio = { version = "1", features = ["full"] }Here is a complete working example that fetches a page and extracts all article titles:
use reqwest::Client;
use scraper::{Html, Selector};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = Client::builder()
.user_agent("Mozilla/5.0 (compatible; my-scraper/1.0)")
.build()?;
let response = client
.get("https://news.ycombinator.com")
.send()
.await?;
let body = response.text().await?;
let document = Html::parse_document(&body);
let selector = Selector::parse(".titleline > a").unwrap();
for element in document.select(&selector) {
let title = element.text().collect::<String>();
let href = element.value().attr("href").unwrap_or("");
println!("{} โ {}", title, href);
}
Ok(())
}This is enough for many use cases: news aggregators, documentation sites, any server-rendered HTML. The Selector::parse call accepts standard CSS selectors. The .text() iterator collects all text nodes within the element.
Extracting structured data
Real scraping usually means you want structured records, not printed strings. Here is the same example returning a Vec of typed structs:
use reqwest::Client;
use scraper::{Html, Selector};
use serde::Serialize;
#[derive(Debug, Serialize)]
struct Story {
title: String,
url: String,
}
async fn scrape_hn(client: &Client) -> Result<Vec<Story>, Box<dyn std::error::Error>> {
let body = client
.get("https://news.ycombinator.com")
.send()
.await?
.text()
.await?;
let document = Html::parse_document(&body);
let title_sel = Selector::parse(".titleline > a").unwrap();
let stories = document
.select(&title_sel)
.map(|el| Story {
title: el.text().collect(),
url: el.value().attr("href").unwrap_or("").to_string(),
})
.collect();
Ok(stories)
}Adding serde lets you serialize directly to JSON with serde_json::to_string_pretty(&stories)?, useful when you want to pipe results to another system.
Real problems you will hit
Static HTML scraping works until it doesn't. Here are the wall-and-ladder moments.
JavaScript rendering
A growing percentage of the web renders content client-side. If you curl a page and see a skeleton with no data, or a loading spinner in the source, the actual content is injected by JavaScript after load. reqwest + scraper cannot help you here because they never execute JavaScript.
Anti-bot measures
Sites that care about scraping deploy CAPTCHAs, browser fingerprinting, IP rate limiting, and JavaScript challenge pages (Cloudflare, DataDome, etc.). A bare reqwest client with a spoofed user agent gets past the simplest defenses. Anything more sophisticated requires either rotating real residential proxies, a browser with realistic fingerprints, or an external service that handles this for you.
Proxies at scale
If you are making thousands of requests per day to the same domain, you will get IP-blocked. Managing a proxy pool in Rust means building rotation logic, health-checking proxies, handling retry/backoff, and monitoring block rates. This is infrastructure work that has nothing to do with the data you actually want.
Rate limiting and concurrency
Rust makes it easy to launch 10,000 concurrent tasks. Most websites do not appreciate that. You need to implement rate limiting, usually with a semaphore or a token bucket. Here is a pattern:
use std::sync::Arc;
use tokio::sync::Semaphore;
use tokio::time::{sleep, Duration};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let semaphore = Arc::new(Semaphore::new(10)); // max 10 concurrent
let urls = vec![
"https://example.com/page/1",
"https://example.com/page/2",
// ...
];
let mut handles = vec![];
for url in urls {
let sem = semaphore.clone();
let handle = tokio::spawn(async move {
let _permit = sem.acquire().await.unwrap();
sleep(Duration::from_millis(200)).await; // 5 req/s per slot
// fetch and parse here
});
handles.push(handle);
}
for handle in handles {
handle.await?;
}
Ok(())
}Headless browsers in Rust
chromiumoxide is a Rust library that speaks the Chrome DevTools Protocol. It launches and controls a real Chromium instance, which means JavaScript executes, network requests fire, and you get the fully rendered DOM.
Add it:
[dependencies]
chromiumoxide = { version = "0.7", features = ["tokio", "async-std-runtime"] }
tokio = { version = "1", features = ["full"] }
futures = "0.3"Basic usage:
use chromiumoxide::Browser;
use chromiumoxide::browser::BrowserConfig;
use futures::StreamExt;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let (browser, mut handler) = Browser::launch(
BrowserConfig::builder()
.arg("--no-sandbox")
.build()?
).await?;
// The handler drives the browser event loop
tokio::spawn(async move {
while let Some(h) = handler.next().await {
if h.is_err() {
break;
}
}
});
let page = browser.new_page("https://example.com").await?;
// Wait for a specific element to confirm the page has rendered
page.find_element("h1").await?;
let content = page.content().await?;
println!("{}", &content[..500]);
// You can also evaluate JavaScript directly
let title: String = page
.evaluate("document.title")
.await?
.into_value()?;
println!("Title: {}", title);
Ok(())
}chromiumoxide also supports intercepting network requests, injecting scripts, handling authentication flows, and taking screenshots. For most JavaScript-rendering problems, this is the right Rust-native tool.
The cost is operational: Chromium consumes significant memory (roughly 100-200 MB per instance), startup is slow relative to a plain HTTP request, and you need Chromium installed in your environment. In a Docker container or CI pipeline this is manageable. In a distributed scraper running thousands of concurrent jobs, it becomes a resource problem.
Where DIY breaks down
Building your own scraping infrastructure starts feeling productive. Then you hit a site with a sophisticated fingerprinting system and spend a week on that. You get IP blocks and build a proxy rotator. You add retry logic. You add monitoring. You write tooling to detect when a site's layout changed and your selectors broke.
None of this work produces the data you wanted. It produces infrastructure that supports getting data. For exploratory work, prototypes, or one-time extractions, the DIY path is fine. For production systems where the data itself is the product, the infrastructure is a distraction.
The specific failure modes:
- Selector rot: Sites redesign their HTML. Your
.product-price-v2selector stops working silently. You need change detection and alerting. - CAPTCHA escalation: A site notices your traffic pattern and adds more aggressive challenges. You are now in an arms race.
- IP burn rate: Residential proxies cost money. You need pool management logic to avoid burning proxies on non-scraping requests.
- Scale limits: A headless browser pool that works for 1,000 pages/day needs significant re-architecture at 1,000,000 pages/day.
Using the Spidra Rust SDK
Spidra handles rendering, proxy rotation, CAPTCHA solving, and structured extraction. Your Rust code sends URLs and optionally a prompt or schema; you get back clean data.
Install:
cargo add spidraBasic scrape
use spidra::{SpidraClient, types::scrape::ScrapeParams};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = SpidraClient::new("your-api-key");
let mut params = ScrapeParams::new("https://example.com");
params.prompt = Some("Extract the headline and summary".to_string());
let result = client.scrape().run(¶ms).await?;
println!("{:?}", result.content);
Ok(())
}The prompt field instructs the AI extraction layer. You get back the data you asked for, not raw HTML you still have to parse.
Batch scraping
When you have a list of URLs, batch mode runs them concurrently on Spidra's infrastructure:
use spidra::{SpidraClient, types::batch::BatchScrapeParams, types::scrape::ScrapeUrl};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = SpidraClient::new("your-api-key");
let urls = vec![
ScrapeUrl::new("https://example.com/product/1"),
ScrapeUrl::new("https://example.com/product/2"),
ScrapeUrl::new("https://example.com/product/3"),
];
let response = client.batch().run(&BatchScrapeParams::new(urls)).await?;
println!(
"Completed: {} / Failed: {}",
response.completed_count, response.failed_count
);
for item in &response.items {
println!("{}: {:?}", item.url, item.result);
}
Ok(())
}Crawling a site
For following links across a domain:
use spidra::{SpidraClient, types::crawl::CrawlParams};
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
let client = SpidraClient::new("your-api-key");
let mut params = CrawlParams::new("https://docs.example.com");
params.max_pages = Some(100);
params.max_depth = Some(3);
let pages = client.crawl().run(¶ms).await?;
println!("Crawled {} pages", pages.len());
for page in &pages {
println!(" {} โ {}", page.status, page.url);
}
Ok(())
}Decision table: when to use each approach
| Scenario | Best approach |
|---|---|
| Static HTML, low volume, known structure | reqwest + scraper |
| JavaScript-rendered pages, low volume | chromiumoxide |
| JS-rendered pages, high volume | Spidra SDK |
| Sites with CAPTCHAs or aggressive anti-bot | Spidra SDK |
| Unknown page structure, need AI extraction | Spidra SDK |
| Crawling entire domains | Spidra SDK |
| Prototype or one-off extraction | reqwest + scraper |
| Need full browser interaction (login, forms) | chromiumoxide |
The honest split: reqwest + scraper is the right default for anything where the HTML is simple and static. chromiumoxide covers the JavaScript rendering gap when you want to stay entirely self-hosted. Spidra is the right choice when the operational cost of managing proxies, CAPTCHAs, and infrastructure exceeds the cost of an API.
Wrapping up
Rust is a genuinely good fit for web scraping. The async story with Tokio is mature, the HTTP and parsing libraries are production-ready, and you get memory safety that makes long-running crawlers reliable without tuning. reqwest and scraper handle most static HTML work. chromiumoxide handles JavaScript rendering. The Spidra Rust SDK handles the cases where the infrastructure problem is bigger than the scraping problem.
Pick the tool that matches what you are actually building. A one-off script and a production data pipeline have different requirements. Rust gives you good options at every level.
Spidra is a web scraping API with AI-powered extraction, proxy rotation, and CAPTCHA handling. Try it free at spidra.io.
