AI-powered web scraping with Rust

Elijah Asaolu

Traditional web scraping treats extraction as a parsing problem: find the right CSS selector, pull the text, repeat. That model works until the site redesigns, adds dynamic loading, or starts obfuscating class names. In 2026, AI-powered scraping replaces selectors with natural language prompts and returns guaranteed structured output, a better fit for LLM pipelines, agent workflows, and any system that needs typed data rather than raw HTML.

This guide covers how to extract structured data from websites in Rust using AI: the shift from selector-based to prompt-based extraction, JSON schema output for LLM pipelines, wiring scraped data into Rust AI agents, and production patterns for async batch pipelines.

Quick answer: To extract structured data from websites in Rust using AI, describe what you want in plain English, define a JSON schema for the output shape, and let an AI scraping layer handle parsing and normalization. The Spidra Rust SDK (cargo add spidra) exposes this as a single async call that returns a typed result.

How AI changes web scraping

Selector-based scrapers are tightly coupled to page structure. A div.product-price > span.amount selector breaks the moment a developer renames a class or restructures the DOM. Maintaining dozens of these across different sites is a significant ongoing cost.

LLMs change the equation. Instead of encoding page structure into your code, you describe the data you want in plain English. The model reads the rendered content, understands context, and returns structured output regardless of how the underlying HTML is organized.

For Rust AI agent scraping, this matters because:

Typed pipelines. When your scraper returns a JSON object that maps to a Rust struct, the compiler catches shape mismatches before they reach production.
Resilience to layout changes. Prompt-based scrapers degrade gracefully. A redesign that breaks a CSS selector often has no impact on a well-written prompt.
Less code to maintain. Replacing fifty selectors with one prompt eliminates a whole class of breakage.

The old way vs the new way

Here is the same extraction task done both ways: pulling job listing details from a page.

Selector-based (old way)

use reqwest::Client;
use scraper::{Html, Selector};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = Client::builder()
        .user_agent("Mozilla/5.0")
        .build()?;

    let html = client
        .get("https://jobs.example.com/senior-engineer")
        .send()
        .await?
        .text()
        .await?;

    let document = Html::parse_document(&html);

    // Fragile — breaks on any DOM change
    let title_sel = Selector::parse("h1.job-title").unwrap();
    let company_sel = Selector::parse("span.company-name").unwrap();
    let salary_sel = Selector::parse("div.salary-range > span.value").unwrap();

    let title = document.select(&title_sel)
        .next()
        .map(|e| e.text().collect::<String>())
        .unwrap_or_default();

    let company = document.select(&company_sel)
        .next()
        .map(|e| e.text().collect::<String>())
        .unwrap_or_default();

    let salary = document.select(&salary_sel)
        .next()
        .map(|e| e.text().collect::<String>())
        .unwrap_or_default();

    println!("{} at {} — {}", title, company, salary);
    Ok(())
}

This works until the site changes. And sites change.

Prompt-based with JSON schema (new way)

use spidra::{SpidraClient, types::scrape::{ScrapeParams, OutputFormat}};
use serde::Deserialize;
use serde_json::json;

#[derive(Deserialize, Debug)]
struct JobListing {
    title: String,
    company: String,
    location: Option<String>,
    salary_min: Option<f64>,
    salary_max: Option<f64>,
    skills: Vec<String>,
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = SpidraClient::new("your-api-key");

    let mut params = ScrapeParams::new("https://jobs.example.com/senior-engineer");
    params.prompt = Some("Extract the job title, company name, location, salary range, and required skills".to_string());
    params.output_format = Some(OutputFormat::Json);
    params.schema = Some(json!({
        "type": "object",
        "required": ["title", "company"],
        "properties": {
            "title": { "type": "string" },
            "company": { "type": "string" },
            "location": { "type": ["string", "null"] },
            "salary_min": { "type": ["number", "null"] },
            "salary_max": { "type": ["number", "null"] },
            "skills": {
                "type": "array",
                "items": { "type": "string" }
            }
        }
    }));

    let result = client.scrape().run(&params).await?;

    // Deserialize directly into your Rust struct
    let listing: JobListing = serde_json::from_value(result.data.unwrap())?;
    println!("{} at {} ({:?})", listing.title, listing.company, listing.location);
    println!("Skills: {:?}", listing.skills);

    Ok(())
}

The schema enforces output shape. Fields in required always appear, as null if the data is not found. Optional fields are omitted when unavailable. Your Rust struct maps directly to the schema, so deserialization is clean.

Structured output with JSON schema for LLM pipelines

The schema field is the most important feature for AI agent scraping in Rust. It turns unpredictable HTML into a typed value you can deserialize directly into a Rust struct, pass to an LLM as structured context, or store in a database.

A few rules worth knowing:

Mark anything that might be missing as ["type", "null"]. The API returns null rather than omitting the field, so your Option<T> fields map cleanly.
Put all fields you always need in required. Put optional enrichment fields outside it.
Enum fields work: "enum": ["full_time", "part_time", "contract", null].

Example schema for a product listing:

let schema = json!({
    "type": "object",
    "required": ["name", "price", "in_stock"],
    "properties": {
        "name": { "type": "string" },
        "price": { "type": "number" },
        "currency": { "type": ["string", "null"] },
        "in_stock": { "type": "boolean" },
        "rating": { "type": ["number", "null"] },
        "review_count": { "type": ["number", "null"] }
    }
});

Wiring scraped data into a Rust AI agent

Here is a complete example of a Rust AI agent that uses async-openai and Spidra together. The agent scrapes a URL for context, then passes the structured result to an LLM.

[dependencies]
spidra = "0.1"
async-openai = "0.23"
tokio = { version = "1", features = ["full"] }
serde_json = "1"

use async_openai::{Client as OpenAIClient, types::{ChatCompletionRequestUserMessageArgs, CreateChatCompletionRequestArgs}};
use spidra::{SpidraClient, types::scrape::ScrapeParams};

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let spidra = SpidraClient::new("your-spidra-key");
    let openai = OpenAIClient::new();

    // Step 1: scrape the page
    let mut params = ScrapeParams::new("https://news.ycombinator.com");
    params.prompt = Some("List the top 10 stories with title, URL, and point count".to_string());

    let scrape_result = spidra.scrape().run(&params).await?;

    let context = serde_json::to_string_pretty(&scrape_result.content)?;

    // Step 2: pass to LLM
    let request = CreateChatCompletionRequestArgs::default()
        .model("gpt-4o")
        .messages(vec![
            ChatCompletionRequestUserMessageArgs::default()
                .content(format!(
                    "Here are the current top stories on Hacker News:\n\n{}\n\nWhich story would be most relevant to a Rust developer and why?",
                    context
                ))
                .build()?
                .into(),
        ])
        .build()?;

    let response = openai.chat().create(request).await?;
    let answer = &response.choices[0].message.content;
    println!("{:?}", answer);

    Ok(())
}

The key pattern: scrape first, structure the data, pass it as context. The LLM gets clean typed information rather than raw HTML noise.

Batch extraction at scale

For processing many URLs in parallel (competitor pages, product listings, directory entries), use the batch endpoint. Up to 50 URLs processed concurrently:

use spidra::types::batch::BatchScrapeParams;
use spidra::types::scrape::ScrapeUrl;

let urls = vec![
    ScrapeUrl::new("https://shop.example.com/product/1"),
    ScrapeUrl::new("https://shop.example.com/product/2"),
    ScrapeUrl::new("https://shop.example.com/product/3"),
];

let batch = client.batch().run(&BatchScrapeParams::new(urls)).await?;

let successful: Vec<_> = batch.items.iter()
    .filter(|item| item.status == "completed")
    .collect();

println!("{}/{} succeeded", successful.len(), batch.items.len());

for item in successful {
    println!("{}: {:?}", item.url, item.result);
}

Production patterns for async Rust scraping pipelines

Retry with exponential backoff

use std::time::Duration;
use tokio::time::sleep;

async fn scrape_with_retry(
    client: &SpidraClient,
    params: ScrapeParams,
    max_retries: u32,
) -> Result<spidra::types::scrape::ScrapeResult, spidra::error::SpidraError> {
    let mut attempt = 0;
    loop {
        match client.scrape().run(&params).await {
            Ok(result) => return Ok(result),
            Err(e) if attempt < max_retries => {
                let wait = Duration::from_secs(2u64.pow(attempt));
                eprintln!("Attempt {} failed: {}. Retrying in {:?}", attempt + 1, e, wait);
                sleep(wait).await;
                attempt += 1;
            }
            Err(e) => return Err(e),
        }
    }
}

Processing a URL queue concurrently

use tokio::task::JoinSet;
use std::sync::Arc;

async fn process_urls(urls: Vec<String>, api_key: &str) {
    let client = Arc::new(SpidraClient::new(api_key));
    let mut set = JoinSet::new();

    for url in urls {
        let client = Arc::clone(&client);
        set.spawn(async move {
            let mut params = ScrapeParams::new(&url);
            params.prompt = Some("Extract the page title and main content summary".to_string());
            client.scrape().run(&params).await
        });
    }

    while let Some(result) = set.join_next().await {
        match result {
            Ok(Ok(scrape_result)) => println!("Success: {:?}", scrape_result.content),
            Ok(Err(e)) => eprintln!("Scrape error: {}", e),
            Err(e) => eprintln!("Task panicked: {}", e),
        }
    }
}

Full end-to-end example: competitive pricing agent

A complete agent that monitors competitor prices and generates a report:

use spidra::{SpidraClient, types::batch::BatchScrapeParams};
use serde::{Deserialize};
use serde_json::json;

#[derive(Deserialize, Debug)]
struct Product {
    name: String,
    price: f64,
    currency: Option<String>,
    on_sale: Option<bool>,
}

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    let client = SpidraClient::new("your-api-key");

    let competitor_urls = vec![
        "https://competitor-a.com/widgets".to_string(),
        "https://competitor-b.com/widgets".to_string(),
        "https://competitor-c.com/widgets".to_string(),
    ];

    let urls = competitor_urls.into_iter().map(ScrapeUrl::new).collect();
    let batch = client.batch().run(&BatchScrapeParams::new(urls)).await?;

    for item in &batch.items {
        if item.status == "completed" {
            println!("--- {} ---", item.url);
            if let Some(result) = &item.result {
                println!("{}", serde_json::to_string_pretty(result)?);
            }
        }
    }

    Ok(())
}

Wrapping up

AI-powered web scraping in Rust replaces brittle selectors with durable prompts and delivers typed structured output that integrates cleanly with Rust's type system. The JSON schema contract means your Option<T> fields and enums map directly to what the API returns, with no post-processing or defensive null checks sprinkled throughout your code.

For production Rust AI agent scraping, the Spidra SDK handles the browser automation, proxy rotation, and CAPTCHA infrastructure. Your code stays focused on what to do with the data.

Install: cargo add spidra

Spidra is a web scraping API with AI-powered extraction, proxy rotation, and CAPTCHA handling. Try it free at spidra.io.

Share this article

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.