Blog/ Web scraping with .NET: a practical guide

April 30, 2026 · 7 min read

Web scraping with .NET: a practical guide

Elijah Asaolu

Web scraping with .NET: a practical guide

.NET is a mature platform for backend systems, data pipelines, and scheduled jobs, exactly the kinds of workloads where web scraping tends to show up. If your stack is already C#, adding a Python scraping service just to collect data introduces operational overhead that is rarely justified.

This guide covers the full .NET web scraping stack: static HTML extraction with HtmlAgilityPack and AngleSharp, JavaScript rendering with Playwright for .NET, anti-bot challenges, and when a managed scraping API makes more sense than maintaining your own browser infrastructure.

Can .NET scrape JavaScript-rendered sites?

Yes. Microsoft's Playwright has first-class .NET support, and Selenium WebDriver has had a .NET binding for years. Both drive a real browser, execute JavaScript, and let you extract content from the fully rendered DOM.

For static pages such as documentation, news articles, and data exports, HtmlAgilityPack or AngleSharp paired with HttpClient is sufficient and significantly faster. The sections below cover both paths.

The basics: `HttpClient` and `HtmlAgilityPack`

HtmlAgilityPack is the standard .NET library for HTML parsing. It handles malformed HTML gracefully and provides XPath and CSS-style queries for element selection.

Install via NuGet:

dotnet add package HtmlAgilityPack

Fetch and parse a page:

using HtmlAgilityPack;

var web = new HtmlWeb();
web.UserAgent = "Mozilla/5.0 (compatible; DotNetScraper/1.0)";

var doc = web.Load("https://news.ycombinator.com");

var stories = doc.DocumentNode.SelectNodes("//span[@class='titleline']/a");

if (stories != null)
{
    foreach (var story in stories)
    {
        var title = story.InnerText;
        var href = story.GetAttributeValue("href", "");
        Console.WriteLine($"{title} -> {href}");
    }
}

HtmlAgilityPack uses XPath for queries. If you prefer CSS selectors, AngleSharp provides a more modern API:

dotnet add package AngleSharp

using AngleSharp;
using AngleSharp.Html.Parser;

var config = Configuration.Default.WithDefaultLoader();
var context = BrowsingContext.New(config);
var document = await context.OpenAsync("https://news.ycombinator.com");

var stories = document.QuerySelectorAll("span.titleline > a");

foreach (var story in stories)
{
    var title = story.TextContent;
    var href = story.GetAttribute("href") ?? "";
    Console.WriteLine($"{title} -> {href}");
}

AngleSharp also supports full CSS selector syntax and includes a JavaScript engine if you need limited JS execution without a full browser.

Using `HttpClient` directly

For fine-grained control over headers, timeouts, and connection pooling:

using System.Net.Http;
using HtmlAgilityPack;

// Reuse HttpClient across requests — do not instantiate per-request
var httpClient = new HttpClient();
httpClient.DefaultRequestHeaders.Add("User-Agent",
    "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36");
httpClient.DefaultRequestHeaders.Add("Accept-Language", "en-US,en;q=0.9");

var html = await httpClient.GetStringAsync("https://example.com/data");

var doc = new HtmlDocument();
doc.LoadHtml(html);

var nodes = doc.DocumentNode.SelectNodes("//table[@class='data-table']//tr");
foreach (var row in nodes ?? Enumerable.Empty<HtmlNode>())
{
    var cells = row.SelectNodes("td");
    if (cells?.Count >= 2)
    {
        Console.WriteLine($"{cells[0].InnerText.Trim()} | {cells[1].InnerText.Trim()}");
    }
}

Real problems: what breaks in production

JavaScript-rendered content in .NET

Most modern web applications (e-commerce sites, SaaS dashboards, social feeds) render content after the initial HTML loads via JavaScript. An HttpClient request returns the shell page with empty containers.

You can verify quickly: check if curl-ing the URL returns the data you want. If the response is mostly empty <div> tags, the page requires browser execution.

Anti-bot detection

HttpClient requests are easy to fingerprint as bots: the TLS handshake, HTTP/2 settings, and header ordering differ from real browsers. Sites using Cloudflare, Datadome, or similar solutions return a challenge page or a 403 before your code sees any content.

Basic mitigations:

var handler = new HttpClientHandler
{
    AllowAutoRedirect = true,
    UseCookies = true,
};

var client = new HttpClient(handler);
client.DefaultRequestHeaders.Add("User-Agent",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 " +
    "(KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36");
client.DefaultRequestHeaders.Add("Accept",
    "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8");
client.DefaultRequestHeaders.Add("Accept-Language", "en-US,en;q=0.5");
client.DefaultRequestHeaders.Add("Referer", "https://www.google.com/");

This defeats basic header checks but does not defeat browser fingerprinting or CAPTCHA challenges.

Parallel scraping in .NET

.NET's async/await and Task.WhenAll make concurrent scraping clean:

using System.Collections.Concurrent;

var urls = new[]
{
    "https://example.com/page/1",
    "https://example.com/page/2",
    "https://example.com/page/3",
};

var results = new ConcurrentBag<string>();

await Task.WhenAll(urls.Select(async url =>
{
    try
    {
        var html = await httpClient.GetStringAsync(url);
        results.Add(html);
    }
    catch (Exception ex)
    {
        Console.Error.WriteLine($"Failed {url}: {ex.Message}");
    }
}));

Console.WriteLine($"Collected {results.Count} pages");

This works well for cooperative sites but does nothing for anti-bot detection. Sending many parallel requests from the same IP is a reliable way to trigger rate limiting.

Scraping JavaScript-rendered pages with Playwright for .NET

Playwright is the best option for .NET browser automation. It supports Chromium, Firefox, and WebKit, has a clean async API, and is actively maintained by Microsoft.

dotnet add package Microsoft.Playwright
playwright install chromium

Basic usage:

using Microsoft.Playwright;

using var playwright = await Playwright.CreateAsync();
await using var browser = await playwright.Chromium.LaunchAsync(new BrowserTypeLaunchOptions
{
    Headless = true
});

var page = await browser.NewPageAsync();
await page.GotoAsync("https://example.com/dynamic-content");

// Wait for content to load
await page.WaitForSelectorAsync("div.product-grid");

// Extract elements
var products = await page.QuerySelectorAllAsync("div.product-card");

foreach (var product in products)
{
    var name = await product.QuerySelectorAsync("h3.product-name");
    var price = await product.QuerySelectorAsync("span.price");

    Console.WriteLine($"{await name!.InnerTextAsync()}: {await price!.InnerTextAsync()}");
}

Interacting with the page before scraping

await page.GotoAsync("https://example.com/search");

// Accept cookie banner
var cookieButton = await page.QuerySelectorAsync("button#accept-cookies");
if (cookieButton != null)
    await cookieButton.ClickAsync();

// Fill a search form
await page.FillAsync("input[name='q']", "laptop");
await page.ClickAsync("button[type='submit']");

// Wait for results
await page.WaitForLoadStateAsync(LoadState.NetworkIdle);

var results = await page.QuerySelectorAllAsync("div.search-result");
Console.WriteLine($"Found {results.Count} results");

Running Playwright in containerized environments

Playwright in Docker requires the browser dependencies:

FROM mcr.microsoft.com/dotnet/sdk:8.0

# Install Playwright dependencies
RUN apt-get update && apt-get install -y \
    libglib2.0-0 libnss3 libnspr4 libdbus-1-3 \
    libatk1.0-0 libatk-bridge2.0-0 libcups2 \
    libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 \
    libxfixes3 libxrandr2 libgbm1 libasound2

WORKDIR /app
COPY . .
RUN dotnet publish -c Release -o out
RUN dotnet out/YourApp.dll playwright install chromium

CMD ["dotnet", "out/YourApp.dll"]

Where DIY breaks down

Running Playwright at scale means managing:

Browser process overhead: each Chromium instance uses 200-500MB of RAM. Ten parallel scrapers means gigabytes just for browser processes.
CAPTCHA solving: requires a third-party CAPTCHA solving service and integration code.
Proxy rotation: building a reliable proxy pool is a separate infrastructure project.
Stealth mode: Playwright's default fingerprint is detectable by sophisticated anti-bot systems. Staying undetected requires patching browser launch arguments and keeping up with detection updates.
Version management: Playwright and Chromium update independently. Breaking changes appear without warning.

For occasional scraping of cooperative sites, this is manageable. For production workloads hitting modern protected sites continuously, the infrastructure overhead dominates.

Using the Spidra .NET SDK

Spidra handles browser automation, proxy rotation, and CAPTCHA solving as a managed API. The .NET SDK wraps the async submit-and-poll pattern into clean Task-based calls.

dotnet add package Spidra

Basic scrape

using Spidra;

var client = new SpidraClient("your-api-key");

var job = await client.Scrape.RunAsync(new ScrapeParams
{
    Urls = [new ScrapeUrl("https://news.ycombinator.com")],
    Prompt = "List the top 5 stories with title, points, and comment count",
    UseProxy = true
});

Console.WriteLine(job.Result.Content);

Structured output with JSON schema

using System.Text.Json;

var job = await client.Scrape.RunAsync(new ScrapeParams
{
    Urls = [new ScrapeUrl("https://jobs.example.com/senior-engineer")],
    Prompt = "Extract the job title, company, location, salary range, and required skills",
    Output = OutputFormat.Json,
    UseProxy = true,
    Schema = JsonSerializer.SerializeToElement(new
    {
        type = "object",
        required = new[] { "title", "company" },
        properties = new
        {
            title = new { type = "string" },
            company = new { type = "string" },
            location = new { type = new[] { "string", "null" } },
            salaryMin = new { type = new[] { "number", "null" } },
            salaryMax = new { type = new[] { "number", "null" } },
            skills = new { type = "array", items = new { type = "string" } }
        }
    })
});

var listing = job.Result.Content.Deserialize<JobListing>();
Console.WriteLine($"{listing!.Title} at {listing.Company}");

record JobListing(string Title, string Company, string? Location,
    double? SalaryMin, double? SalaryMax, List<string> Skills);

Batch scraping

var batch = await client.Batch.RunAsync(new BatchScrapeParams
{
    Urls =
    [
        "https://shop.example.com/product/1",
        "https://shop.example.com/product/2",
        "https://shop.example.com/product/3"
    ],
    Prompt = "Extract product name, price, and availability",
    Output = OutputFormat.Json,
    UseProxy = true
});

foreach (var item in batch.Items.Where(i => i.Status == "completed"))
{
    Console.WriteLine($"{item.Url}: {item.Result}");
}

Crawling an entire site

var job = await client.Crawl.RunAsync(new CrawlParams
{
    BaseUrl = "https://competitor.com/blog",
    CrawlInstruction = "Find all blog posts published in 2024",
    TransformInstruction = "Extract title, author, publish date, and summary",
    MaxPages = 30,
    UseProxy = true
});

foreach (var page in job.Result)
{
    Console.WriteLine($"{page.Url}: {page.Data}");
}

When to use each approach

Scenario	Approach
Static pages, cooperative sites	`HttpClient` + `HtmlAgilityPack` / `AngleSharp`
JSON API available	`HttpClient` + `System.Text.Json`
JS-rendered, low volume	Playwright for .NET
Production scale, anti-bot sites	Scraping API (Spidra)
LLM / agent pipeline input	Scraping API with structured output
One-off research	Browser DevTools + copy as cURL

Wrapping up

.NET web scraping starts with HtmlAgilityPack or AngleSharp for static pages and steps up to Playwright when JavaScript rendering is required. The operational cost of running headless browsers at scale (memory, CAPTCHA handling, proxy rotation, stealth maintenance) pushes most production use cases toward a managed scraping API.

The Spidra .NET SDK handles that infrastructure and returns structured data that maps directly to C# records and POCOs via System.Text.Json.

Install: dotnet add package Spidra

Spidra is a web scraping API with AI-powered extraction, proxy rotation, and CAPTCHA handling. Try it free at spidra.io.

Share this article

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.