Traditional .NET web scraping means HtmlAgilityPack selectors, Selenium click sequences, and a growing collection of XPath expressions that break every time a site redesigns. The maintenance cost is real: selectors are fragile, JavaScript-rendered pages need browser automation, and the gap between what you scraped yesterday and what the site serves today widens constantly.
AI-powered web scraping changes this. Instead of encoding page structure into selectors, you describe the data you want in plain English. The AI layer handles parsing, understands layout changes, and returns guaranteed structured output, a cleaner fit for C# type systems, System.Text.Json deserialization, and LLM agent pipelines.
This guide covers AI-powered web scraping in .NET: the shift from selector-based to prompt-based extraction, JSON schema output for C# pipelines, wiring scraped data into a .NET AI agent, and production patterns using Task and IAsyncEnumerable.
Quick answer: To extract structured data from websites in .NET using AI, describe what you want in plain English, define a JSON schema for the output shape, and let an AI scraping layer handle parsing and normalization. The Spidra .NET SDK (dotnet add package Spidra) exposes this as a Task<T> call that returns a typed result.
How AI changes web scraping for .NET developers
Selector-based scrapers in .NET are tightly coupled to DOM structure. A //div[@class='product-price']/span XPath expression breaks the moment a frontend developer renames a class. In enterprise environments, that maintenance falls to backend developers who would rather be building features.
AI-powered scraping changes the model:
- Prompts over selectors. You describe the data once in plain English. The model reads the rendered content and returns a typed result regardless of how the page is structured.
- Resilience to redesigns. A prompt asking for "product name and price" continues working after a CSS class rename. An XPath expression does not.
- C# type safety end-to-end. When the scraper returns a JSON object conforming to a schema, you deserialize it directly into a C# record. The compiler enforces correctness at the boundary.
The old way vs the new way
The same task, extracting job listing details, done both ways.
Selector-based (old way)
using HtmlAgilityPack;
var web = new HtmlWeb();
var doc = web.Load("https://jobs.example.com/senior-engineer");
// Fragile โ breaks on any DOM change
var titleNode = doc.DocumentNode.SelectSingleNode("//h1[@class='job-title']");
var companyNode = doc.DocumentNode.SelectSingleNode("//span[@class='company-name']");
var salaryNode = doc.DocumentNode.SelectSingleNode("//div[@class='salary-range']/span");
// Manual null checks everywhere
var title = titleNode?.InnerText.Trim() ?? "";
var company = companyNode?.InnerText.Trim() ?? "";
var salary = salaryNode?.InnerText.Trim() ?? "";
// What if the page requires JavaScript to render?
// All three variables are empty strings.
Console.WriteLine($"{title} at {company} โ {salary}");
Problems: breaks on DOM changes, silent failures on JS-rendered pages, defensive null checks throughout.
Prompt-based with JSON schema (new way)
using Spidra;
using System.Text.Json;
using System.Text.Json.Serialization;
var client = new SpidraClient("your-api-key");
var job = await client.Scrape.RunAsync(new ScrapeParams
{
Urls = [new ScrapeUrl("https://jobs.example.com/senior-engineer")],
Prompt = "Extract the job title, company name, location, salary range, and required skills",
Output = OutputFormat.Json,
UseProxy = true,
Schema = JsonSerializer.SerializeToElement(new
{
type = "object",
required = new[] { "title", "company" },
properties = new
{
title = new { type = "string" },
company = new { type = "string" },
location = new { type = new[] { "string", "null" } },
salaryMin = new { type = new[] { "number", "null" } },
salaryMax = new { type = new[] { "number", "null" } },
skills = new { type = "array", items = new { type = "string" } }
}
})
});
// Deserialize directly โ no null checking, no string parsing
var listing = job.Result.Content.Deserialize<JobListing>(new JsonSerializerOptions
{
PropertyNameCaseInsensitive = true
});
Console.WriteLine($"{listing!.Title} at {listing.Company}");
Console.WriteLine($"Skills: {string.Join(", ", listing.Skills)}");
record JobListing(
string Title,
string Company,
string? Location,
[property: JsonPropertyName("salaryMin")] double? SalaryMin,
[property: JsonPropertyName("salaryMax")] double? SalaryMax,
List<string> Skills
);
The schema contract means:
- Fields in
requiredalways appear, asnullif the data is not found. - Optional fields are omitted when unavailable.
- Your C# record maps directly to the schema. No defensive handling needed.
JSON schema for structured output: why it matters for .NET pipelines
The schema field turns unpredictable HTML into a typed value you can deserialize directly into C# records, persist to a database, or pass as structured context to an LLM.
Rules worth knowing:
- Use
new[] { "string", "null" }for nullable fields. The API returnsnullrather than omitting the field, so yourstring?anddouble?types map cleanly. - Put all fields you always need in
required. Optional enrichment fields go outside it. - Enum constraints work:
new[] { "full_time", "part_time", "contract", null }.
Example schema for a product listing:
var schema = JsonSerializer.SerializeToElement(new
{
type = "object",
required = new[] { "name", "price", "inStock" },
properties = new
{
name = new { type = "string" },
price = new { type = "number" },
currency = new { type = new[] { "string", "null" } },
inStock = new { type = "boolean" },
rating = new { type = new[] { "number", "null" } },
reviewCount = new { type = new[] { "number", "null" } }
}
});
The corresponding C# record:
record Product(
string Name,
double Price,
string? Currency,
bool InStock,
double? Rating,
int? ReviewCount
);
Wiring scraped data into a .NET AI agent
Using Microsoft Semantic Kernel
Semantic Kernel is Microsoft's framework for building LLM-powered agents in .NET. Here is a complete example that uses Spidra as a scraping tool within a Semantic Kernel agent:
dotnet add package Microsoft.SemanticKernel
dotnet add package Spidra
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Connectors.Anthropic;
using Spidra;
using System.ComponentModel;
// Define the scraping tool
public class ScrapingPlugin
{
private readonly SpidraClient _client;
public ScrapingPlugin(SpidraClient client) => _client = client;
[KernelFunction("scrape_url")]
[Description("Fetch and extract structured data from a URL. Use when you need current information from a website.")]
public async Task<string> ScrapeUrlAsync(
[Description("The URL to scrape")] string url,
[Description("What data to extract, in plain English")] string prompt)
{
var job = await _client.Scrape.RunAsync(new ScrapeParams
{
Urls = [new ScrapeUrl(url)],
Prompt = prompt,
UseProxy = true
});
return System.Text.Json.JsonSerializer.Serialize(job.Result.Content);
}
}
// Wire it into a Semantic Kernel agent
var spidra = new SpidraClient(Environment.GetEnvironmentVariable("SPIDRA_API_KEY")!);
var kernel = Kernel.CreateBuilder()
.AddAnthropicChatCompletion("claude-opus-4-6",
Environment.GetEnvironmentVariable("ANTHROPIC_API_KEY")!)
.Build();
kernel.Plugins.AddFromObject(new ScrapingPlugin(spidra));
var settings = new AnthropicPromptExecutionSettings
{
FunctionChoiceBehavior = FunctionChoiceBehavior.Auto()
};
var result = await kernel.InvokePromptAsync(
"What are the top 3 trending repositories on GitHub today, and what do they do?",
new KernelArguments(settings)
);
Console.WriteLine(result);
Using the Anthropic SDK directly
For a lighter approach without Semantic Kernel:
dotnet add package Anthropic.SDK
dotnet add package Spidra
using Anthropic.SDK;
using Anthropic.SDK.Messaging;
using Spidra;
using System.Text.Json;
var spidra = new SpidraClient(Environment.GetEnvironmentVariable("SPIDRA_API_KEY")!);
var anthropic = new AnthropicClient(Environment.GetEnvironmentVariable("ANTHROPIC_API_KEY")!);
// Step 1: scrape and structure the data
var scrapeJob = await spidra.Scrape.RunAsync(new ScrapeParams
{
Urls = [new ScrapeUrl("https://news.ycombinator.com")],
Prompt = "List the top 10 stories with title, URL, and point count",
UseProxy = true
});
var context = JsonSerializer.Serialize(scrapeJob.Result.Content,
new JsonSerializerOptions { WriteIndented = true });
// Step 2: pass structured data to the LLM
var messages = new List<Message>
{
new()
{
Role = RoleType.User,
Content = $"""
Here are the current top stories on Hacker News:
{context}
Which story would be most relevant to a .NET developer and why?
"""
}
};
var response = await anthropic.Messages.GetClaudeMessageAsync(
new MessageParameters
{
Model = AnthropicModels.Claude3Opus,
MaxTokens = 1024,
Messages = messages
});
Console.WriteLine(response.Content.First().Text);
Batch extraction at scale
For processing many URLs in parallel (competitor pages, product catalogs, job listings):
using Spidra;
using System.Text.Json;
var client = new SpidraClient("your-api-key");
var batch = await client.Batch.RunAsync(new BatchScrapeParams
{
Urls =
[
"https://competitor-a.com/pricing",
"https://competitor-b.com/pricing",
"https://competitor-c.com/pricing"
],
Prompt = "Extract all pricing plans with name, monthly price, and key features",
Output = OutputFormat.Json,
UseProxy = true,
Schema = JsonSerializer.SerializeToElement(new
{
type = "object",
required = new[] { "plans" },
properties = new
{
plans = new
{
type = "array",
items = new
{
type = "object",
properties = new
{
name = new { type = "string" },
monthlyPrice = new { type = new[] { "number", "null" } },
features = new { type = "array", items = new { type = "string" } }
}
}
}
}
})
});
var succeeded = batch.Items.Where(i => i.Status == "completed").ToList();
Console.WriteLine($"{succeeded.Count}/{batch.Items.Count} succeeded");
foreach (var item in succeeded)
{
Console.WriteLine($"\n=== {item.Url} ===");
Console.WriteLine(item.Result);
}
// Retry any failures
var failed = batch.Items.Where(i => i.Status == "failed").ToList();
if (failed.Count > 0)
{
Console.WriteLine($"\nRetrying {failed.Count} failed items...");
await client.Batch.RetryAsync(batch.BatchId);
}
Production patterns: error handling and retries
Typed exception handling
using Spidra;
using Spidra.Exceptions;
try
{
var job = await client.Scrape.RunAsync(params);
return job.Result.Content;
}
catch (SpidraAuthenticationException)
{
logger.LogError("Invalid API key. Check your SPIDRA_API_KEY.");
throw;
}
catch (SpidraInsufficientCreditsException)
{
logger.LogWarning("Out of scraping credits. Upgrade your plan.");
await NotifyOpsAsync();
throw;
}
catch (SpidraRateLimitException ex)
{
logger.LogWarning("Rate limited. Retrying after backoff.");
await Task.Delay(ex.RetryAfter ?? TimeSpan.FromSeconds(5));
return await client.Scrape.RunAsync(params)
.ContinueWith(t => t.Result.Result.Content);
}
catch (SpidraServerException)
{
logger.LogError("Spidra server error. Retrying after short wait.");
await Task.Delay(TimeSpan.FromSeconds(10));
throw;
}
Retry policy with Polly
dotnet add package Polly
dotnet add package Microsoft.Extensions.Http.Polly
using Polly;
using Polly.Extensions.Http;
// Define a retry policy with exponential backoff
var retryPolicy = Policy
.Handle<SpidraRateLimitException>()
.Or<SpidraServerException>()
.WaitAndRetryAsync(
retryCount: 3,
sleepDurationProvider: attempt => TimeSpan.FromSeconds(Math.Pow(2, attempt)),
onRetry: (exception, delay, attempt, context) =>
{
Console.Error.WriteLine($"Attempt {attempt} failed: {exception.Message}. Retrying in {delay}.");
}
);
var result = await retryPolicy.ExecuteAsync(() =>
client.Scrape.RunAsync(new ScrapeParams
{
Urls = [new ScrapeUrl(url)],
Prompt = prompt
})
);
Full end-to-end example: news research agent
A complete .NET agent that scrapes current news, extracts structured data, and answers questions about it:
using Anthropic.SDK;
using Anthropic.SDK.Messaging;
using Spidra;
using System.Text.Json;
class NewsResearchAgent
{
private readonly SpidraClient _spidra;
private readonly AnthropicClient _anthropic;
public NewsResearchAgent(string spidraKey, string anthropicKey)
{
_spidra = new SpidraClient(spidraKey);
_anthropic = new AnthropicClient(anthropicKey);
}
public async Task<string> ResearchAsync(string question)
{
// Scrape multiple news sources in parallel
var sources = new[]
{
"https://news.ycombinator.com",
"https://lobste.rs"
};
var scrapeTask = _spidra.Batch.RunAsync(new BatchScrapeParams
{
Urls = [.. sources],
Prompt = "List the top 10 stories with title, URL, and a one-sentence summary",
Output = OutputFormat.Json,
UseProxy = true
});
var batch = await scrapeTask;
var contextBuilder = new System.Text.StringBuilder();
foreach (var item in batch.Items.Where(i => i.Status == "completed"))
{
contextBuilder.AppendLine($"## Source: {item.Url}");
contextBuilder.AppendLine(JsonSerializer.Serialize(item.Result,
new JsonSerializerOptions { WriteIndented = true }));
contextBuilder.AppendLine();
}
var response = await _anthropic.Messages.GetClaudeMessageAsync(
new MessageParameters
{
Model = AnthropicModels.Claude3Opus,
MaxTokens = 2048,
Messages =
[
new Message
{
Role = RoleType.User,
Content = $"Here is today's tech news:\n\n{contextBuilder}\n\nQuestion: {question}"
}
]
});
return response.Content.First().Text;
}
public static async Task Main()
{
var agent = new NewsResearchAgent(
Environment.GetEnvironmentVariable("SPIDRA_API_KEY")!,
Environment.GetEnvironmentVariable("ANTHROPIC_API_KEY")!
);
var answer = await agent.ResearchAsync(
"What are the most significant AI announcements in tech news today?"
);
Console.WriteLine(answer);
}
}
Wrapping up
AI-powered web scraping in .NET replaces brittle XPath and CSS selectors with durable natural language prompts and delivers typed structured output that deserializes cleanly into C# records via System.Text.Json. The JSON schema contract eliminates defensive null handling and shape mismatches. Your types and the schema enforce correctness together.
For production .NET AI agent scraping, the Spidra SDK handles browser automation, proxy rotation, and CAPTCHA infrastructure. Your code focuses on what to do with the data.
Install: dotnet add package Spidra
Spidra is a web scraping API with AI-powered extraction, proxy rotation, and CAPTCHA handling. Try it free at spidra.io.
