If you have tried to pull product data from Amazon, you know how the first attempt goes. You write a quick requests call, add a User-Agent header, and it works for maybe three pages before you start getting block pages.
You add proxy rotation. You add delays. You get through more pages, but the HTML keeps shifting on you. The price is in .a-price-whole until it is not. The rating is in .AverageCustomerReviews until Amazon runs an A/B test and it moves somewhere else.
This guide covers how to scrape Amazon product pages and search results reliably, without the selector maintenance problem. All code examples use the Spidra REST API, with Python SDK and Node.js SDK alternatives for each section.
What makes Amazon hard to scrape in 2026
The core problem is that Amazon runs AWS WAF with Bot Control, which blocks datacenter IP ranges before requests even reach the application layer.
A plain requests call without residential proxies fails on the first attempt. Add residential proxies and you get through, but then you are managing proxy rotation, detecting when you get a CAPTCHA or a "dog page," and handling retries.
Assuming you solve the network layer, the data extraction is its own problem. Here is what scraping a product price looks like with BeautifulSoup after you get a real HTML response:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")
# Price is split across two separate elements
whole = soup.find(class_="a-price-whole")
fraction = soup.find(class_="a-price-fraction")
if whole and fraction:
price = f"${whole.text.strip('.')}.{fraction.text.strip()}"
else:
# Out of stock, or Amazon changed the element structure again
price = None
# Rating text comes back as "4.6 out of 5 stars4.6 out of 5"
# so you split on the first occurrence
rating_div = soup.find(class_="AverageCustomerReviews")
rating = rating_div.text.strip().split(" out of")[0] if rating_div else None
# Images are not in a simple src attribute.
# They are serialised in JSON inside a script tag
# and the format changes between product types.This works until it does not. Amazon changes class names and restructures product sections when it A/B tests page layouts. Any scraper built on Amazon selectors needs ongoing maintenance as a baseline assumption.
Note: Scraping publicly visible Amazon product data is generally considered legal in the United States based on the hiQ Labs v. LinkedIn ruling (Ninth Circuit 2022), which held that scraping publicly accessible data does not constitute unauthorised access under the Computer Fraud and Abuse Act. Amazon's Terms of Service prohibit automated access, which is a contractual restriction, and Amazon enforces it technically rather than legally in most cases.
The two Amazon page types you need to understand
Amazon product pages come in two forms that serve different purposes in a scraping pipeline.
A product detail page lives at amazon.com/dp/{ASIN}. The ASIN is a 10-character alphanumeric identifier that uniquely identifies every product on Amazon, and it is the key piece of infrastructure for everything else.
This is where you get the full picture: price, original price, discount percentage, star rating, review count, bullet-point features, images, seller, Prime status, Best Seller Rank with full category path, and specifications.
A search results page lives at amazon.com/s?k={keyword}. It returns 20-25 product cards per page. Each card has a title, ASIN, price, and review count.
The practical pattern is to use search results pages to collect ASINs at scale, then batch-scrape the product detail pages for full data. We build exactly this pipeline below.
Prerequisites
Sign up free at app.spidra.io. The free plan gives you 300 credits and no card is required. Get your API key from Settings → API Keys once you are in.
# Python SDK
pip install spidra
# Node.js SDK
npm install spidra
export SPIDRA_API_KEY="YOUR_API_KEY"Scraping a product detail page
The clean URL format for any Amazon product is https://www.amazon.com/dp/{ASIN}. You will often see product URLs with long ref= query strings from clicking through search results. These work fine, but the /dp/ASIN format is cleaner to store and construct programmatically.
Building the schema
Passing a JSON Schema to the API tells it exactly what shape to return. Fields in required always appear in the output even if the page does not have a value for them — they come back as null. This matters for production pipelines because it means you can write to a database or pass the result downstream without defensive handling for missing fields.
If you want to generate a schema from an existing JSON sample rather than writing it by hand, the free JSON Schema Generator at spidra.io/tools does this in seconds.
Paste in any JSON output or use the [...] and it infers the full schema. Here is the schema for a product detail page:
{
"type": "object",
"required": ["title", "asin", "price", "availability"],
"properties": {
"title": {"type": "string"},
"brand": {"type": ["string", "null"]},
"asin": {"type": "string"},
"price": {"type": ["number", "null"]},
"original_price": {"type": ["number", "null"]},
"currency": {"type": ["string", "null"]},
"discount_percentage": {"type": ["integer", "null"]},
"availability": {"type": "string"},
"rating": {"type": ["number", "null"]},
"review_count": {"type": ["integer", "null"]},
"features": {"type": "array", "items": {"type": "string"}},
"images": {"type": "array", "items": {"type": "string"}},
"seller": {"type": ["string", "null"]},
"ships_from": {"type": ["string", "null"]},
"prime": {"type": ["boolean", "null"]},
"bsr_rank": {"type": ["integer", "null"]},
"bsr_category": {"type": ["string", "null"]},
"color": {"type": ["string", "null"]},
"model_number": {"type": ["string", "null"]}
}
}REST API
The API follows an async job pattern. You POST a scrape request, receive a jobId, and poll GET /api/scrape/{jobId} until the status is completed.
# Submit
curl -X POST https://api.spidra.io/api/scrape \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"urls": [{"url": "https://www.amazon.com/dp/B0G8G4SQXQ"}],
"prompt": "Extract the full product details",
"output": "json",
"useProxy": true,
"proxyCountry": "us",
"schema": { ... }
}'
# Returns: {"jobId": "abc-123", "status": "queued"}
# Poll
curl https://api.spidra.io/api/scrape/abc-123 \
-H "x-api-key: YOUR_API_KEY"Python SDK
import os
from spidra import SpidraClient, ScrapeParams, ScrapeUrl
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
PRODUCT_SCHEMA = {
"type": "object",
"required": ["title", "asin", "price", "availability"],
"properties": {
"title": {"type": "string"},
"brand": {"type": ["string", "null"]},
"asin": {"type": "string"},
"price": {"type": ["number", "null"]},
"original_price": {"type": ["number", "null"]},
"currency": {"type": ["string", "null"]},
"discount_percentage": {"type": ["integer", "null"]},
"availability": {"type": "string"},
"rating": {"type": ["number", "null"]},
"review_count": {"type": ["integer", "null"]},
"features": {"type": "array", "items": {"type": "string"}},
"images": {"type": "array", "items": {"type": "string"}},
"seller": {"type": ["string", "null"]},
"ships_from": {"type": ["string", "null"]},
"prime": {"type": ["boolean", "null"]},
"bsr_rank": {"type": ["integer", "null"]},
"bsr_category": {"type": ["string", "null"]},
"color": {"type": ["string", "null"]},
"model_number": {"type": ["string", "null"]}
}
}
job = spidra.scrape.run_sync(ScrapeParams(
urls=[ScrapeUrl(url="https://www.amazon.com/dp/B0G8G4SQXQ")],
prompt="Extract the full product details",
output="json",
schema=PRODUCT_SCHEMA,
use_proxy=True,
proxy_country="us",
))
print(job.result.content)Node.js SDK
import { SpidraClient } from 'spidra'
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })
const job = await spidra.scrape.run({
urls: [{ url: 'https://www.amazon.com/dp/B0G8G4SQXQ' }],
prompt: 'Extract the full product details',
output: 'json',
schema: PRODUCT_SCHEMA,
useProxy: true,
proxyCountry: 'us',
})
console.log(job.result.content)You can also use the Spidra playground:
What comes back
Scraping https://www.amazon.com/dp/B0G8G4SQXQ with proxyCountry: "us" returned this:
{
"title": "Mopchnic Wireless Headset with Noise Cancelling Microphone",
"brand": "Mopchnic",
"asin": "B0G8G4SQXQ",
"price": 29.99,
"original_price": 46.99,
"currency": "$",
"discount_percentage": 36,
"availability": "In Stock",
"rating": 4.5,
"review_count": 5000,
"features": [
"Active Noise Cancelling",
"Built-in Microphone",
"Bluetooth 5.3",
"Hi-Fi Stereo Sound",
"USB-C Charging"
],
"images": [
"https://m.media-amazon.com/images/I/61+j+lJ6eJL._AC_SL1500_.jpg",
"https://m.media-amazon.com/images/I/61c+5p-3pNL._AC_SL1500_.jpg",
"https://m.media-amazon.com/images/I/61Xg8p1k73L._AC_SL1500_.jpg"
],
"seller": "Mopchnic Official Store",
"ships_from": "Amazon",
"prime": true,
"bsr_rank": 15,
"bsr_category": "Electronics > Headphones > Over-Ear Headphones",
"color": "Black",
"model_number": "MCH-ANC-BT53-BLK"
}You'll notice that bsr_category returns the full path, not just the leaf category, discount_percentage comes from what is shown on the page, not computed from the two prices, and the currency field returns the symbol as rendered — "$" rather than "USD".
If your pipeline needs ISO codes, either add that instruction to the prompt ("return currency as a 3-letter ISO code like USD, GBP, EUR") or normalize in code:
CURRENCY_MAP = {"$": "USD", "£": "GBP", "€": "EUR", "¥": "JPY", "CA$": "CAD"}
product["currency"] = CURRENCY_MAP.get(product.get("currency", ""), product.get("currency"))Why proxyCountry matters
Without a proxy country preference, the request routes through whichever residential proxy is geographically closest to available capacity. Amazon serves localized pricing based on the connecting IP's country.
Testing without proxyCountry returned prices in SEK at 250 SEK for the same product that costs $29.99 in the US. Setting proxyCountry: "us" gives you consistent USD pricing on amazon.com.
If you are scraping a regional marketplace, match the country:
| Marketplace | Domain | proxyCountry |
|---|---|---|
| United States | amazon.com | "us" |
| United Kingdom | amazon.co.uk | "gb" |
| Germany | amazon.de | "de" |
| Canada | amazon.ca | "ca" |
| Japan | amazon.co.jp | "jp" |
Scraping Amazon search results
Search results pages work on the same async pattern. The main difference is the output shape: instead of one product object, you get a list of product cards. Because the API requires the root schema type to be "object", you wrap the list in a named key:
{
"type": "object",
"required": ["products"],
"properties": {
"products": {
"type": "array",
"items": {
"type": "object",
"required": ["title", "asin"],
"properties": {
"title": {"type": "string"},
"asin": {"type": "string"},
"url": {"type": ["string", "null"]},
"price": {"type": ["number", "null"]},
"currency": {"type": ["string", "null"]},
"rating": {"type": ["number", "null"]},
"review_count": {"type": ["integer", "null"]},
"prime": {"type": ["boolean", "null"]},
"sponsored": {"type": ["boolean", "null"]},
"thumbnail": {"type": ["string", "null"]}
}
}
}
}
}Amazon lazy-loads thumbnails as the page is scrolled. Without scroll actions, image URLs come back null on most cards. Adding a scroll sequence before extraction triggers the lazy load:
REST API
curl -X POST https://api.spidra.io/api/scrape \
-H "x-api-key: YOUR_API_KEY" \
-H "Content-Type: application/json" \
-d '{
"urls": [
{
"url": "https://www.amazon.com/s?k=wireless+headphones",
"actions": [
{"type": "scroll", "to": "50%"},
{"type": "wait", "duration": 1000},
{"type": "scroll", "to": "100%"},
{"type": "wait", "duration": 1000}
]
}
],
"prompt": "Extract all product listings. For rating check aria-label attributes containing out of 5 stars. For prime check for Prime badge images or prime in class names. For thumbnail use the m.media-amazon.com image src. Clean product URLs to https://www.amazon.com/dp/ASIN format.",
"output": "json",
"useProxy": true,
"proxyCountry": "us",
"schema": { ... }
}'Python SDK
from spidra import BrowserAction
job = spidra.scrape.run_sync(ScrapeParams(
urls=[
ScrapeUrl(
url="https://www.amazon.com/s?k=wireless+headphones",
actions=[
BrowserAction(type="scroll", to="50%"),
BrowserAction(type="wait", duration=1000),
BrowserAction(type="scroll", to="100%"),
BrowserAction(type="wait", duration=1000),
]
)
],
prompt="Extract all product listings. For rating check aria-label attributes containing out of 5 stars. For prime check for Prime badge images or prime in class names. For thumbnail use the m.media-amazon.com image src. Clean product URLs to https://www.amazon.com/dp/ASIN format.",
output="json",
schema=SEARCH_SCHEMA,
use_proxy=True,
proxy_country="us",
))
products = job.result.content["products"]
print(f"Got {len(products)} products")Node.js SDK
const job = await spidra.scrape.run({
urls: [{
url: 'https://www.amazon.com/s?k=wireless+headphones',
actions: [
{ type: 'scroll', to: '50%' },
{ type: 'wait', duration: 1000 },
{ type: 'scroll', to: '100%' },
{ type: 'wait', duration: 1000 },
],
}],
prompt: 'Extract all product listings. For rating check aria-label attributes containing out of 5 stars. For prime check for Prime badge images or prime in class names. For thumbnail use the m.media-amazon.com image src. Clean product URLs to https://www.amazon.com/dp/ASIN format.',
output: 'json',
schema: SEARCH_SCHEMA,
useProxy: true,
proxyCountry: 'us',
})
const { products } = job.result.content as anyWhat the results look like
From a real test against https://www.amazon.com/headset/s?k=headset:
{
"products": [
{
"title": "HyperX Cloud II Gaming Headset - 7.1 Surround Sound",
"asin": "B00SAYCXWG",
"url": "https://www.amazon.com/HyperX-Cloud-Gaming-Headset-KHX-HSCP-RD/dp/B00SAYCXWG",
"price": 49.99,
"currency": "USD",
"rating": null,
"review_count": 2000,
"prime": false,
"sponsored": false,
"thumbnail": "https://m.media-amazon.com/images/I/71631Jb-dZL._AC_SX679_.jpg"
},
{
"title": "Bose QuietComfort Headphones - Wireless Bluetooth, Active Noise Cancelling",
"asin": "B0CCZ26B5V",
"url": "https://www.amazon.com/Bose-QuietComfort-Cancelling-Headphones-Bluetooth/dp/B0CCZ26B5V",
"price": 249.00,
"currency": "USD",
"rating": null,
"review_count": 7000,
"prime": false,
"sponsored": false,
"thumbnail": "https://m.media-amazon.com/images/I/71kW6VVY1cL._AC_SX679_.jpg"
}
]
}In the returned data above, rating is null on all products. This is because Amazon renders star ratings in search results as SVG graphics, not as readable text in the DOM.
There is no reliable way to extract a number from a star image at this stage. You can collect ratings from the product detail page, which always has the rating as clear text.
The URLs come back clean in /dp/ASIN format, which is exactly what you need for the next stage.
Collecting at scale: search to batch PDPs
One search page gives you roughly 20-25 ASINs. Three pages gives you 60-75. From there, a single batch request can scrape all of them in parallel, up to 50 at a time. The full pipeline looks like this:
import os, json
from spidra import SpidraClient, ScrapeParams, ScrapeUrl, BrowserAction, BatchScrapeParams
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
def collect_asins(keyword: str, pages: int = 3) -> list[str]:
asins = []
seen = set()
for page in range(1, pages + 1):
url = f"https://www.amazon.com/s?k={keyword.replace(' ', '+')}&page={page}"
job = spidra.scrape.run_sync(ScrapeParams(
urls=[
ScrapeUrl(
url=url,
actions=[
BrowserAction(type="scroll", to="50%"),
BrowserAction(type="wait", duration=1000),
BrowserAction(type="scroll", to="100%"),
BrowserAction(type="wait", duration=1000),
]
)
],
prompt="Extract all product listings. Clean product URLs to https://www.amazon.com/dp/ASIN format.",
output="json",
schema=SEARCH_SCHEMA,
use_proxy=True,
proxy_country="us",
))
if job.result.ai_extraction_failed:
print(f"Page {page}: extraction failed, skipping")
continue
for p in job.result.content.get("products", []):
asin = p.get("asin")
if asin and asin not in seen:
asins.append(asin)
seen.add(asin)
print(f"Page {page}: {len(asins)} unique ASINs collected")
return asins
def scrape_products(asins: list[str]) -> list[dict]:
results = []
urls = [f"https://www.amazon.com/dp/{asin}" for asin in asins]
for i in range(0, len(urls), 50):
chunk = urls[i:i + 50]
batch_num = i // 50 + 1
total = -(-len(urls) // 50)
batch = spidra.batch.run_sync(BatchScrapeParams(
urls=chunk,
prompt="Extract the full product details",
output="json",
schema=PRODUCT_SCHEMA,
use_proxy=True,
proxy_country="us",
))
for item in batch.items:
if item.status == "completed" and item.result:
results.append(item.result)
else:
print(f" Failed: {item.url}")
print(f"Batch {batch_num}/{total}: {batch.completed_count}/{batch.total_urls} succeeded")
return results
asins = collect_asins("wireless headphones", pages=2)
products = scrape_products(asins)
with open("amazon_products.jsonl", "w") as f:
for product in products:
f.write(json.dumps(product) + "\n")
print(f"Saved {len(products)} products to amazon_products.jsonl")The Node.js version of the same pipeline:
import { SpidraClient } from 'spidra'
import { writeFileSync } from 'fs'
import * as os from 'os'
const spidra = new SpidraClient({ apiKey: process.env.SPIDRA_API_KEY! })
async function collectAsins(keyword: string, pages = 3): Promise<string[]> {
const asins: string[] = []
const seen = new Set<string>()
for (let page = 1; page <= pages; page++) {
const url = `https://www.amazon.com/s?k=${keyword.replace(/ /g, '+')}&page=${page}`
const job = await spidra.scrape.run({
urls: [{
url,
actions: [
{ type: 'scroll', to: '50%' },
{ type: 'wait', duration: 1000 },
{ type: 'scroll', to: '100%' },
{ type: 'wait', duration: 1000 },
],
}],
prompt: 'Extract all product listings. Clean product URLs to https://www.amazon.com/dp/ASIN format.',
output: 'json',
schema: SEARCH_SCHEMA,
useProxy: true,
proxyCountry: 'us',
})
const products = (job.result.content as any)?.products ?? []
for (const p of products) {
if (p.asin && !seen.has(p.asin)) {
asins.push(p.asin)
seen.add(p.asin)
}
}
console.log(`Page ${page}: ${asins.length} unique ASINs`)
}
return asins
}
async function scrapeProducts(asins: string[]): Promise<unknown[]> {
const results: unknown[] = []
const urls = asins.map(a => `https://www.amazon.com/dp/${a}`)
for (let i = 0; i < urls.length; i += 50) {
const chunk = urls.slice(i, i + 50)
const batch = await spidra.batch.run({
urls: chunk,
prompt: 'Extract the full product details',
output: 'json',
schema: PRODUCT_SCHEMA,
useProxy: true,
proxyCountry: 'us',
})
for (const item of batch.items) {
if (item.status === 'completed' && item.result) results.push(item.result)
}
console.log(`Batch: ${batch.completedCount}/${batch.totalUrls}`)
}
return results
}
const asins = await collectAsins('wireless headphones', 2)
const products = await scrapeProducts(asins)
writeFileSync('amazon_products.jsonl', products.map(p => JSON.stringify(p)).join(os.EOL))
console.log(`Saved ${products.length} products`)Price monitoring
Once you have a list of ASINs you track regularly, checking them for price changes is a straightforward batch job. Run it daily, compare to the previous snapshot, and surface anything that moved by more than your threshold.
import os, json
from pathlib import Path
from spidra import SpidraClient, BatchScrapeParams
spidra = SpidraClient(api_key=os.environ["SPIDRA_API_KEY"])
PRICE_SCHEMA = {
"type": "object",
"required": ["asin", "title", "price", "availability"],
"properties": {
"asin": {"type": "string"},
"title": {"type": "string"},
"price": {"type": ["number", "null"]},
"original_price": {"type": ["number", "null"]},
"currency": {"type": ["string", "null"]},
"availability": {"type": "string"},
}
}
# Real ASINs tested and confirmed working
WATCHED_ASINS = [
"B0G8G4SQXQ", # Mopchnic Wireless Headset
"B00SAYCXWG", # HyperX Cloud II
"B0C3BV19Q3", # HyperX Cloud III
"B0BS1RT9S2", # Sony WH-CH520
"B0CCZ26B5V", # Bose QuietComfort
]
def load_previous(path="data/prices.json") -> dict:
p = Path(path)
return json.loads(p.read_text()) if p.exists() else {}
def save_current(data: dict, path="data/prices.json"):
Path(path).parent.mkdir(parents=True, exist_ok=True)
Path(path).write_text(json.dumps(data, indent=2))
def check_prices(asins: list[str]) -> dict:
batch = spidra.batch.run_sync(BatchScrapeParams(
urls=[f"https://www.amazon.com/dp/{a}" for a in asins],
prompt="Extract the product ASIN, title, price, and availability",
output="json",
schema=PRICE_SCHEMA,
use_proxy=True,
proxy_country="us",
))
results = {}
for item in batch.items:
if item.status == "completed" and item.result:
asin = item.result.get("asin")
if asin:
results[asin] = item.result
return results
def find_changes(previous: dict, current: dict, threshold: float = 3.0) -> list[dict]:
changes = []
for asin, data in current.items():
curr = data.get("price")
prev = previous.get(asin, {}).get("price")
if not curr or not prev or prev == 0:
continue
pct = ((curr - prev) / prev) * 100
if abs(pct) >= threshold:
changes.append({
"asin": asin,
"title": data.get("title", "")[:60],
"prev_price": prev,
"curr_price": curr,
"change_pct": round(pct, 1),
"direction": "up" if pct > 0 else "down",
})
return sorted(changes, key=lambda x: abs(x["change_pct"]), reverse=True)
previous = load_previous()
current = check_prices(WATCHED_ASINS)
save_current(current)
changes = find_changes(previous, current)
if changes:
print(f"{len(changes)} price changes:")
for c in changes:
sign = "+" if c["direction"] == "up" else ""
print(f" {c['title']}: ${c['prev_price']} to ${c['curr_price']} ({sign}{c['change_pct']}%)")
else:
print("No significant price changes")This covers the market research and data enrichment pattern for e-commerce. You can swap in the full product schema to access more detailed data on each monitoring run without changing the pipeline structure.
