Amazon product data is useful for price monitoring, competitive research, trend tracking, and building product catalogs. The data is publicly visible on every product page. In this tutorial you will build a Python scraper that extracts it.
We will use the Mopchnic Wireless Headset as our target throughout:
https://www.amazon.com/dp/B0G8G4SQXQBy the end you will have a scraper that extracts the following from a product page:
- Product title
- Price
- Rating and review count
- Product images
- Bullet-point features
We will then scale it to scrape search result pages and handle pagination. At the end you will see how to do all of this without writing or maintaining any selectors.
Understanding Amazon's anti-scraping measures
Before writing a line of code, it is worth understanding what you are dealing with. Amazon is one of the most aggressively protected sites on the internet.
- CAPTCHA challenges appear when Amazon detects automated behaviour — repeated requests from the same IP, missing browser headers, or unusual request timing. They block the scraping process entirely.
- Rate limiting kicks in when too many requests hit the servers within a short window. Amazon temporarily blocks the IP or surfaces a CAPTCHA as a checkpoint.
- IP blocking happens at the network layer. Datacenter IP ranges are on a deny list by default. A plain
requestscall from a cloud server or your home IP will often get a block page rather than a product page, especially under any sustained load.
For a single test request with a good User-Agent you can often get through. At any real scale, you need proxy rotation and browser rendering.
Prerequisites
Make sure you have Python 3.9 or higher. Install the required libraries:
pip install requests beautifulsoup4
We will use requests as the HTTP client and BeautifulSoup to parse the HTML.
Step 1: retrieve the page HTML
Start with a basic request to confirm you can reach the page. Create a file called scraper.py and add the following:
import requests
target_url = "https://www.amazon.com/dp/B0G8G4SQXQ"
response = requests.get(target_url)
if response.status_code != 200:
print(f"Request failed: {response.status_code}")
else:
print(response.text[:500])Run this and you will likely get a 503, a CAPTCHA page, or a "Something went wrong" block page rather than the product. Amazon identifies the missing browser headers and blocks the request before it reaches the application layer.
Adding a User-Agent header makes the request look more like a real browser:
import requests
custom_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com/",
}
target_url = "https://www.amazon.com/dp/B0G8G4SQXQ"
response = requests.get(target_url, headers=custom_headers)
if response.status_code != 200:
print(f"Request failed: {response.status_code}")
else:
print("Success")
print(response.text[:500])This can get you a 200 for a small number of requests. Once you start scraping at volume, Amazon flags the IP. If you are sending more than a handful of requests, add a proxy:
proxies = {
"http": "http://YOUR_PROXY_HOST:PORT",
"https": "http://YOUR_PROXY_HOST:PORT",
}
response = requests.get(target_url, headers=custom_headers, proxies=proxies)Free proxy lists are unreliable and short-lived. Residential proxies from a paid provider are the only option that holds up at scale. We will cover the managed approach at the end of this guide.
Your complete setup code before any parsing:
import requests
custom_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com/",
}
target_url = "https://www.amazon.com/dp/B0G8G4SQXQ"
response = requests.get(target_url, headers=custom_headers)
if response.status_code != 200:
print(f"Request failed: {response.status_code}")
else:
print("Got the page")Step 2: scrape the product details
Once you have the HTML, you use BeautifulSoup to find specific elements. The general approach is: open the product page in Chrome, right-click the element you want, select Inspect, find a stable ID or class name, then use soup.find() or soup.select_one() to extract it.
Add BeautifulSoup to your imports and parse the HTML:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")Locate and extract the product title
Right-click the product title and select Inspect. The title lives in a <span> tag with the ID productTitle. IDs are the most stable selectors on Amazon because they identify a single element rather than a class shared across many:
title_element = soup.select_one("#productTitle")
title = title_element.text.strip() if title_element else None
print("Title:", title)Title: Mopchnic Wireless Headset with Noise Cancelling Microphone, On-Ear Bluetooth Headset with USB Dongle, Mute Function & Charging BaseLocate and extract the price
Amazon's price element has changed a few times and now uses multiple container structures depending on whether the product is on sale, has a Prime deal, or is a third-party listing.
The selector #priceblock_ourprice that appears in older tutorials is gone. The reliable approach in 2026 is to try multiple fallbacks:
price = None
for selector in [
"#corePriceDisplay_desktop_feature_div .a-offscreen",
"span.priceToPay .a-offscreen",
".apexPriceToPay .a-offscreen",
".a-price .a-offscreen",
]:
el = soup.select_one(selector)
if el:
price = el.text.strip()
break
print("Price:", price)Price: $29.99The fallback loop is necessary because Amazon uses different price containers depending on the product type, the seller, and which A/B test you land in. A single selector will work until it does not.
Locate and extract the rating
The star rating in 2026 is most reliably in the #acrPopover element, which contains a span with the full "4.5 out of 5 stars" text:
rating_element = soup.select_one("#acrPopover span.a-icon-alt")
rating = rating_element.text.strip() if rating_element else None
print("Rating:", rating)Rating: 4.5 out of 5 starsLocate and extract the review count
The review count sits in a <span> with the ID acrCustomerReviewText:
review_count_element = soup.select_one("#acrCustomerReviewText")
review_count = review_count_element.text.strip() if review_count_element else None
print("Reviews:", review_count)Reviews: 5,000 ratingsIf you need a number rather than a formatted string for storage or comparison, parse it:
import re
review_count_raw = review_count_element.text.strip() if review_count_element else "0"
review_count_num = int(re.sub(r"[^\d]", "", review_count_raw)) if review_count_raw else 0Locate and extract the main product image
The main product image has a stable ID of landingImage. Extract the src attribute:
image_element = soup.select_one("#landingImage")
main_image = image_element.get("src") if image_element else None
print("Image:", main_image)Image: https://m.media-amazon.com/images/I/61+j+lJ6eJL._AC_SL1500_.jpgLocate and extract the bullet-point features
The "About this item" bullet points live inside #feature-bullets. Each point is a <li> containing a <span> with the text:
feature_bullets = []
bullets = soup.select("#feature-bullets .a-list-item")
for bullet in bullets:
text = bullet.text.strip()
if text:
feature_bullets.append(text)
print("Features:", feature_bullets[:2])Features: ['ENC Noise Cancelling & Clear Calls: This wireless headset blocks most distracting background noise...', 'Dual-Pairing & Wide Compatibility: Bluetooth 5.3 supports connection to two devices at once...']Step 3: complete product scraper code
import requests
import re
from bs4 import BeautifulSoup
custom_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com/",
}
target_url = "https://www.amazon.com/dp/B0G8G4SQXQ"
response = requests.get(target_url, headers=custom_headers)
if response.status_code != 200:
print(f"Request failed: {response.status_code}")
else:
soup = BeautifulSoup(response.text, "html.parser")
# Title
title_el = soup.select_one("#productTitle")
title = title_el.text.strip() if title_el else None
# Price — try multiple selectors, Amazon changes these regularly
price = None
for selector in [
"#corePriceDisplay_desktop_feature_div .a-offscreen",
"span.priceToPay .a-offscreen",
".apexPriceToPay .a-offscreen",
".a-price .a-offscreen",
]:
el = soup.select_one(selector)
if el:
price = el.text.strip()
break
# Rating
rating_el = soup.select_one("#acrPopover span.a-icon-alt")
rating = rating_el.text.strip() if rating_el else None
# Review count
review_count_el = soup.select_one("#acrCustomerReviewText")
review_count_raw = review_count_el.text.strip() if review_count_el else "0"
review_count = int(re.sub(r"[^\d]", "", review_count_raw)) if review_count_raw else 0
# Main image
image_el = soup.select_one("#landingImage")
main_image = image_el.get("src") if image_el else None
# Feature bullets
feature_bullets = [
b.text.strip()
for b in soup.select("#feature-bullets .a-list-item")
if b.text.strip()
]
data = {
"title": title,
"price": price,
"rating": rating,
"review_count": review_count,
"main_image": main_image,
"features": feature_bullets,
}
print(data)Output:
{
'title': 'Mopchnic Wireless Headset with Noise Cancelling Microphone...',
'price': '$29.99',
'rating': '4.5 out of 5 stars',
'review_count': 5000,
'main_image': 'https://m.media-amazon.com/images/I/61+j+lJ6eJL._AC_SL1500_.jpg',
'features': ['ENC Noise Cancelling & Clear Calls...', 'Dual-Pairing & Wide Compatibility...', ...]
}Step 4: export to CSV
import csv
csv_file = "product.csv"
with open(csv_file, mode="w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["Title", "Price", "Rating", "Review Count", "Main Image"])
writer.writerow([
data["title"],
data["price"],
data["rating"],
data["review_count"],
data["main_image"],
])
print(f"Saved to {csv_file}")Scraping Amazon search results
A single product page is a good start, but most use cases require collecting data across many products. Amazon search results pages list 20-25 products per page and are the starting point for any bulk extraction pipeline.
The target URL for a keyword search follows a consistent pattern:
https://www.amazon.com/s?k=wireless+headphonesInspect any product listing. The product link sits in an <a> tag inside an <h2> element within a container that has a data-asin attribute. That attribute is the most reliable anchor because it uniquely identifies each product card:
import requests, csv, time
from bs4 import BeautifulSoup
custom_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com/",
}
base_url = "https://www.amazon.com"
target_url = "https://www.amazon.com/s?k=wireless+headphones"
response = requests.get(target_url, headers=custom_headers)
if response.status_code != 200:
print(f"Request failed: {response.status_code}")
else:
soup = BeautifulSoup(response.text, "html.parser")
product_links = []
for link in soup.select("[data-asin] h2 a"):
href = link.get("href", "")
if href and "/dp/" in href:
full_url = base_url + href if not href.startswith("https") else href
product_links.append(full_url)
print(f"Found {len(product_links)} products on this page")
for url in product_links[:5]:
print(url)The [data-asin] h2 a selector finds every product title link by looking for anchor tags inside <h2> elements that live within a container with a data-asin attribute. This is more reliable than class-based selectors on Amazon search pages because data-asin is used structurally rather than for styling.
Handling pagination
Amazon breaks search results across multiple pages. The Next button at the bottom of the listing has a stable class of s-pagination-next. When there are no more pages, that element disappears from the DOM.
Wrap the scraper in a loop that follows the Next link until it is gone:
import requests, csv, time
from bs4 import BeautifulSoup
custom_headers = {
"User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36",
"Accept-Language": "en-US,en;q=0.9",
"Referer": "https://www.google.com/",
}
base_url = "https://www.amazon.com"
target_url = "https://www.amazon.com/s?k=wireless+headphones"
all_links = []
while True:
response = requests.get(target_url, headers=custom_headers)
if response.status_code != 200:
print(f"Request failed: {response.status_code}")
break
soup = BeautifulSoup(response.text, "html.parser")
for link in soup.select("[data-asin] h2 a"):
href = link.get("href", "")
if href and "/dp/" in href:
full_url = base_url + href if not href.startswith("https") else href
if full_url not in all_links:
all_links.append(full_url)
print(f"Page scraped — {len(all_links)} total links so far")
# Find the next page link
next_page = soup.select_one("a.s-pagination-next")
if not next_page:
print("No more pages")
break
next_href = next_page.get("href", "")
if not next_href:
break
target_url = base_url + next_href if not next_href.startswith("https") else next_href
# Pause between requests to avoid rate limiting
time.sleep(3)
# Save all links to CSV
with open("product_links.csv", "w", newline="", encoding="utf-8") as f:
writer = csv.writer(f)
writer.writerow(["Product URL"])
for url in all_links:
writer.writerow([url])
print(f"Saved {len(all_links)} links to product_links.csv")This crawls all pages for the keyword, stops when the Next button disappears, and saves the collected links to a CSV.
The challenges you will run into at scale
Getting blocked
The User-Agent header and 3-second delays help, but they are not enough for sustained scraping. Amazon's bot detection goes beyond headers — it looks at TLS fingerprints, browser behaviour signals, and IP reputation.
Datacenter IPs get blocked almost immediately under volume. Residential proxies with rotation are the baseline requirement for reliable Amazon scraping.
Selectors breaking without warning
Amazon updates its frontend continuously. The most famous casualty is #priceblock_ourprice — it appeared in hundreds of scraping tutorials for years before Amazon removed it. Any code depending on that ID silently returned nothing. The price selector in this guide already uses multiple fallbacks because a single selector is not sustainable. Amazon does not announce these changes and does not treat your selectors as a contract.
The further implication is that every selector in this tutorial may need updating by the time you read it. Check each element in DevTools before assuming the selector will work.
The easier approach: Spidra
Instead of managing selectors, proxies, and CAPTCHA handling separately, Spidra handles all three.
You describe what you want from the page and define an output schema. Spidra loads the page in a real browser, routes through residential proxies automatically, and returns structured JSON matching your schema regardless of how Amazon's HTML is structured.
Sign up free at app.spidra.io — 300 credits, no card required.
import requests, time, os
API_KEY = os.environ["SPIDRA_API_KEY"]
BASE = "https://api.spidra.io/api"
HEADERS = {"x-api-key": API_KEY, "Content-Type": "application/json"}
PRODUCT_SCHEMA = {
"type": "object",
"required": ["title", "asin", "price", "availability"],
"properties": {
"title": {"type": "string"},
"brand": {"type": ["string", "null"]},
"asin": {"type": "string"},
"price": {"type": ["number", "null"]},
"original_price": {"type": ["number", "null"]},
"currency": {"type": ["string", "null"]},
"discount_percentage": {"type": ["integer", "null"]},
"availability": {"type": "string"},
"rating": {"type": ["number", "null"]},
"review_count": {"type": ["integer", "null"]},
"features": {"type": "array", "items": {"type": "string"}},
"images": {"type": "array", "items": {"type": "string"}},
"seller": {"type": ["string", "null"]},
"prime": {"type": ["boolean", "null"]},
"bsr_rank": {"type": ["integer", "null"]},
"bsr_category": {"type": ["string", "null"]},
}
}
resp = requests.post(f"{BASE}/scrape", headers=HEADERS, json={
"urls": [{"url": "https://www.amazon.com/dp/B0G8G4SQXQ"}],
"prompt": "Extract the full product details",
"output": "json",
"useProxy": True,
"proxyCountry": "us",
"schema": PRODUCT_SCHEMA,
})
job_id = resp.json()["jobId"]
while True:
result = requests.get(f"{BASE}/scrape/{job_id}", headers=HEADERS).json()
if result["status"] == "completed":
break
time.sleep(3)
print(result["result"]["content"])Output from the actual request:
{
"title": "Mopchnic Wireless Headset with Noise Cancelling Microphone",
"brand": "Mopchnic",
"asin": "B0G8G4SQXQ",
"price": 29.99,
"original_price": 46.99,
"currency": "$",
"discount_percentage": 36,
"availability": "In Stock",
"rating": 4.5,
"review_count": 5000,
"features": [
"ENC Noise Cancelling & Clear Calls",
"Dual-Pairing & Wide Compatibility",
"Extended Battery & Easy Charging",
"Crystal Stereo Sound for Calls & Music",
"Ultra Comfortable & All-Day Wear"
],
"images": [
"https://m.media-amazon.com/images/I/61+j+lJ6eJL._AC_SL1500_.jpg",
"https://m.media-amazon.com/images/I/61c+5p-3pNL._AC_SL1500_.jpg"
],
"seller": "Mopchnic Official Store",
"prime": true,
"bsr_rank": 15,
"bsr_category": "Electronics > Headphones > Over-Ear Headphones"
}No selector maintenance. No proxy setup. No CAPTCHA handling. The schema definition replaces all of it, and when Amazon changes its HTML the prompt keeps working because it describes the data, not its location in the DOM.
For scraping many products in parallel, the batch scraping endpoint processes up to 50 ASINs in a single request. For collecting ASINs from search results first, see the Amazon product data guide which covers the full search-to-PDP pipeline.
Conclusion
You have now seen how to scrape Amazon product pages and search results with Python's Requests and BeautifulSoup. Here is a summary of what this guide covered:
- Getting the full HTML of an Amazon product page, including handling User-Agents and the blocking problem
- Extracting product title, price, rating, review count, images, and features field by field
- Exporting scraped data to a CSV file
- Collecting product links from search result pages
- Handling pagination to crawl across multiple pages
- The maintenance challenges that come with selector-based Amazon scraping
- Using Spidra as the alternative that removes selector and proxy management entirely
Scraping Amazon at any real scale is genuinely difficult. The BeautifulSoup approach works for small, occasional scrapes, but if you are building a monitoring pipeline or collecting data regularly, managing the infrastructure adds up quickly.
