There is a lot of data on the internet that is not available through an API. Product prices, news articles, job listings, research data. Most of it lives in HTML pages rather than structured endpoints. Web scraping is how you get to it programmatically.
In this guide you will learn how to scrape web pages using Python's requests library and Beautiful Soup. We use a real scraping practice site throughout so every code example produces actual output.
What is Beautiful Soup and why use it
Beautiful Soup is a Python library that parses HTML and XML documents and gives you a clean interface to navigate and search the document tree. It does not fetch web pages. That is the job of requests. Once you have the HTML, Beautiful Soup makes it straightforward to pull out the data you need.
It is a good first tool for web scraping because the learning curve is gentle, it handles messy HTML gracefully, and it works well with other libraries. Pair it with requests for static pages, with Selenium or Playwright when you need JavaScript rendering, and with Scrapy when you need to crawl an entire site.
The limitation to understand early: Beautiful Soup is a parser, not a browser. It reads whatever HTML your requests call returns. If a page loads its content via JavaScript after the initial HTML is delivered, Beautiful Soup will see an empty result. This is a common source of confusion and we cover it in the challenges section below.
Prerequisites
You need Python 3.8 or higher. Check your version:
python --versionCreate and activate a virtual environment before installing anything:
# macOS / Linux
python3 -m venv venv
source venv/bin/activate
# Windows
python -m venv venv
venv\Scripts\activateInstall the required libraries:
pip install requests beautifulsoup4 lxmlrequests handles HTTP. beautifulsoup4 is the parser. lxml is a fast C-based parser that Beautiful Soup can use under the hood and also provides XPath support when you need it.
The target page
We scrape books.toscrape.com throughout this guide. It is a sandbox site built specifically for scraping practice, publicly accessible and stable, designed to let you experiment without worrying about rate limits or legal concerns. Each page lists 20 books with a title, price, star rating, and availability.
https://books.toscrape.com/catalogue/page-1.htmlStep 1: download the page with requests
Start by fetching the HTML. The get() method sends an HTTP GET request and returns a response object. The text attribute on that object contains the raw HTML:
import requests
url = 'https://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url)
print(response.status_code)
print(response.text[:500])A 200 status code means the request succeeded. Anything in the 4xx or 5xx range is an error. The text[:500] just prints the first 500 characters so you can confirm you got real HTML back rather than a block page.
Adding a realistic User-Agent header reduces the chance of getting blocked on sites that check for it:
import requests
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
'Accept-Language': 'en-US,en;q=0.9',
}
url = 'https://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url, headers=headers)Step 2: parse the HTML with Beautiful Soup
Once you have the HTML, pass it to the BeautifulSoup constructor along with the parser you want to use. We use lxml because it is fast and handles malformed HTML well:
from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, 'lxml')soup is now a navigable tree of Python objects representing the entire HTML document. You can search it, navigate it, and extract text and attributes from it.
Step 3: inspect the page to find your selectors
Before writing extraction code, you need to know what selectors to use. Open the target page in Chrome, right-click the element you want to extract, and select Inspect. The browser's DevTools panel opens with the HTML for that element highlighted.
On Books to Scrape, each book is an <article> element with the class product_pod. Inside it you will find:
<h3><a title="Book Title">: the book title lives in thetitleattribute<p class="price_color">: the price including the currency symbol<p class="star-rating {One|Two|Three|Four|Five}">: the rating encoded in the class name<p class="instock availability">: stock status
Knowing this structure is what you need to write the extraction code.
Step 4: extract data from the page
Get an element by HTML tag
find() returns the first matching element. find_all() returns a list of all matching elements. To get all book containers:
books = soup.find_all('article', class_='product_pod')
print(f'Found {len(books)} books')Found 20 booksGet an element by CSS class
Prefix the class name with a dot in select() calls, or pass it as class_= in find() calls. To get the price of the first book:
first_book = books[0]
price = first_book.find('p', class_='price_color').text.strip()
print(price)£51.77Get an element by ID
When an element has a unique ID, select_one() with a # prefix is the cleanest way to find it. This page has an #default wrapper on the main content:
main = soup.select_one('#default')
print(main.name)Get an element by attribute
Square bracket notation in CSS selectors matches on attribute values. The book title is stored in the title attribute of the <a> tag inside each <h3>:
title = first_book.find('h3').find('a')['title']
print(title)A Light in the AtticGet an element using XPath
For XPath queries you need lxml directly alongside Beautiful Soup. Right-click an element in DevTools, choose Copy, then Copy XPath to get the path. Here is how to use it:
from bs4 import BeautifulSoup
from lxml import etree
dom = etree.HTML(str(soup))
first_title = dom.xpath('//article[@class="product_pod"]/h3/a/@title')[0]
print(first_title)A Light in the AtticXPath is more powerful than CSS selectors for complex traversals, but CSS selectors are usually cleaner for straightforward extraction.
Step 5: scrape the full page
Now that you know how to find each field, combine everything into a single scraper:
import requests
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
}
url = 'https://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url, headers=headers)
if response.status_code != 200:
print(f'Request failed: {response.status_code}')
else:
soup = BeautifulSoup(response.text, 'lxml')
books_data = []
for article in soup.find_all('article', class_='product_pod'):
title = article.find('h3').find('a')['title']
price = article.find('p', class_='price_color').text.strip()
rating = article.find('p', class_='star-rating')['class'][1]
in_stock = 'In stock' in article.find('p', class_='instock').text
books_data.append({
'title': title,
'price': price,
'rating': rating,
'in_stock': in_stock,
})
for book in books_data[:3]:
print(book){'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three', 'in_stock': True}
{'title': 'Tipping the Velvet', 'price': '£53.74', 'rating': 'One', 'in_stock': True}
{'title': 'Soumission', 'price': '£50.10', 'rating': 'One', 'in_stock': True}The rating field comes from the class name. star-rating Three has two classes, so ['class'][1] picks the second one, which is the word that represents the rating. You would map it to a number in post-processing: {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}.
Step 6: export to CSV
Python's built-in csv module writes the data to a file you can open in Excel or Google Sheets:
import csv
csv_file = 'books.csv'
with open(csv_file, mode='w', newline='', encoding='utf-8') as f:
writer = csv.DictWriter(f, fieldnames=['title', 'price', 'rating', 'in_stock'])
writer.writeheader()
writer.writerows(books_data)
print(f'Saved {len(books_data)} books to {csv_file}')Step 7: export to JSON
For pipelines that consume JSON or need to store the data in a format that preserves the types:
import json
with open('books.json', 'w', encoding='utf-8') as f:
json.dump(books_data, f, indent=4)
print('Saved to books.json')[
{
"title": "A Light in the Attic",
"price": "£51.77",
"rating": "Three",
"in_stock": true
},
...
]Step 8: handle pagination
Books to Scrape spreads its catalogue across 50 pages. The Next button at the bottom of each page has a predictable pattern: the URL increments from page-1.html through page-50.html. This function walks through every page and collects all the books:
import requests, time
from bs4 import BeautifulSoup
headers = {
'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
}
base_url = 'https://books.toscrape.com/catalogue/page-{}.html'
all_books = []
page = 1
while True:
response = requests.get(base_url.format(page), headers=headers)
if response.status_code != 200:
break
soup = BeautifulSoup(response.text, 'lxml')
articles = soup.find_all('article', class_='product_pod')
if not articles:
break
for article in articles:
title = article.find('h3').find('a')['title']
price = article.find('p', class_='price_color').text.strip()
rating = article.find('p', class_='star-rating')['class'][1]
in_stock = 'In stock' in article.find('p', class_='instock').text
all_books.append({'title': title, 'price': price, 'rating': rating, 'in_stock': in_stock})
print(f'Page {page}: {len(all_books)} books collected so far')
# Check if there is a next page
next_btn = soup.find('li', class_='next')
if not next_btn:
break
page += 1
time.sleep(1) # Be respectful of the server
print(f'\nTotal: {len(all_books)} books')The time.sleep(1) between requests gives the server breathing room. Sending requests as fast as possible is what gets scrapers blocked or rate-limited.
Common challenges
Dynamic content
Beautiful Soup reads the HTML that requests returns. If a page uses JavaScript to load its data after the initial HTML arrives, requests gets the empty skeleton and Beautiful Soup finds nothing to parse. You will see this on modern e-commerce sites, single-page applications, and any page where the content only appears after scrolling or clicking.
The fix is to use a headless browser like Playwright or Selenium to render the page first, then pass the rendered HTML to Beautiful Soup. This adds significant overhead because a full browser instance runs for every page, and sites that use JavaScript-rendered content are also typically the ones with more aggressive bot detection.
Getting blocked
Sites detect scrapers through a combination of signals: missing or suspicious User-Agent headers, requests arriving too fast, datacenter IP addresses, and missing browser fingerprints. A plain requests call with no headers is the most obvious signal you can send.
Common mitigations are adding realistic headers, introducing random delays between requests, and routing through residential proxies. Each adds complexity and maintenance burden. Free proxy lists in particular are unreliable. They have short lifespans and many are already flagged by the time you use them.
Selectors breaking
Site owners update their HTML without any concern for external scrapers. Class names change, element nesting shifts, and IDs get renamed. Any selector you write is a dependency on the current state of a page that you do not control. This is the hidden cost of selector-based scraping: it works until it does not, and it breaks silently.
Error handling
Web scraping produces a lot of edge cases. An element you expect to be present is missing on one page. A find() returns None and the next attribute access throws an AttributeError. Always write defensive code when accessing nested elements:
price_elem = article.find('p', class_='price_color')
price = price_elem.text.strip() if price_elem else NoneA try-except around the whole extraction loop keeps one bad page from stopping the rest:
for article in articles:
try:
title = article.find('h3').find('a')['title']
price = article.find('p', class_='price_color').text.strip()
books_data.append({'title': title, 'price': price})
except Exception as e:
print(f'Failed to parse article: {e}')
continueThe easier approach for more complex targets
Books to Scrape is a scraping sandbox built to be easy. Real targets like Amazon, eBay, news sites, and job boards have bot detection, JavaScript-rendered content, rotating layouts, and CAPTCHA challenges. The selector maintenance, proxy management, and rendering overhead add up quickly.
The Spidra API takes a different approach. You describe what you want from a page in plain English and define the output shape in a JSON Schema. Spidra loads the page in a real browser, routes through residential proxies, handles CAPTCHA automatically, and returns structured JSON matching your schema. When the site changes its layout, the prompt keeps working because it describes the data, not where it sits in the DOM.
import requests, time, os, json
API_KEY = os.environ['SPIDRA_API_KEY']
BASE = 'https://api.spidra.io/api'
HEADERS = {'x-api-key': API_KEY, 'Content-Type': 'application/json'}
SCHEMA = {
'type': 'object',
'required': ['books'],
'properties': {
'books': {
'type': 'array',
'items': {
'type': 'object',
'properties': {
'title': {'type': 'string'},
'price': {'type': 'string'},
'rating': {'type': 'string'},
'in_stock': {'type': 'boolean'},
}
}
}
}
}
resp = requests.post(f'{BASE}/scrape', headers=HEADERS, json={
'urls': [{'url': 'https://books.toscrape.com/catalogue/page-1.html'}],
'prompt': 'Extract all books on this page with their title, price, star rating as a word, and whether they are in stock',
'output': 'json',
'schema': SCHEMA,
})
job_id = resp.json()['jobId']
while True:
result = requests.get(f'{BASE}/scrape/{job_id}', headers=HEADERS).json()
if result['status'] == 'completed':
break
time.sleep(3)
books = result['result']['content']['books']
print(f'Got {len(books)} books')
print(json.dumps(books[:2], indent=2))Sign up free at app.spidra.io. The free plan gives you 300 credits with no card required. If you want to generate a schema from real output rather than writing it by hand, the free JSON Schema Generator builds the structure for you from any JSON sample.
