Skip to main content
Blog/ How to scrape web data with Beautiful Soup: step-by-step guide in 2026
June 27, 2026 · 10 min read

How to scrape web data with Beautiful Soup: step-by-step guide in 2026

Joel Olawanle
Joel Olawanle
How to scrape web data with Beautiful Soup: step-by-step guide in 2026

There is a lot of data on the internet that is not available through an API. Product prices, news articles, job listings, research data. Most of it lives in HTML pages rather than structured endpoints. Web scraping is how you get to it programmatically.

In this guide you will learn how to scrape web pages using Python's requests library and Beautiful Soup. We use a real scraping practice site throughout so every code example produces actual output.

What is Beautiful Soup and why use it

Beautiful Soup is a Python library that parses HTML and XML documents and gives you a clean interface to navigate and search the document tree. It does not fetch web pages. That is the job of requests. Once you have the HTML, Beautiful Soup makes it straightforward to pull out the data you need.

It is a good first tool for web scraping because the learning curve is gentle, it handles messy HTML gracefully, and it works well with other libraries. Pair it with requests for static pages, with Selenium or Playwright when you need JavaScript rendering, and with Scrapy when you need to crawl an entire site.

The limitation to understand early: Beautiful Soup is a parser, not a browser. It reads whatever HTML your requests call returns. If a page loads its content via JavaScript after the initial HTML is delivered, Beautiful Soup will see an empty result. This is a common source of confusion and we cover it in the challenges section below.

Prerequisites

You need Python 3.8 or higher. Check your version:

python --version

Create and activate a virtual environment before installing anything:

# macOS / Linux
python3 -m venv venv
source venv/bin/activate

# Windows
python -m venv venv
venv\Scripts\activate

Install the required libraries:

pip install requests beautifulsoup4 lxml

requests handles HTTP. beautifulsoup4 is the parser. lxml is a fast C-based parser that Beautiful Soup can use under the hood and also provides XPath support when you need it.

The target page

We scrape books.toscrape.com throughout this guide. It is a sandbox site built specifically for scraping practice, publicly accessible and stable, designed to let you experiment without worrying about rate limits or legal concerns. Each page lists 20 books with a title, price, star rating, and availability.

https://books.toscrape.com/catalogue/page-1.html

Step 1: download the page with requests

Start by fetching the HTML. The get() method sends an HTTP GET request and returns a response object. The text attribute on that object contains the raw HTML:

import requests

url = 'https://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url)

print(response.status_code)
print(response.text[:500])

A 200 status code means the request succeeded. Anything in the 4xx or 5xx range is an error. The text[:500] just prints the first 500 characters so you can confirm you got real HTML back rather than a block page.

Adding a realistic User-Agent header reduces the chance of getting blocked on sites that check for it:

import requests

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/126.0.0.0 Safari/537.36',
    'Accept-Language': 'en-US,en;q=0.9',
}

url = 'https://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url, headers=headers)

Step 2: parse the HTML with Beautiful Soup

Once you have the HTML, pass it to the BeautifulSoup constructor along with the parser you want to use. We use lxml because it is fast and handles malformed HTML well:

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, 'lxml')

soup is now a navigable tree of Python objects representing the entire HTML document. You can search it, navigate it, and extract text and attributes from it.

Step 3: inspect the page to find your selectors

Before writing extraction code, you need to know what selectors to use. Open the target page in Chrome, right-click the element you want to extract, and select Inspect. The browser's DevTools panel opens with the HTML for that element highlighted.

beautiful-soup-1.jpg

On Books to Scrape, each book is an <article> element with the class product_pod. Inside it you will find:

  • <h3><a title="Book Title"> : the book title lives in the title attribute
  • <p class="price_color"> : the price including the currency symbol
  • <p class="star-rating {One|Two|Three|Four|Five}"> : the rating encoded in the class name
  • <p class="instock availability"> : stock status

Knowing this structure is what you need to write the extraction code.

Step 4: extract data from the page

Get an element by HTML tag

find() returns the first matching element. find_all() returns a list of all matching elements. To get all book containers:

books = soup.find_all('article', class_='product_pod')
print(f'Found {len(books)} books')
Found 20 books

Get an element by CSS class

Prefix the class name with a dot in select() calls, or pass it as class_= in find() calls. To get the price of the first book:

first_book = books[0]
price = first_book.find('p', class_='price_color').text.strip()
print(price)
£51.77

Get an element by ID

When an element has a unique ID, select_one() with a # prefix is the cleanest way to find it. This page has an #default wrapper on the main content:

main = soup.select_one('#default')
print(main.name)

Get an element by attribute

Square bracket notation in CSS selectors matches on attribute values. The book title is stored in the title attribute of the <a> tag inside each <h3>:

title = first_book.find('h3').find('a')['title']
print(title)
A Light in the Attic

Get an element using XPath

For XPath queries you need lxml directly alongside Beautiful Soup. Right-click an element in DevTools, choose Copy, then Copy XPath to get the path. Here is how to use it:

from bs4 import BeautifulSoup
from lxml import etree

dom = etree.HTML(str(soup))
first_title = dom.xpath('//article[@class="product_pod"]/h3/a/@title')[0]
print(first_title)
A Light in the Attic

XPath is more powerful than CSS selectors for complex traversals, but CSS selectors are usually cleaner for straightforward extraction.

Step 5: scrape the full page

Now that you know how to find each field, combine everything into a single scraper:

import requests
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
}

url = 'https://books.toscrape.com/catalogue/page-1.html'
response = requests.get(url, headers=headers)

if response.status_code != 200:
    print(f'Request failed: {response.status_code}')
else:
    soup = BeautifulSoup(response.text, 'lxml')
    books_data = []

    for article in soup.find_all('article', class_='product_pod'):
        title    = article.find('h3').find('a')['title']
        price    = article.find('p', class_='price_color').text.strip()
        rating   = article.find('p', class_='star-rating')['class'][1]
        in_stock = 'In stock' in article.find('p', class_='instock').text

        books_data.append({
            'title':    title,
            'price':    price,
            'rating':   rating,
            'in_stock': in_stock,
        })

    for book in books_data[:3]:
        print(book)
{'title': 'A Light in the Attic', 'price': '£51.77', 'rating': 'Three', 'in_stock': True}
{'title': 'Tipping the Velvet',   'price': '£53.74', 'rating': 'One',   'in_stock': True}
{'title': 'Soumission',           'price': '£50.10', 'rating': 'One',   'in_stock': True}

The rating field comes from the class name. star-rating Three has two classes, so ['class'][1] picks the second one, which is the word that represents the rating. You would map it to a number in post-processing: {'One': 1, 'Two': 2, 'Three': 3, 'Four': 4, 'Five': 5}.

Step 6: export to CSV

Python's built-in csv module writes the data to a file you can open in Excel or Google Sheets:

import csv

csv_file = 'books.csv'

with open(csv_file, mode='w', newline='', encoding='utf-8') as f:
    writer = csv.DictWriter(f, fieldnames=['title', 'price', 'rating', 'in_stock'])
    writer.writeheader()
    writer.writerows(books_data)

print(f'Saved {len(books_data)} books to {csv_file}')

Step 7: export to JSON

For pipelines that consume JSON or need to store the data in a format that preserves the types:

import json

with open('books.json', 'w', encoding='utf-8') as f:
    json.dump(books_data, f, indent=4)

print('Saved to books.json')
[
    {
        "title": "A Light in the Attic",
        "price": "£51.77",
        "rating": "Three",
        "in_stock": true
    },
    ...
]

Step 8: handle pagination

Books to Scrape spreads its catalogue across 50 pages. The Next button at the bottom of each page has a predictable pattern: the URL increments from page-1.html through page-50.html. This function walks through every page and collects all the books:

import requests, time
from bs4 import BeautifulSoup

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36',
}

base_url = 'https://books.toscrape.com/catalogue/page-{}.html'
all_books = []
page = 1

while True:
    response = requests.get(base_url.format(page), headers=headers)

    if response.status_code != 200:
        break

    soup = BeautifulSoup(response.text, 'lxml')
    articles = soup.find_all('article', class_='product_pod')

    if not articles:
        break

    for article in articles:
        title    = article.find('h3').find('a')['title']
        price    = article.find('p', class_='price_color').text.strip()
        rating   = article.find('p', class_='star-rating')['class'][1]
        in_stock = 'In stock' in article.find('p', class_='instock').text
        all_books.append({'title': title, 'price': price, 'rating': rating, 'in_stock': in_stock})

    print(f'Page {page}: {len(all_books)} books collected so far')

    # Check if there is a next page
    next_btn = soup.find('li', class_='next')
    if not next_btn:
        break

    page += 1
    time.sleep(1)  # Be respectful of the server

print(f'\nTotal: {len(all_books)} books')

The time.sleep(1) between requests gives the server breathing room. Sending requests as fast as possible is what gets scrapers blocked or rate-limited.

Common challenges

Dynamic content

Beautiful Soup reads the HTML that requests returns. If a page uses JavaScript to load its data after the initial HTML arrives, requests gets the empty skeleton and Beautiful Soup finds nothing to parse. You will see this on modern e-commerce sites, single-page applications, and any page where the content only appears after scrolling or clicking.

The fix is to use a headless browser like Playwright or Selenium to render the page first, then pass the rendered HTML to Beautiful Soup. This adds significant overhead because a full browser instance runs for every page, and sites that use JavaScript-rendered content are also typically the ones with more aggressive bot detection.

Getting blocked

Sites detect scrapers through a combination of signals: missing or suspicious User-Agent headers, requests arriving too fast, datacenter IP addresses, and missing browser fingerprints. A plain requests call with no headers is the most obvious signal you can send.

Common mitigations are adding realistic headers, introducing random delays between requests, and routing through residential proxies. Each adds complexity and maintenance burden. Free proxy lists in particular are unreliable. They have short lifespans and many are already flagged by the time you use them.

Selectors breaking

Site owners update their HTML without any concern for external scrapers. Class names change, element nesting shifts, and IDs get renamed. Any selector you write is a dependency on the current state of a page that you do not control. This is the hidden cost of selector-based scraping: it works until it does not, and it breaks silently.

Error handling

Web scraping produces a lot of edge cases. An element you expect to be present is missing on one page. A find() returns None and the next attribute access throws an AttributeError. Always write defensive code when accessing nested elements:

price_elem = article.find('p', class_='price_color')
price = price_elem.text.strip() if price_elem else None

A try-except around the whole extraction loop keeps one bad page from stopping the rest:

for article in articles:
    try:
        title = article.find('h3').find('a')['title']
        price = article.find('p', class_='price_color').text.strip()
        books_data.append({'title': title, 'price': price})
    except Exception as e:
        print(f'Failed to parse article: {e}')
        continue

The easier approach for more complex targets

Books to Scrape is a scraping sandbox built to be easy. Real targets like Amazon, eBay, news sites, and job boards have bot detection, JavaScript-rendered content, rotating layouts, and CAPTCHA challenges. The selector maintenance, proxy management, and rendering overhead add up quickly.

The Spidra API takes a different approach. You describe what you want from a page in plain English and define the output shape in a JSON Schema. Spidra loads the page in a real browser, routes through residential proxies, handles CAPTCHA automatically, and returns structured JSON matching your schema. When the site changes its layout, the prompt keeps working because it describes the data, not where it sits in the DOM.

import requests, time, os, json

API_KEY = os.environ['SPIDRA_API_KEY']
BASE    = 'https://api.spidra.io/api'
HEADERS = {'x-api-key': API_KEY, 'Content-Type': 'application/json'}

SCHEMA = {
    'type': 'object',
    'required': ['books'],
    'properties': {
        'books': {
            'type': 'array',
            'items': {
                'type': 'object',
                'properties': {
                    'title':    {'type': 'string'},
                    'price':    {'type': 'string'},
                    'rating':   {'type': 'string'},
                    'in_stock': {'type': 'boolean'},
                }
            }
        }
    }
}

resp = requests.post(f'{BASE}/scrape', headers=HEADERS, json={
    'urls':   [{'url': 'https://books.toscrape.com/catalogue/page-1.html'}],
    'prompt': 'Extract all books on this page with their title, price, star rating as a word, and whether they are in stock',
    'output': 'json',
    'schema': SCHEMA,
})
job_id = resp.json()['jobId']

while True:
    result = requests.get(f'{BASE}/scrape/{job_id}', headers=HEADERS).json()
    if result['status'] == 'completed':
        break
    time.sleep(3)

books = result['result']['content']['books']
print(f'Got {len(books)} books')
print(json.dumps(books[:2], indent=2))

Sign up free at app.spidra.io. The free plan gives you 300 credits with no card required. If you want to generate a schema from real output rather than writing it by hand, the free JSON Schema Generator builds the structure for you from any JSON sample.

Frequently asked questions

Scraping publicly accessible web pages is generally considered legal in the United States based on the hiQ Labs v. LinkedIn ruling (Ninth Circuit 2022). Site terms of service may prohibit automated access contractually. Always check a site's robots.txt and terms before scraping at scale.

Share this article

Start scraping for free.

Get 300 free credits to explore Spidra. Build your first scraper in minutes, not hours. Upgrade anytime as you scale.

We build features around real workflows. Usually within days.