What Is Web Scraping, Really?

Web scraping is the practice of automatically fetching and extracting data from websites. Instead of manually copying and pasting, you write code that:

Fetches the HTML from a URL
Parses the structure
Extracts relevant data
Saves or processes it

Think of it like a web browser, but automated and focused on data extraction.

Why You Might Scrape

No API available: Sometimes a website doesn't offer an API, but the data is publicly visible
Real-time data: You need to monitor prices, job listings, or news across multiple sites
Research: Collecting datasets for analysis or machine learning
Integration: Pulling data into your application from external sources

Why You Might NOT Scrape

An API exists: Use it. APIs are faster, more reliable, and respect the server
Terms of Service forbid it: Many sites explicitly prohibit scraping in their ToS
The site uses JavaScript rendering: You'll need Selenium or Playwright, which is heavier
You're scraping personal data: Always consider privacy and GDPR/CCPA implications

Setting Up Your Environment

Before we dive in, let's install the tools you'll need. These four packages cover everything from HTTP requests to database storage, and they install cleanly via pip with no external dependencies required.

bash

pip install requests beautifulsoup4 lxml sqlalchemy

Here's what each package does:

requests: Makes HTTP requests to fetch web pages
beautifulsoup4: Parses HTML and extracts data
lxml: A fast HTML/XML parser (BeautifulSoup's engine)
sqlalchemy: ORM for storing data in databases

Once installed, it's worth doing a quick sanity check before writing your actual scraper, catching import errors now saves you from mysterious failures later. Run this short verification script to confirm everything is wired up correctly.

python

import requests
from bs4 import BeautifulSoup
from sqlalchemy import create_engine
import sqlite3
 
print("All imports successful!")

If that runs without errors, you're ready to go.

HTML Structure for Scrapers

Before you write a single line of scraping code, you need to understand what you're scraping. Web scraping is fundamentally the act of navigating an HTML tree, so the better you understand HTML structure, the faster and more accurately you can extract the data you want.

HTML is organized as a hierarchy of nested elements, often called the DOM (Document Object Model). Every element sits inside a parent element, may contain children, and lives alongside sibling elements at the same level. When you scrape, you're essentially writing instructions for traversing that tree and pulling out the nodes that contain your target data.

The most important HTML concepts for scrapers are tags (like <div>, <p>, <a>, <table>), classes (the class attribute, which can be shared by many elements), and IDs (the id attribute, which should be unique per page). Classes are how most modern websites style groups of elements, which makes them your primary targeting mechanism. IDs are powerful when they're present because they uniquely identify a single element. Attributes like href on links or src on images carry the actual values you often want to extract.

The practical skill you need to develop is using your browser's developer tools (right-click → Inspect) to examine the DOM. Find the element that contains your target data, look at what class names or IDs it has, and trace upward to understand what container it lives in. That mental model translates directly into BeautifulSoup selectors. The more precisely you can identify a target element in the DOM, the less fragile your scraper will be when the page gets minor style updates.

Here is the kind of structure you'll encounter constantly when scraping real sites:

html

<html>
  <head>
    <title>Example Page</title>
  </head>
  <body>
    <div class="container">
      <h1>Welcome</h1>
      <p class="description">This is a paragraph.</p>
      <a href="/page">Link</a>
    </div>
  </body>
</html>

When scraping, you'll target elements by:

Tag name: <p>, <a>, <div>
Class: class="description"
ID: id="main-content"
Attributes: href="/page"

BeautifulSoup lets you navigate this tree with CSS selectors or methods like find() and find_all().

Your First Scraping Script

Let's start simple. We'll fetch a page and extract headlines. This is the fundamental pattern you'll build on for every scraper you ever write, understand it deeply and the rest follows naturally.

python

import requests
from bs4 import BeautifulSoup
 
# Fetch the page
url = "https://example.com"
response = requests.get(url)
 
# Check if the request was successful
if response.status_code == 200:
    # Parse the HTML
    soup = BeautifulSoup(response.content, "html.parser")
 
    # Extract all h1 tags
    headlines = soup.find_all("h1")
 
    for headline in headlines:
        print(headline.get_text())
else:
    print(f"Error: {response.status_code}")

Expected output:

Main Headline
Secondary Headline

Let's break down what happened:

requests.get(url) fetches the HTML
response.status_code tells us if the request succeeded (200 = OK)
BeautifulSoup(response.content, "html.parser") parses the HTML into a navigable tree
soup.find_all("h1") finds all h1 elements
.get_text() extracts the text content

Notice we're checking response.status_code, always do this. A 404, 403, or 500 means something went wrong, and you shouldn't try to parse it. Beyond that simple guard, this pattern shows you the fundamental three-step cycle of all scraping: fetch, parse, extract. Every scraper you build, no matter how complex, is a variation on exactly that sequence.

CSS Selectors and find_all()

BeautifulSoup gives you two main ways to extract data: find/find_all and CSS selectors. Let's explore both. Knowing when to use each approach will make your scraping code cleaner and easier to maintain, especially when you need to target elements nested several levels deep in the DOM.

Using find() and find_all()

The find() and find_all() methods are BeautifulSoup's native API. They're explicit, readable, and work well when you're targeting elements primarily by tag name or a single attribute. Think of find() as "give me the first match" and find_all() as "give me every match", straightforward and predictable.

python

soup = BeautifulSoup(html, "html.parser")
 
# Find the first matching element
first_link = soup.find("a")
 
# Find all matching elements
all_links = soup.find_all("a")
 
# Find with attributes
div_with_class = soup.find("div", class_="container")
 
# Find by ID
main_content = soup.find("div", id="main-content")
 
# Combine tag and multiple attributes
element = soup.find("a", class_="button", href="/submit")

Using CSS Selectors

CSS selectors are more flexible and often cleaner when you need to express complex relationships between elements. If you've written any frontend CSS, these will feel familiar. They shine particularly when you need to target elements based on their position in the DOM hierarchy or combine multiple conditions elegantly.

python

# By class
elements = soup.select(".container")
 
# By ID
element = soup.select_one("#main-content")
 
# By tag and class
links = soup.select("a.button")
 
# Nested selectors
paragraphs = soup.select(".container > p")
 
# Attribute selectors
external_links = soup.select('a[href^="http"]')

Now let's apply both techniques to a realistic scraping scenario. A product listing page is one of the most common scraping targets in the wild, and the pattern here applies to job boards, real estate listings, news archives, and more.

python

import requests
from bs4 import BeautifulSoup
 
url = "https://example-ecommerce.com/products"
response = requests.get(url)
soup = BeautifulSoup(response.content, "html.parser")
 
# Extract product information
products = soup.select(".product-card")
 
for product in products:
    name = product.select_one(".product-name").get_text(strip=True)
    price = product.select_one(".product-price").get_text(strip=True)
    link = product.select_one("a")["href"]
 
    print(f"Name: {name}")
    print(f"Price: {price}")
    print(f"Link: {link}")
    print("---")

Expected output:

Name: Blue Widget
Price: $19.99
Link: /products/blue-widget
---
Name: Red Widget
Price: $24.99
Link: /products/red-widget
---

Notice the .get_text(strip=True), strip=True removes leading/trailing whitespace, which is usually what you want. HTML source often has extra indentation and newlines baked in, and stripping them automatically keeps your extracted data clean without any post-processing.

Extracting Text, Attributes, and Nested Elements

Sometimes you need more than just text. Let's extract different parts of an element. Understanding what data lives in text nodes versus attributes is fundamental, links store their destination in href, images store their source in src, and custom data often lives in data-* attributes. Once you know where to look, extracting it is trivial.

python

link = soup.find("a")
 
# Get text content
text = link.get_text()
 
# Get an attribute
href = link["href"]
 
# Or use .get() with a default
title = link.get("title", "No title")
 
# Get all attributes as a dictionary
all_attrs = link.attrs
 
# Navigate to parent or siblings
parent = link.parent
next_sibling = link.next_sibling

Here's a real-world example, scraping a blog post. Notice how we extract both text content and a data-date attribute, which is a common pattern websites use to store machine-readable values alongside human-readable display text.

python

post = soup.find("article")
 
# Extract metadata
title = post.select_one("h1").get_text(strip=True)
author = post.select_one(".author-name").get_text(strip=True)
date = post.select_one(".publish-date")["data-date"]
 
# Extract content paragraphs
paragraphs = post.select("p")
content = "\n\n".join([p.get_text(strip=True) for p in paragraphs])
 
# Extract comments (nested structure)
comments = post.select(".comment")
comment_data = []
for comment in comments:
    comment_data.append({
        "author": comment.select_one(".comment-author").get_text(strip=True),
        "text": comment.select_one(".comment-text").get_text(strip=True)
    })
 
print(f"Title: {title}")
print(f"Author: {author}")
print(f"Date: {date}")
print(f"Content:\n{content}")
print(f"Comments: {len(comment_data)}")

The key insight: BeautifulSoup lets you navigate the tree like you're exploring a directory structure. Each element is an object with parent, siblings, children, and attributes. Once you internalize that mental model, you stop thinking of HTML as raw text and start seeing it as structured data that's just waiting to be extracted.

Handling Pagination

Most real scraping projects involve multiple pages. Let's build a paginated scraper. Pagination is where scraping projects get interesting, you're no longer just parsing a single page but orchestrating a sequence of requests that need to stay in sync with your storage layer, respect rate limits, and know when to stop.

python

import requests
from bs4 import BeautifulSoup
import time
 
def scrape_paginated_data(base_url, max_pages=5):
    all_items = []
 
    for page_num in range(1, max_pages + 1):
        # Construct the page URL (adjust based on site structure)
        url = f"{base_url}?page={page_num}"
 
        print(f"Scraping page {page_num}...")
        response = requests.get(url)
 
        if response.status_code != 200:
            print(f"Failed to fetch page {page_num}")
            break
 
        soup = BeautifulSoup(response.content, "html.parser")
 
        # Extract items
        items = soup.select(".item")
 
        if not items:
            print("No items found. Stopping.")
            break
 
        for item in items:
            all_items.append({
                "title": item.select_one(".title").get_text(strip=True),
                "price": item.select_one(".price").get_text(strip=True)
            })
 
        # Be respectful: wait between requests
        time.sleep(2)
 
    return all_items
 
# Usage
items = scrape_paginated_data("https://example.com/products", max_pages=5)
print(f"Collected {len(items)} items")

Key considerations for pagination:

URL structure varies: Some sites use ?page=2, others use /page/2/ or /products?offset=20. Inspect the site first.
Stop conditions: Always check if there are items on the page before moving to the next one.
Rate limiting: Add time.sleep() between requests to avoid hammering the server.
Error handling: Network requests fail sometimes. Catch exceptions and retry gracefully.

Since network reliability is never guaranteed, especially when you're making dozens or hundreds of requests, a retry mechanism with exponential backoff is essential for any scraper that runs unattended. Here's a robust version that handles transient failures gracefully.

python

def scrape_with_retry(url, max_retries=3):
    for attempt in range(max_retries):
        try:
            response = requests.get(url, timeout=10)
            response.raise_for_status()  # Raise exception for bad status codes
            return response
        except requests.exceptions.RequestException as e:
            print(f"Attempt {attempt + 1} failed: {e}")
            if attempt < max_retries - 1:
                time.sleep(2 ** attempt)  # Exponential backoff
            else:
                raise

Ethical Scraping Practices

Web scraping occupies an interesting legal and ethical gray area, and as a developer you need to understand both dimensions. On the legal side, courts in various jurisdictions have ruled differently on scraping public data, but many websites' Terms of Service explicitly prohibit it, and violating those terms can result in IP bans, legal threats, or worse. On the ethical side, even when scraping is technically permissible, you have a responsibility to avoid causing harm to the websites you access.

The most fundamental rule is check robots.txt first. This standard protocol, found at example.com/robots.txt, tells automated bots which pages they're allowed to visit and at what speed. Respecting it is both the ethical and the professional thing to do. Beyond robots.txt, always set a meaningful crawl delay between requests, hitting a server with hundreds of requests per second is functionally identical to a denial-of-service attack, even if that's not your intent.

Set a descriptive User-Agent header that identifies your bot and includes contact information. This allows site operators to reach out if there's an issue instead of just banning your IP. Consider what time of day you scrape, if you're running a large crawl, doing it during off-peak hours reduces the impact on the site's real users. Cache your results aggressively so you never need to re-fetch pages you've already captured. And finally, always ask yourself whether an API exists before reaching for a scraper, APIs are explicitly designed for programmatic access, they're more stable than scraped HTML, and using them is the right choice whenever available.

Every website has a /robots.txt file that tells scrapers what they can and can't access. Always check it.

https://example.com/robots.txt

This file might look like:

User-agent: *
Disallow: /admin/
Disallow: /private/
Crawl-delay: 5

Here's what it means:

User-agent: *: Rules for all bots
Disallow: /admin/: Don't scrape /admin/ pages
Crawl-delay: 5: Wait 5 seconds between requests

You can parse this automatically rather than reading it manually every time. Python's standard library includes a RobotFileParser class that does exactly this, and integrating it into your scraper takes less than ten lines of code.

python

from urllib.robotparser import RobotFileParser
 
def is_allowed_to_scrape(url):
    rp = RobotFileParser()
    rp.set_url("https://example.com/robots.txt")
    rp.read()
 
    # Check if we can fetch this URL
    return rp.can_fetch("*", url)
 
# Usage
if is_allowed_to_scrape("https://example.com/products"):
    print("OK to scrape")
else:
    print("Not allowed")

The Ethics of Scraping:

Check the Terms of Service: Many sites prohibit scraping. Respect that.
Respect robots.txt: It's there for a reason.
Rate limit your requests: Don't hammer the server. Use time.sleep() between requests.
Use a descriptive User-Agent: Instead of the default, identify yourself:

python

headers = {
    "User-Agent": "MyDataCollector/1.0 (+http://mydomain.com/bot)"
}
response = requests.get(url, headers=headers)

Setting a proper User-Agent costs you nothing and gives site operators the information they need to make decisions about your bot. It's the minimum professional courtesy of the web scraping world, and it distinguishes you from malicious scrapers who deliberately hide their identity.

Consider the server's load: If you're scraping 10,000 pages, do it at night or spread it out over days.
Cache responses: Don't re-scrape the same page twice. Save it locally.

When to use an API instead:

If the site offers an API (Twitter, GitHub, Reddit, etc.), use it. APIs are:

Faster and more reliable
Less likely to break
Explicitly permitted by the site
Often provide better data (metadata, verified info)

Check if a site has an API before scraping. A simple Google search usually reveals it.

Handling Dynamic Content

Here's the problem: some websites render content with JavaScript. When you fetch the HTML with requests, you get the empty shell, the JavaScript hasn't run yet, so there's no data to extract. This is increasingly common as the web has shifted toward React, Vue, Angular, and other JavaScript frameworks that build the DOM client-side rather than serving pre-rendered HTML.

How do you know if a site uses JavaScript? Open the page in your browser, right-click → View Page Source (not Inspect, Page Source shows the raw HTML the server sent). If the elements you need are absent from that source view but visible in the browser, JavaScript is building them dynamically. The Inspect panel shows the live DOM after JavaScript has run; Page Source shows what the server actually sent.

For these cases, you need a headless browser that actually executes JavaScript. The two main tools are Selenium and Playwright. Playwright is generally preferred for new projects because it's faster, has a cleaner API, and handles modern browser features more reliably. Both tools spin up a real browser engine behind the scenes, navigate to your target URL, wait for the JavaScript to execute, and then give you access to the fully rendered DOM.

bash

pip install playwright
playwright install

python

from playwright.sync_api import sync_playwright
 
def scrape_with_playwright(url):
    with sync_playwright() as p:
        # Launch a browser
        browser = p.chromium.launch()
        page = browser.new_page()
 
        # Navigate to the URL
        page.goto(url)
 
        # Wait for content to load (adjust selector as needed)
        page.wait_for_selector(".product-list")
 
        # Get the rendered HTML
        content = page.content()
 
        # Now parse with BeautifulSoup
        soup = BeautifulSoup(content, "html.parser")
        products = soup.select(".product")
 
        browser.close()
        return products
 
# Usage
products = scrape_with_playwright("https://example-js-heavy.com")
print(f"Found {len(products)} products")

Notice that once you have the rendered HTML from Playwright, you still use BeautifulSoup for the actual parsing, the two tools complement each other. Playwright handles the JavaScript execution problem, and BeautifulSoup handles the data extraction problem.

When to use Playwright instead of requests:

The page uses React, Vue, Angular, or other frameworks
Content loads after a button click or scroll
The page shows a "loading spinner"
Elements are missing in the page source (right-click → View Page Source)

Trade-off: Playwright is slower and heavier. Use it only when necessary. If a site has an API, use the API.

Common Scraping Pitfalls

Even experienced developers run into the same recurring traps when scraping. Knowing these in advance will save you hours of debugging. The most common issue is assuming the HTML structure is stable. Websites redesign, A/B test, and update their markup constantly. A scraper that worked perfectly last month may return empty results today because a class name changed from .product-card to .product-item. Build your scrapers defensively: validate that expected selectors exist before processing, log warnings when extraction returns empty, and set up monitoring so you know when something breaks.

The second most common pitfall is not handling missing elements. Even on a stable page, individual records may have incomplete data, a product without a price, an article without an author byline. If you call .get_text() on None (what select_one() returns when nothing matches), you get an AttributeError and your entire scrape crashes. Use defensive extraction patterns that return a default value instead of raising.

Third is ignoring rate limits and getting banned. If your requests come too fast, you'll hit HTTP 429 (Too Many Requests) or get your IP blocked entirely. Some sites use sophisticated bot detection that goes beyond rate limiting, they check for missing headers, too-regular timing patterns, or browser fingerprinting. Adding small random delays between requests, rotating user agents, and using residential proxies (for legitimate large-scale scraping) are all tools in the advanced scraper's toolkit.

Finally, watch out for encoding issues. HTML pages can use UTF-8, ISO-8859-1, or other encodings. Always use response.content (bytes) passed to BeautifulSoup rather than response.text (string), and let BeautifulSoup detect the encoding from the HTML's meta tags. This avoids a whole class of mysterious character corruption bugs that are extremely frustrating to diagnose.

Storing Data in SQLite

Now that you've scraped data, you need to store it. SQLite is perfect for this, it's lightweight, requires no server, and integrates beautifully with SQLAlchemy. For most scraping projects, SQLite is all you need: it handles millions of rows comfortably, supports full SQL queries for analysis, and lives in a single file that you can move, share, or back up trivially.

Setting Up SQLAlchemy

SQLAlchemy's ORM approach gives you Python classes that map directly to database tables. You define your schema in Python, and SQLAlchemy handles the SQL. This means your scraper code reads cleanly as Python, not as a mix of Python and SQL strings.

python

from sqlalchemy import create_engine, Column, Integer, String, Float, DateTime
from sqlalchemy.orm import DeclarativeBase, sessionmaker
from datetime import datetime
 
# Create database engine
engine = create_engine("sqlite:///products.db")
 
# Define the table structure
class Base(DeclarativeBase):
    pass
 
class Product(Base):
    __tablename__ = "products"
 
    id = Column(Integer, primary_key=True)
    name = Column(String)
    price = Column(Float)
    url = Column(String, unique=True)
    scraped_at = Column(DateTime, default=datetime.utcnow)
 
# Create the table
Base.metadata.create_all(engine)
 
print("Database created!")

Inserting Scraped Data

Inserting data is as simple as creating instances of your model class and adding them to the session. The unique=True constraint on url above is important: it prevents duplicate entries if you re-run your scraper, which will happen inevitably when you're iterating on your code or resuming a crawl that was interrupted.

python

Session = sessionmaker(bind=engine)
session = Session()
 
# Insert a single product
product = Product(
    name="Blue Widget",
    price=19.99,
    url="https://example.com/blue-widget"
)
session.add(product)
session.commit()
 
# Insert multiple products
products = [
    Product(name="Red Widget", price=24.99, url="https://example.com/red"),
    Product(name="Green Widget", price=22.99, url="https://example.com/green")
]
session.add_all(products)
session.commit()
 
print("Data inserted!")

Querying the Data

Once the data is in SQLite, you have the full power of SQL at your disposal through SQLAlchemy's query API. This is where storing scraped data in a proper database pays off, you can filter, sort, aggregate, and join your scraped data just like any other structured dataset.

python

# Get all products
all_products = session.query(Product).all()
 
# Filter by price
expensive = session.query(Product).filter(Product.price > 20).all()
 
# Order by price
sorted_products = session.query(Product).order_by(Product.price).all()
 
# Display results
for product in all_products:
    print(f"{product.name}: ${product.price}")

Complete Scraping + Storage Example

Let's put it all together. This end-to-end example shows how the fetching, parsing, and storage layers connect in a real scraper. Pay attention to the error handling inside the item loop, wrapping individual item extraction in a try/except means one bad record won't crash the entire page's worth of data.

python

import requests
from bs4 import BeautifulSoup
from sqlalchemy import create_engine, Column, Integer, String, Float, DateTime
from sqlalchemy.orm import DeclarativeBase, sessionmaker
from datetime import datetime
import time
 
# Setup database
engine = create_engine("sqlite:///products.db")
 
class Base(DeclarativeBase):
    pass
 
class Product(Base):
    __tablename__ = "products"
    id = Column(Integer, primary_key=True)
    name = Column(String)
    price = Column(Float)
    url = Column(String, unique=True)
    scraped_at = Column(DateTime, default=datetime.utcnow)
 
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
session = Session()
 
# Scraping function
def scrape_and_store(base_url, max_pages=3):
    for page_num in range(1, max_pages + 1):
        url = f"{base_url}?page={page_num}"
        print(f"Scraping page {page_num}...")
 
        response = requests.get(url)
        if response.status_code != 200:
            break
 
        soup = BeautifulSoup(response.content, "html.parser")
        items = soup.select(".product")
 
        if not items:
            break
 
        for item in items:
            try:
                product = Product(
                    name=item.select_one(".title").get_text(strip=True),
                    price=float(item.select_one(".price").get_text(strip=True).replace("$", "")),
                    url=item.select_one("a")["href"]
                )
                session.add(product)
            except Exception as e:
                print(f"Error parsing item: {e}")
 
        session.commit()
        time.sleep(2)
 
# Run the scraper
scrape_and_store("https://example.com/products")
print("Scraping complete!")

Error Handling for Network and Layout Changes

Real-world scraping is messy. Servers go down, websites change their HTML structure, and networks are unreliable. Here's how to build resilience. The key insight is that a scraper running in production is different from a scraper you're testing interactively, in production, no one is watching, and failures need to be detected, logged, and recovered from automatically.

Network Errors

The requests.adapters module gives you built-in retry logic with exponential backoff. Configure it once on a Session object and every request you make through that session automatically benefits from the retry policy, no need to wrap every individual request in retry logic.

python

import requests
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
 
def create_session_with_retries():
    session = requests.Session()
 
    # Retry strategy
    retry_strategy = Retry(
        total=3,
        backoff_factor=1,
        status_forcelist=[429, 500, 502, 503, 504]
    )
    adapter = HTTPAdapter(max_retries=retry_strategy)
    session.mount("http://", adapter)
    session.mount("https://", adapter)
 
    return session
 
# Usage
session = create_session_with_retries()
response = session.get("https://example.com")

Handling Missing Elements

The safe_extract pattern below is one you'll want in every scraper you write. Instead of crashing when a selector returns None, it returns a sensible default and optionally logs the failure. This is the difference between a scraper that dies on the first malformed record and one that collects 99% of its target data despite occasional inconsistencies.

python

def safe_extract(element, selector, default="N/A"):
    """Safely extract text from an element with a default fallback."""
    try:
        found = element.select_one(selector)
        return found.get_text(strip=True) if found else default
    except Exception as e:
        print(f"Error extracting {selector}: {e}")
        return default
 
# Usage
product = soup.select_one(".product")
name = safe_extract(product, ".title")
price = safe_extract(product, ".price", default="0.00")

Detecting Layout Changes

If the scraper suddenly returns empty data, the website structure probably changed. Rather than silently returning an empty dataset that might look like a successful run, validate that your key selectors are present before proceeding. This turns a silent failure into a loud, diagnosable one.

python

def scrape_with_validation(url, expected_selectors):
    response = requests.get(url)
    soup = BeautifulSoup(response.content, "html.parser")
 
    # Validate that expected elements exist
    for selector in expected_selectors:
        if not soup.select_one(selector):
            raise ValueError(f"Expected selector '{selector}' not found. Layout may have changed.")
 
    # If validation passes, proceed with scraping
    return soup
 
# Usage
try:
    soup = scrape_with_validation(
        "https://example.com",
        expected_selectors=[".product-list", ".product-item"]
    )
except ValueError as e:
    print(f"Validation failed: {e}")
    # Alert the user, send email, etc.

Putting It All Together: A Real Scraper

Here's a complete, production-ready scraper that brings together every concept from this guide. Notice how the class-based design keeps each concern, session management, robots.txt checking, page fetching, item extraction, and database storage, cleanly separated. This isn't just good software design; it makes the scraper much easier to debug and extend when requirements change.

Respects robots.txt
Handles pagination
Stores data in SQLite
Retries on network errors
Handles missing elements gracefully

python

import requests
from bs4 import BeautifulSoup
from sqlalchemy import create_engine, Column, Integer, String, Float, DateTime
from sqlalchemy.orm import DeclarativeBase, sessionmaker
from requests.adapters import HTTPAdapter
from urllib3.util.retry import Retry
from urllib.robotparser import RobotFileParser
from datetime import datetime
import time
import logging
 
# Setup logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)
 
# Setup database
engine = create_engine("sqlite:///scraped_data.db")
 
class Base(DeclarativeBase):
    pass
 
class ScrapedItem(Base):
    __tablename__ = "items"
    id = Column(Integer, primary_key=True)
    title = Column(String)
    description = Column(String)
    url = Column(String, unique=True)
    source = Column(String)
    scraped_at = Column(DateTime, default=datetime.utcnow)
 
Base.metadata.create_all(engine)
Session = sessionmaker(bind=engine)
 
class WebScraper:
    def __init__(self, domain):
        self.domain = domain
        self.session = self._create_session()
        self.session.headers.update({
            "User-Agent": "DataCollector/1.0 (+http://example.com/bot)"
        })
 
    def _create_session(self):
        session = requests.Session()
        retry = Retry(total=3, backoff_factor=1, status_forcelist=[429, 500, 502, 503])
        adapter = HTTPAdapter(max_retries=retry)
        session.mount("http://", adapter)
        session.mount("https://", adapter)
        return session
 
    def is_allowed(self, url):
        """Check robots.txt"""
        try:
            rp = RobotFileParser()
            rp.set_url(f"https://{self.domain}/robots.txt")
            rp.read()
            return rp.can_fetch("*", url)
        except Exception as e:
            logger.warning(f"Could not read robots.txt: {e}")
            return True
 
    def fetch_page(self, url):
        """Fetch a page with error handling"""
        try:
            response = self.session.get(url, timeout=10)
            response.raise_for_status()
            return response
        except requests.exceptions.RequestException as e:
            logger.error(f"Failed to fetch {url}: {e}")
            return None
 
    def scrape_items(self, page_url):
        """Extract items from a page"""
        response = self.fetch_page(page_url)
        if not response:
            return []
 
        soup = BeautifulSoup(response.content, "html.parser")
        items = []
 
        for item_elem in soup.select(".item"):
            try:
                item = {
                    "title": self._safe_extract(item_elem, ".item-title"),
                    "description": self._safe_extract(item_elem, ".item-desc"),
                    "url": item_elem.select_one("a")["href"] if item_elem.select_one("a") else None,
                    "source": self.domain
                }
                items.append(item)
            except Exception as e:
                logger.warning(f"Error parsing item: {e}")
 
        return items
 
    @staticmethod
    def _safe_extract(element, selector):
        """Safely extract text"""
        try:
            found = element.select_one(selector)
            return found.get_text(strip=True) if found else None
        except:
            return None
 
    def scrape_all(self, base_url, max_pages=5):
        """Scrape multiple pages and store results"""
        session = Session()
 
        for page_num in range(1, max_pages + 1):
            url = f"{base_url}?page={page_num}"
 
            if not self.is_allowed(url):
                logger.info(f"Robots.txt forbids scraping {url}")
                break
 
            logger.info(f"Scraping page {page_num}...")
            items = self.scrape_items(url)
 
            if not items:
                logger.info("No items found. Stopping.")
                break
 
            for item in items:
                if item["url"]:
                    db_item = ScrapedItem(**item)
                    session.add(db_item)
 
            try:
                session.commit()
            except Exception as e:
                session.rollback()
                logger.error(f"Database error: {e}")
 
            time.sleep(2)  # Rate limiting
 
        session.close()
        logger.info("Scraping complete!")
 
# Usage
scraper = WebScraper("example.com")
scraper.scrape_all("https://example.com/items", max_pages=10)

This scraper includes:

robots.txt checking to respect site guidelines
Session with retries for network resilience
Pagination handling
Safe element extraction with fallbacks
Database storage via SQLAlchemy
Logging for debugging
Rate limiting via time.sleep()

Key Takeaways

BeautifulSoup + requests is the perfect duo for static HTML scraping
Use CSS selectors for cleaner, more flexible element targeting
Always respect robots.txt, rate limits, and Terms of Service
Use Playwright or Selenium only for JavaScript-heavy sites
Store data in SQLite for easy access and analysis
Add error handling and retries for resilience
Check if an API exists before scraping, APIs are better when available

Wrapping Up

Web scraping is one of those skills that compounds over time. Your first scraper might be five lines of BeautifulSoup targeting a single page. Your tenth will be a class-based system with robots.txt checking, retry logic, database storage, and validation. By the time you're writing your twentieth, you'll have a personal toolkit of patterns and utilities that let you stand up a new scraper in minutes for almost any target.

The discipline we covered here, understanding HTML structure before writing a single selector, building in error handling from the start, respecting the sites you access, and reaching for Playwright only when static scraping genuinely won't work, is what separates brittle one-off scripts from maintainable scrapers that you can run and trust. That foundation is worth internalizing deeply, because web scraping unlocks access to data that would otherwise be completely out of reach for your projects.

As you continue through this series toward data science and AI/ML, you'll find that many interesting real-world datasets live on web pages rather than in neat CSV files or APIs. The scraping skills you've built here will let you collect that data yourself, on demand, for whatever project you're working on. Go apply them to something real, inspect the DOM on a site you're curious about, and write your first scraper. The best way to learn is to build.

Now go scrape responsibly.

Web Scraping with Python and BeautifulSoup

What Is Web Scraping, Really?

Why You Might Scrape

Why You Might NOT Scrape

Setting Up Your Environment

HTML Structure for Scrapers

Your First Scraping Script

CSS Selectors and find_all()

Using find() and find_all()

Using CSS Selectors

Extracting Text, Attributes, and Nested Elements

Ethical Scraping Practices

Handling Dynamic Content

Common Scraping Pitfalls

Storing Data in SQLite

Setting Up SQLAlchemy

Inserting Scraped Data

Querying the Data

Complete Scraping + Storage Example

Error Handling for Network and Layout Changes

Network Errors

Handling Missing Elements

Detecting Layout Changes

Putting It All Together: A Real Scraper

Key Takeaways

Wrapping Up

Need help implementing this?