Parsing Webpages (HTML) for RAG¶

Webpages are a great data source for RAG, but they’re also one of the messiest:

Boilerplate: navigation, footers, cookie banners, sidebars
Ads / UI text: content mixed with buttons and UI labels
JavaScript rendering: the HTML you see in the browser may not exist in the initial response
Duplication: tag pages, print views, query params, and mirrored content

Your goal is to extract the main content plus enough structure/metadata to chunk and retrieve well later.

Choose your approach¶

Approach	Best for	Pros	Cons
`requests` + `BeautifulSoup`	Simple/static pages	Minimal deps, full control	You must handle boilerplate removal yourself
`trafilatura` (recommended default)	Articles, docs pages, blogs	Great “main text” extraction	Another dependency; not perfect for every site
`playwright`	JS-heavy apps	Renders like a browser	Heavier setup and slower

In practice, a good default is: 1) try trafilatura extraction, 2) fall back to BeautifulSoup, 3) only use playwright if the page is JS-rendered.

A simple parsing pipeline¶

Install¶

pip install requests beautifulsoup4 trafilatura

If you need JS rendering:

pip install playwright
playwright install

Reference implementation¶

The functions below return a single dict:

{
  "url": "...",
  "title": "...",
  "text": "...",
  "headings": ["...", "..."]
}

from __future__ import annotations

from datetime import datetime, timezone
from typing import Any

import requests
from bs4 import BeautifulSoup


def fetch_url(url: str, *, timeout_s: int = 20) -> str:
    headers = {
        "User-Agent": "BuildRagBot/1.0 (+https://buildrag.com)",
        "Accept": "text/html,application/xhtml+xml",
    }
    resp = requests.get(url, headers=headers, timeout=timeout_s)
    resp.raise_for_status()
    return resp.text


def extract_title_and_headings(html: str) -> tuple[str | None, list[str]]:
    soup = BeautifulSoup(html, "html.parser")

    title = soup.title.get_text(strip=True) if soup.title else None

    headings: list[str] = []
    for tag in soup.find_all(["h1", "h2", "h3"]):
        text = tag.get_text(" ", strip=True)
        if text:
            headings.append(text)

    # Keep the first ~20 headings to avoid excessive noise
    return title, headings[:20]


def extract_main_text(html: str) -> str:
    # Recommended default: trafilatura for "main content" extraction
    try:
        import trafilatura  # type: ignore
    except ImportError:
        trafilatura = None

    if trafilatura is not None:
        extracted = trafilatura.extract(
            html,
            include_comments=False,
            include_tables=False,
            favor_recall=True,
        )
        if extracted and extracted.strip():
            return extracted.strip()

    # Fallback: basic BeautifulSoup text extraction (no boilerplate removal)
    soup = BeautifulSoup(html, "html.parser")

    # Remove common non-content tags
    for tag in soup(["script", "style", "noscript"]):
        tag.decompose()

    text = soup.get_text("\n", strip=True)
    # Collapse too many blank lines
    lines = [line.strip() for line in text.splitlines() if line.strip()]
    return "\n".join(lines)


def parse_webpage(url: str) -> dict[str, Any]:
    html = fetch_url(url)
    title, headings = extract_title_and_headings(html)
    text = extract_main_text(html)

    return {
        "url": url,
        "title": title,
        "headings": headings,
        "text": text,
        "fetched_at": datetime.now(timezone.utc).isoformat(),
    }

Best practices (what matters in production)¶

Respect robots.txt and rate limits¶

Even if you can fetch a page, you often shouldn’t fetch it aggressively.

Check robots.txt (Python’s urllib.robotparser)
Add a small delay between requests (e.g. 0.5–2s)
Use caching and avoid re-fetching unchanged pages

Normalize URLs to reduce duplicates¶

Store and compare canonicalized URLs: - Remove tracking query params (e.g. utm_*) - Prefer canonical URLs when the page provides <link rel="canonical" ...>

Store metadata for retrieval + citations¶

At minimum store: - source_url - title - fetched_at - optionally: headings, section_path, content_type

This makes it easier to generate citations later and debug retrieval failures.

Keep headings if you can¶

Headings are extremely useful for: - structure-aware chunking - filtering (“retrieve only from FAQ section”) - better citations (“Section: Pricing”)

Next steps¶

Chunk your extracted text: Chunking Strategies
Store and retrieve it efficiently: Vector Stores for RAG (Postgres + pgvector)