Parsing Webpages (HTML) for RAG¶
Webpages are a great data source for RAG, but they’re also one of the messiest:
- Boilerplate: navigation, footers, cookie banners, sidebars
- Ads / UI text: content mixed with buttons and UI labels
- JavaScript rendering: the HTML you see in the browser may not exist in the initial response
- Duplication: tag pages, print views, query params, and mirrored content
Your goal is to extract the main content plus enough structure/metadata to chunk and retrieve well later.
Choose your approach¶
| Approach | Best for | Pros | Cons |
|---|---|---|---|
requests + BeautifulSoup |
Simple/static pages | Minimal deps, full control | You must handle boilerplate removal yourself |
trafilatura (recommended default) |
Articles, docs pages, blogs | Great “main text” extraction | Another dependency; not perfect for every site |
playwright |
JS-heavy apps | Renders like a browser | Heavier setup and slower |
In practice, a good default is:
1) try trafilatura extraction, 2) fall back to BeautifulSoup, 3) only use playwright if the page is JS-rendered.
A simple parsing pipeline¶
Install¶
pip install requests beautifulsoup4 trafilatura
If you need JS rendering:
pip install playwright
playwright install
Reference implementation¶
The functions below return a single dict:
{
"url": "...",
"title": "...",
"text": "...",
"headings": ["...", "..."]
}
from __future__ import annotations
from datetime import datetime, timezone
from typing import Any
import requests
from bs4 import BeautifulSoup
def fetch_url(url: str, *, timeout_s: int = 20) -> str:
headers = {
"User-Agent": "BuildRagBot/1.0 (+https://buildrag.com)",
"Accept": "text/html,application/xhtml+xml",
}
resp = requests.get(url, headers=headers, timeout=timeout_s)
resp.raise_for_status()
return resp.text
def extract_title_and_headings(html: str) -> tuple[str | None, list[str]]:
soup = BeautifulSoup(html, "html.parser")
title = soup.title.get_text(strip=True) if soup.title else None
headings: list[str] = []
for tag in soup.find_all(["h1", "h2", "h3"]):
text = tag.get_text(" ", strip=True)
if text:
headings.append(text)
# Keep the first ~20 headings to avoid excessive noise
return title, headings[:20]
def extract_main_text(html: str) -> str:
# Recommended default: trafilatura for "main content" extraction
try:
import trafilatura # type: ignore
except ImportError:
trafilatura = None
if trafilatura is not None:
extracted = trafilatura.extract(
html,
include_comments=False,
include_tables=False,
favor_recall=True,
)
if extracted and extracted.strip():
return extracted.strip()
# Fallback: basic BeautifulSoup text extraction (no boilerplate removal)
soup = BeautifulSoup(html, "html.parser")
# Remove common non-content tags
for tag in soup(["script", "style", "noscript"]):
tag.decompose()
text = soup.get_text("\n", strip=True)
# Collapse too many blank lines
lines = [line.strip() for line in text.splitlines() if line.strip()]
return "\n".join(lines)
def parse_webpage(url: str) -> dict[str, Any]:
html = fetch_url(url)
title, headings = extract_title_and_headings(html)
text = extract_main_text(html)
return {
"url": url,
"title": title,
"headings": headings,
"text": text,
"fetched_at": datetime.now(timezone.utc).isoformat(),
}
Best practices (what matters in production)¶
Respect robots.txt and rate limits¶
Even if you can fetch a page, you often shouldn’t fetch it aggressively.
- Check
robots.txt(Python’surllib.robotparser) - Add a small delay between requests (e.g. 0.5–2s)
- Use caching and avoid re-fetching unchanged pages
Normalize URLs to reduce duplicates¶
Store and compare canonicalized URLs:
- Remove tracking query params (e.g. utm_*)
- Prefer canonical URLs when the page provides <link rel="canonical" ...>
Store metadata for retrieval + citations¶
At minimum store:
- source_url
- title
- fetched_at
- optionally: headings, section_path, content_type
This makes it easier to generate citations later and debug retrieval failures.
Keep headings if you can¶
Headings are extremely useful for: - structure-aware chunking - filtering (“retrieve only from FAQ section”) - better citations (“Section: Pricing”)
Next steps¶
- Chunk your extracted text: Chunking Strategies
- Store and retrieve it efficiently: Vector Stores for RAG (Postgres + pgvector)