Parsing PDF Documents¶
PDFs are one of the most common document formats you'll encounter when building RAG systems. Unlike plain text files, PDFs can contain complex layouts, tables, images, and multiple columns—making text extraction challenging.
In this tutorial, you'll learn how to parse PDFs using two powerful libraries:
- Unstructured - A versatile library with multiple parsing strategies
- Docling - IBM's layout-aware document parser designed for RAG
Why PDF Parsing is Challenging¶
PDFs store text as positioned elements on a page, not as a flowing document. This means:
- Text order isn't always obvious (especially with multi-column layouts)
- Tables often get scrambled
- Headers and footers can mix with body content
- Scanned PDFs require OCR (Optical Character Recognition)
Both Unstructured and Docling solve these problems using layout detection models that understand document structure.
Why Unstructured and Docling?¶
There are many PDF parsing libraries in Python. Here's why we focus on these two:
| Library | Layout Detection | Table Extraction | OCR | RAG-Ready | Maintenance |
|---|---|---|---|---|---|
| Unstructured | ✅ AI-powered | ✅ Excellent | ✅ | ✅ | Active |
| Docling | ✅ AI-powered | ✅ Best-in-class | ✅ | ✅ | Active (IBM) |
| PyPDF2/pypdf | ❌ | ❌ | ❌ | ❌ | Active |
| pdfplumber | ❌ | ✅ Basic | ❌ | ❌ | Active |
| PyMuPDF (fitz) | ❌ | ❌ | ❌ | ❌ | Active |
Why not PyPDF2, pdfplumber, or PyMuPDF?
These libraries are great for simple text extraction, but they lack layout understanding:
- PyPDF2/pypdf - Extracts raw text but loses reading order in multi-column documents
- pdfplumber - Good for basic table extraction, but struggles with complex layouts
- PyMuPDF - Fast text extraction, but no semantic understanding of document structure
Why Unstructured and Docling are better for RAG:
- Layout Detection - Both use AI models to understand page structure, correctly handling multi-column text, headers, and sidebars
- Semantic Elements - They classify content into meaningful types (titles, paragraphs, tables, lists) rather than just raw text
- Table Preservation - Tables maintain their row/column structure instead of becoming garbled text
- Framework Integration - Both integrate directly with LangChain, LlamaIndex, and other RAG frameworks
- Production-Ready - Actively maintained with enterprise support options
Method 1: Using Unstructured¶
Unstructured is a widely-used library that breaks documents into elements like Title, NarrativeText, Table, and ListItem.
Installation¶
# Basic installation
pip install unstructured
# For PDF support with all features (OCR, layout detection)
pip install "unstructured[pdf]"
System Dependencies
For advanced PDF parsing, you may need to install system dependencies:
- Tesseract (for OCR):
brew install tesseract(macOS) orapt-get install tesseract-ocr(Linux) - Poppler (for PDF rendering):
brew install poppler(macOS)
Basic Usage¶
from unstructured.partition.pdf import partition_pdf
# Parse a PDF file
elements = partition_pdf("path/to/document.pdf")
# Print each extracted element
for element in elements:
print(f"[{type(element).__name__}]: {element.text[:100]}...")
Parsing Strategies¶
Unstructured offers four parsing strategies, each with different trade-offs:
| Strategy | Speed | Accuracy | Best For |
|---|---|---|---|
"fast" |
⚡ Fastest | Basic | Simple PDFs with extractable text |
"auto" |
Medium | Smart | General use (default) |
"hi_res" |
Slowest | Highest | Complex layouts, tables |
"ocr_only" |
Slow | OCR-based | Scanned documents |
from unstructured.partition.pdf import partition_pdf
# Use high-resolution strategy for complex documents
elements = partition_pdf(
filename="complex-report.pdf",
strategy="hi_res" # Uses layout detection model
)
# Use OCR for scanned documents
elements = partition_pdf(
filename="scanned-document.pdf",
strategy="ocr_only",
languages=["eng"] # Specify OCR language
)
Extracting Tables and Images¶
With the hi_res strategy, you can extract tables and images:
from unstructured.partition.pdf import partition_pdf
elements = partition_pdf(
filename="report-with-tables.pdf",
strategy="hi_res",
extract_images_in_pdf=True,
extract_image_block_output_dir="./extracted_images"
)
# Filter for specific element types
tables = [el for el in elements if el.category == "Table"]
for table in tables:
print(table.metadata.text_as_html) # Get table as HTML
Complete RAG-Ready Example¶
from unstructured.partition.pdf import partition_pdf
def extract_text_from_pdf(pdf_path: str) -> str:
"""Extract text from PDF for RAG ingestion."""
elements = partition_pdf(
filename=pdf_path,
strategy="auto",
include_page_breaks=True
)
# Combine all text elements
text_content = []
for element in elements:
if hasattr(element, 'text') and element.text:
text_content.append(element.text)
return "\n\n".join(text_content)
# Usage
text = extract_text_from_pdf("annual-report.pdf")
print(f"Extracted {len(text)} characters")
Method 2: Using Docling¶
Docling is IBM's open-source document parser specifically designed for RAG applications. It uses computer vision models to understand document layout, reading order, and table structures.
Installation¶
pip install docling
GPU Acceleration
Docling benefits from GPU acceleration via PyTorch. For CPU-only Linux:
pip install docling --extra-index-url https://download.pytorch.org/whl/cpu
Basic Usage¶
from docling.document_converter import DocumentConverter
# Create a converter instance
converter = DocumentConverter()
# Convert a PDF (from file path or URL)
result = converter.convert("path/to/document.pdf")
# Export to Markdown
markdown = result.document.export_to_markdown()
print(markdown)
Converting from URL¶
Docling can directly fetch and parse PDFs from URLs:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
# Convert directly from URL (no need to download first)
result = converter.convert("https://arxiv.org/pdf/2408.09869")
print(result.document.export_to_markdown())
Export Formats¶
Docling supports multiple export formats:
from docling.document_converter import DocumentConverter
converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document
# Export to Markdown (great for LLMs)
markdown = doc.export_to_markdown()
# Export to JSON (lossless, preserves structure)
json_data = doc.export_to_dict()
# Export to HTML
html = doc.export_to_html()
Using the CLI¶
Docling provides a convenient command-line interface:
# Convert a PDF to Markdown
docling document.pdf
# Convert from URL
docling https://arxiv.org/pdf/2206.01062
# Use Visual Language Model for better accuracy
docling --pipeline vlm --vlm-model granite_docling document.pdf
# See all options
docling --help
Complete RAG-Ready Example¶
from docling.document_converter import DocumentConverter
from pathlib import Path
def extract_text_from_pdf(pdf_path: str) -> str:
"""Extract text from PDF using Docling for RAG ingestion."""
converter = DocumentConverter()
result = converter.convert(pdf_path)
# Export to Markdown (clean, structured output)
return result.document.export_to_markdown()
# Usage
text = extract_text_from_pdf("technical-manual.pdf")
print(f"Extracted {len(text)} characters")
Comparison: Unstructured vs Docling¶
| Feature | Unstructured | Docling |
|---|---|---|
| Ease of Setup | Easy | Easy |
| PDF Layout Detection | ✅ (hi_res mode) | ✅ (default) |
| Table Extraction | ✅ | ✅ (better accuracy) |
| OCR Support | ✅ (Tesseract) | ✅ (built-in) |
| Output Formats | Elements, JSON, HTML | Markdown, JSON, HTML |
| LLM Framework Integration | LangChain, LlamaIndex | LangChain, LlamaIndex, Haystack |
| Visual Language Models | ❌ | ✅ (GraniteDocling) |
| CLI | Limited | ✅ Full featured |
| Best For | Multi-format pipelines | Production RAG systems |
When to Choose Each¶
Choose Unstructured when:
- You need to process many different file formats in one pipeline
- You want fine-grained control over parsing strategies
- You're already using it for other document types
Choose Docling when:
- PDF parsing accuracy is critical
- You need the best table extraction
- You're building a production RAG system
- You want clean Markdown output for LLMs
Best Practices for RAG¶
- Choose the right strategy: Use
hi_res(Unstructured) or default Docling for complex documents - Handle OCR carefully: Always specify the correct language for scanned documents
- Preserve structure: Markdown output (both libraries) works great with LLMs
- Batch processing: For large document sets, use Docling's CLI or Unstructured's batch processing
- Clean the output: Remove headers, footers, and page numbers if they're not useful
Next Steps¶
Now that you can parse PDFs, learn how to:
- Chunk your parsed text for optimal retrieval
- Create embeddings from your chunks
- Store in a vector database for semantic search