Parsing PDF Documents¶

PDFs are one of the most common document formats you'll encounter when building RAG systems. Unlike plain text files, PDFs can contain complex layouts, tables, images, and multiple columns—making text extraction challenging.

In this tutorial, you'll learn how to parse PDFs using two powerful libraries:

Unstructured - A versatile library with multiple parsing strategies
Docling - IBM's layout-aware document parser designed for RAG

Why PDF Parsing is Challenging¶

PDFs store text as positioned elements on a page, not as a flowing document. This means:

Text order isn't always obvious (especially with multi-column layouts)
Tables often get scrambled
Headers and footers can mix with body content
Scanned PDFs require OCR (Optical Character Recognition)

Both Unstructured and Docling solve these problems using layout detection models that understand document structure.

Why Unstructured and Docling?¶

There are many PDF parsing libraries in Python. Here's why we focus on these two:

Library	Layout Detection	Table Extraction	OCR	RAG-Ready	Maintenance
Unstructured	✅ AI-powered	✅ Excellent	✅	✅	Active
Docling	✅ AI-powered	✅ Best-in-class	✅	✅	Active (IBM)
PyPDF2/pypdf	❌	❌	❌	❌	Active
pdfplumber	❌	✅ Basic	❌	❌	Active
PyMuPDF (fitz)	❌	❌	❌	❌	Active

Why not PyPDF2, pdfplumber, or PyMuPDF?

These libraries are great for simple text extraction, but they lack layout understanding:

PyPDF2/pypdf - Extracts raw text but loses reading order in multi-column documents
pdfplumber - Good for basic table extraction, but struggles with complex layouts
PyMuPDF - Fast text extraction, but no semantic understanding of document structure

Why Unstructured and Docling are better for RAG:

Layout Detection - Both use AI models to understand page structure, correctly handling multi-column text, headers, and sidebars
Semantic Elements - They classify content into meaningful types (titles, paragraphs, tables, lists) rather than just raw text
Table Preservation - Tables maintain their row/column structure instead of becoming garbled text
Framework Integration - Both integrate directly with LangChain, LlamaIndex, and other RAG frameworks
Production-Ready - Actively maintained with enterprise support options

Method 1: Using Unstructured¶

Unstructured is a widely-used library that breaks documents into elements like Title, NarrativeText, Table, and ListItem.

Installation¶

# Basic installation
pip install unstructured

# For PDF support with all features (OCR, layout detection)
pip install "unstructured[pdf]"

System Dependencies

For advanced PDF parsing, you may need to install system dependencies:

Tesseract (for OCR): brew install tesseract (macOS) or apt-get install tesseract-ocr (Linux)
Poppler (for PDF rendering): brew install poppler (macOS)

Basic Usage¶

from unstructured.partition.pdf import partition_pdf

# Parse a PDF file
elements = partition_pdf("path/to/document.pdf")

# Print each extracted element
for element in elements:
    print(f"[{type(element).__name__}]: {element.text[:100]}...")

Parsing Strategies¶

Unstructured offers four parsing strategies, each with different trade-offs:

Strategy	Speed	Accuracy	Best For
`"fast"`	⚡ Fastest	Basic	Simple PDFs with extractable text
`"auto"`	Medium	Smart	General use (default)
`"hi_res"`	Slowest	Highest	Complex layouts, tables
`"ocr_only"`	Slow	OCR-based	Scanned documents

from unstructured.partition.pdf import partition_pdf

# Use high-resolution strategy for complex documents
elements = partition_pdf(
    filename="complex-report.pdf",
    strategy="hi_res"  # Uses layout detection model
)

# Use OCR for scanned documents
elements = partition_pdf(
    filename="scanned-document.pdf",
    strategy="ocr_only",
    languages=["eng"]  # Specify OCR language
)

Extracting Tables and Images¶

With the hi_res strategy, you can extract tables and images:

from unstructured.partition.pdf import partition_pdf

elements = partition_pdf(
    filename="report-with-tables.pdf",
    strategy="hi_res",
    extract_images_in_pdf=True,
    extract_image_block_output_dir="./extracted_images"
)

# Filter for specific element types
tables = [el for el in elements if el.category == "Table"]
for table in tables:
    print(table.metadata.text_as_html)  # Get table as HTML

Complete RAG-Ready Example¶

from unstructured.partition.pdf import partition_pdf

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract text from PDF for RAG ingestion."""
    elements = partition_pdf(
        filename=pdf_path,
        strategy="auto",
        include_page_breaks=True
    )

    # Combine all text elements
    text_content = []
    for element in elements:
        if hasattr(element, 'text') and element.text:
            text_content.append(element.text)

    return "\n\n".join(text_content)

# Usage
text = extract_text_from_pdf("annual-report.pdf")
print(f"Extracted {len(text)} characters")

Method 2: Using Docling¶

Docling is IBM's open-source document parser specifically designed for RAG applications. It uses computer vision models to understand document layout, reading order, and table structures.

Installation¶

pip install docling

GPU Acceleration

Docling benefits from GPU acceleration via PyTorch. For CPU-only Linux:

pip install docling --extra-index-url https://download.pytorch.org/whl/cpu

Basic Usage¶

from docling.document_converter import DocumentConverter

# Create a converter instance
converter = DocumentConverter()

# Convert a PDF (from file path or URL)
result = converter.convert("path/to/document.pdf")

# Export to Markdown
markdown = result.document.export_to_markdown()
print(markdown)

Converting from URL¶

Docling can directly fetch and parse PDFs from URLs:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()

# Convert directly from URL (no need to download first)
result = converter.convert("https://arxiv.org/pdf/2408.09869")
print(result.document.export_to_markdown())

Export Formats¶

Docling supports multiple export formats:

from docling.document_converter import DocumentConverter

converter = DocumentConverter()
result = converter.convert("document.pdf")
doc = result.document

# Export to Markdown (great for LLMs)
markdown = doc.export_to_markdown()

# Export to JSON (lossless, preserves structure)
json_data = doc.export_to_dict()

# Export to HTML
html = doc.export_to_html()

Using the CLI¶

Docling provides a convenient command-line interface:

# Convert a PDF to Markdown
docling document.pdf

# Convert from URL
docling https://arxiv.org/pdf/2206.01062

# Use Visual Language Model for better accuracy
docling --pipeline vlm --vlm-model granite_docling document.pdf

# See all options
docling --help

Complete RAG-Ready Example¶

from docling.document_converter import DocumentConverter
from pathlib import Path

def extract_text_from_pdf(pdf_path: str) -> str:
    """Extract text from PDF using Docling for RAG ingestion."""
    converter = DocumentConverter()
    result = converter.convert(pdf_path)

    # Export to Markdown (clean, structured output)
    return result.document.export_to_markdown()

# Usage
text = extract_text_from_pdf("technical-manual.pdf")
print(f"Extracted {len(text)} characters")

Comparison: Unstructured vs Docling¶

Feature	Unstructured	Docling
Ease of Setup	Easy	Easy
PDF Layout Detection	✅ (hi_res mode)	✅ (default)
Table Extraction	✅	✅ (better accuracy)
OCR Support	✅ (Tesseract)	✅ (built-in)
Output Formats	Elements, JSON, HTML	Markdown, JSON, HTML
LLM Framework Integration	LangChain, LlamaIndex	LangChain, LlamaIndex, Haystack
Visual Language Models	❌	✅ (GraniteDocling)
CLI	Limited	✅ Full featured
Best For	Multi-format pipelines	Production RAG systems

When to Choose Each¶

Choose Unstructured when:

You need to process many different file formats in one pipeline
You want fine-grained control over parsing strategies
You're already using it for other document types

Choose Docling when:

PDF parsing accuracy is critical
You need the best table extraction
You're building a production RAG system
You want clean Markdown output for LLMs

Best Practices for RAG¶

Choose the right strategy: Use hi_res (Unstructured) or default Docling for complex documents
Handle OCR carefully: Always specify the correct language for scanned documents
Preserve structure: Markdown output (both libraries) works great with LLMs
Batch processing: For large document sets, use Docling's CLI or Unstructured's batch processing
Clean the output: Remove headers, footers, and page numbers if they're not useful

Next Steps¶

Now that you can parse PDFs, learn how to:

Chunk your parsed text for optimal retrieval
Create embeddings from your chunks
Store in a vector database for semantic search