Parsing Text Documents¶

Plain text files (.txt, .md, .rst) are the simplest document format, but proper parsing is still important for RAG systems. Unlike binary formats, text files are human-readable, but they may contain inconsistent encodings, formatting patterns, or structure that needs to be identified.

In this tutorial, you'll learn how to parse text files using:

Native Python - Direct file handling for simple cases
Unstructured - Intelligent parsing that identifies document structure

Why Parse Text Files?¶

While text files seem straightforward, there are several challenges:

Character encoding - Files may use UTF-8, Latin-1, or other encodings
Line endings - Different systems use \n, \r\n, or \r
Structure detection - Identifying headers, paragraphs, and lists
Cleaning - Removing extra whitespace, special characters, or formatting

For simple text extraction, native Python is sufficient. For structure-aware parsing (identifying titles, sections, lists), Unstructured provides more intelligent processing.

Why These Two Methods?¶

There are many ways to read text files in Python. Here's why we focus on these two approaches:

Method	Structure Detection	Dependencies	Speed	Best Use Case
Native Python	❌ Manual	None	⚡ Fastest	Simple extraction, logs, config files
Unstructured	✅ Automatic	unstructured	Fast	Documents with headers, lists, sections
regex parsing	❌ Manual	None	Fast	Pattern-based extraction
NLTK/spaCy	⚠️ Sentence-level	Heavy	Slow	NLP tasks, not document parsing

Why not just use regex or simple file.read()?

For basic text files, file.read() works fine. However, as your RAG system grows, you'll encounter:

Mixed content - Documents with headers, paragraphs, and lists that need semantic understanding
Consistency - Unstructured provides uniform element types across all document formats (text, PDF, HTML)
Pipeline integration - Using the same library for text as you do for PDFs simplifies your codebase

When Native Python is the right choice:

Log files, configuration files, or simple notes
Maximum speed is critical (processing millions of files)
You need zero external dependencies
The text structure is consistent and predictable

When Unstructured is the right choice:

Documents with varying structure (headers, lists, paragraphs)
You're already using Unstructured for PDFs or other formats
You want automatic element classification for downstream processing
Building a production RAG system with multiple document types

Method 1: Using Native Python¶

Native Python is the best choice when you need speed, simplicity, or zero dependencies. It gives you full control over how text is processed, but you must handle encoding detection and text cleaning yourself.

For straightforward text files, Python's built-in file handling is efficient and requires no additional dependencies.

Basic Usage¶

def read_text_file(file_path: str) -> str:
    """Read a text file with proper encoding handling."""
    with open(file_path, "r", encoding="utf-8") as f:
        return f.read()

# Usage
text = read_text_file("document.txt")
print(f"Read {len(text)} characters")

Handling Different Encodings¶

Real-world files may not always be UTF-8. Here's a robust approach:

import chardet

def read_text_with_encoding_detection(file_path: str) -> str:
    """Read a text file with automatic encoding detection."""
    # First, detect the encoding
    with open(file_path, "rb") as f:
        raw_data = f.read()
        detected = chardet.detect(raw_data)
        encoding = detected.get("encoding", "utf-8")

    # Then read with the detected encoding
    with open(file_path, "r", encoding=encoding) as f:
        return f.read()

# Usage
text = read_text_with_encoding_detection("legacy-document.txt")

Installing chardet

Install the encoding detection library with:

pip install chardet

Cleaning and Normalizing Text¶

import re

def clean_text(text: str) -> str:
    """Clean and normalize text for RAG ingestion."""
    # Normalize line endings
    text = text.replace("\r\n", "\n").replace("\r", "\n")

    # Remove excessive whitespace
    text = re.sub(r"\n{3,}", "\n\n", text)  # Max 2 newlines
    text = re.sub(r"[ \t]+", " ", text)      # Collapse spaces

    # Strip leading/trailing whitespace from lines
    lines = [line.strip() for line in text.split("\n")]
    text = "\n".join(lines)

    return text.strip()

# Usage
raw_text = read_text_file("messy-document.txt")
clean = clean_text(raw_text)

Complete RAG-Ready Example (Native Python)¶

import chardet
import re
from pathlib import Path

def extract_text_from_file(file_path: str) -> str:
    """Extract and clean text from a file for RAG ingestion."""
    path = Path(file_path)

    # Detect encoding
    with open(path, "rb") as f:
        raw_data = f.read()
        detected = chardet.detect(raw_data)
        encoding = detected.get("encoding", "utf-8")

    # Read with detected encoding
    with open(path, "r", encoding=encoding, errors="replace") as f:
        text = f.read()

    # Clean the text
    text = text.replace("\r\n", "\n").replace("\r", "\n")
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = re.sub(r"[ \t]+", " ", text)

    return text.strip()

# Usage
text = extract_text_from_file("report.txt")
print(f"Extracted {len(text)} characters")

Method 2: Using Unstructured¶

Why upgrade from Native Python?

While native Python handles basic text extraction well, it treats all text the same—it doesn't understand that a line in all caps might be a title, or that lines starting with - are list items. This matters for RAG because:

Better chunking - Knowing that something is a title helps you keep it with its content when chunking
Richer metadata - Element types can be filtered or weighted differently in retrieval
Unified pipeline - If you're using Unstructured for PDFs, using it for text files means one consistent API

Unstructured provides intelligent parsing that identifies document structure within text files—detecting titles, headers, paragraphs, and lists automatically.

Installation¶

pip install unstructured

Basic Usage¶

from unstructured.partition.text import partition_text

# Parse a text file
elements = partition_text(filename="document.txt")

# Print each extracted element
for element in elements:
    print(f"[{type(element).__name__}]: {element.text[:100]}...")

Understanding Element Types¶

Unstructured classifies text into semantic elements:

Element Type	Description
`Title`	Section headers and titles
`NarrativeText`	Regular paragraphs
`ListItem`	Bulleted or numbered list items
`Header`	Document headers
`Footer`	Document footers
`UncategorizedText`	Text that doesn't fit other categories

from unstructured.partition.text import partition_text

elements = partition_text(filename="structured-document.txt")

# Filter by element type
titles = [el for el in elements if el.category == "Title"]
paragraphs = [el for el in elements if el.category == "NarrativeText"]

print(f"Found {len(titles)} titles and {len(paragraphs)} paragraphs")

Parsing from String Instead of File¶

You can also parse text directly without a file:

from unstructured.partition.text import partition_text

raw_text = """
# Introduction

This is the first paragraph of the document.
It contains important information.

## Features

- Feature one
- Feature two
- Feature three
"""

elements = partition_text(text=raw_text)

for element in elements:
    print(f"[{element.category}]: {element.text}")

Parsing Markdown Files¶

Unstructured can also parse Markdown files with better structure awareness:

from unstructured.partition.md import partition_md

# Parse a Markdown file
elements = partition_md(filename="README.md")

# Headers in Markdown become Titles
for element in elements:
    print(f"[{element.category}]: {element.text[:80]}")

Accessing Metadata¶

Each element includes useful metadata:

from unstructured.partition.text import partition_text

elements = partition_text(filename="long-document.txt")

for element in elements:
    print(f"Type: {element.category}")
    print(f"Text: {element.text[:100]}...")

    # Access metadata
    if hasattr(element, 'metadata'):
        print(f"Filename: {element.metadata.filename}")
        print(f"Filetype: {element.metadata.filetype}")
    print("---")

Complete RAG-Ready Example (Unstructured)¶

from unstructured.partition.text import partition_text

def extract_text_with_structure(file_path: str) -> str:
    """Extract text from file with structure awareness."""
    elements = partition_text(filename=file_path)

    # Combine all text elements with proper spacing
    text_parts = []
    for element in elements:
        if hasattr(element, 'text') and element.text:
            # Add extra newline before titles for readability
            if element.category == "Title":
                text_parts.append(f"\n{element.text}")
            else:
                text_parts.append(element.text)

    return "\n\n".join(text_parts)

# Usage
text = extract_text_with_structure("technical-doc.txt")
print(f"Extracted {len(text)} characters")

Extracting Structured Data¶

For more control, you can extract elements as structured data:

from unstructured.partition.text import partition_text

def extract_as_structured_data(file_path: str) -> list[dict]:
    """Extract text as a list of structured elements."""
    elements = partition_text(filename=file_path)

    structured = []
    for element in elements:
        structured.append({
            "type": element.category,
            "text": element.text,
            "id": element.id,
        })

    return structured

# Usage
data = extract_as_structured_data("document.txt")
for item in data:
    print(f"[{item['type']}] {item['text'][:60]}...")

Comparison: Native Python vs Unstructured¶

Feature	Native Python	Unstructured
Dependencies	None (or chardet)	unstructured
Speed	⚡ Fastest	Fast
Structure Detection	❌ Manual	✅ Automatic
Element Classification	❌	✅
Markdown Support	Basic	✅ Full
Best For	Simple text extraction	Structure-aware parsing

When to Choose Each¶

Choose Native Python when:

You need the fastest possible processing
The text files are simple and well-formatted
You don't need to identify document structure
You want minimal dependencies

Choose Unstructured when:

You need to identify titles, headers, and lists
You're processing Markdown or structured text
You want consistent element classification
You're building a production RAG pipeline

Handling Common Text Formats¶

RST (reStructuredText) Files¶

from unstructured.partition.rst import partition_rst

elements = partition_rst(filename="documentation.rst")
for element in elements:
    print(f"[{element.category}]: {element.text[:60]}...")

Plain Log Files¶

For log files, native Python is often better:

def parse_log_file(file_path: str) -> list[str]:
    """Parse a log file into individual entries."""
    with open(file_path, "r", encoding="utf-8") as f:
        lines = f.readlines()

    # Filter and clean log entries
    entries = [line.strip() for line in lines if line.strip()]
    return entries

# Usage
logs = parse_log_file("application.log")
print(f"Found {len(logs)} log entries")

Best Practices for RAG¶

Handle encoding properly - Always detect or specify the correct encoding
Clean excessive whitespace - Too many blank lines can affect chunking
Preserve structure - Use Unstructured when document structure matters
Normalize line endings - Ensure consistent \n characters across platforms
Consider file size - For very large files, process in chunks to avoid memory issues

Processing Multiple Files¶

Batch Processing with Native Python¶

from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

def process_text_files(directory: str) -> dict[str, str]:
    """Process all text files in a directory."""
    path = Path(directory)
    results = {}

    def read_file(file_path):
        with open(file_path, "r", encoding="utf-8", errors="replace") as f:
            return file_path.name, f.read()

    text_files = list(path.glob("*.txt"))

    with ThreadPoolExecutor(max_workers=4) as executor:
        for name, content in executor.map(read_file, text_files):
            results[name] = content

    return results

# Usage
all_texts = process_text_files("./documents")
print(f"Processed {len(all_texts)} files")

Batch Processing with Unstructured¶

from unstructured.partition.auto import partition
from pathlib import Path

def process_documents(directory: str) -> list[dict]:
    """Process multiple documents with automatic type detection."""
    path = Path(directory)
    all_elements = []

    for file_path in path.glob("*"):
        if file_path.suffix in [".txt", ".md", ".rst"]:
            elements = partition(filename=str(file_path))
            for element in elements:
                all_elements.append({
                    "source": file_path.name,
                    "type": element.category,
                    "text": element.text
                })

    return all_elements

# Usage
documents = process_documents("./docs")
print(f"Extracted {len(documents)} elements from all documents")

Next Steps¶

Now that you can parse text documents, learn how to:

Chunk your parsed text for optimal retrieval
Parse PDF documents for more complex formats
Parse web pages for HTML content