Skip to content

Parsing Text Documents

Plain text files (.txt, .md, .rst) are the simplest document format, but proper parsing is still important for RAG systems. Unlike binary formats, text files are human-readable, but they may contain inconsistent encodings, formatting patterns, or structure that needs to be identified.

In this tutorial, you'll learn how to parse text files using:

  • Native Python - Direct file handling for simple cases
  • Unstructured - Intelligent parsing that identifies document structure

Why Parse Text Files?

While text files seem straightforward, there are several challenges:

  • Character encoding - Files may use UTF-8, Latin-1, or other encodings
  • Line endings - Different systems use \n, \r\n, or \r
  • Structure detection - Identifying headers, paragraphs, and lists
  • Cleaning - Removing extra whitespace, special characters, or formatting

For simple text extraction, native Python is sufficient. For structure-aware parsing (identifying titles, sections, lists), Unstructured provides more intelligent processing.

Why These Two Methods?

There are many ways to read text files in Python. Here's why we focus on these two approaches:

Method Structure Detection Dependencies Speed Best Use Case
Native Python ❌ Manual None ⚡ Fastest Simple extraction, logs, config files
Unstructured ✅ Automatic unstructured Fast Documents with headers, lists, sections
regex parsing ❌ Manual None Fast Pattern-based extraction
NLTK/spaCy ⚠️ Sentence-level Heavy Slow NLP tasks, not document parsing

Why not just use regex or simple file.read()?

For basic text files, file.read() works fine. However, as your RAG system grows, you'll encounter:

  1. Mixed content - Documents with headers, paragraphs, and lists that need semantic understanding
  2. Consistency - Unstructured provides uniform element types across all document formats (text, PDF, HTML)
  3. Pipeline integration - Using the same library for text as you do for PDFs simplifies your codebase

When Native Python is the right choice:

  • Log files, configuration files, or simple notes
  • Maximum speed is critical (processing millions of files)
  • You need zero external dependencies
  • The text structure is consistent and predictable

When Unstructured is the right choice:

  • Documents with varying structure (headers, lists, paragraphs)
  • You're already using Unstructured for PDFs or other formats
  • You want automatic element classification for downstream processing
  • Building a production RAG system with multiple document types

Method 1: Using Native Python

Native Python is the best choice when you need speed, simplicity, or zero dependencies. It gives you full control over how text is processed, but you must handle encoding detection and text cleaning yourself.

For straightforward text files, Python's built-in file handling is efficient and requires no additional dependencies.

Basic Usage

def read_text_file(file_path: str) -> str:
    """Read a text file with proper encoding handling."""
    with open(file_path, "r", encoding="utf-8") as f:
        return f.read()

# Usage
text = read_text_file("document.txt")
print(f"Read {len(text)} characters")

Handling Different Encodings

Real-world files may not always be UTF-8. Here's a robust approach:

import chardet

def read_text_with_encoding_detection(file_path: str) -> str:
    """Read a text file with automatic encoding detection."""
    # First, detect the encoding
    with open(file_path, "rb") as f:
        raw_data = f.read()
        detected = chardet.detect(raw_data)
        encoding = detected.get("encoding", "utf-8")

    # Then read with the detected encoding
    with open(file_path, "r", encoding=encoding) as f:
        return f.read()

# Usage
text = read_text_with_encoding_detection("legacy-document.txt")

Installing chardet

Install the encoding detection library with:

pip install chardet

Cleaning and Normalizing Text

import re

def clean_text(text: str) -> str:
    """Clean and normalize text for RAG ingestion."""
    # Normalize line endings
    text = text.replace("\r\n", "\n").replace("\r", "\n")

    # Remove excessive whitespace
    text = re.sub(r"\n{3,}", "\n\n", text)  # Max 2 newlines
    text = re.sub(r"[ \t]+", " ", text)      # Collapse spaces

    # Strip leading/trailing whitespace from lines
    lines = [line.strip() for line in text.split("\n")]
    text = "\n".join(lines)

    return text.strip()

# Usage
raw_text = read_text_file("messy-document.txt")
clean = clean_text(raw_text)

Complete RAG-Ready Example (Native Python)

import chardet
import re
from pathlib import Path

def extract_text_from_file(file_path: str) -> str:
    """Extract and clean text from a file for RAG ingestion."""
    path = Path(file_path)

    # Detect encoding
    with open(path, "rb") as f:
        raw_data = f.read()
        detected = chardet.detect(raw_data)
        encoding = detected.get("encoding", "utf-8")

    # Read with detected encoding
    with open(path, "r", encoding=encoding, errors="replace") as f:
        text = f.read()

    # Clean the text
    text = text.replace("\r\n", "\n").replace("\r", "\n")
    text = re.sub(r"\n{3,}", "\n\n", text)
    text = re.sub(r"[ \t]+", " ", text)

    return text.strip()

# Usage
text = extract_text_from_file("report.txt")
print(f"Extracted {len(text)} characters")

Method 2: Using Unstructured

Why upgrade from Native Python?

While native Python handles basic text extraction well, it treats all text the same—it doesn't understand that a line in all caps might be a title, or that lines starting with - are list items. This matters for RAG because:

  1. Better chunking - Knowing that something is a title helps you keep it with its content when chunking
  2. Richer metadata - Element types can be filtered or weighted differently in retrieval
  3. Unified pipeline - If you're using Unstructured for PDFs, using it for text files means one consistent API

Unstructured provides intelligent parsing that identifies document structure within text files—detecting titles, headers, paragraphs, and lists automatically.

Installation

pip install unstructured

Basic Usage

from unstructured.partition.text import partition_text

# Parse a text file
elements = partition_text(filename="document.txt")

# Print each extracted element
for element in elements:
    print(f"[{type(element).__name__}]: {element.text[:100]}...")

Understanding Element Types

Unstructured classifies text into semantic elements:

Element Type Description
Title Section headers and titles
NarrativeText Regular paragraphs
ListItem Bulleted or numbered list items
Header Document headers
Footer Document footers
UncategorizedText Text that doesn't fit other categories
from unstructured.partition.text import partition_text

elements = partition_text(filename="structured-document.txt")

# Filter by element type
titles = [el for el in elements if el.category == "Title"]
paragraphs = [el for el in elements if el.category == "NarrativeText"]

print(f"Found {len(titles)} titles and {len(paragraphs)} paragraphs")

Parsing from String Instead of File

You can also parse text directly without a file:

from unstructured.partition.text import partition_text

raw_text = """
# Introduction

This is the first paragraph of the document.
It contains important information.

## Features

- Feature one
- Feature two
- Feature three
"""

elements = partition_text(text=raw_text)

for element in elements:
    print(f"[{element.category}]: {element.text}")

Parsing Markdown Files

Unstructured can also parse Markdown files with better structure awareness:

from unstructured.partition.md import partition_md

# Parse a Markdown file
elements = partition_md(filename="README.md")

# Headers in Markdown become Titles
for element in elements:
    print(f"[{element.category}]: {element.text[:80]}")

Accessing Metadata

Each element includes useful metadata:

from unstructured.partition.text import partition_text

elements = partition_text(filename="long-document.txt")

for element in elements:
    print(f"Type: {element.category}")
    print(f"Text: {element.text[:100]}...")

    # Access metadata
    if hasattr(element, 'metadata'):
        print(f"Filename: {element.metadata.filename}")
        print(f"Filetype: {element.metadata.filetype}")
    print("---")

Complete RAG-Ready Example (Unstructured)

from unstructured.partition.text import partition_text

def extract_text_with_structure(file_path: str) -> str:
    """Extract text from file with structure awareness."""
    elements = partition_text(filename=file_path)

    # Combine all text elements with proper spacing
    text_parts = []
    for element in elements:
        if hasattr(element, 'text') and element.text:
            # Add extra newline before titles for readability
            if element.category == "Title":
                text_parts.append(f"\n{element.text}")
            else:
                text_parts.append(element.text)

    return "\n\n".join(text_parts)

# Usage
text = extract_text_with_structure("technical-doc.txt")
print(f"Extracted {len(text)} characters")

Extracting Structured Data

For more control, you can extract elements as structured data:

from unstructured.partition.text import partition_text

def extract_as_structured_data(file_path: str) -> list[dict]:
    """Extract text as a list of structured elements."""
    elements = partition_text(filename=file_path)

    structured = []
    for element in elements:
        structured.append({
            "type": element.category,
            "text": element.text,
            "id": element.id,
        })

    return structured

# Usage
data = extract_as_structured_data("document.txt")
for item in data:
    print(f"[{item['type']}] {item['text'][:60]}...")

Comparison: Native Python vs Unstructured

Feature Native Python Unstructured
Dependencies None (or chardet) unstructured
Speed ⚡ Fastest Fast
Structure Detection ❌ Manual ✅ Automatic
Element Classification
Markdown Support Basic ✅ Full
Best For Simple text extraction Structure-aware parsing

When to Choose Each

Choose Native Python when:

  • You need the fastest possible processing
  • The text files are simple and well-formatted
  • You don't need to identify document structure
  • You want minimal dependencies

Choose Unstructured when:

  • You need to identify titles, headers, and lists
  • You're processing Markdown or structured text
  • You want consistent element classification
  • You're building a production RAG pipeline

Handling Common Text Formats

RST (reStructuredText) Files

from unstructured.partition.rst import partition_rst

elements = partition_rst(filename="documentation.rst")
for element in elements:
    print(f"[{element.category}]: {element.text[:60]}...")

Plain Log Files

For log files, native Python is often better:

def parse_log_file(file_path: str) -> list[str]:
    """Parse a log file into individual entries."""
    with open(file_path, "r", encoding="utf-8") as f:
        lines = f.readlines()

    # Filter and clean log entries
    entries = [line.strip() for line in lines if line.strip()]
    return entries

# Usage
logs = parse_log_file("application.log")
print(f"Found {len(logs)} log entries")

Best Practices for RAG

  1. Handle encoding properly - Always detect or specify the correct encoding
  2. Clean excessive whitespace - Too many blank lines can affect chunking
  3. Preserve structure - Use Unstructured when document structure matters
  4. Normalize line endings - Ensure consistent \n characters across platforms
  5. Consider file size - For very large files, process in chunks to avoid memory issues

Processing Multiple Files

Batch Processing with Native Python

from pathlib import Path
from concurrent.futures import ThreadPoolExecutor

def process_text_files(directory: str) -> dict[str, str]:
    """Process all text files in a directory."""
    path = Path(directory)
    results = {}

    def read_file(file_path):
        with open(file_path, "r", encoding="utf-8", errors="replace") as f:
            return file_path.name, f.read()

    text_files = list(path.glob("*.txt"))

    with ThreadPoolExecutor(max_workers=4) as executor:
        for name, content in executor.map(read_file, text_files):
            results[name] = content

    return results

# Usage
all_texts = process_text_files("./documents")
print(f"Processed {len(all_texts)} files")

Batch Processing with Unstructured

from unstructured.partition.auto import partition
from pathlib import Path

def process_documents(directory: str) -> list[dict]:
    """Process multiple documents with automatic type detection."""
    path = Path(directory)
    all_elements = []

    for file_path in path.glob("*"):
        if file_path.suffix in [".txt", ".md", ".rst"]:
            elements = partition(filename=str(file_path))
            for element in elements:
                all_elements.append({
                    "source": file_path.name,
                    "type": element.category,
                    "text": element.text
                })

    return all_elements

# Usage
documents = process_documents("./docs")
print(f"Extracted {len(documents)} elements from all documents")

Next Steps

Now that you can parse text documents, learn how to: