Parsing Text Documents¶
Plain text files (.txt, .md, .rst) are the simplest document format, but proper parsing is still important for RAG systems. Unlike binary formats, text files are human-readable, but they may contain inconsistent encodings, formatting patterns, or structure that needs to be identified.
In this tutorial, you'll learn how to parse text files using:
- Native Python - Direct file handling for simple cases
- Unstructured - Intelligent parsing that identifies document structure
Why Parse Text Files?¶
While text files seem straightforward, there are several challenges:
- Character encoding - Files may use UTF-8, Latin-1, or other encodings
- Line endings - Different systems use
\n,\r\n, or\r - Structure detection - Identifying headers, paragraphs, and lists
- Cleaning - Removing extra whitespace, special characters, or formatting
For simple text extraction, native Python is sufficient. For structure-aware parsing (identifying titles, sections, lists), Unstructured provides more intelligent processing.
Why These Two Methods?¶
There are many ways to read text files in Python. Here's why we focus on these two approaches:
| Method | Structure Detection | Dependencies | Speed | Best Use Case |
|---|---|---|---|---|
| Native Python | ❌ Manual | None | ⚡ Fastest | Simple extraction, logs, config files |
| Unstructured | ✅ Automatic | unstructured | Fast | Documents with headers, lists, sections |
| regex parsing | ❌ Manual | None | Fast | Pattern-based extraction |
| NLTK/spaCy | ⚠️ Sentence-level | Heavy | Slow | NLP tasks, not document parsing |
Why not just use regex or simple file.read()?
For basic text files, file.read() works fine. However, as your RAG system grows, you'll encounter:
- Mixed content - Documents with headers, paragraphs, and lists that need semantic understanding
- Consistency - Unstructured provides uniform element types across all document formats (text, PDF, HTML)
- Pipeline integration - Using the same library for text as you do for PDFs simplifies your codebase
When Native Python is the right choice:
- Log files, configuration files, or simple notes
- Maximum speed is critical (processing millions of files)
- You need zero external dependencies
- The text structure is consistent and predictable
When Unstructured is the right choice:
- Documents with varying structure (headers, lists, paragraphs)
- You're already using Unstructured for PDFs or other formats
- You want automatic element classification for downstream processing
- Building a production RAG system with multiple document types
Method 1: Using Native Python¶
Native Python is the best choice when you need speed, simplicity, or zero dependencies. It gives you full control over how text is processed, but you must handle encoding detection and text cleaning yourself.
For straightforward text files, Python's built-in file handling is efficient and requires no additional dependencies.
Basic Usage¶
def read_text_file(file_path: str) -> str:
"""Read a text file with proper encoding handling."""
with open(file_path, "r", encoding="utf-8") as f:
return f.read()
# Usage
text = read_text_file("document.txt")
print(f"Read {len(text)} characters")
Handling Different Encodings¶
Real-world files may not always be UTF-8. Here's a robust approach:
import chardet
def read_text_with_encoding_detection(file_path: str) -> str:
"""Read a text file with automatic encoding detection."""
# First, detect the encoding
with open(file_path, "rb") as f:
raw_data = f.read()
detected = chardet.detect(raw_data)
encoding = detected.get("encoding", "utf-8")
# Then read with the detected encoding
with open(file_path, "r", encoding=encoding) as f:
return f.read()
# Usage
text = read_text_with_encoding_detection("legacy-document.txt")
Installing chardet
Install the encoding detection library with:
pip install chardet
Cleaning and Normalizing Text¶
import re
def clean_text(text: str) -> str:
"""Clean and normalize text for RAG ingestion."""
# Normalize line endings
text = text.replace("\r\n", "\n").replace("\r", "\n")
# Remove excessive whitespace
text = re.sub(r"\n{3,}", "\n\n", text) # Max 2 newlines
text = re.sub(r"[ \t]+", " ", text) # Collapse spaces
# Strip leading/trailing whitespace from lines
lines = [line.strip() for line in text.split("\n")]
text = "\n".join(lines)
return text.strip()
# Usage
raw_text = read_text_file("messy-document.txt")
clean = clean_text(raw_text)
Complete RAG-Ready Example (Native Python)¶
import chardet
import re
from pathlib import Path
def extract_text_from_file(file_path: str) -> str:
"""Extract and clean text from a file for RAG ingestion."""
path = Path(file_path)
# Detect encoding
with open(path, "rb") as f:
raw_data = f.read()
detected = chardet.detect(raw_data)
encoding = detected.get("encoding", "utf-8")
# Read with detected encoding
with open(path, "r", encoding=encoding, errors="replace") as f:
text = f.read()
# Clean the text
text = text.replace("\r\n", "\n").replace("\r", "\n")
text = re.sub(r"\n{3,}", "\n\n", text)
text = re.sub(r"[ \t]+", " ", text)
return text.strip()
# Usage
text = extract_text_from_file("report.txt")
print(f"Extracted {len(text)} characters")
Method 2: Using Unstructured¶
Why upgrade from Native Python?
While native Python handles basic text extraction well, it treats all text the same—it doesn't understand that a line in all caps might be a title, or that lines starting with - are list items. This matters for RAG because:
- Better chunking - Knowing that something is a title helps you keep it with its content when chunking
- Richer metadata - Element types can be filtered or weighted differently in retrieval
- Unified pipeline - If you're using Unstructured for PDFs, using it for text files means one consistent API
Unstructured provides intelligent parsing that identifies document structure within text files—detecting titles, headers, paragraphs, and lists automatically.
Installation¶
pip install unstructured
Basic Usage¶
from unstructured.partition.text import partition_text
# Parse a text file
elements = partition_text(filename="document.txt")
# Print each extracted element
for element in elements:
print(f"[{type(element).__name__}]: {element.text[:100]}...")
Understanding Element Types¶
Unstructured classifies text into semantic elements:
| Element Type | Description |
|---|---|
Title |
Section headers and titles |
NarrativeText |
Regular paragraphs |
ListItem |
Bulleted or numbered list items |
Header |
Document headers |
Footer |
Document footers |
UncategorizedText |
Text that doesn't fit other categories |
from unstructured.partition.text import partition_text
elements = partition_text(filename="structured-document.txt")
# Filter by element type
titles = [el for el in elements if el.category == "Title"]
paragraphs = [el for el in elements if el.category == "NarrativeText"]
print(f"Found {len(titles)} titles and {len(paragraphs)} paragraphs")
Parsing from String Instead of File¶
You can also parse text directly without a file:
from unstructured.partition.text import partition_text
raw_text = """
# Introduction
This is the first paragraph of the document.
It contains important information.
## Features
- Feature one
- Feature two
- Feature three
"""
elements = partition_text(text=raw_text)
for element in elements:
print(f"[{element.category}]: {element.text}")
Parsing Markdown Files¶
Unstructured can also parse Markdown files with better structure awareness:
from unstructured.partition.md import partition_md
# Parse a Markdown file
elements = partition_md(filename="README.md")
# Headers in Markdown become Titles
for element in elements:
print(f"[{element.category}]: {element.text[:80]}")
Accessing Metadata¶
Each element includes useful metadata:
from unstructured.partition.text import partition_text
elements = partition_text(filename="long-document.txt")
for element in elements:
print(f"Type: {element.category}")
print(f"Text: {element.text[:100]}...")
# Access metadata
if hasattr(element, 'metadata'):
print(f"Filename: {element.metadata.filename}")
print(f"Filetype: {element.metadata.filetype}")
print("---")
Complete RAG-Ready Example (Unstructured)¶
from unstructured.partition.text import partition_text
def extract_text_with_structure(file_path: str) -> str:
"""Extract text from file with structure awareness."""
elements = partition_text(filename=file_path)
# Combine all text elements with proper spacing
text_parts = []
for element in elements:
if hasattr(element, 'text') and element.text:
# Add extra newline before titles for readability
if element.category == "Title":
text_parts.append(f"\n{element.text}")
else:
text_parts.append(element.text)
return "\n\n".join(text_parts)
# Usage
text = extract_text_with_structure("technical-doc.txt")
print(f"Extracted {len(text)} characters")
Extracting Structured Data¶
For more control, you can extract elements as structured data:
from unstructured.partition.text import partition_text
def extract_as_structured_data(file_path: str) -> list[dict]:
"""Extract text as a list of structured elements."""
elements = partition_text(filename=file_path)
structured = []
for element in elements:
structured.append({
"type": element.category,
"text": element.text,
"id": element.id,
})
return structured
# Usage
data = extract_as_structured_data("document.txt")
for item in data:
print(f"[{item['type']}] {item['text'][:60]}...")
Comparison: Native Python vs Unstructured¶
| Feature | Native Python | Unstructured |
|---|---|---|
| Dependencies | None (or chardet) | unstructured |
| Speed | ⚡ Fastest | Fast |
| Structure Detection | ❌ Manual | ✅ Automatic |
| Element Classification | ❌ | ✅ |
| Markdown Support | Basic | ✅ Full |
| Best For | Simple text extraction | Structure-aware parsing |
When to Choose Each¶
Choose Native Python when:
- You need the fastest possible processing
- The text files are simple and well-formatted
- You don't need to identify document structure
- You want minimal dependencies
Choose Unstructured when:
- You need to identify titles, headers, and lists
- You're processing Markdown or structured text
- You want consistent element classification
- You're building a production RAG pipeline
Handling Common Text Formats¶
RST (reStructuredText) Files¶
from unstructured.partition.rst import partition_rst
elements = partition_rst(filename="documentation.rst")
for element in elements:
print(f"[{element.category}]: {element.text[:60]}...")
Plain Log Files¶
For log files, native Python is often better:
def parse_log_file(file_path: str) -> list[str]:
"""Parse a log file into individual entries."""
with open(file_path, "r", encoding="utf-8") as f:
lines = f.readlines()
# Filter and clean log entries
entries = [line.strip() for line in lines if line.strip()]
return entries
# Usage
logs = parse_log_file("application.log")
print(f"Found {len(logs)} log entries")
Best Practices for RAG¶
- Handle encoding properly - Always detect or specify the correct encoding
- Clean excessive whitespace - Too many blank lines can affect chunking
- Preserve structure - Use Unstructured when document structure matters
- Normalize line endings - Ensure consistent
\ncharacters across platforms - Consider file size - For very large files, process in chunks to avoid memory issues
Processing Multiple Files¶
Batch Processing with Native Python¶
from pathlib import Path
from concurrent.futures import ThreadPoolExecutor
def process_text_files(directory: str) -> dict[str, str]:
"""Process all text files in a directory."""
path = Path(directory)
results = {}
def read_file(file_path):
with open(file_path, "r", encoding="utf-8", errors="replace") as f:
return file_path.name, f.read()
text_files = list(path.glob("*.txt"))
with ThreadPoolExecutor(max_workers=4) as executor:
for name, content in executor.map(read_file, text_files):
results[name] = content
return results
# Usage
all_texts = process_text_files("./documents")
print(f"Processed {len(all_texts)} files")
Batch Processing with Unstructured¶
from unstructured.partition.auto import partition
from pathlib import Path
def process_documents(directory: str) -> list[dict]:
"""Process multiple documents with automatic type detection."""
path = Path(directory)
all_elements = []
for file_path in path.glob("*"):
if file_path.suffix in [".txt", ".md", ".rst"]:
elements = partition(filename=str(file_path))
for element in elements:
all_elements.append({
"source": file_path.name,
"type": element.category,
"text": element.text
})
return all_elements
# Usage
documents = process_documents("./docs")
print(f"Extracted {len(documents)} elements from all documents")
Next Steps¶
Now that you can parse text documents, learn how to:
- Chunk your parsed text for optimal retrieval
- Parse PDF documents for more complex formats
- Parse web pages for HTML content