Chunking Strategies for RAG¶
Chunking is the step where you split parsed text into smaller pieces (“chunks”) before embedding and storing them.
Chunking matters because it directly affects:
- Retrieval recall: can you retrieve the right information?
- Answer quality: does the LLM get enough context to answer correctly?
- Cost and latency: how many tokens do you store and send to the LLM?
If your RAG system feels “random”, chunking is often the root cause.
Baseline recommendations (good default)¶
If you don’t know where to start, use these defaults:
- Prefer structure-aware chunking when possible (Markdown headings, HTML headings, PDF layout blocks)
- Target ~200–400 tokens per chunk
- Use ~10–20% overlap (e.g. 300 tokens with 50 overlap)
- Store chunk metadata:
doc_idsource(file path or URL)section_path(e.g."Pricing > Enterprise")chunk_indexstart/endoffsets if available
Then evaluate and iterate (see: Evaluating RAG).
Chunking strategies¶
1) Fixed-size chunks (simple)¶
Split by token/character counts with overlap.
Pros: - easy and fast - predictable chunk sizes
Cons: - can cut concepts mid-sentence/section - headings can get separated from content (hurts retrieval)
2) Recursive chunking (delimiter hierarchy)¶
Split using a hierarchy: 1) headings → 2) paragraphs → 3) sentences → 4) tokens
This is a strong general-purpose strategy because it tries to keep coherent units together.
3) Sentence / semantic chunking¶
Split by sentences or semantic boundaries.
Pros: - chunks are usually coherent
Cons: - implementation is more complex - can be slower
4) Sliding-window overlap¶
Use overlap so that information near a boundary appears in both chunks.
Pros: - improves recall for boundary cases
Cons: - more chunks → higher storage cost
5) Parent-child chunking (advanced)¶
Store: - a parent chunk (larger section) - multiple child chunks (smaller pieces)
Retrieve children for precision, but keep parent available for “expand context” when answering.
This helps when: - your documents are long - answers need a larger surrounding context than a single chunk
Token-aware chunk sizing (recommended)¶
Tokens are the unit that LLMs bill and process. “1,000 characters” is not a stable proxy.
Install¶
pip install tiktoken
Count tokens¶
import tiktoken
def count_tokens(text: str, *, encoding_name: str = "o200k_base") -> int:
enc = tiktoken.get_encoding(encoding_name)
return len(enc.encode(text))
A simple token-based chunker¶
import tiktoken
def chunk_by_tokens(
text: str,
*,
max_tokens: int = 300,
overlap_tokens: int = 50,
encoding_name: str = "o200k_base",
) -> list[str]:
enc = tiktoken.get_encoding(encoding_name)
tokens = enc.encode(text)
chunks: list[str] = []
start = 0
while start < len(tokens):
end = min(start + max_tokens, len(tokens))
chunks.append(enc.decode(tokens[start:end]))
if end >= len(tokens):
break
start = max(0, end - overlap_tokens)
return chunks
Structure-aware chunking (Markdown example)¶
Keeping headings attached to their content improves retrieval a lot. Here’s a simple Markdown section splitter:
import re
HEADING_RE = re.compile(r"^(#{1,6})\\s+(.*)\\s*$")
def split_markdown_sections(md: str) -> list[tuple[str, str]]:
sections: list[tuple[str, str]] = []
heading_stack: list[str] = []
current_lines: list[str] = []
current_path = "Document"
def flush():
nonlocal current_lines
text = "\n".join(current_lines).strip()
if text:
sections.append((current_path, text))
current_lines = []
for line in md.splitlines():
match = HEADING_RE.match(line)
if match:
flush()
level = len(match.group(1))
title = match.group(2).strip()
heading_stack[:] = heading_stack[: max(0, level - 1)]
heading_stack.append(title)
current_path = " > ".join(heading_stack)
current_lines.append(line)
else:
current_lines.append(line)
flush()
return sections
Then combine it with the token chunker:
def chunk_markdown(md: str) -> list[dict]:
out: list[dict] = []
for section_path, section_text in split_markdown_sections(md):
for idx, chunk in enumerate(chunk_by_tokens(section_text)):
out.append(
{
"section_path": section_path,
"chunk_index": idx,
"text": chunk,
}
)
return out
Common failure modes (and fixes)¶
“It retrieves irrelevant chunks”¶
- Chunks too small → increase size; keep headings/metadata
- Too much boilerplate (web pages) → improve parsing; remove nav/footer
- Missing metadata filters → add
source,doc_type,product,language
“It retrieves the right chunk but still answers wrong”¶
- Prompt isn’t grounded → see: Prompt Engineering for RAG
- Context formatting is messy → add chunk IDs + clear separators
- Too many chunks → reduce
k, dedupe by doc, or rerank
“It can’t answer questions that require multiple chunks”¶
- Chunk size too small → increase size or add parent-child
- Retrieval returns duplicates → enforce diversity (MMR / per-doc cap)
Optional: framework helpers¶
LangChain
RecursiveCharacterTextSplitterfor delimiter-based splittingMarkdownHeaderTextSplitterfor section-aware chunking
LlamaIndex
SimpleNodeParser/MarkdownNodeParserfor structured chunking into nodes
Next steps¶
- Generate embeddings for your chunks: Understanding Embeddings
- Store and search them with pgvector: Vector Stores for RAG (Postgres + pgvector)