Chunking Strategies for RAG¶

Chunking is the step where you split parsed text into smaller pieces (“chunks”) before embedding and storing them.

Chunking matters because it directly affects:

Retrieval recall: can you retrieve the right information?
Answer quality: does the LLM get enough context to answer correctly?
Cost and latency: how many tokens do you store and send to the LLM?

If your RAG system feels “random”, chunking is often the root cause.

Baseline recommendations (good default)¶

If you don’t know where to start, use these defaults:

Prefer structure-aware chunking when possible (Markdown headings, HTML headings, PDF layout blocks)
Target ~200–400 tokens per chunk
Use ~10–20% overlap (e.g. 300 tokens with 50 overlap)
Store chunk metadata:
doc_id
source (file path or URL)
section_path (e.g. "Pricing > Enterprise")
chunk_index
start/end offsets if available

Then evaluate and iterate (see: Evaluating RAG).

Chunking strategies¶

1) Fixed-size chunks (simple)¶

Split by token/character counts with overlap.

Pros: - easy and fast - predictable chunk sizes

Cons: - can cut concepts mid-sentence/section - headings can get separated from content (hurts retrieval)

2) Recursive chunking (delimiter hierarchy)¶

Split using a hierarchy: 1) headings → 2) paragraphs → 3) sentences → 4) tokens

This is a strong general-purpose strategy because it tries to keep coherent units together.

3) Sentence / semantic chunking¶

Split by sentences or semantic boundaries.

Pros: - chunks are usually coherent

Cons: - implementation is more complex - can be slower

4) Sliding-window overlap¶

Use overlap so that information near a boundary appears in both chunks.

Pros: - improves recall for boundary cases

Cons: - more chunks → higher storage cost

5) Parent-child chunking (advanced)¶

Store: - a parent chunk (larger section) - multiple child chunks (smaller pieces)

Retrieve children for precision, but keep parent available for “expand context” when answering.

This helps when: - your documents are long - answers need a larger surrounding context than a single chunk

Token-aware chunk sizing (recommended)¶

Tokens are the unit that LLMs bill and process. “1,000 characters” is not a stable proxy.

Install¶

pip install tiktoken

Count tokens¶

import tiktoken


def count_tokens(text: str, *, encoding_name: str = "o200k_base") -> int:
    enc = tiktoken.get_encoding(encoding_name)
    return len(enc.encode(text))

A simple token-based chunker¶

import tiktoken


def chunk_by_tokens(
    text: str,
    *,
    max_tokens: int = 300,
    overlap_tokens: int = 50,
    encoding_name: str = "o200k_base",
) -> list[str]:
    enc = tiktoken.get_encoding(encoding_name)
    tokens = enc.encode(text)

    chunks: list[str] = []
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunks.append(enc.decode(tokens[start:end]))
        if end >= len(tokens):
            break
        start = max(0, end - overlap_tokens)

    return chunks

Structure-aware chunking (Markdown example)¶

Keeping headings attached to their content improves retrieval a lot. Here’s a simple Markdown section splitter:

import re


HEADING_RE = re.compile(r"^(#{1,6})\\s+(.*)\\s*$")


def split_markdown_sections(md: str) -> list[tuple[str, str]]:
    sections: list[tuple[str, str]] = []
    heading_stack: list[str] = []
    current_lines: list[str] = []
    current_path = "Document"

    def flush():
        nonlocal current_lines
        text = "\n".join(current_lines).strip()
        if text:
            sections.append((current_path, text))
        current_lines = []

    for line in md.splitlines():
        match = HEADING_RE.match(line)
        if match:
            flush()
            level = len(match.group(1))
            title = match.group(2).strip()
            heading_stack[:] = heading_stack[: max(0, level - 1)]
            heading_stack.append(title)
            current_path = " > ".join(heading_stack)
            current_lines.append(line)
        else:
            current_lines.append(line)

    flush()
    return sections

Then combine it with the token chunker:

def chunk_markdown(md: str) -> list[dict]:
    out: list[dict] = []
    for section_path, section_text in split_markdown_sections(md):
        for idx, chunk in enumerate(chunk_by_tokens(section_text)):
            out.append(
                {
                    "section_path": section_path,
                    "chunk_index": idx,
                    "text": chunk,
                }
            )
    return out

Common failure modes (and fixes)¶

“It retrieves irrelevant chunks”¶

Chunks too small → increase size; keep headings/metadata
Too much boilerplate (web pages) → improve parsing; remove nav/footer
Missing metadata filters → add source, doc_type, product, language

“It retrieves the right chunk but still answers wrong”¶

Prompt isn’t grounded → see: Prompt Engineering for RAG
Context formatting is messy → add chunk IDs + clear separators
Too many chunks → reduce k, dedupe by doc, or rerank

“It can’t answer questions that require multiple chunks”¶

Chunk size too small → increase size or add parent-child
Retrieval returns duplicates → enforce diversity (MMR / per-doc cap)

Optional: framework helpers¶

LangChain

RecursiveCharacterTextSplitter for delimiter-based splitting
MarkdownHeaderTextSplitter for section-aware chunking

LlamaIndex

SimpleNodeParser / MarkdownNodeParser for structured chunking into nodes

Next steps¶

Generate embeddings for your chunks: Understanding Embeddings
Store and search them with pgvector: Vector Stores for RAG (Postgres + pgvector)