Skip to content

Chunking Strategies for RAG

Chunking is the step where you split parsed text into smaller pieces (“chunks”) before embedding and storing them.

Chunking matters because it directly affects:

  • Retrieval recall: can you retrieve the right information?
  • Answer quality: does the LLM get enough context to answer correctly?
  • Cost and latency: how many tokens do you store and send to the LLM?

If your RAG system feels “random”, chunking is often the root cause.


Baseline recommendations (good default)

If you don’t know where to start, use these defaults:

  • Prefer structure-aware chunking when possible (Markdown headings, HTML headings, PDF layout blocks)
  • Target ~200–400 tokens per chunk
  • Use ~10–20% overlap (e.g. 300 tokens with 50 overlap)
  • Store chunk metadata:
  • doc_id
  • source (file path or URL)
  • section_path (e.g. "Pricing > Enterprise")
  • chunk_index
  • start/end offsets if available

Then evaluate and iterate (see: Evaluating RAG).


Chunking strategies

1) Fixed-size chunks (simple)

Split by token/character counts with overlap.

Pros: - easy and fast - predictable chunk sizes

Cons: - can cut concepts mid-sentence/section - headings can get separated from content (hurts retrieval)

2) Recursive chunking (delimiter hierarchy)

Split using a hierarchy: 1) headings → 2) paragraphs → 3) sentences → 4) tokens

This is a strong general-purpose strategy because it tries to keep coherent units together.

3) Sentence / semantic chunking

Split by sentences or semantic boundaries.

Pros: - chunks are usually coherent

Cons: - implementation is more complex - can be slower

4) Sliding-window overlap

Use overlap so that information near a boundary appears in both chunks.

Pros: - improves recall for boundary cases

Cons: - more chunks → higher storage cost

5) Parent-child chunking (advanced)

Store: - a parent chunk (larger section) - multiple child chunks (smaller pieces)

Retrieve children for precision, but keep parent available for “expand context” when answering.

This helps when: - your documents are long - answers need a larger surrounding context than a single chunk


Tokens are the unit that LLMs bill and process. “1,000 characters” is not a stable proxy.

Install

pip install tiktoken

Count tokens

import tiktoken


def count_tokens(text: str, *, encoding_name: str = "o200k_base") -> int:
    enc = tiktoken.get_encoding(encoding_name)
    return len(enc.encode(text))

A simple token-based chunker

import tiktoken


def chunk_by_tokens(
    text: str,
    *,
    max_tokens: int = 300,
    overlap_tokens: int = 50,
    encoding_name: str = "o200k_base",
) -> list[str]:
    enc = tiktoken.get_encoding(encoding_name)
    tokens = enc.encode(text)

    chunks: list[str] = []
    start = 0
    while start < len(tokens):
        end = min(start + max_tokens, len(tokens))
        chunks.append(enc.decode(tokens[start:end]))
        if end >= len(tokens):
            break
        start = max(0, end - overlap_tokens)

    return chunks

Structure-aware chunking (Markdown example)

Keeping headings attached to their content improves retrieval a lot. Here’s a simple Markdown section splitter:

import re


HEADING_RE = re.compile(r"^(#{1,6})\\s+(.*)\\s*$")


def split_markdown_sections(md: str) -> list[tuple[str, str]]:
    sections: list[tuple[str, str]] = []
    heading_stack: list[str] = []
    current_lines: list[str] = []
    current_path = "Document"

    def flush():
        nonlocal current_lines
        text = "\n".join(current_lines).strip()
        if text:
            sections.append((current_path, text))
        current_lines = []

    for line in md.splitlines():
        match = HEADING_RE.match(line)
        if match:
            flush()
            level = len(match.group(1))
            title = match.group(2).strip()
            heading_stack[:] = heading_stack[: max(0, level - 1)]
            heading_stack.append(title)
            current_path = " > ".join(heading_stack)
            current_lines.append(line)
        else:
            current_lines.append(line)

    flush()
    return sections

Then combine it with the token chunker:

def chunk_markdown(md: str) -> list[dict]:
    out: list[dict] = []
    for section_path, section_text in split_markdown_sections(md):
        for idx, chunk in enumerate(chunk_by_tokens(section_text)):
            out.append(
                {
                    "section_path": section_path,
                    "chunk_index": idx,
                    "text": chunk,
                }
            )
    return out

Common failure modes (and fixes)

“It retrieves irrelevant chunks”

  • Chunks too small → increase size; keep headings/metadata
  • Too much boilerplate (web pages) → improve parsing; remove nav/footer
  • Missing metadata filters → add source, doc_type, product, language

“It retrieves the right chunk but still answers wrong”

  • Prompt isn’t grounded → see: Prompt Engineering for RAG
  • Context formatting is messy → add chunk IDs + clear separators
  • Too many chunks → reduce k, dedupe by doc, or rerank

“It can’t answer questions that require multiple chunks”

  • Chunk size too small → increase size or add parent-child
  • Retrieval returns duplicates → enforce diversity (MMR / per-doc cap)

Optional: framework helpers

LangChain

  • RecursiveCharacterTextSplitter for delimiter-based splitting
  • MarkdownHeaderTextSplitter for section-aware chunking

LlamaIndex

  • SimpleNodeParser / MarkdownNodeParser for structured chunking into nodes

Next steps