Parsing CSVs for RAG¶

CSV data is structured (rows/columns), which changes how you should build RAG:

If your questions are semantic (“Which plan includes SSO?”), embeddings can work well.
If your questions are structured (“count by status”, “sum revenue by month”), SQL is the right tool.

This page shows a practical default: row-wise text chunks + metadata, and how to decide when not to embed.

Two approaches¶

1) Embed rows (small/medium CSVs)¶

Use embeddings when: - users ask “fuzzy” questions (synonyms, paraphrases) - each row is a meaningful unit (product, customer, ticket, FAQ entry) - you want retrieval over free-text cells (descriptions, notes)

2) Use SQL (large CSVs / strict correctness)¶

Prefer SQL when: - you need exact filters, joins, aggregations - numbers must be correct (finance, metrics) - the dataset is large enough that embedding every row is expensive

If you’ll end up doing SQL anyway, skip embeddings and build a SQL-first pipeline.

See: SQL for RAG (pgvector + Full‑Text + Structured Filters)

Row-wise embedding: a good default¶

Install¶

pip install pandas

Convert a row into RAG-friendly text¶

A simple formatting convention that works well:

col_1: value
col_2: value
...

from __future__ import annotations

from pathlib import Path
from typing import Any

import pandas as pd


def row_to_text(row: pd.Series) -> str:
    lines: list[str] = []
    for col, val in row.items():
        if pd.isna(val):
            continue
        lines.append(f"{col}: {val}")
    return "\n".join(lines)


def parse_csv_rows(csv_path: str, *, primary_key_col: str | None = None) -> list[dict[str, Any]]:
    path = Path(csv_path)
    df = pd.read_csv(path)

    rows: list[dict[str, Any]] = []
    for row_number, row in df.iterrows():
        pk = None
        if primary_key_col and primary_key_col in df.columns:
            pk = row.get(primary_key_col)

        rows.append(
            {
                "text": row_to_text(row),
                "metadata": {
                    "file": path.name,
                    "row_number": int(row_number),
                    "primary_key": None if pk is None or pd.isna(pk) else str(pk),
                },
            }
        )

    return rows


rows = parse_csv_rows("products.csv", primary_key_col="id")
print(rows[0]["metadata"])
print(rows[0]["text"][:200])

Metadata you should always keep¶

At minimum: - file - row_number - a stable primary_key if you have one

This enables: - clean citations (you can point back to an exact row) - updates (re-embed only changed rows) - debugging retrieval (“why did it return this row?”)

When NOT to embed CSVs¶

Skip row embeddings if: - you primarily need sums/counts/group-bys - you need strict numeric correctness - the schema naturally belongs in a database

In those cases, ingest the data into Postgres and use: - SQL-first RAG (query → results → LLM) - optionally: pgvector for fuzzy matching on text columns

Next steps¶

Learn how to chunk long fields before embedding: Chunking Strategies
Learn when to use SQL and hybrid retrieval: SQL for RAG