Skip to content

Parsing CSVs for RAG

CSV data is structured (rows/columns), which changes how you should build RAG:

  • If your questions are semantic (“Which plan includes SSO?”), embeddings can work well.
  • If your questions are structured (“count by status”, “sum revenue by month”), SQL is the right tool.

This page shows a practical default: row-wise text chunks + metadata, and how to decide when not to embed.


Two approaches

1) Embed rows (small/medium CSVs)

Use embeddings when: - users ask “fuzzy” questions (synonyms, paraphrases) - each row is a meaningful unit (product, customer, ticket, FAQ entry) - you want retrieval over free-text cells (descriptions, notes)

2) Use SQL (large CSVs / strict correctness)

Prefer SQL when: - you need exact filters, joins, aggregations - numbers must be correct (finance, metrics) - the dataset is large enough that embedding every row is expensive

If you’ll end up doing SQL anyway, skip embeddings and build a SQL-first pipeline.

See: SQL for RAG (pgvector + Full‑Text + Structured Filters)


Row-wise embedding: a good default

Install

pip install pandas

Convert a row into RAG-friendly text

A simple formatting convention that works well:

col_1: value
col_2: value
...
from __future__ import annotations

from pathlib import Path
from typing import Any

import pandas as pd


def row_to_text(row: pd.Series) -> str:
    lines: list[str] = []
    for col, val in row.items():
        if pd.isna(val):
            continue
        lines.append(f"{col}: {val}")
    return "\n".join(lines)


def parse_csv_rows(csv_path: str, *, primary_key_col: str | None = None) -> list[dict[str, Any]]:
    path = Path(csv_path)
    df = pd.read_csv(path)

    rows: list[dict[str, Any]] = []
    for row_number, row in df.iterrows():
        pk = None
        if primary_key_col and primary_key_col in df.columns:
            pk = row.get(primary_key_col)

        rows.append(
            {
                "text": row_to_text(row),
                "metadata": {
                    "file": path.name,
                    "row_number": int(row_number),
                    "primary_key": None if pk is None or pd.isna(pk) else str(pk),
                },
            }
        )

    return rows


rows = parse_csv_rows("products.csv", primary_key_col="id")
print(rows[0]["metadata"])
print(rows[0]["text"][:200])

Metadata you should always keep

At minimum: - file - row_number - a stable primary_key if you have one

This enables: - clean citations (you can point back to an exact row) - updates (re-embed only changed rows) - debugging retrieval (“why did it return this row?”)


When NOT to embed CSVs

Skip row embeddings if: - you primarily need sums/counts/group-bys - you need strict numeric correctness - the schema naturally belongs in a database

In those cases, ingest the data into Postgres and use: - SQL-first RAG (query → results → LLM) - optionally: pgvector for fuzzy matching on text columns


Next steps