Parsing CSVs for RAG¶
CSV data is structured (rows/columns), which changes how you should build RAG:
- If your questions are semantic (“Which plan includes SSO?”), embeddings can work well.
- If your questions are structured (“count by status”, “sum revenue by month”), SQL is the right tool.
This page shows a practical default: row-wise text chunks + metadata, and how to decide when not to embed.
Two approaches¶
1) Embed rows (small/medium CSVs)¶
Use embeddings when: - users ask “fuzzy” questions (synonyms, paraphrases) - each row is a meaningful unit (product, customer, ticket, FAQ entry) - you want retrieval over free-text cells (descriptions, notes)
2) Use SQL (large CSVs / strict correctness)¶
Prefer SQL when: - you need exact filters, joins, aggregations - numbers must be correct (finance, metrics) - the dataset is large enough that embedding every row is expensive
If you’ll end up doing SQL anyway, skip embeddings and build a SQL-first pipeline.
See: SQL for RAG (pgvector + Full‑Text + Structured Filters)
Row-wise embedding: a good default¶
Install¶
pip install pandas
Convert a row into RAG-friendly text¶
A simple formatting convention that works well:
col_1: value
col_2: value
...
from __future__ import annotations
from pathlib import Path
from typing import Any
import pandas as pd
def row_to_text(row: pd.Series) -> str:
lines: list[str] = []
for col, val in row.items():
if pd.isna(val):
continue
lines.append(f"{col}: {val}")
return "\n".join(lines)
def parse_csv_rows(csv_path: str, *, primary_key_col: str | None = None) -> list[dict[str, Any]]:
path = Path(csv_path)
df = pd.read_csv(path)
rows: list[dict[str, Any]] = []
for row_number, row in df.iterrows():
pk = None
if primary_key_col and primary_key_col in df.columns:
pk = row.get(primary_key_col)
rows.append(
{
"text": row_to_text(row),
"metadata": {
"file": path.name,
"row_number": int(row_number),
"primary_key": None if pk is None or pd.isna(pk) else str(pk),
},
}
)
return rows
rows = parse_csv_rows("products.csv", primary_key_col="id")
print(rows[0]["metadata"])
print(rows[0]["text"][:200])
Metadata you should always keep¶
At minimum:
- file
- row_number
- a stable primary_key if you have one
This enables: - clean citations (you can point back to an exact row) - updates (re-embed only changed rows) - debugging retrieval (“why did it return this row?”)
When NOT to embed CSVs¶
Skip row embeddings if: - you primarily need sums/counts/group-bys - you need strict numeric correctness - the schema naturally belongs in a database
In those cases, ingest the data into Postgres and use: - SQL-first RAG (query → results → LLM) - optionally: pgvector for fuzzy matching on text columns
Next steps¶
- Learn how to chunk long fields before embedding: Chunking Strategies
- Learn when to use SQL and hybrid retrieval: SQL for RAG