Designing a RAG and embeddings backend

Retrieval-augmented generation has a reputation for being easy, and the first version always is. Embed some documents, store the vectors, find the nearest ones to a query, stuff them in the prompt. It works in the demo. Then you point it at real questions and it confidently retrieves the wrong passage, and you discover the uncomfortable truth: in RAG, the model is rarely the problem. Retrieval is. If the right context doesn't make it into the prompt, no amount of model quality saves you.

I learned this building the backend for a multi-agent system — call it Mission Control — where embeddings aren't a feature bolted onto a chatbot, they're the agents' memory. Five agents share a retrieval layer; what they can recall determines whether they act coherently or contradict each other. When retrieval is your memory, you can't hand-wave it. So here's the design, decision by decision.

Embeddings: dimensionality is a real trade-off#

An embedding turns text into a vector whose geometry encodes meaning — similar text, nearby vectors. The two decisions that matter are the model and the dimensionality.

I run embeddings through Azure OpenAI↗ at 768 dimensions. That number is a deliberate middle. Higher-dimensional embeddings capture more nuance but cost more to store and compare, and the returns flatten fast; lower dimensions are cheap but start losing the distinctions you need. 768 is the point where, for this corpus, retrieval quality stops improving enough to justify the storage and latency of going wider. Modern embedding models support shortening dimensions (Matryoshka-style) precisely so you can make this call deliberately instead of accepting a default.

Don't pick an embedding dimensionality by vibes. Pick it by measuring retrieval quality on your queries as you vary it. The right number is the smallest one where recall on a held-out set of real questions stops getting meaningfully better. For my corpus that was 768; yours may differ.

Storage: Postgres with pgvector is enough#

You do not need a dedicated vector database to start, and usually not to finish. pgvector↗ puts vector search inside Postgres, which means the embeddings live next to the relational data they describe — same transactions, same backups, same joins. For a system already on Postgres (this one is on Supabase), that co-location removes a whole category of sync problems.

The schema is unremarkable, which is the point:

-- a dedicated `mc` schema; embeddings stored beside their metadata
create table mc.memories (
  id          uuid primary key default gen_random_uuid(),
  agent_id    text not null,
  content     text not null,
  embedding   vector(768) not null,
  created_at  timestamptz default now()
);
 
-- HNSW index for fast approximate nearest-neighbour on cosine distance
create index on mc.memories using hnsw (embedding vector_cosine_ops);

Retrieval is one query — and because it's Postgres, you filter by metadata in the same statement instead of fetching vectors and filtering in app code:

select content, 1 - (embedding <=> $1) as similarity
from mc.memories
where agent_id = $2                 -- metadata filter, for free
order by embedding <=> $1           -- <=> is cosine distance
limit 8;

On index choice: HNSW over IVFFlat for almost everything. It builds slower and uses more memory, but its recall-versus-speed curve is better and it doesn't need a training step or retuning as data grows. That where agent_id = $2 is doing quiet but critical work — it scopes each agent's recall to its own memory, which is how five agents share infrastructure without bleeding context into each other.

Chunking is the lever nobody wants to hear about#

Here's the unglamorous truth: how you split documents before embedding affects retrieval quality more than which embedding model you choose. It's also the part people most want to skip.

Embed whole documents and a query matches the document but the prompt gets a wall of mostly-irrelevant text. Chunk too small and you sever the context a passage needs to mean anything. The sweet spot is semantically coherent chunks — a section, a coherent few paragraphs — with a little overlap so an idea that straddles a boundary survives in at least one chunk. Chunk on structure (headings, paragraphs) rather than blind character counts whenever the source has structure to chunk on.

When retrieval is disappointing, chunking is the first thing I revisit, not the embedding model and not the prompt. It's almost always where the recall is being lost.

Hybrid search and reranking: the last 20%#

Pure vector search has a known blind spot: exact terms. Someone searches for an error code, a product SKU, a specific function name — semantic similarity shrugs, but a keyword index nails it. The fix is hybrid search: run vector similarity and Postgres full-text (tsvector) search, then fuse the rankings. You get semantic recall and lexical precision instead of trading one for the other.

The final refinement is reranking. Approximate nearest-neighbour over-fetches — pull the top ~20 candidates, then pass them through a cross-encoder reranker that scores each one against the query directly and keeps the best 5. Bi-encoder retrieval (embeddings) is fast and approximate; a cross-encoder is slow and accurate. Use the fast one to get candidates and the accurate one to order them. That two-stage shape — cheap recall, expensive precision — is the same instinct as scouting before committing expensive work, which I keep coming back to in how I build agentic workflows.

The design, in one breath#

Embed at a dimensionality you actually measured. Store vectors in Postgres with pgvector and an HNSW index so retrieval is one filtered query. Spend your real effort on chunking, because that's where recall is won or lost. Add hybrid search for exact terms and a reranker for the final ordering. None of it is exotic; all of it is the difference between a RAG demo and a retrieval layer five agents can actually trust as memory.

If retrieval is memory, then the agents built on top of it are the interesting part — how you give each one a narrow job and let them compose. I wrote about that side in Specialized Agents and Systems, Not Scripts.