Building LLamaPG: RAG as a service with pgvector

Most RAG tutorials stop at the notebook. You get a working demo, a couple of PDF chunks in a FAISS index, and a chat loop. That is fine for a proof of concept but it falls apart the moment you need to serve more than one user, swap the embedding model, or actually inspect what is happening inside the retrieval step.

LLamaPG started as my attempt to build RAG the way I would want to use it in production: a proper service boundary, a real database, and enough observability to debug bad retrievals without guessing.

Why PostgreSQL instead of a dedicated vector store

The obvious choice for a first RAG project is Chroma or FAISS. Both are easy to set up and work fine for small collections. I went with pgvector for a few reasons.

First, I already had PostgreSQL in the stack for metadata storage. Running a separate vector database meant another service to operate, another connection pool to manage, and another failure mode. pgvector keeps everything in one place.

Second, SQL gives you joins for free. If you want to filter by document date, owner, or collection before running the similarity search, you write a WHERE clause. With most dedicated vector stores you are either doing client-side filtering (fetch everything, then filter) or learning a custom query DSL.

Third, hnsw indexes in pgvector are fast enough. For the workloads I tested, retrieval latency on 10k+ chunks was comfortably under 50ms.

Service structure

The platform is three containers behind a single Compose file.

api/        FastAPI service, handles ingestion and query endpoints
worker/     Celery worker for async chunking and embedding jobs
db/         PostgreSQL 15 + pgvector extension

The API exposes two main surfaces: an ingestion endpoint that accepts documents and kicks off a Celery task, and a query endpoint that takes a question, embeds it, runs the similarity search, stuffs context into a prompt, and returns the answer along with the source chunks.

Keeping the embedding work in Celery was important. Chunking a 200-page PDF is not something you want on the hot path of an HTTP request.

Chunking decisions

I used LangChain’s RecursiveCharacterTextSplitter with a chunk size of 512 tokens and a 64-token overlap. The overlap is the part most tutorials skip over. Without it, a sentence that falls on a chunk boundary ends up split across two embeddings, and neither chunk carries the full meaning. Overlap is cheap storage for a real improvement in retrieval quality.

For embedding I used text-embedding-3-small from OpenAI. Swapping models is a one-line config change, but you have to re-embed everything when you do, which is another argument for async workers over inline processing.

The retrieval query

SELECT chunk_text, metadata, 1 - (embedding <=> $1) AS similarity
FROM document_chunks
WHERE collection_id = $2
ORDER BY embedding <=> $1
LIMIT $3;

The <=> operator is cosine distance in pgvector. 1 - distance gives you a similarity score between 0 and 1, which is easier to reason about when you are setting a relevance threshold.

The collection filter runs before the ANN search. pgvector’s HNSW index supports pre-filtering, so this does not degrade into a sequential scan.

What I would do differently

The main thing I underestimated was reranking. Embedding similarity is a good first filter but it is not a great final ranker. Adding a cross-encoder reranker as a second pass on the top-k results would meaningfully improve answer quality, especially for longer documents where the relevant chunk is not always the semantically closest one.

I also hardcoded the prompt template, which made it annoying to iterate. A proper prompt registry, even just a YAML file of named templates, would have saved a lot of redeployments.

Source: github.com/akvnn/llama-pg