FAISS-based retrieval for large-scale RAG

This post documents a few design decisions and learnings while building a FAISS-based retrieval system for large-scale RAG experiments as part of one of our projects.

We explored several embedding models (BAAI/bge-base-en-v1.5, google/embeddinggemma-300m, qwen3-embedding-8b). We eventually defaulted to Qwen3-Embedding-8B for most experiments since it’s consistently at the top of public benchmarks and showed stronger semantic recall. Some notes on qwen3-embedding-8b:

dimensionality: 7168
pooling: EOS token
context length: 512 tokens

We also ended up supporting four FAISS index types:

FLAT: exact search, no approximation
- memory scales linearly with N x d x 4 bytes
- at ~10M chunks with 768-dim embeddings, this is already ~30 GB of RAM
IVF: coarse quantization, exact search within clusters
- requires training (nlist centroids)
- search quality depends heavily on nprobe
HNSW: graph-based approximate nearest neighbor search
- query latency often sub-ms for millions of vectors
- for 10-20M chunks, RAM usage can exceed 30-40 GB depending on M and efSearch
IVFPQ: IVF + product quantization
- 16-64x compression depending on PQ settings
- recall sensitive to nlist, nprobe, and pq_nbits

For a corpus on the order of ~100B tokens, even modest retrieval percentages become large very quickly. Assuming 512-token chunks:

1% retrieval -> ~2M chunks
5% retrieval -> ~10M chunks
10% retrieval -> ~20M chunks
20% retrieval -> ~40M chunks
40% retrieval -> ~78M chunks

We converged on the following strategy:

<1-2% retrieval: FLAT
5-10% retrieval: HNSW
>10% retrieval: IVFPQ

Investigations and experiments:

We built an index with IVFPQ and tested retrieval quality on a small 30M model with 5B RAG tokens. It showed pretty bad performance and very inaccurate retrievals.

Example:

Query: The pomegranate appearas in cuisines around the world and has unique nutritional compounds.

- dist=6417.6577 idx=10395190 text=’s U.S. breakout albums on the RCA label—their songs had some pop catchiness. “Dance Of The Mad Bastards” is no exception, with its repeated refrains and silly lyrics. But its relentless drum-machine drone, coupled with extended stops and s
- dist=7175.8086 idx=3173248 text=I was never fortunate enough to meet Dr. King, but I was a member of the vast crowd that stood on the Mall when he spoke to the March on Washington on August 28, 1963. I became a good friend of his close aide Bayard Rustin, who like Dr. Kin
- dist=7949.0264 idx=8303613 text=Small Labia … Labia Stretching  In a recent comment on Enlightened Male, Cindy writes:  I would like to see you do something about women who are extremely self conscious because of the exact opposite. Very small labia which is my case.

The index we used was:

class: <class 'faiss.swigfaiss.IndexIVFPQ'> ntotal: 14152118 dim: 4096 metric: L2 nlist: 3761 nprobe (current): 10 PQ: m= 512 nbits= 8

The number of vectors (chunks) indexed was about 14.15 M, embedding dim = 4096, L2 distance metric, the space was partitioned into 3761 parts, query time searches for 10 closest clusters, and each vector is split into 512 sub-vectors and each sub-vector is quantized into 256 codewords.

We realized nprobes is probably too small for our data size. Higher probe means better recall but slower queries. With nprobe = 10 and nlist = 3761, that means we’re searching for 10 / 3761 = 0.27% of the coarse space. After some investigation, we realized a common-rule-of-thumb range for IVF style-indexes is that nprobe should be about 1%-10% of nlist.

After changing nprobes to 64, the retrieval quality finally started to make sense.

Building retrieval at this scale, I realized that most failures weren’t algorithmic (nothing was objectively “wrong” in the code), but systems-level mismatches that happen with scale. We also realized the importance of treating these hyperparams as experimental variables and not just implementation details to gloss over.

← Blog