Disclaimer: We will go into some technical and architectural details of how we do this at Neum AI — A data platform for embeddings management, optimization, and synchronization at large scale, essentially helping with large-scale RAG.
As we’ve shared in other blogs in the past, getting a Retrieval Augmented Generation (RAG) application started is pretty straightforward. The problem comes when trying to scale it and making it production-ready. In this blog we will go into some technical and architectural details of how we do this at Neum AI, specifically on how we did this for a pipeline syncing 1 billion vectors.
First off, What exactlty is RAG?
RAG helps finding data quickly by performing search in a “natural way” and use that information/knowledge to power a more accurate AI application that needs such information! It is a recent methodology employed by lots of people when building accurate and up-to-date AI applications!
This is what a typical RAG system looks like
- Data is extracted, processed, embedded and stored in a vector database for fast semantic search lookup
- User submits input, generate embeddings of input, searches across vector database to find most relevant information and passes that down as context to an AI application for accurate responses.
Now, let’s talk about the problem at hand — how to effectively ingest and synchronize billions of text embeddings to be used in a RAG workflow?
RAG is straightforward, but when dealing with lots of source data a couple of complex problems arise.
- Ingestion at large scale — One thing is to ingest a couple of PDFs, chunk them, embed and store in a vector db. When you have billions of records, general data infrastructure and engineering problems arise. How to effectively parallelize requests, how to handle retry mechanisms, how to spin up the right infrastructure and distributed systems, and more. It is also very important to understand the volume of data, ingestion time requirement, search latency, cost, and more to properly plan and deploy the right infra/compute. All of these are core engineering problems that, albeit solved, are daunting to implement.
- Embedding (transforming it into vector format for low-latency semantic search) — generating embeddings is not a problem except when you have large data and have to deal with rate limits, retry logic, self hosted models, and more. Not to mention, but syncing data becomes crucial here. If something changed at the source and not at the downstream vector database — where the AI application typically queries from — then the response from the AI application will be stale and inaccurate. Embeddings can be costly if not done efficiently. There will always be a one-time cost of embedding all the data, but for an application that relies on new/changed data, embedding all of the source data can be very expensive, and so, there has to be a mechanism to detect whether or not data needs to be re-embedded.
In this specific case, our data pipeline is responsible for 3 operations.
a) Reading data
b) Processing data
c) Embedding data
d) Storing data in a vector database — in this case Weaviate!
Each of the points above have their own challenges.
- Reading data needs to be done efficiently and attempt to maximize parallelization to meet ingestion time requirements
- Once data is read, it needs to be processed, we can’t just dump everything to an embedding model. It needs to carefully be chunked depending on the source type, extract the relevant metadata fields, and clean any anomalies.
- Embedding data as mentioned on #1 needs to be done only if required and parallelized in terms of requests/compute as per the constraints of the system and external api limits if applicable.
- Storing in the vector database has its own limitations
What are the compute resources in the Vector Database?
Is it self hosted, managed? is there monitoring?
Is data sharded, what is the latency? What about compression?
Did you know that HNSW algorithm is pretty inefficient when trying to store identical vectors? (more on this later)
Could the ingestion into the database be the bottleneck of our system?
In addition to that, the system itself must have great monitoring, cancellation options, logging and alerting in place, all of the things you would expect from a robust distributed system.
For the rest of the blog we will explore solutions and share a bit into our architectural diagram for how we tested, benchmarked and ran the pipeline moving 1 billion vectors.
Let’s talk about the high level architecture and break down each of the components.
As mentioned before, this distributed system has the responsibility of four main tasks, and eac