With the popularity of LLM (Large Language Model), vector databases have also become a hot topic. With just a few lines of simple Python code, a vector database can act as a cheap but highly effective “external brain” for your LLM. But do we really need a specialized vector database?
Why does LLM need vector search?
First, let me briefly introduce why LLM needs to use vector search technology. Vector search is a problem that has been around for a long time. The process of finding the most similar object in a collection given an object is vector search. Text/images, etc. can be converted into a vector representation, and the similarity problem of text/images can be transformed into a vector similarity problem.
In the example above, we convert different words into a three-dimensional vector. Therefore, we can intuitively display the similarity between different words in a 3D space. For example, the similarity between “student” and “school” is higher than the similarity between “student” and “food”.
Returning to LLM, the limitation of the context window length is a major challenge. For instance, ChatGPT 3.5 has a context length limit of 4k tokens. This poses a significant problem for LLM’s context-learning ability and negatively impacts the model’s user experience. However, vector search provides an elegant solution to this problem:
- Divide the text that exceeds the context length limit into shorter chunks and convert different chunks into vectors (embeddings).
- Before inputting the prompt to LLM, convert the prompt into a vector (embedding).
- Search the prompt vector to find the most similar chunk vector.
- Concatenate the most similar chunk vector with the prompt vector as the input to LLM.
This is like giving LLM an external memory, which allows it to search for the most relevant information from this memory. This memory is the ability brought by vector search. If you want to learn more details, you can read these articles (Article 1 and Article 2), which explain it more clearly.
Why is vector database so popular?
In LLM, the vector database has become an indispensable part, and one of the most important reasons is its ease of use. After being used in conjunction with OpenAI Embedding models (such as text-embedding-ada-002
), it only takes about ten lines of code to convert a prompt query into a vector and perform the entire process of vector search.
def query(query, collection_name, top_k=20):
# Creates embedding vector from user query
embedded_query = openai.Embedding.create(
input=query,
model=EMBEDDING_MODEL,
)["data"][0]['embedding']
near_vector = {"vector": embedded_query}
# Queries input schema with vectorized user query
query_result = (
client.query
.get(collection_name)
.with_near_vector(near_vector)
.with_limit(top_k)
.do()
)
return query_result
In LLM, vector search mainly plays a role in recall. Simply put, recall is finding the most similar objects in a candidate set. In LLM, the candidate set is all chunks, and the most similar object is the chunk that is most similar to the prompt. In the reasoning process of LLM, vector search is regarded as the main implementation of recall. It is easy to implement and can use OpenAI Embedding models to solve the most troublesome problem of converting text into vectors. The remaining part is an independent and clean vector search problem, which can be well completed by current vector databases. Therefore, the entire process is particularly smooth.
As the name suggests, vector database is a database specifically designed for the special data type of vectors. The similarity calculation of vectors was originally an O(n^2) complexity problem because it required comparing all vectors in the set pairwise. Therefore, the industry proposed the Approximate Nearest Neighbor (ANN) algorithm. By using the ANN algorithm, the vector index is constructed by pre-calculating in the vector database, using the idea of trading space for time, which greatly speeds up the process of similarity calculation. This is similar to the index in traditional databases.
Therefore, vector databases not only have strong performance but also excellent ease of use, making them a perfect match for LLM! (Really?)
Perhaps a general-purpose database would be better?
We’ve talked about the advantages and benefits of vector databases, but what are their limitations? A blog post by SingleStore provides a good answer to this question:
Vectors and vector search are a data type and query processing approach, not a foundation for a new way of processing data. Using a specialty vector database (SVDB) will lead to the usual problems we see (and solve) again and again with our customers who use multiple specialty systems: redundant data, excessive data movement, lack of agreement on data values among distributed components, extra labor expense for specialized skills, extra licensing costs, limited query language power, programmability and extensibility, limited tool integration, and poor data integrity and availability compared with a true DBMS.
There are two issues that I think are important. The first is the issue of data consistency. During the prototyping phase, vector databases are very suitable, and ease of use is more important than anything else. However, a vector database is an independent system that is completely decoupled from other data storage systems such as TP databases and AP data lakes. Therefore, data needs to be synchronized, streamed, and processed between multiple systems.
Imagine if your data is already stored in an OLTP database such as PostgreSQL. To perform vector search using an independent vector database, you need to first extract the data from the database, then convert each data point into a vector using services such as OpenAI Embedding, and then synchronize it to a dedicated vector database. This adds a lot of comp