Looking for the right words
Ever had that moment when you’re struggling to remember something, and then suddenly—the perfect word or phrase pops into your head, unlocking a flood of memories?
Just like how using the right words can help you remember what you need, sparse search focuses on precise keywords to return your relevant documents.
Pinecone is excited to introduce pinecone-sparse-english-v0, our new proprietary sparse retrieval model that strikes the perfect balance between high-quality search results and the low latency required for production environments.
In this article, we’ll take you on a journey through the evolution of sparse models, from basic keyword matching to sophisticated retrieval models. Through this, we’ll discuss advancements and decisions that motivate the best-in-class performance of pinecone-sparse. By the end, you’ll have a clear understanding of how to implement sparse retrieval in your own applications using Pinecone.
In a hurry? Skip to the bottom for code samples and best practices to start using pinecone-sparse-english-v0 right away!
Defining Sparse Retrieval
In keyword search, you pass a string of keywords to a search bar, and you get a set of documents that overlap with the keywords you used within the query. This tends to be a pretty good estimate of how relevant those documents are to what you have looked for, and works great for situations where you need high alignment on proper nouns, technical terms, brands, serial numbers, etc.
For example, you might ask:
“How do I use Pinecone Assistant?”
and get back articles titled:
“Use Pinecone Assistant in your Project”.
However, how often a word appears in a document doesn’t reliably indicate its relevance to a specific query.
Consider the word “data” as an example. This term appears frequently in technical documents about data visualization, data science, and data wrangling—but its relevance varies significantly depending on the context and the specific query.
If someone searches for “data visualization techniques,” documents containing many instances of “data wrangling” might be less relevant than those with fewer mentions but in a visualization context. This frequency-based approach fails to capture the semantic relationship between the query and the document content.
So, we need some advanced way of dealing with words that are “low information” and “high information”.
Generalizing the structure of information retrieval
To better understand how to iterate on sparse retrieval, let’s describe the task at hand generally.
We need the following:
- a way to transform input queries into vectors, or numerical representations of those words
- a way to transform input documents and a place to store them (in our case, a vector database)
- a way to score and return documents based on some estimate of relevance
We can think of the scores on a word level as impact scores and the sum of the impact scores as a measure of relevance or importance to the incoming query. Our goal is to find the best way to calculate these scores and return the most relevant documents back to users.
Now, if each query and document is represented by a vector whose length is the entire vocabulary, we’ll end up with a load of zeros in our representation.
This is why sparse embeddings are called sparse since they tend to have a ton of null values. This is quite undesirable from a storage perspective, so these embeddings are typically stored within an inverted index, which maps tokens to the documents they appear in and their score.
An example of how words are stored in an inverted index
Now, the search mechanism is super easy! Find all the tokens from the query, their corresponding documents and scores, and return the ones that are scored the highest. This is way simpler than checking each document.
Hold on, what about dense search?
At this point, if you are already familiar with semantic search (and large language models), you might be scratching your head. Isn’t it better, you think, to implement dense search using an embedding model and avoid all this pesky optimization on sparse keywords and tokens and whatnot?
To review, semantic search typically involves using an embedding model that transforms input text into fixed-size embeddings. The idea here is you measure similarity by looking at how close these vectors are in vector space.
So, instead of calculating impact scores on a word level, and summing them to return relevant documents, we represent the whole query and document (or document piece) with a fixed-sized vector and directly calculate their distance in vector space.
An example of how sentences cluster in vector space across languages
There are a few subtle differences here:
- First, the vectors created by these embedding models are not directly interpretable, nor are they mappable to specific words or phrases like sparse embeddings are
- Second, these tend to be better at describing the entirety of the meaning of a query or document without a strong condition on the words being used
Because of this, dense embeddings are great for when input queries may not overlap with document sets at all, such as when searching help manuals or code documentation as a new user, or for cross-lingual or multilingual applications, where traditional keyword search isn’t really possible.
However, we lose this nice quality of high-precision results when the keywords entered in a document matter a lot. If you pass in short queries or queries of the titles of documents to a semantic search application that uses dense embeddings, you might not get the document you are looking for at all! However, most sparse embeddings will ensure that at least those tokens will appear in returned results.
Most production implementations of semantic search will include some form of dense or sparse search to cover all possible queries.
Coming back to sparse search, there are a few methodologies developed over the years that can make searching sparse embeddings more fruitful.
A Brief History of Sparse Search
TF-IDF
Term Frequency Inverse Document Frequency is a methodology for doing a type of simple weighting of terms within documents
Luckily, we can understand how TF-IDF works by literally interpreting the name for the formula!
We measure how often a term occurs within a given document, related to how often it occurs across documents, and take the inverse of the latter. We take the inverse becau