In Part 1, we delved into the capabilities of PostgreSQL’s full-text search and explored how advanced search features such as relevancy boosters, typo-tolerance, and faceted search can be implemented. In this part, we’ll compare it with Elasticsearch.
First, let’s note that Postgres and Elasticsearch are generally not in competition with each other. In fact, it’s very common to see them together in architecture diagrams, often in a configuration like this:
In this architecture, the source of truth for the data lives in Postgres, which serves the transactional CRUD operations. The data is continuously synced to Elasticsearch, either via something like Postgres logical replication events (change-data-capture) or by the application itself via custom code. During this data replication, denormalization might be required. The search functionality, including facets and aggregations, is served from Elasticsearch.
While this architecture is as common as it is for very good reasons, it does have a few challenges:
- Dealing with two types of stores means more operational burden and higher infrastructure costs.
- Keeping the data in sync is more challenging than you might think. I’m planning a dedicated blog for this problem because it’s quite interesting. Let’s just say, it’s pretty hard to get it completely right.
- The data replication is, at best, near real-time, meaning that there can be consistency issues in the search service.
Point 2 is generally solvable via engineering effort and careful dedicated code. From the existing tools, PGSync is an open source project that aims to specifically solve this problem. ZomboDB is an interesting Postgres extension that tackles point 2 (and I think partially point 3), by controlling and querying Elasticsearch through Postgres. I haven’t yet tried either of these two projects, so I can’t comment on their trade-offs, but I wanted to mention them.
And yes, a data platform like Xata solves most of points 1 and 2, by taking that complexity and offering it as a service, together with other goodies.
That said, if the Postgres full-text search functionality is enough for your use case, making use of it promises to significantly simplify your architecture and application. In this version, Postgres serves both the CRUD app needs and the full-text search needs:
This means you don’t need to operate two types of stores, no more data replication, no more denormalization, no more eventual consistency. The search engine built into Postgres happens to support ACID transactions, joins between tables, constraints (e.g. not null or unique), referential integrity (foreign keys), and all the other Postgres goodies that make application development simpler.
Therefore, it’s no wonder the Hacker News thread for our part 1 blog post had a lively discussion about the pros and cons of this approach. Can we go for the Postgres-only solution, or does the best-tool-for-the-job argument wins?
We’re going to compare the convenience, search relevancy, performance, and scalability of the two options.
As we showed in part 1, you can replicate a lot of the Elasticsearch functionality in Postgres, even more advanced things like relevancy boosters, typo-tolerance, suggesters/autocomplete, or semantic/vector search. However, it’s not always straight forward.
An example where it’s not too simple is with typo-tolerance (called fuzziness in Elasticsearch). It’s not available out-of-the-box in Postgres, but you can implement it with the following steps:
- index all lexemes (words) from all documents in a separate table
- for each word in the query, use similarity or Levenshtein distance to search in this table
- modify the search query to include any words that are found
While the above is quite doable, in dedicated search engines like Elasticsearch, you can enable typo-tolerance with a simple flag:
// POST /recipes/_search
{
"query": {
"multi_match": {
"query": "biscaits",
"fuzziness": 1
}
}
}
The default ranking algorithm for keyword search in Elasticsearch is BM25. With the release of Elasticsearch 5.0 in 2016, it dethroned TF-IDF as the default ranking algorithm. Postgres doesn’t support either of them, mainly because its ranking functions (explained in here) don’t have access to global word frequency data which is needed by these algorithms. To see how relevant (pun intended) or not so relevant that might be, let’s look at the ranking functions and algorithms from simple to complex:
ts_rank
(Postgres function) – ranks based on the term frequency. In other words, it does the “TF” (term frequency) part of TF-IDF. The principle is that if you are searching for a word, the more often that word shows up in the matching document, the higher the score. In addition to using simple TF, Postgres provides ways to normalize the term frequency into a score. For instance, one approach is to divide it by the document length.ts_rank_cd
(Postgres function) – rank + cover density. In addition to the term frequency, this function also takes into account the “cover density”, meaning the proximity of the terms in the document.- TF-IDF – term frequency + inverse document frequency. In addition to the term frequency, this algorithm “penalizes” words that are very common in the overall data set. So if the word “egg” matches, but that word is super common because we have a recipes dataset, it is valued less compared to other words in the query.
- BM25 – this algorithm is based on a probabilistic model of relevancy. While the TF-IDF formula is mostly based on intuition and practical experiments, BM25 is the result of more formal mathematical research. If you’re curious about the said mathematical research, I recommend this talk that makes it accessible. Interestingly, the resulting BM25 formula is not all that different from TF-IDF but it incorporates a couple more concepts: the frequency saturation and the document length. Ultimately, this gives better results over a wider range of document types.
There’s no question that BM25 is a more advanced relevancy algorithm than what ts_rank
or ts_rank_cd
use. BM25 uses more input signals, it’s based on better heuristics, and it typically doesn’t require tuning.
One practical effect of BM25 is that it automatically penalizes the very common words (”the”, “in”, “or”, etc.), also called “stop words”, which means that th