Skip to content Skip to footer

0 items - $0.00 0

BM25 in PostgreSQL by gaocegege

10CommentsShare PostShare on Facebook Share on XShare by EmailSend Link

Vídeo

BM25 in PostgreSQL by gaocegege

ByHackTech March 3, 2025

10Comments

Share This Article

Sed ut perspiciatis unde.

Send to HN

BM25 in PostgreSQL
Read More

0Likes

Written by

HackTech

View all posts by HackTech

Show comments (10)

10 Comments

Post Author

nitinreddy88

Posted March 3, 2025 at 4:46 am

Any comparison results in terms of performance vs accuracy with: https://github.com/paradedb/paradedb/tree/dev/pg_search

0Likes Log in to Reply
Post Author

emilsedgh

Posted March 3, 2025 at 4:49 am

This looks very neat. In our company, for the past 10 years I've bet heavily on Postgres have not allowed any other databases to be used. Stuff like this makes me hope I can continue to do so as our database is growing.

But, it appears that we are hitting our limits at this point. We have a table with tens of millions of rows and the use case is that it's faceted search.

Something like Zillow, with tends of millions of listings and tens of columns that can be queried. I'm having a tough time scaling it up. Does anyone have experience building something like that in pg?

0Likes Log in to Reply
Post Author

thund

Posted March 3, 2025 at 5:54 am

<3 the license

https://github.com/tensorchord/VectorChord/blob/main/LICENSE

0Likes Log in to Reply
Post Author

rpcope1

Posted March 3, 2025 at 6:11 am

I'd be interested to see how this stacks up against Manticore or Meilisearch.

0Likes Log in to Reply
Post Author

siquick

Posted March 3, 2025 at 6:30 am

Can this be used on AWS RDS? I’ve seen a few things like this that would be great to use but without RDS support they’re unusable for us.

0Likes Log in to Reply
Post Author

jszymborski

Posted March 3, 2025 at 6:32 am

> Our next step is to fully decouple the tokenization process, transforming it into an independent and extensible extension. This will enable us to support multiple languages, allow users to customize tokenization for better results, and even incorporate advanced features like synonym handling.

This would be rad

0Likes Log in to Reply
Post Author

jankovicsandras

Posted March 3, 2025 at 10:00 am

This looks cool!

Shameless plug:

https://github.com/jankovicsandras/plpgsql_bm25

BM25 search implemented in PL/pgSQL, might be useful if one can't use Rust extensions with Postgres, e. g. hosted Postgres without admin rights.

0Likes Log in to Reply
Post Author

jillesvangurp

Posted March 3, 2025 at 10:24 am

There's more to Elasticsearch/opensearch than just bm25. It's a Swiss Army knife of stuff that you need to implement search for all sorts of things. Most of which is missing in action with solutions like this.

The article mentions stop words, which is an outdated and primitive strategy to deal with often used words adding a lot of overhead to your search. These days with Elasticsearch the advice is actually to not rely on lists of stop words and avoid using them entirely.

Reason: the search engine is smart enough to handle very common words like, "to", "not, "be", and "or" efficiently. Bm25 relies on term frequencies for scoring. So these words would have very high term frequencies. The "match" query uses some optimizations that avoid most of the overhead associated with juggling a lot of high frequency term matches and scoring a lot of matches that would score very low. If you are only going to look at five results, scoring 500000 documents that potentially match the query that includes a stop word is really expensive (because you need to score each document and then sort the results). But if you prioritize scoring to low frequency terms first, you can narrow down the result list considerably and avoid scoring most of the documents.

Not filtering out stop words means that if you are indexing the works of Shakespeare, you'd actually be able to find the phrase "to be or not to be", which is entirely made up of stop words. Classic edge case to test for. Just an example of one of many optimizations that you'd find in Lucene based search systems that improve both precision and recall metrics as well as performance. There are a lot of wheels that need to be reinvented on the postgresql side to get to that level. It's not quite as simple as "apply bm25 for each hit and sort".

Solutions like discussed in the article provide you a some of the low level primitives but without most of the high level abstractions and optimizations that you'd need to build a proper search that is both accurate and fast. Fine if you know what you are doing and can compensate for that but probably not a great starting point if that's not the case.

And if you are thinking that it's convenient to index your database model, you'd be well advised to read up on ETL and the notion of optimizing what you index for search, rather than for simple storage. In most more sophisticated search systems what's indexed for search is not the same as what you put in your database and there are some good reasons for that.

Even if you do want to use postgres for search, you'd be well advised to consider setting up two completely separate database clusters. One for storing your data and one for indexing and searching through your data. The ETL pipeline in between those is where most of the magic happens. And you typically don't want that on the critical path of simple CRUD operations because that magic tends to be expensive. Which is why you do it in an ETL pipeline so you can keep your writes and queries fast and allocate hardware resources as is appropriate.

Basically the T in ETL (extract, transform, load) is where you do all the expensive stuff that you don't want to do when you are querying or interacting with your system and editing things. If you are Google 25 years ago, that's where you'd be calculating page rank, for example. These days they do quite a bit more work. Let's just say that BM25 doesn't quite cut it if you are Google these days. Vector search is a classic example of something that's fairly expensive as well that you don't want to block on while you are writing to a database or file system (e.g. when crawling the web). Doing things asynchronously and decoupling them via queues, databases, filesystems, etc. is basically what data engineering is all about.

Of course if you are setting up two systems anyway (a db, and a search index), you might want to consider using something that has all the right tools for the job. Which just isn't postgresql unless your ambition level is relatively low and search is a low priority thing for you. Whether you can afford to be naive and unambitious is ultimately a business choice. But you should be considering what your competitors do and what the consequences are of not being quite as good.

A lot of my consulting clients end up talking to me because they figure out they are not as good as they should be. I've helped transition several clients from naive postgresql based systems to Opensearch recently. Performance issues you can usually solve by throwing hardware at a problem. But quality issues require using better tools and knowing what to use.

0Likes Log in to Reply
Post Author

__jl__

Posted March 3, 2025 at 11:14 am

How does this compare with pg_search (formally pg_bm25) from ParadeDB?

0Likes Log in to Reply
Post Author

ngrilly

Posted March 3, 2025 at 11:39 am

Awesome! How does it compare to RUM, another PostgreSQL extension solving a similar problem (make ranking fast by putting enough info in the index)?

0Likes Log in to Reply

BM25 in PostgreSQL by gaocegege

BM25 in PostgreSQL by gaocegege

Share This Article

Newsletter

HackTech

10 Comments

nitinreddy88

emilsedgh

thund

rpcope1

siquick

jszymborski

jankovicsandras

jillesvangurp

jl

ngrilly

Leave a comment Cancel reply

Editor's Choice

BM25 in PostgreSQL by gaocegege

BM25 in PostgreSQL by gaocegege

Share This Article

Newsletter

10 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter