Skip to content Skip to footer

10 Comments

  • Post Author
    nitinreddy88
    Posted March 3, 2025 at 4:46 am

    Any comparison results in terms of performance vs accuracy with: https://github.com/paradedb/paradedb/tree/dev/pg_search

  • Post Author
    emilsedgh
    Posted March 3, 2025 at 4:49 am

    This looks very neat. In our company, for the past 10 years I've bet heavily on Postgres have not allowed any other databases to be used. Stuff like this makes me hope I can continue to do so as our database is growing.

    But, it appears that we are hitting our limits at this point. We have a table with tens of millions of rows and the use case is that it's faceted search.

    Something like Zillow, with tends of millions of listings and tens of columns that can be queried. I'm having a tough time scaling it up. Does anyone have experience building something like that in pg?

  • Post Author
    thund
    Posted March 3, 2025 at 5:54 am
  • Post Author
    rpcope1
    Posted March 3, 2025 at 6:11 am

    I'd be interested to see how this stacks up against Manticore or Meilisearch.

  • Post Author
    siquick
    Posted March 3, 2025 at 6:30 am

    Can this be used on AWS RDS? I’ve seen a few things like this that would be great to use but without RDS support they’re unusable for us.

  • Post Author
    jszymborski
    Posted March 3, 2025 at 6:32 am

    > Our next step is to fully decouple the tokenization process, transforming it into an independent and extensible extension. This will enable us to support multiple languages, allow users to customize tokenization for better results, and even incorporate advanced features like synonym handling.

    This would be rad

  • Post Author
    jankovicsandras
    Posted March 3, 2025 at 10:00 am

    This looks cool!

    Shameless plug:

    https://github.com/jankovicsandras/plpgsql_bm25

    BM25 search implemented in PL/pgSQL, might be useful if one can't use Rust extensions with Postgres, e. g. hosted Postgres without admin rights.

  • Post Author
    jillesvangurp
    Posted March 3, 2025 at 10:24 am

    There's more to Elasticsearch/opensearch than just bm25. It's a Swiss Army knife of stuff that you need to implement search for all sorts of things. Most of which is missing in action with solutions like this.

    The article mentions stop words, which is an outdated and primitive strategy to deal with often used words adding a lot of overhead to your search. These days with Elasticsearch the advice is actually to not rely on lists of stop words and avoid using them entirely.

    Reason: the search engine is smart enough to handle very common words like, "to", "not, "be", and "or" efficiently. Bm25 relies on term frequencies for scoring. So these words would have very high term frequencies. The "match" query uses some optimizations that avoid most of the overhead associated with juggling a lot of high frequency term matches and scoring a lot of matches that would score very low. If you are only going to look at five results, scoring 500000 documents that potentially match the query that includes a stop word is really expensive (because you need to score each document and then sort the results). But if you prioritize scoring to low frequency terms first, you can narrow down the result list considerably and avoid scoring most of the documents.

    Not filtering out stop words means that if you are indexing the works of Shakespeare, you'd actually be able to find the phrase "to be or not to be", which is entirely made up of stop words. Classic edge case to test for. Just an example of one of many optimizations that you'd find in Lucene based search systems that improve both precision and recall metrics as well as performance. There are a lot of wheels that need to be reinvented on the postgresql side to get to that level. It's not quite as simple as "apply bm25 for each hit and sort".

    Solutions like discussed in the article provide you a some of the low level primitives but without most of the high level abstractions and optimizations that you'd need to build a proper search that is both accurate and fast. Fine if you know what you are doing and can compensate for that but probably not a great starting point if that's not the case.

    And if you are thinking that it's convenient to index your database model, you'd be well advised to read up on ETL and the notion of optimizing what you index for search, rather than for simple storage. In most more sophisticated search systems what's indexed for search is not the same as what you put in your database and there are some good reasons for that.

    Even if you do want to use postgres for search, you'd be well advised to consider setting up two completely separate database clusters. One for storing your data and one for indexing and searching through your data. The ETL pipeline in between those is where most of the magic happens. And you typically don't want that on the critical path of simple CRUD operations because that magic tends to be expensive. Which is why you do it in an ETL pipeline so you can keep your writes and queries fast and allocate hardware resources as is appropriate.

    Basically the T in ETL (extract, transform, load) is where you do all the expensive stuff that you don't want to do when you are querying or interacting with your system and editing things. If you are Google 25 years ago, that's where you'd be calculating page rank, for example. These days they do quite a bit more work. Let's just say that BM25 doesn't quite cut it if you are Google these days. Vector search is a classic example of something that's fairly expensive as well that you don't want to block on while you are writing to a database or file system (e.g. when crawling the web). Doing things asynchronously and decoupling them via queues, databases, filesystems, etc. is basically what data engineering is all about.

    Of course if you are setting up two systems anyway (a db, and a search index), you might want to consider using something that has all the right tools for the job. Which just isn't postgresql unless your ambition level is relatively low and search is a low priority thing for you. Whether you can afford to be naive and unambitious is ultimately a business choice. But you should be considering what your competitors do and what the consequences are of not being quite as good.

    A lot of my consulting clients end up talking to me because they figure out they are not as good as they should be. I've helped transition several clients from naive postgresql based systems to Opensearch recently. Performance issues you can usually solve by throwing hardware at a problem. But quality issues require using better tools and knowing what to use.

  • Post Author
    __jl__
    Posted March 3, 2025 at 11:14 am

    How does this compare with pg_search (formally pg_bm25) from ParadeDB?

  • Post Author
    ngrilly
    Posted March 3, 2025 at 11:39 am

    Awesome! How does it compare to RUM, another PostgreSQL extension solving a similar problem (make ranking fast by putting enough info in the index)?

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.