Skip to content Skip to footer
0 items - $0.00 0

The best way to use text embeddings portably is with Parquet and Polars by minimaxir

The best way to use text embeddings portably is with Parquet and Polars by minimaxir

The best way to use text embeddings portably is with Parquet and Polars by minimaxir

13 Comments

  • Post Author
    whinvik
    Posted February 24, 2025 at 6:46 pm

    Since we are talking about an embedded solution shouldn't the benchmark be something like sqlite with a vector extension or lancedb?

  • Post Author
    thelastbender12
    Posted February 24, 2025 at 7:41 pm

    This is pretty neat.

    IMO a hindrance to this was lack of built-in fixed-size list array support in the Arrow format, until recently. Some implementations/clients supported it, while others didn't. Else, it could have been used as the default storage format for numpy arrays, torch tensors, too.

    (You could always store arrays as variable length list arrays with fixed strides and handle the conversion).

  • Post Author
    banku_brougham
    Posted February 24, 2025 at 8:00 pm

    Is your example of a float32 number correct, holding 24 ascii char representation? I had thought single-precision gonna be 7 digits and the exponent, sign and exp sign. Something like 7+2+1+1 or 10 char ascii representation? Rather than the 24 you mentioned?

  • Post Author
    banku_brougham
    Posted February 24, 2025 at 8:20 pm

    Really cool article, I've enjoyed your work for a long time. You might add a note for those jumping into a sqlite implementation, that duckdb reads parquet and launched a few vector similarity functions which cover this use-case perfectly:

    https://duckdb.org/2024/05/03/vector-similarity-search-vss.h…

  • Post Author
    kernelsanderz
    Posted February 24, 2025 at 8:20 pm

    For another library that has great performance and features like full text indexing and the ability to version changes I’d recommend lancedb https://lancedb.github.io/lancedb/

    Yes, it’s a vector database and has more complexity. But you can use it without creating indexes and it has excellent polars and pandas zero copy arrow support also.

  • Post Author
    stephantul
    Posted February 24, 2025 at 8:22 pm

    Check out Unum’s usearch. It beats anything, and is super easy to use. It just does exactly what you need.

    https://github.com/unum-cloud/usearch

  • Post Author
    kipukun
    Posted February 24, 2025 at 8:34 pm

    To the second footnote: you could utilize Polar's lazyframe API to do that cosine similarity in a streaming fashion for large files.

  • Post Author
    jtrueb
    Posted February 24, 2025 at 8:51 pm

    Polars + Parquet is awesome for portability and performance. This post focused on python portability, but Polars has an easy-to-use Rust API for embedding the engine all over the place.

  • Post Author
    robschmidt90
    Posted February 24, 2025 at 9:11 pm

    Nice read. I agree that for a lot of hobby use cases you can just load the embeddings from parquet and compute the similarities in-memory.

    To find similarity between my blogposts [1] I wanted to experiment with a local vector database and found ChromaDB fairly easy to use (similar to SQLite just a file on your machine).

    [1] https://staticnotes.org/posts/how-recommendations-work/

  • Post Author
    jononor
    Posted February 24, 2025 at 10:00 pm

    At 33k items in memory is quite fast, 10 ms is very responsive. With 10x/330k items given same hardware the expected time is 1 second. That might be too slow for some applications (but not all). Especially if one just does retrieval of a rather small amount of matches, an index will help a lot for 100k++ datasets.

  • Post Author
    noahbp
    Posted February 24, 2025 at 10:01 pm

    Wow! How much did this cost you in GPU credits? And did you consider using your MacBook?

  • Post Author
    rcarmo
    Posted February 24, 2025 at 10:08 pm

    I'm a huge fan of polars, but I hadn't considered using it to store embeddings in this way (I've been fiddling with sqlite-vec). Seems like an interesting idea indeed.

  • Post Author
    thomasfromcdnjs
    Posted February 24, 2025 at 10:22 pm

    Lots of great findings

    I'm curious if anyone knows whether it is better to pass structured data or unstructured data to embedding api's? If I ask ChatGPT, it says it is better to send unstructured data. (looking at the authors github, it looks like he generated embeddings from json strings)

    My use case is for jsonresume, I am creating embeddings by sending full json versions as strings, but I've been experimenting with using models to translate resume.json's into full text versions first before creating embeddings. The results seem to be better but I haven't seen any concrete opinions on this.

    My understanding is that unstructured data is better because it contains textual/semantic meaning because of natural lanaguage aka

      skills: ['Javascript', 'Python']
    

    is worse than;

      Thomas excels at Javascript and Python
    

    Another question: What if the search was also a json embedding? JSON <> JSON embeddings could also be great?

Leave a comment

In the Shadows of Innovation”

© 2025 HackTech.info. All Rights Reserved.

Sign Up to Our Newsletter

Be the first to know the latest updates

Whoops, you're not connected to Mailchimp. You need to enter a valid Mailchimp API key.