The best way to use text embeddings portably is with Parquet and Polars by minimaxir

Share This Article

Sed ut perspiciatis unde.

Text embeddings, particularly modern embeddings generated from large language models, are one of the most useful applications coming from the generative AI boom. Embeddings are a list of numbers which represent an object: in the case of text embeddings, they can represent words, sentences, and full paragraphs and documents, and they do so with a surprising amount of distinctiveness.

Recently, I created text embeddings representing every distinct Magic: the Gathering card released as of the February 2025 Aetherdrift expansion: 32,254 in total. With these embeddings, I can find the mathematical similarity between cards through the encoded representation of their card design, including all mechanical attributes such as the card name, card cost, card text, and even card rarity.

The iconic Magic card Wrath of God, along with its top four most similar cards identified using their respective embeddings. The similar cards are valid matches, with similar card text and card types.

Additionally, I can create a fun 2D UMAP projection of all those cards, which also identifies interesting patterns:

The UMAP dimensionality reduction process also implicitly clusters the Magic cards to logical clusters, such as by card color(s) and card type.

I generated these Magic card embeddings for something special besides a pretty data visualization, but if you are curious how I generated them, they were made using the new-but-underrated gte-modernbert-base embedding model and the process is detailed in this GitHub repository. The embeddings themselves (including the coordinate values to reproduce the 2D UMAP visualization) are available as a Hugging Face dataset.

Most tutorials involving embedding generation omit the obvious question: what do you do with the text embeddings after you generate them? The common solution is to use a vector database, such as faiss or qdrant, or even a cloud-hosted service such as Pinecone. But those aren’t easy to use: faiss has confusing configuration options, qdrant requires using a Docker container to host the storage server, and Pinecone can get very expensive very quickly, and its free Starter tier is limited.

What many don’t know about text embeddings is that you don’t need a vector database to calculate nearest-neighbor similarity if your data isn’t too large. Using numpy and my Magic card embeddings, a 2D matrix of 32,254 float32 embeddings at a dimensionality of 768D (common for “smaller” LLM embedding models) occupies 94.49 MB of system memory, which is relatively low for modern personal computers and can fit within free usage tiers of cloud VMs. If both the query vector and the embeddings themselves are unit normalized (many embedding generators normalize by default), then the matrix dot product between the query and embeddings results in a cosine similarity between [-1, 1], where the higher score is better/more similar. Since dot products are such a fundamental aspect of linear algebra, numpy’s implementation is extremely fast: with the help of additional numpy sorting shenanigans, on my M3 Pro MacBook Pro it takes just 1.08 ms on average to calculate all 32,254 dot products, find the top 3 most similar embeddings, and return their corresponding idx of the matrix and and cosine similarity score.

def fast_dot_product(query, matrix, k=3):
    dot_products = query @ matrix.T

    idx = np.argpartition(dot_products, -k)[-k:]
    idx = idx[np.argsort(dot_products[idx])[::-1]]

    score = dot_products[idx]

    return idx, score

In most implementations of vector databases, once you insert the embeddings, they’re stuck there in a proprietary serialization format and you are locked into that library and service. If you’re just building a personal pet project or sanity-checking embeddings to make sure the results are good, that’s a huge amount of friction. For example, when I want to experiment with embeddings, I generate them on a cloud server with a GPU since LLM-based embeddings models are often slow to generate without one, and then download them locally to my personal computer. What is the best way to handle embeddings portably such that they can easily be moved between machines and also in a non-proprietary format?

The answer, after much personal trial-and-error, is Parquet files, which still has a surprising amount of nuance. But before we talk about why Parquet files are good, let’s talk about how not to store embeddings.

The Worst Ways to Store Embeddings

The incorrect-but-unfortunately-common way to store embeddings is in a text format such as a CSV file. Text data is substantially larger than float32 data: for example, a decimal number with full precision (e.g. 2.145829051733016968e-02) as a float32 is 32 bits/4 bytes, while as a text representation (in this case 24 ASCII chars) it’s 24 bytes, 6x larger. When the CSV is saved and loaded, the data has to be serialized between a numpy and a string representation of the array, which adds significant overhead. Despite that, in one of OpenAI’s official tutorials for their embeddings models, they save the embeddings as a CSV using pandas with the admitted caveat of “Because this example only uses a few thousand strings, we’ll store them in a CSV file. (For larger datasets, use a vector database, which will be more performant.)”. In the case of the Magic card embeddings, pandas-to-CSV performs the worst out of any encoding options: more on why later.

Numpy has native methods to save and load embeddings as a .txt that’s straightforward:

np.savetxt("embeddings_txt.txt", embeddings)

embeddings_r = np.loadtxt("embeddings_txt.txt", dtype=np.float32, delimiter=" ")

The resulting file not only takes a few seconds to save and load, but it’s also massive: 631.5 MB!

As an aside, HTTP APIs such as OpenAI’s Embeddings API do transmit the embeddings over text which adds needless latency and bandwidth overhead. I wish more embedding providers offered gRPC APIs which allow transfer of binary float32 data instead to gain a performance increase: Pinecone’s Python SDK, for example, does just that.

The second incorrect method to save a matrix of embeddings to disk is to save it as a Python pickle object, which stores its representation in memory on disk with a few lines of code from the native pickle library. Pickling is unfortunately common in the machine learning industry since many ML frameworks such as scikit-learn don’t have easy ways to serialize encoders and models. But it comes with two major caveats: pickled files are a massive security risk as they can execute arbitrary code, and the pickled file may not be guaranteed to be able to be opened on other machines or Python versions. It’s 2025, just stop pickling if you can.

In the case of the Magic card embeddings, it does indeed work with instant save/loads, and the file size on disk is 94.49 MB: the same as its memory consumption and about 1/6th of the text size as expected:

with open("embeddings_matrix.pkl", "wb") as f:
    pickle.dump(embeddings, f)

with open("embeddings_matrix.pkl", "rb") as f:
    embeddings_r = pickle.load(f)

But there are still better and easier approaches.

The Intended-But-Not-Great Way to Store Embeddings

Numpy itself has a canonical way to save and load matrixes — which annoyingly saves as a pickle by default for compatability reasons, but that can fortunately be disabled by setting allow_pickle=False:

np.save("embeddings_matrix

Post Author

whinvik

Posted February 24, 2025 at 6:46 pm

Since we are talking about an embedded solution shouldn't the benchmark be something like sqlite with a vector extension or lancedb?

0Likes Log in to Reply
Post Author

thelastbender12

Posted February 24, 2025 at 7:41 pm

This is pretty neat.

IMO a hindrance to this was lack of built-in fixed-size list array support in the Arrow format, until recently. Some implementations/clients supported it, while others didn't. Else, it could have been used as the default storage format for numpy arrays, torch tensors, too.

(You could always store arrays as variable length list arrays with fixed strides and handle the conversion).

0Likes Log in to Reply
Post Author

banku_brougham

Posted February 24, 2025 at 8:00 pm

Is your example of a float32 number correct, holding 24 ascii char representation? I had thought single-precision gonna be 7 digits and the exponent, sign and exp sign. Something like 7+2+1+1 or 10 char ascii representation? Rather than the 24 you mentioned?

0Likes Log in to Reply
Post Author

banku_brougham

Posted February 24, 2025 at 8:20 pm

Really cool article, I've enjoyed your work for a long time. You might add a note for those jumping into a sqlite implementation, that duckdb reads parquet and launched a few vector similarity functions which cover this use-case perfectly:

https://duckdb.org/2024/05/03/vector-similarity-search-vss.h…

0Likes Log in to Reply
Post Author

kernelsanderz

Posted February 24, 2025 at 8:20 pm

For another library that has great performance and features like full text indexing and the ability to version changes I’d recommend lancedb https://lancedb.github.io/lancedb/

Yes, it’s a vector database and has more complexity. But you can use it without creating indexes and it has excellent polars and pandas zero copy arrow support also.

0Likes Log in to Reply
Post Author

stephantul

Posted February 24, 2025 at 8:22 pm

Check out Unum’s usearch. It beats anything, and is super easy to use. It just does exactly what you need.

https://github.com/unum-cloud/usearch

0Likes Log in to Reply
Post Author

kipukun

Posted February 24, 2025 at 8:34 pm

To the second footnote: you could utilize Polar's lazyframe API to do that cosine similarity in a streaming fashion for large files.

0Likes Log in to Reply
Post Author

jtrueb

Posted February 24, 2025 at 8:51 pm

Polars + Parquet is awesome for portability and performance. This post focused on python portability, but Polars has an easy-to-use Rust API for embedding the engine all over the place.

0Likes Log in to Reply
Post Author

robschmidt90

Posted February 24, 2025 at 9:11 pm

Nice read. I agree that for a lot of hobby use cases you can just load the embeddings from parquet and compute the similarities in-memory.

To find similarity between my blogposts [1] I wanted to experiment with a local vector database and found ChromaDB fairly easy to use (similar to SQLite just a file on your machine).

[1] https://staticnotes.org/posts/how-recommendations-work/

0Likes Log in to Reply
Post Author

jononor

Posted February 24, 2025 at 10:00 pm

At 33k items in memory is quite fast, 10 ms is very responsive. With 10x/330k items given same hardware the expected time is 1 second. That might be too slow for some applications (but not all). Especially if one just does retrieval of a rather small amount of matches, an index will help a lot for 100k++ datasets.

0Likes Log in to Reply
Post Author

noahbp

Posted February 24, 2025 at 10:01 pm

Wow! How much did this cost you in GPU credits? And did you consider using your MacBook?

0Likes Log in to Reply
Post Author

rcarmo

Posted February 24, 2025 at 10:08 pm

I'm a huge fan of polars, but I hadn't considered using it to store embeddings in this way (I've been fiddling with sqlite-vec). Seems like an interesting idea indeed.

0Likes Log in to Reply
Post Author

thomasfromcdnjs

Posted February 24, 2025 at 10:22 pm
Lots of great findings

—

I'm curious if anyone knows whether it is better to pass structured data or unstructured data to embedding api's? If I ask ChatGPT, it says it is better to send unstructured data. (looking at the authors github, it looks like he generated embeddings from json strings)

My use case is for jsonresume, I am creating embeddings by sending full json versions as strings, but I've been experimenting with using models to translate resume.json's into full text versions first before creating embeddings. The results seem to be better but I haven't seen any concrete opinions on this.

My understanding is that unstructured data is better because it contains textual/semantic meaning because of natural lanaguage aka

skills: ['Javascript', 'Python']

is worse than;

Thomas excels at Javascript and Python

Another question: What if the search was also a json embedding? JSON <> JSON embeddings could also be great?
0Likes Log in to Reply

The best way to use text embeddings portably is with Parquet and Polars by minimaxir

The best way to use text embeddings portably is with Parquet and Polars by minimaxir

Share This Article

Newsletter

The Worst Ways to Store Embeddings

The Intended-But-Not-Great Way to Store Embeddings

HackTech

13 Comments

whinvik

thelastbender12

banku_brougham

banku_brougham

kernelsanderz

stephantul

kipukun

jtrueb

robschmidt90

jononor

noahbp

rcarmo

thomasfromcdnjs

Leave a comment Cancel reply

Editor's Choice

The best way to use text embeddings portably is with Parquet and Polars by minimaxir

The best way to use text embeddings portably is with Parquet and Polars by minimaxir

Share This Article

Newsletter

The Worst Ways to Store Embeddings

The Intended-But-Not-Great Way to Store Embeddings

13 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter