Text embeddings, particularly modern embeddings generated from large language models, are one of the most useful applications coming from the generative AI boom. Embeddings are a list of numbers which represent an object: in the case of text embeddings, they can represent words, sentences, and full paragraphs and documents, and they do so with a surprising amount of distinctiveness.
Recently, I created text embeddings representing every distinct Magic: the Gathering card released as of the February 2025 Aetherdrift expansion: 32,254 in total. With these embeddings, I can find the mathematical similarity between cards through the encoded representation of their card design, including all mechanical attributes such as the card name, card cost, card text, and even card rarity.

The iconic Magic card Wrath of God, along with its top four most similar cards identified using their respective embeddings. The similar cards are valid matches, with similar card text and card types.
Additionally, I can create a fun 2D UMAP projection of all those cards, which also identifies interesting patterns:

The UMAP dimensionality reduction process also implicitly clusters the Magic cards to logical clusters, such as by card color(s) and card type.
I generated these Magic card embeddings for something special besides a pretty data visualization, but if you are curious how I generated them, they were made using the new-but-underrated gte-modernbert-base embedding model and the process is detailed in this GitHub repository. The embeddings themselves (including the coordinate values to reproduce the 2D UMAP visualization) are available as a Hugging Face dataset.
Most tutorials involving embedding generation omit the obvious question: what do you do with the text embeddings after you generate them? The common solution is to use a vector database, such as faiss or qdrant, or even a cloud-hosted service such as Pinecone. But those aren’t easy to use: faiss has confusing configuration options, qdrant requires using a Docker container to host the storage server, and Pinecone can get very expensive very quickly, and its free Starter tier is limited.
What many don’t know about text embeddings is that you don’t need a vector database to calculate nearest-neighbor similarity if your data isn’t too large. Using numpy and my Magic card embeddings, a 2D matrix of 32,254 float32
embeddings at a dimensionality of 768D (common for “smaller” LLM embedding models) occupies 94.49 MB of system memory, which is relatively low for modern personal computers and can fit within free usage tiers of cloud VMs. If both the query vector and the embeddings themselves are unit normalized (many embedding generators normalize by default), then the matrix dot product between the query and embeddings results in a cosine similarity between [-1, 1]
, where the higher score is better/more similar. Since dot products are such a fundamental aspect of linear algebra, numpy’s implementation is extremely fast: with the help of additional numpy sorting shenanigans, on my M3 Pro MacBook Pro it takes just 1.08 ms on average to calculate all 32,254 dot products, find the top 3 most similar embeddings, and return their corresponding idx
of the matrix and and cosine similarity score
.
def fast_dot_product(query, matrix, k=3):
dot_products = query @ matrix.T
idx = np.argpartition(dot_products, -k)[-k:]
idx = idx[np.argsort(dot_products[idx])[::-1]]
score = dot_products[idx]
return idx, score
In most implementations of vector databases, once you insert the embeddings, they’re stuck there in a proprietary serialization format and you are locked into that library and service. If you’re just building a personal pet project or sanity-checking embeddings to make sure the results are good, that’s a huge amount of friction. For example, when I want to experiment with embeddings, I generate them on a cloud server with a GPU since LLM-based embeddings models are often slow to generate without one, and then download them locally to my personal computer. What is the best way to handle embeddings portably such that they can easily be moved between machines and also in a non-proprietary format?
The answer, after much personal trial-and-error, is Parquet files, which still has a surprising amount of nuance. But before we talk about why Parquet files are good, let’s talk about how not to store embeddings.
The Worst Ways to Store Embeddings
The incorrect-but-unfortunately-common way to store embeddings is in a text format such as a CSV file. Text data is substantially larger than float32
data: for example, a decimal number with full precision (e.g. 2.145829051733016968e-02
) as a float32
is 32 bits/4 bytes, while as a text representation (in this case 24 ASCII char
s) it’s 24 bytes, 6x larger. When the CSV is saved and loaded, the data has to be serialized between a numpy and a string representation of the array, which adds significant overhead. Despite that, in one of OpenAI’s official tutorials for their embeddings models, they save the embeddings as a CSV using pandas with the admitted caveat of “Because this example only uses a few thousand strings, we’ll store them in a CSV file. (For larger datasets, use a vector database, which will be more performant.)”. In the case of the Magic card embeddings, pandas-to-CSV performs the worst out of any encoding options: more on why later.
Numpy has native methods to save and load embeddings as a .txt
that’s straightforward:
np.savetxt("embeddings_txt.txt", embeddings)
embeddings_r = np.loadtxt("embeddings_txt.txt", dtype=np.float32, delimiter=" ")
The resulting file not only takes a few seconds to save and load, but it’s also massive: 631.5 MB!
As an aside, HTTP APIs such as OpenAI’s Embeddings API do transmit the embeddings over text which adds needless latency and bandwidth overhead. I wish more embedding providers offered gRPC APIs which allow transfer of binary float32
data instead to gain a performance increase: Pinecone’s Python SDK, for example, does just that.
The second incorrect method to save a matrix of embeddings to disk is to save it as a Python pickle object, which stores its representation in memory on disk with a few lines of code from the native pickle
library. Pickling is unfortunately common in the machine learning industry since many ML frameworks such as scikit-learn don’t have easy ways to serialize encoders and models. But it comes with two major caveats: pickled files are a massive security risk as they can execute arbitrary code, and the pickled file may not be guaranteed to be able to be opened on other machines or Python versions. It’s 2025, just stop pickling if you can.
In the case of the Magic card embeddings, it does indeed work with instant save/loads, and the file size on disk is 94.49 MB: the same as its memory consumption and about 1/6th of the text size as expected:
with open("embeddings_matrix.pkl", "wb") as f:
pickle.dump(embeddings, f)
with open("embeddings_matrix.pkl", "rb") as f:
embeddings_r = pickle.load(f)
But there are still better and easier approaches.
The Intended-But-Not-Great Way to Store Embeddings
Numpy itself has a canonical way to save and load matrixes — which annoyingly saves as a pickle by default for compatability reasons, but that can fortunately be disabled by setting allow_pickle=False
:
np.save("embeddings_matrix
13 Comments
whinvik
Since we are talking about an embedded solution shouldn't the benchmark be something like sqlite with a vector extension or lancedb?
thelastbender12
This is pretty neat.
IMO a hindrance to this was lack of built-in fixed-size list array support in the Arrow format, until recently. Some implementations/clients supported it, while others didn't. Else, it could have been used as the default storage format for numpy arrays, torch tensors, too.
(You could always store arrays as variable length list arrays with fixed strides and handle the conversion).
banku_brougham
Is your example of a float32 number correct, holding 24 ascii char representation? I had thought single-precision gonna be 7 digits and the exponent, sign and exp sign. Something like 7+2+1+1 or 10 char ascii representation? Rather than the 24 you mentioned?
banku_brougham
Really cool article, I've enjoyed your work for a long time. You might add a note for those jumping into a sqlite implementation, that duckdb reads parquet and launched a few vector similarity functions which cover this use-case perfectly:
https://duckdb.org/2024/05/03/vector-similarity-search-vss.h…
kernelsanderz
For another library that has great performance and features like full text indexing and the ability to version changes I’d recommend lancedb https://lancedb.github.io/lancedb/
Yes, it’s a vector database and has more complexity. But you can use it without creating indexes and it has excellent polars and pandas zero copy arrow support also.
stephantul
Check out Unum’s usearch. It beats anything, and is super easy to use. It just does exactly what you need.
https://github.com/unum-cloud/usearch
kipukun
To the second footnote: you could utilize Polar's lazyframe API to do that cosine similarity in a streaming fashion for large files.
jtrueb
Polars + Parquet is awesome for portability and performance. This post focused on python portability, but Polars has an easy-to-use Rust API for embedding the engine all over the place.
robschmidt90
Nice read. I agree that for a lot of hobby use cases you can just load the embeddings from parquet and compute the similarities in-memory.
To find similarity between my blogposts [1] I wanted to experiment with a local vector database and found ChromaDB fairly easy to use (similar to SQLite just a file on your machine).
[1] https://staticnotes.org/posts/how-recommendations-work/
jononor
At 33k items in memory is quite fast, 10 ms is very responsive. With 10x/330k items given same hardware the expected time is 1 second. That might be too slow for some applications (but not all). Especially if one just does retrieval of a rather small amount of matches, an index will help a lot for 100k++ datasets.
noahbp
Wow! How much did this cost you in GPU credits? And did you consider using your MacBook?
rcarmo
I'm a huge fan of polars, but I hadn't considered using it to store embeddings in this way (I've been fiddling with sqlite-vec). Seems like an interesting idea indeed.
thomasfromcdnjs
Lots of great findings
—
I'm curious if anyone knows whether it is better to pass structured data or unstructured data to embedding api's? If I ask ChatGPT, it says it is better to send unstructured data. (looking at the authors github, it looks like he generated embeddings from json strings)
My use case is for jsonresume, I am creating embeddings by sending full json versions as strings, but I've been experimenting with using models to translate resume.json's into full text versions first before creating embeddings. The results seem to be better but I haven't seen any concrete opinions on this.
My understanding is that unstructured data is better because it contains textual/semantic meaning because of natural lanaguage aka
is worse than;
Another question: What if the search was also a json embedding? JSON <> JSON embeddings could also be great?