Pgvector 0.5.0 Feature Highlights and how tos by samaysharma

Share This Article

Sed ut perspiciatis unde.

Mon, Aug 28, 2023

12-minute read

It’s here! pgvector 0.5.0 is released and has some incredible new features. pgvector is an open-source project that brings vector database capabilities to PostgreSQL. The pgvector community is moving very rapidly on adding new features, so I thought it prudent to put together some highlights of the 0.5.0 release.

Here’s a quick list of highlights, though I encourage you read the rest in depth and explore on your own!

This is a big release, so let’s dive right in.

New index type: Hierarchical Navigable Small Worlds (`hnsw`)

The biggest highlight of the pgvector 0.5.0 release is the introduction of the hnsw index type. hnsw is based on the hierarchical navigable small worlds paper, which describes an indexing technique that creates layers of increasingly dense “neighborhoods” of vectors. The main idea of HNSW is that you can achieve a better performance/recall ratio by connecting vectors that are close to each other, so that when you perform similarity search, you have a higher likelihood of finding the exact nearest neighbors you’re looking for.

Beyond performance, pgvector’s HNSW implementation has several notable features:

“Build as you go”: With HNSW, you can create an index on an empty table and add vectors as you go without impacting recall! This is different from ivfflat, where you first need to load your vectors before building the index to find optimal centers for better recall. As you add more data to your ivfflat index, you may also need to re-index to find updated centers.
Update and delete: pgvector’s HNSW implementation lets you update and delete vectors from the index, as part of standard UPDATE and DELETE queries. Many HNSW implementations do not support this feature.
Concurrent inserts: Additionally, pgvector’s HNSW implementations lets you concurrently insert values into the index, making it easier to simultaneously load data from multiple sources.

In a previous post, I went into depth on the HNSW performance for pgvector with benchmarks that compared it to ivfflat and pg_embedding’s HNSW implementation. The chart below shows the performance/recall tradeoffs on OpenAI-style embedding data using cosine distance (for full testing methodology and the ANN Benchmark framework. Please read the previous post) for more details on the test methodology and other considerations (index build time, index size on disk):

Instead of focusing on benchmarking in this post, I want to provide guidance on how to use the key parameters in hnsw so you understand the ramifications on your performance/recall ratio and index build times.

In the previous post, we reviewed the 3 parameters that are part of the HNSW algorithm. We’ll break them down to where they are applicable in pgvector’s hnsw implementation:

Index building options

These options are available in the WITH clause of CREATE INDEX, e.g.

CREATE TABLE vecs (id int PRIMARY KEY, embedding vector(768));

CREATE INDEX ON vecs USING hnsw(embedding vector_l2_ops) WITH (m=16, ef_construction=64);

m: (default: 16; range: 2 to 100) Indicates how many bidirectional links (or paths) exist between each indexed element. Setting this to a higher number can increase recall, but can also significantly increase index build time and may impact query performance.
ef_construction: (default: 64; range: 4 to 1000) Indicates how many nearest neighbors to check while adding an element to the index. Increasing this value can increase recall, but will also increase index build time. This value must also be at least double m, e.g. if m is 24 then ef_construction must be at least 48.

Note that as of this writing, you must specify the operator class to use with the hnsw index. For example, to use cosine distance with an HNSW index, you would use a command similar to the below:

CREATE INDEX ON vecs USING hnsw(embedding vector_cosine_ops);

Index search options

hnsw.ef_search: (default: 40; range 1 to 1000) Indicates the number of nearest neighbors to keep in a “dynamic list” while keeping the search. A large value can improve recall, usually with a tradeoff in performance. You need hnsw.ef_search be at least as big as your LIMIT value.

How to use `hnsw` in pgvector

The default index build settings are chosen to optimize build time relative to the recall you can achieve during search. If you’re not getting the recall that you expect for your data set, first try increasing the value of ef_construction before adjusting m, as adjusting ef_construction is often a faster operation. There are some studies that show that increasing m can help with recall for higher dimensionality data sets, though in the previous post we saw that pgvector could process OpenAI-style embeddings effectively with high recall.

You can increase performance of your search queries by lowing the value hnsw.ef_search, e.g. set hnsw.ef_search to 20, though note that this could impact your recall:

SET hnsw.ef_search TO 20;
SELECT *
FROM vecs
ORDER BY q <-> embedding
LIMIT 10;

Improved performance of distance functions

Speed is paramount when computing distance between two vectors. Any way you can shave off computation time means you can build indexes and search for vectors more quickly.

pgvector 0.5.0 does exactly this, improving distance calculations across the board with noticeable gains for ARM64 architecture. By default, pgvector can use CPU acceleration for vector processing through compile flags, and writing the implementation code in certain ways can help unlock performance gains once its compiled.

The gains in pgvector 0.5.0 are noticeable. To demonstrate this, I ran an experiment on my Mac M1 Pro (2021 edition, 8 CPI, 16 GB RAM) to show the speedup in building an ivfflat index with both Euclidean (vector_l2_ops, or the default operator class) and cosine distnace (vector_cosine_ops) on a table with 1,000,000 768-dimensional vectors using PLAIN storage:

CREATE TABLE vecs (id int PRIMARY KEY, embedding vector(768));
ALTER TABLE vecs ALTER COLUMN embedding SET STORAGE PLAIN;

Below are some of the relevant local settings used to test the index builds:

shared_buffers: 4GB
max_wal_size: 10GB
work_mem: 8MB
max_parallel_mainetance_workers: 0

Before each run, I ensured that the vecs table was loaded into memory using the pg_prewarm extension:

SELECT pg_prewarm('vecs');

Finally, I created the ivfflat index with lists set to 100. Note that this was to run a series of tests quickly; the recommended value of lists for 1,000,000 rows is 1000. The effect of the distance calculations may be more pronounced with a larger value of lists.

Below is a table summarizing the results. Please note that these results are directional, particularly due to the value of lists:

Version	Euclidean (s)	cosine (s)
0.4.4	71.66	69.92
0.5.0	45.65	64.02

The above test showed a noticeable improvement with Euclidean distance and a marginal improvement with cosine distance. Andrew Kane’s tests showed a greater speedup across all distance functions on ARM64 systems. That said, you should likely see some performance gains in your pgvector workloads, with these being most pronounced on tasks with many distance computations (e.g. index builds, searches over large sets of vectors).

Parallelization of `ivfflat` index builds

ivfflat is comparatively fast when it comes to building indexes, though there are always

Pgvector 0.5.0 Feature Highlights and how tos by samaysharma

Pgvector 0.5.0 Feature Highlights and how tos by samaysharma

Share This Article

Newsletter

New index type: Hierarchical Navigable Small Worlds (`hnsw`)

Index building options

Index search options

How to use `hnsw` in pgvector

Improved performance of distance functions

Parallelization of `ivfflat` index builds

HackTech

Leave a comment Cancel reply

Editor's Choice

Pgvector 0.5.0 Feature Highlights and how tos by samaysharma

Pgvector 0.5.0 Feature Highlights and how tos by samaysharma

Share This Article

Newsletter

New index type: Hierarchical Navigable Small Worlds (hnsw)

Index building options

Index search options

How to use hnsw in pgvector

Improved performance of distance functions

Parallelization of ivfflat index builds

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter

New index type: Hierarchical Navigable Small Worlds (`hnsw`)

How to use `hnsw` in pgvector

Parallelization of `ivfflat` index builds