Mon, Aug 28, 2023
12-minute read
It’s here! pgvector 0.5.0 is released and has some incredible new features. pgvector is an open-source project that brings vector database capabilities to PostgreSQL. The pgvector community is moving very rapidly on adding new features, so I thought it prudent to put together some highlights of the 0.5.0 release.
Here’s a quick list of highlights, though I encourage you read the rest in depth and explore on your own!
This is a big release, so let’s dive right in.
New index type: Hierarchical Navigable Small Worlds (hnsw
)
The biggest highlight of the pgvector 0.5.0 release is the introduction of the hnsw
index type. hnsw
is based on the hierarchical navigable small worlds paper, which describes an indexing technique that creates layers of increasingly dense “neighborhoods” of vectors. The main idea of HNSW is that you can achieve a better performance/recall ratio by connecting vectors that are close to each other, so that when you perform similarity search, you have a higher likelihood of finding the exact nearest neighbors you’re looking for.
Beyond performance, pgvector’s HNSW implementation has several notable features:
- “Build as you go”: With HNSW, you can create an index on an empty table and add vectors as you go without impacting recall! This is different from
ivfflat
, where you first need to load your vectors before building the index to find optimal centers for better recall. As you add more data to yourivfflat
index, you may also need to re-index to find updated centers. - Update and delete: pgvector’s HNSW implementation lets you update and delete vectors from the index, as part of standard
UPDATE
andDELETE
queries. Many HNSW implementations do not support this feature. - Concurrent inserts: Additionally, pgvector’s HNSW implementations lets you concurrently insert values into the index, making it easier to simultaneously load data from multiple sources.
In a previous post, I went into depth on the HNSW performance for pgvector with benchmarks that compared it to ivfflat
and pg_embedding
’s HNSW implementation. The chart below shows the performance/recall tradeoffs on OpenAI-style embedding data using cosine distance (for full testing methodology and the ANN Benchmark framework. Please read the previous post) for more details on the test methodology and other considerations (index build time, index size on disk):
Instead of focusing on benchmarking in this post, I want to provide guidance on how to use the key parameters in hnsw
so you understand the ramifications on your performance/recall ratio and index build times.
In the previous post, we reviewed the 3 parameters that are part of the HNSW algorithm. We’ll break them down to where they are applicable in pgvector’s hnsw
implementation:
Index building options
These options are available in the WITH
clause of CREATE INDEX
, e.g.
CREATE TABLE vecs (id int PRIMARY KEY, embedding vector(768));
CREATE INDEX ON vecs USING hnsw(embedding vector_l2_ops) WITH (m=16, ef_construction=64);
m
: (default:16
; range:2
to100
) Indicates how many bidirectional links (or paths) exist between each indexed element. Setting this to a higher number can increase recall, but can also significantly increase index build time and may impact query performance.ef_construction
: (default:64
; range:4
to1000
) Indicates how many nearest neighbors to check while adding an element to the index. Increasing this value can increase recall, but will also increase index build time. This value must also be at least doublem
, e.g. ifm
is24
thenef_construction
must be at least48
.
Note that as of this writing, you must specify the operator class to use with the hnsw
index. For example, to use cosine distance with an HNSW index, you would use a command similar to the below:
CREATE INDEX ON vecs USING hnsw(embedding vector_cosine_ops);
Index search options
hnsw.ef_search
: (default:40
; range1
to1000
) Indicates the number of nearest neighbors to keep in a “dynamic list” while keeping the search. A large value can improve recall, usually with a tradeoff in performance. You needhnsw.ef_search
be at least as big as yourLIMIT
value.
How to use hnsw
in pgvector
The default index build settings are chosen to optimize build time relative to the recall you can achieve during search. If you’re not getting the recall that you expect for your data set, first try increasing the value of ef_construction
before adjusting m
, as adjusting ef_construction
is often a faster operation. There are some studies that show that increasing m
can help with recall for higher dimensionality data sets, though in the previous post we saw that pgvector could process OpenAI-style embeddings effectively with high recall.
You can increase performance of your search queries by lowing the value hnsw.ef_search
, e.g. set hnsw.ef_search
to 20
, though note that this could impact your recall:
SET hnsw.ef_search TO 20;
SELECT *
FROM vecs
ORDER BY q <-> embedding
LIMIT 10;
Improved performance of distance functions
Speed is paramount when computing distance between two vectors. Any way you can shave off computation time means you can build indexes and search for vectors more quickly.
pgvector 0.5.0 does exactly this, improving distance calculations across the board with noticeable gains for ARM64 architecture. By default, pgvector can use CPU acceleration for vector processing through compile flags, and writing the implementation code in certain ways can help unlock performance gains once its compiled.
The gains in pgvector 0.5.0 are noticeable. To demonstrate this, I ran an experiment on my Mac M1 Pro (2021 edition, 8 CPI, 16 GB RAM) to show the speedup in building an ivfflat
index with both Euclidean (vector_l2_ops
, or the default operator class) and cosine distnace (vector_cosine_ops
) on a table with 1,000,000 768-dimensional vectors using PLAIN
storage:
CREATE TABLE vecs (id int PRIMARY KEY, embedding vector(768));
ALTER TABLE vecs ALTER COLUMN embedding SET STORAGE PLAIN;
Below are some of the relevant local settings used to test the index builds:
shared_buffers
:4GB
max_wal_size
:10GB
work_mem
:8MB
max_parallel_mainetance_workers
:0
Before each run, I ensured that the vecs
table was loaded into memory using the pg_prewarm
extension:
SELECT pg_prewarm('vecs');
Finally, I created the ivfflat
index with lists
set to 100
. Note that this was to run a series of tests quickly; the recommended value of lists
for 1,000,000 rows is 1000
. The effect of the distance calculations may be more pronounced with a larger value of lists
.
Below is a table summarizing the results. Please note that these results are directional, particularly due to the value of lists
:
Version | Euclidean (s) | cosine (s) |
---|---|---|
0.4.4 | 71.66 | 69.92 |
0.5.0 | 45.65 | 64.02 |
The above test showed a noticeable improvement with Euclidean distance and a marginal improvement with cosine distance. Andrew Kane’s tests showed a greater speedup across all distance functions on ARM64 systems. That said, you should likely see some performance gains in your pgvector workloads, with these being most pronounced on tasks with many distance computations (e.g. index builds, searches over large sets of vectors).
Parallelization of ivfflat
index builds
ivfflat
is comparatively fast when it comes to building indexes, though there are always