Show HN: Lance – Alternative to Parquet for ML data by chop

Share This Article

Sed ut perspiciatis unde.

Modern columnar data format for ML. Convert from parquet in 2-lines of code for 100x faster random access, a vector index, data versioning, and more.

Compatible with pandas, duckdb, polars, pyarrow, with more integrations on the way.

Documentation •
Blog •
Discord •
Twitter

Lance is a modern columnar data format that is optimized for ML workflows and datasets. Lance is perfect for:

Building search engines and features stores.
Large-scale ML training requiring high performance IO and shuffles.
Storing, querying, and inspecting deeply nested data for robotics or large blobs like images, point-clouds, and more.

The key features of Lance include:

High-performance random access: 100x faster than Parquet without sacrificing scan performance.
Vector search: find nearest neighbors in milliseconds and combine OLAP-queries with vector search.
Zero-copy, automatic versioning: manage versions of your data without needing extra infrastructure.
Ecosystem integrations: Apache-Arrow, Pandas, Polars, DuckDB and more on the way.

Quick Start

Installation

Converting to Lance

import lance

import pandas as pd
import pyarrow as pa
import pyarrow.dataset

df = pd.DataFrame({"a": [5], "b": [10]})
uri = "/tmp/test.parquet"
tbl = pa.Table.from_pandas(df)
pa.dataset.write_dataset(tbl, uri, format='parquet')

parquet = pa.dataset.dataset(uri, format='parquet')
lance.write_dataset(parquet, "/tmp/test.lance")

Reading Lance data

dataset = lance.dataset("/tmp/test.lance")
assert isinstance(dataset, pa.dataset.Dataset)

Pandas

df = dataset.to_table().to_pandas()
df

DuckDB

import duckdb

# If this segfaults, make sure you have duckdb v0.7+ installed
duckdb.query("SELECT * FROM dataset LIMIT 10").to_df()

Vector search

Download the sift1m subset

wget ftp://ftp.irisa.fr/local/texmex/corpus/sift.tar.gz
tar -xzf sift.tar.gz

Convert it to Lance

import lance
from lance.vector import vec_to_table
import numpy as np
import struct

nvecs = 1000000
ndims = 128
with open("sift/sift_base.fvecs", mode="rb") as fobj:
    buf = fobj.read()
    data = np.array(struct.unpack("<128000000f", buf[4 : 4 + 4 * nvecs * ndims])).reshape((nvecs, ndims))
    dd = dict(zip(range(nvecs), data))

table = vec_to_table(dd)
uri = "vec_data.lance"
sift1m = lance.write_dataset(table, uri, max_rows_per_group=8192, max_rows_per_file=1024*1024)

Build the index

sift1m.create_index("vector",
                    index_type="IVF_PQ", 
                    num_partitions=256,  # IVF
                    num_sub_vectors=16)  # PQ

Search the dataset

# Get top 10 similar vectors
import duckdb

dataset = lance.dataset(uri)

# Sample 100 query vectors. If this segfaults, make sure you have duckdb v0.7+ installed
sample = duckdb.query("SELECT vector FROM dataset USING SAMPLE 100").to_df()
query_vectors = np.array([np.array(x) for x in sample.vector])

# Get nearest neighbors for all of them
rs = [dataset.to_table(nearest={"column": "vector", "k": 10, "q": q})      
      for q in query_vectors]

Directory structure

Directory	Description
rust	Core Rust implementation
python	Python bindings (pyo3)
docs	Documentation source

What makes Lance different

Here we will highlight a few aspects of Lance’s design. For more details, see the full Lance design document.

Vector index: Vector index for similarity search over embedding space

Encodings: to achieve both fast columnar scan and sub-linear point queries, Lance uses custom encodings and layouts.

Nested fields: Lance stores each subfield as a separate column to support efficient filters like “find images where detected objects include cats”.

Versioning: a Manifest can be used to record snapshots. Currently we support creating new versions automatically via appends, overwrites, and index creation

Fast updates (ROADMAP): Updates will be supported via write-ahead logs.

Rich secondary indices (ROADMAP):

Inverted index for fuzzy search over many label / annotation fields

Benchmarks

Vector search

We used the sift dataset to benchmark our results with 1M vectors of 128D

For 100 randomly sampled query vectors, we get <1ms average response time (on a 2023 m2 macbook air)

ANN is always a trade-off between recall and performance

Vs parquet

We create a Lance dataset using the Oxford Pet dataset to do some preliminary performance testing of Lance as compared to Parquet and raw image/xmls. For analytics queries, Lance is 50-100x better than reading the raw met

Show HN: Lance – Alternative to Parquet for ML data by chop

Show HN: Lance – Alternative to Parquet for ML data by chop

Share This Article

Newsletter

Quick Start

Directory structure

What makes Lance different

Benchmarks

Vector search

Vs parquet

HackTech

Leave a comment Cancel reply

Editor's Choice

Show HN: Lance – Alternative to Parquet for ML data by chop

Show HN: Lance – Alternative to Parquet for ML data by chop

Share This Article

Newsletter

Quick Start

Directory structure

What makes Lance different

Benchmarks

Vector search

Vs parquet

HackTech

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter