Smallpond – A lightweight data processing framework built on DuckDB and 3FS by overflowcat

Share This Article

Sed ut perspiciatis unde.

A lightweight data processing framework built on DuckDB and 3FS.

🚀 High-performance data processing powered by DuckDB
🌍 Scalable to handle PB-scale datasets
🛠️ Easy operations with no long-running services

Python 3.8 to 3.12 is supported.

# Download example data
wget https://duckdb.org/data/prices.parquet

import smallpond

# Initialize session
sp = smallpond.init()

# Load data
df = sp.read_parquet("prices.parquet")

# Process data
df = df.repartition(3, hash_by="ticker")
df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df)

# Save results
df.write_parquet("output/")
# Show results
print(df.to_pandas())

For detailed guides and API reference:

We evaluated smallpond using the GraySort benchmark (script) on a cluster comprising 50 compute nodes and 25 storage nodes running 3FS. The benchmark sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.

Details can be found in 3FS – Gray Sort.

pip install .[dev]

# run unit tests
pytest -v tests/test*.py

# build documentation
pip install .[docs]
cd docs
make html
python -m http.server --directory build/html

This project is licensed unde

Post Author

fastasucan

Posted February 28, 2025 at 1:26 pm

What does this do – what is the benefit over DuckDB, Polers etc?

0Likes Log in to Reply
Post Author

HackerThemAll

Posted March 2, 2025 at 6:19 pm

DuckDB itself is cool enough, especially when combined with SQLite and/or PostgreSQL, and now this. Thanks DeepSeek!

0Likes Log in to Reply
Post Author

orlp

Posted March 2, 2025 at 6:31 pm

One thing I found peculiar is that for the GraySort benchmark it dispatches to Polars by default to do the actual sorting, not DuckDB: https://github.com/deepseek-ai/smallpond/blob/ed112db42af4d0….

0Likes Log in to Reply
Post Author

shipp02

Posted March 2, 2025 at 6:53 pm

Is the code written by the deepseek model?

I should probably give up on being a software engineer if it is.

0Likes Log in to Reply
Post Author

rubenvanwyk

Posted March 2, 2025 at 7:08 pm

May Data Engineering content keep on hitting front page HN!

0Likes Log in to Reply
Post Author

RyanHamilton

Posted March 2, 2025 at 7:35 pm

If you want to checkout duckdb try QStudio. It's a free sql client with duckdb integrated: https://www.timestored.com/qstudio/help/duckdb-sql-editor. Disclaimer: I'm the main author.

0Likes Log in to Reply
Post Author

lvl155

Posted March 2, 2025 at 7:39 pm

Looking forward to next few years when we can finally abstract away all the back-end techs.

0Likes Log in to Reply
Post Author

dang

Posted March 2, 2025 at 8:20 pm

Related ongoing thread:

Understanding Smallpond and 3FS – https://news.ycombinator.com/item?id=43232410

also:

DuckDB goes distributed? DeepSeek's smallpond takes on Big Data – https://news.ycombinator.com/item?id=43206964 (no comments there, but some people have been recommending that article)

0Likes Log in to Reply
Post Author

jamesblonde

Posted March 2, 2025 at 8:21 pm

We are seeing more and more specialized query engines.
This is a query engine specialized for training pipelines. It is not general purpose – it is for providing batches of training data at workers. It uses Ray for parallelization. The kind of queries you need are random reads (to implement shuffling across epochs), arrow support (zero copy to Pandas DataFrames), and efficient checkpointing.

0Likes Log in to Reply

Smallpond – A lightweight data processing framework built on DuckDB and 3FS by overflowcat

Smallpond – A lightweight data processing framework built on DuckDB and 3FS by overflowcat

Share This Article

Newsletter

HackTech

9 Comments

fastasucan

HackerThemAll

orlp

shipp02

rubenvanwyk

RyanHamilton

lvl155

dang

jamesblonde

Leave a comment Cancel reply

Editor's Choice

Smallpond – A lightweight data processing framework built on DuckDB and 3FS by overflowcat

Smallpond – A lightweight data processing framework built on DuckDB and 3FS by overflowcat

Share This Article

Newsletter

9 Comments

Leave a comment Cancel reply

Editor's Choice

Sign Up to Our Newsletter