A lightweight data processing framework built on DuckDB and 3FS.
- 🚀 High-performance data processing powered by DuckDB
- 🌍 Scalable to handle PB-scale datasets
- 🛠️ Easy operations with no long-running services
Python 3.8 to 3.12 is supported.
# Download example data
wget https://duckdb.org/data/prices.parquet
import smallpond # Initialize session sp = smallpond.init() # Load data df = sp.read_parquet("prices.parquet") # Process data df = df.repartition(3, hash_by="ticker") df = sp.partial_sql("SELECT ticker, min(price), max(price) FROM {0} GROUP BY ticker", df) # Save results df.write_parquet("output/") # Show results print(df.to_pandas())
For detailed guides and API reference:
We evaluated smallpond using the GraySort benchmark (script) on a cluster comprising 50 compute nodes and 25 storage nodes running 3FS. The benchmark sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.
Details can be found in 3FS – Gray Sort.
pip install .[dev] # run unit tests pytest -v tests/test*.py # build documentation pip install .[docs] cd docs make html python -m http.server --directory build/html
This project is licensed unde
9 Comments
fastasucan
What does this do – what is the benefit over DuckDB, Polers etc?
HackerThemAll
DuckDB itself is cool enough, especially when combined with SQLite and/or PostgreSQL, and now this. Thanks DeepSeek!
orlp
One thing I found peculiar is that for the GraySort benchmark it dispatches to Polars by default to do the actual sorting, not DuckDB: https://github.com/deepseek-ai/smallpond/blob/ed112db42af4d0….
shipp02
Is the code written by the deepseek model?
I should probably give up on being a software engineer if it is.
rubenvanwyk
May Data Engineering content keep on hitting front page HN!
RyanHamilton
If you want to checkout duckdb try QStudio. It's a free sql client with duckdb integrated: https://www.timestored.com/qstudio/help/duckdb-sql-editor. Disclaimer: I'm the main author.
lvl155
Looking forward to next few years when we can finally abstract away all the back-end techs.
dang
Related ongoing thread:
Understanding Smallpond and 3FS – https://news.ycombinator.com/item?id=43232410
also:
DuckDB goes distributed? DeepSeek's smallpond takes on Big Data – https://news.ycombinator.com/item?id=43206964 (no comments there, but some people have been recommending that article)
jamesblonde
We are seeing more and more specialized query engines.
This is a query engine specialized for training pipelines. It is not general purpose – it is for providing batches of training data at workers. It uses Ray for parallelization. The kind of queries you need are random reads (to implement shuffling across epochs), arrow support (zero copy to Pandas DataFrames), and efficient checkpointing.