DeepSeek has made a lot of noise lately. Their R1 model, released in January 2025, outperformed competitors like OpenAI’s O1 at launch. But what truly set it apart was its highly efficient infrastructure—dramatically reducing costs while maintaining top-tier performance.
Now, they’re coming for data engineers. DeepSeek released a bunch of small repositories as independent code modules. Thomas Wolf, Co-founder and Chief of Product at HuggingFace shared some of his highlights, but we’re going to focus on one particularly important project went that unmentioned—smallpond, a distributed compute framework built on DuckDB. DeepSeek is pushing DuckDB beyond its single-node roots with smallpond, a new, simple approach to distributed computing.
First, having DeepSeek, a hot AI company, using DuckDB is a significant statement, and we’ll understand why. Second, we’ll dive into the repository itself, exploring their smart approach to enabling DuckDB as a distributed system, along with its limitations and open questions.
I assume you’re familiar with DuckDB. I’ve created tons of content around it. But just in case, here’s a high-level recap.
For transparency, at the time of writing this blog, I’m a data engineer and DevRel at MotherDuck. MotherDuck provides a cloud-based version of DuckDB with enhanced features. Its approach differs from what we’ll discuss here, and while I’ll do my best to remain objective, just a heads-up! 🙂
DuckDB is an in-process analytical database, meaning it runs within your application without requiring a separate server. You can install it easily in multiple programming languages by adding a library—think of it as the SQLite of analytics, but built for high-performance querying on large datasets.
It’s built in C++ and contains all the integrations you might need for your data pipelines (AWS S3/Google Cloud Storage, Parquet, Iceberg, spatial data, etc.), and it’s damn fast. Besides working with common file formats, it has its own efficient storage format—a single ACID-compliant file containing all tables and metadata, with strong compression.
In Python, getting started is as simple as:
pip install duckdb
Then, load and query a Parquet file in just a few lines:
import duckdb conn = duckdb.connect() conn.sql("SELECT * FROM '/path/to/file.parquet'")
It also supports reading and writing to Pandas and Polars DataFrames with zero copy, thanks to Arrow.
import duckdb import pandas # Create a Pandas dataframe my_df = pandas.DataFrame.from_dict({'a': [42]}) # query the Pandas DataFrame "my_df" # Note: duckdb.sql connects to the default in-memory database connection results = duckdb.sql("SELECT * FROM my_df").df()
We talk a lot about LLM frameworks, models, and agents, but we often forget that the first step in ANY AI project comes down to data.
Whether it’s for training, RAG, or other applications, it all comes down to feeding systems with good, clean data. But how do we even accomplish that step? Through data engineering. Data engineering is a crucial step in AI workflows but is less discussed because it’s less “sexy” and less “new.”
Regarding DuckDB, we’ve already seen other AI companies like HuggingFace using it behind the scenes to quickly serve and explore their datasets library through their dataset viewer.
Now, DeepSeek is introducing smallpond, a lightweight open-source framework, leveraging DuckDB to process terabyte-scale datasets in a distributed manner. Their benchmark states: _“Sorted 110.5TiB of data in 30 minutes and 14 seconds, achieving an average throughput of 3.66TiB/min.”
While we’ve seen DuckDB crushing 500GB on a single node easily (clickbench), this enters another realm of data size.
But wait, isn’t DuckDB single-node focused? What’s the catch here?
Let’s dive in.
smallpond follows a lazy evaluation approach when performing operations on DataFrames (map()
, filter()
, partial_sql()
, etc.), meaning it doesn’t execute them immediately. Instead, it constructs a logical plan represented as a directed acyclic graph (DAG), where each operation corresponds to a node in the graph (e.g., SqlEngineNode
, HashPartitionNode
, DataSourceNode
).
Execution is only triggered when an action is called, such as:
-
write_parquet()
– Write data to disk -
to_pandas()
– Convert to a pandas DataFrame -
compute()
– Explicitly request computation -
count()
– Count rows -
take()
– Retrieve rows
This approach optimizes performance by deferring computation until necessary, reducing redundant operations and improving efficiency.
When execution is triggered, the logical plan is converted into an execution plan. The execution plan consists of tasks (e.g., SqlEngineTask
, HashPartitionTask
) that correspond to the nodes in the logical plan. These tasks are the actual units of work that will be distributed and executed through Ray.
The important thing to understand is that the distribution mechanism in smallpond operates at the Python level with help from Ray, specifically Ray Core, through partitions.
A given operation is distributed based on manual partitioning provided by the user. Smallpond supports multiple partitioning strategies:
-
Hash partitioning (by column values)
-
Even partitioning (by files or rows)
-
Random shuffle partitioning
For each partition, a separate DuckDB instance is created within a Ray task. Each task processes its assigned partition independently using SQL queries through DuckDB.
Given this architecture, you might notice that the framework is tightly integrated with Ray, which comes with a trade-off: it prioritizes scaling out (adding more nodes with standard hardware) over scaling up (improving the performance of a single node).
Therefore, you would need to have a Ray cluster. Multiple options exist, but you would have more options today to manage your own cluster through AWS/GCP compute or a Kubernetes cluster. Only Anyscale, the company founded and led by the creators of Ray, offers a fully
2 Comments
ogarten
Looks like we are approaching the "distributed" phase of the distributed-centralized computing cycle :)
Not saying this is bad, but it's just interesting to see after being in the industry for 8 years.
OutOfHere
Deepseek is the real "open<something>" that the world needed. Via these three projects, Deepseek has addressed not only efficient AI but also distributed computing:
1. smallpond: https://github.com/deepseek-ai/smallpond
2. 3fs: https://github.com/deepseek-ai/3FS
3. deepep: https://github.com/deepseek-ai/DeepEP