1. The DataFrame scale gap
When I started working on Polars, I was surprised how much DataFrame implementations differed from SQL and databases. SQL could run anywhere 1. It could run embedded, on a client server model, or on full OLAP data warehouses.
Whereas for dataframes, the API was different per use case and performance was drastically lacking behind SQL solutions. Locally, pandas was dominant, and remotely/distributed, it was PySpark. For end-users, pandas was very easy to get up and running, but it seems to have ignored what databases have learned over decades, there was no query optimization, poor data type implementation, many needless materializations, it offloaded memory handling to NumPy, and a few other design decisions that led to poor scaling and inconsistent behavior. PySpark was much closer to databases, it follows the relational model, has optimization, a distributed query engine and scaled properly. However PySpark is written in Scala, requires the JVM to run locally, has very poor unpythonic UX (java backtraces for one), and is very sensitive to OOMs. It also was designed for commodity hardware a decade ago and row-based OLAP execution has proven to be suboptimal.
With Polars we want to unify those two worlds under one flexible DataFrame API, backed by high performant compute. Our initial (and achieved) goal was offering an alternative for pandas with a flexible API that does enable query optimization, and parallel streaming execution. Second we want to make running DataFrame code remotely easy. Just like SQL, a Polars LazyFrame
is a description of a query, and it can be sent to a server to be executed remotely. This should be dead easy. In the cloud dominant era, you should not be limited to the boundaries of your laptop.
2. Run Polars everywhere
Our goal is to enable Scalable data processing with all the flexibility and expressiveness of Polars’ API.
We are working on two things; Polars Cloud and a completely novel Streaming Engine design. We will explain more about the streaming engine in later posts; Today we want to share what are building with Polars Cloud.
The features we will offer:
- Distributed Polars; one API for all high performant DataFrame needs;
- Serverless compute;
- Configurable hardware, both GPU and CPU;
- Diagonal scaling; scaling both horizontally and vertically;
- Bring your own cloud; AWS, Azure and GCP;
- On premise licensing;
- Fault tolerance;
- Data lineage;
- Observability;
It will be very seamless to spin up hardware and run Polars queries remotely, either in batch mode for production ETL jobs, or interactively doing data exploration. The rest of the post, we want to explore this through a few code examples.
3. A remote query.
It’s important for us that starting a remote query feels native and seamless for the end user. Running a query remotely will be available from within Polars’ native API.
Note that we are agnostic of where you call this code. You can start a remote query from a notebook on your machine, an Airflow dag, an AWS Lambda, your server etc. The compute needed for data-processing is often much higher than the compute needed for orchestration in Airflow or Prefect. By not constraining you to a platform where you need to run your queries, we give you the flexibility to embed Polars Cloud in any environment.
In the query below we start our first query.
import polars as pl
import polars_cloud as pc
from datetime import date
query = (pl.scan_parquet("s3://my-dataset/")
.filter(pl.col("l_shipdate") <= date(19
17 Comments
LaurensBER
This is very impressive and definitely fills a huge hole in the whole data frame ecosystem.
I've been quite impressed with the Polars team and after using Pandas for years, Polars feels like a much needed fresh wind. Very excited to give this a go sometime soon!
ydjje
[flagged]
0cf8612b2e1e
I’ll bite- what’s the pitch vs Dask/Spark/Ray/etc?
I am admittedly a tough sell when the workstation under my desk has 192GB of RAM.
__mharrison__
Really excited for the Polars team. I've always been impressed by their work and responsiveness to issues I've filed in the past. The world is lifted when there is good competition like this.
TheAlchemist
Having switched from Pandas to Polars recently, this is quite interesting and I guess performance wise it will be excellent.
whalesalad
Never understood these kinds of cloud tools that deal with big data. You are paying enormous ingress/egress fees to do this.
Starlord2048
I can appreciate the pain points you guys are addressing.
The "diagonal scaling" approach seems particularly clever – dynamically choosing between horizontal and vertical scaling based on the query characteristics rather than forcing users into a one-size-fits-all model. Most real-world data workloads have mixed requirements, so this flexibility could be a major advantage.
I'm curious how the new streaming engine with out-of-core processing will compare to Dask, which has been in this space for a while but hasn't quite achieved the adoption of pandas/PySpark despite its strengths.
The unified API approach also tackles a real issue. The cognitive overhead of switching between pandas for local work and PySpark for distributed work is higher than most people acknowledge. Having a consistent mental model regardless of scale would be a productivity boost.
Anyway, I would love to apply for the early access and try it out. I'd be particularly interested in seeing benchmark comparisons against Ray, Dask, and Spark for different workload profiles. Also curious about the pricing model and the cold start problem that plagues many distributed systems.
tfehring
This is really cool, not sure how I missed it. I assume catalog support will be added fairly quickly. But ironically I think the biggest barrier to adoption will be the lack of an off-ramp to a FOSS solution that companies can self-host. Obviously Polars itself is FOSS, but it understandably seems like there's no way to self-host a backend to point a `pc.ComputeContext` to. That will be an especially tough selling point for companies that are already on Spark. I wonder how much they'll focus on startups vs. trying to get bigger companies to switch, and whether they'll try a Spark compatibility layer like DataFusion (https://github.com/apache/datafusion-comet).
whyho
How does this integrate into existing services like aws glue? I fear that despite polars being good/better it will lack adoption since it cannot easily be integrated.
melvinroest
I just got into data analysis recently (former software engineer) and tried out pandas vs polars. I like polars way more because it feels like SQL but then sane, and it's faster. It's clear in what it tries to do. I didn't really have that with pandas.
efxhoy
Looks great! Can I run it on my own bare metal cluster? Will I need to buy a license?
marxisttemp
What does this project have to do with Serbia? They’re based in the Netherlands. They must have made a mistake when registering their domain name.
unit149
[dead]
marquisdepolis
This is very interesting, clearly there's a major pain point here to be addressed, especially the delta between local pandas work and distributed [pyspark] work!
Would love to test this out and do benchmarks against us/ Dask/ Spark/ Ray etc which have been our primary testing ground. Full disclosure, work at Bodo which has similar-ish aspirations (https://github.com/bodo-ai/Bodo), but FOSS all the way.
noworriesnate
Every time I build something complex with dataframes in either R or Python (Pandas, I haven't used Polars yet), I end up really wishing I could have statically typed dataframes. I miss the security of knowing that when I change common code, the compiler will catch if I break a totally different part of the dashboard for instance.
I'm aware of Pandera[1] which has support for Polars as well but, while nice, it doesn't cause the code to fail to compile, it only fails at runtime. To me this is the achilles heel of analysis in both Python and R.
Does anybody have ideas on how this situation could be improved?
[1] https://pandera.readthedocs.io/en/stable/
c7THEC2DDFVV2V
who covers egress costs?
Larrikin
As a hobbyist, I describe polars as pandas if it was planned for humans to use. It's great to use, I just hate running into issues trying to use it. I wish them luck