A decade ago, the data industry struggled with scaling data warehouses. The solution that emerged was to take data out of the warehouse and transform it in memory with programming languages like Python and Java, using tools like Hadoop, Spark, and Dask. This was accompanied by a move from data warehouses to data lakes.
But modern data warehouses have removed the problem of scalability and we’re now seeing a reverse-trend: a move back to transforming data within warehouses. This change has been largely driven by dbt (data build tool), which has fixed a number of important limitations of SQL (Structured Query Language) and is showing strong adoption. dbt enables data analysts to write modular SQL code, forces the use of Git to help with version control, and enables model testing. This seems to take away a lot of the traditional arguments against SQL in the SQL vs Python debate until now.
It seems like the clean division of labor between SQL (data querying and consolidation) and Python (complex data transformation) is fading. For example, tools like dask-sql allow you to both query and transform data using a mix of SQL operations and Python code. This article compares SQL and Python performance, functionality, and developer experience.
SQL vs Python: Performance
Running SQL code on data warehouses is generally faster than Python for querying data and doing basic aggregations. This is mainly because the data has a schema applied and the computation happens close to the data. When processing data with Python or other programming languages, you first need to extract the data out of the database and load it into local memory which is likely to introduce latency. Technically, databases have to load data into memory too, but the cost of that operation is much lower.
That said, it is difficult to say anything concretely meaningful about the performance comparison between SQL and Python without specifying things like the SQL execution engine (SQLite, PostgreSQL, Snowflake, …) and the Python libraries you are using to process the data (pandas, NumPy with Numba, …).
For example, parallel computing solutions like Dask and others scale Python code to larger-than-memory datasets and can significantly lower processing times. Managing the underlying infrastructure for running computations on remote clusters has often been a barrier to adoption for such tools. Enterprise solutions like Coiled are removing this barrier by automating the launching and management of remote clusters (disclaimer: I work there).
Nevertheless, using SQL to transform data inside the warehouse is generally a lot faster for many basic queries and aggregations as it moves code to data rather than data to code. And while a full-fledged technical benchmark is beyond the scope of this article, this comparison shows SQLite can be significantly faster than pandas for basic “select” and “filter” queries. For more context, however, you might want to check out the Databricks and Snowflake benchmarking discussions here and here.
Modern analytic databases are also equipped with powerful optimizers. SQL users can write queries that describe the desired transf