Originally posted: 2020-10-22. View source code for this page here.
Learning more about a tool that can filter and aggregate two billion rows on a laptop in two seconds
In my work as a data scientist, I’ve come across Apache Arrow in a range of seemingly-unrelated circumstances. However, I’ve always struggled to describe exactly what it is, and what it does.
The official description of Arrow is:
a cross-language development platform for in-memory analytics
which is quite abstract — and for good reason. The project is extremely ambitious, and aims to provide the backbone for a wide range of data processing tasks. This means it sits at a low level, providing building blocks for higher level, user-facing, analytics tools like pandas or dplyr.
As a result, the importance of the project can be hard to understand for users who run into it occasionally in their day to day work, because much of what it does is behind the scenes.
In this post I describe some of the user-facing features of Apache Arrow which I have run into in my work, and explain why they are all facets of more fundamental problems which Apache Arrow aims to solve.
By connecting these dots it becomes clear why Arrow is not just a useful tool to solve some practical problems today, but one of the most exciting emerging tools, with the potential to be the engine behind large parts of future data science workflows.
Faster csv reading
A striking feature of Arrow is that it can read csvs into Pandas more than 10x faster than pandas.read.csv.
This is actually a two step process: Arrow reads the data into memory in an Arrow table, which is really just a collection of record batches, and then converts the Arrow table into a pandas dataframe.
The speedup is thus a consequence of the underlying design of Arrow:
-
Arrow has its own in-memory storage format. When we use Arrow to load data into Pandas, we are really loading data into the Arrow format (an in-memory format for data frames), and then translating this into the pandas in-memory format. Part of the speedup in reading csvs therefore comes from the careful design of the Arrow columnar format itself.
-
Data in Arrow is stored in-memory in record batches, a 2D data structure containing contiguous columns of data of equal length. A ‘table’ can be created from these batches without requiring additional memory copying, because tables can have ‘chunked’ columns (i.e. sections of data, each part representing a contiguous chunk of memory). This design means that data can be read in parallel rather than the single-threaded approach of pandas.
Faster User Defined Functions (UDF) in PySpark
Running Python user defined functions in Apache Spark has historically been very slow — so slow that the usual recommendation was not to do it on datasets of any significant size.
More recently, Apache Arrow has made it possible to efficiently transfer data between JVM and Python processes. Combined with vectorised UDFs, this has resulted in huge speedups.
What’s going on here?
Prior to the introduction of Arrow, the process of translating data from the Java representation to the Python representation and back was slow — involving serialisation a