We are excited to announce our new project, Xorbits, a scalable data science framework that aims to scale the entire Python data science world.
Before delving into the details, I’d like to highlight some of the key features of Xorbits:
-
Xorbits is incredibly easy to get started, if you know how to use pandas , you would probably know how to use Xorbits. In addition to pandas API, Xorbits also supports numpy API, and will integrate more APIs in the future.
- Xorbits is lightening fast, and able to process terabytes of data with ease. According to our benchmark test, Xorbits is the fastest distributed data science framework compared to the most popular frameworks right now.
- Xorbits is super easy to deploy with official support for running on your laptop, existing cluster, Kubernetes, and the cloud.
Background: Python data science world
Python has become the most popular programming language
The increasing popularity of AI and data science has further propelled Python to the forefront as the most popular programming language.
According to the most representative programming language ranking websites, Tiobe Index and IEEE spectrum, until Jan 2023, Python is the top 1 programming language.
Tiobe Index
Top pgrogramming laguages from IEEE spectrum
Data analysis and machine learning are the top popular fields
According to Python developer survey of 2021 conducted by PSF and Jetbrains, data analysis and machine learning are already the top popular fields for Python usage.
Python developer survey of 2021
Numpy, Pandas etc are the top popular libraries for data science
According to Python developer survey of 2021 conducted by PSF and Jetbrains and stackoverflow 2022 developer survey , numpy, pandas etc are the top popular libraries.
Python developer survey of 2021
What we need to scale Python data science?
As more users flock to the Python data science world, several issues may arise, including:
- The current ecosystem may struggle to process large datasets. While libraries such as numpy and pandas are suitable and fast for working with data at the megabyte scale, when data grows to gigabytes or larger, users may encounter errors such as “out of memory.”
- Libraries like numpy and pandas lack scalability. Most operations in these libraries can only run on a single CPU, making it difficult to scale to multiple cores or clusters. Additionally, hardware such as GPUs that are commonly used for AI applications cannot be utilized to accelerate data science tasks.
Users can upgrade their machines. but since most operations in these libraries cannot utilize multiple cores, this may not be an effective solution. The only useful enhancement would be an increase in memory.
Given these limitations, users may require an alternative solution for scaling data science. In our view, the following key points should be taken into consideration.
Compatibility with the existing ecosystem
Libraries such as pandas and numpy have been widely adopted due to their user-friendliness and wealth of features. Any substantial modifications to their APIs would require users to make considerable changes to their existing code, which would be a significant burden for those who have many historical tasks that