Skip to content

Latest commit

 

History

History
100 lines (64 loc) · 14.7 KB

README.md

File metadata and controls

100 lines (64 loc) · 14.7 KB

Awesome Pandas Alternatives Awesome

A curated list of awesome Python frameworks, libraries, software and resources.

Table of Contents

Pandas Shortcomings

We all love pandas and most of us have learnt how to do data manipulation using this amazing library. However useful and great pandas is, unfortunately it has some well-known shortcomings that developers have been trying to address in the past years. Here are the most common weak points:

  • You might read that pandas generall requires as much RAM as 5-10 times the dataset you are working with, mostly because many operations generate under the hood a in-memory copy of the data;
  • pandas eagerly executes code, i.e. executes every statement sequentially. For example, if you read a .csv file and then filter it on a specific column (say, col2 > 5), the whole dataset will be first read into memory and then the subset you are interested in will be returned. Of course, you could manually write pandas command sequentially to improve on efficiency, but one can only do so much. For this reason, many of the pandas alternatives implement lazy evaluation - i.e. do not execute statements until a .collect() or .compute() method is called - and include a query execution engine to optimise the order of the operations (read more here).

This awesome-repo aims to gather the libraries meant to overcome pandas weaknesses, as well as some resources to learn them. Everyone is encouraged to add new projects, or edit the descriptions to correct potential mistakes.

Almost Under Every Hood: Apache Arrow

Most of these libraries leverage Apache Arrow, "a language-independent columnar memory format". In other words, unlike good old .csv files that are stored by rows, Apache Arrow storage is (also) column-based. This allows partitioning the data in chunks with a lot of clever tricks to enable greater compression (like storing sequences of repeated values) and faster queries (because each chunk also stores metadata like the min or max value).

Arrow offers Python bindings with its Python API, named pyarrow. This library has modules to read and write data with either Arrow formats (.parquet most notably, but also .feather) and other formats like .csv and .json, but also data from cloud storage services and in a streaming fashion, which means data is processed in batches and does not need to be read wholly into memory. The module pyarrow.compute allows to perform basic data manipulation.

On its own, pyarrow is rarely being used as a standalone library to perform data manipulation: usually more expressive and feature rich modules are built upon Arrow, especially on its fast C++ or Rust API interface. For this reason, most of the libraries listed here will display a general landing page and links to other languages APIs (mostly, Python and R). To be honest, the R {arrow} interface has a backend for {dplyr} (the equivalent of pandas in R), which makes its use more straightforward. Development for the R {arrow} package is quite active!

In-Memory DataFrame Libraries

These libraries leverage Apache Arrow memory format to implement a parallelised and lazy execution engine. These are designed to take advantage of all the cores (and threads) of a machine, but are mainly geared towards dealing with in-memory data (i.e., with read_csv()-like functions to read the data into the memory of your machine, instead of processing it on a cluster node).

  • polars: Polars claims to be "a blazingly fast DataFrames library implemented in Rust using Apache Arrow Columnar Format as memory model". It leverages (quasi-)lazy evaluation, uses all cores, has multithreaded operations and its query engine is written in Rust. polars has an expressive and high-level API that enables complex operations.
  • duckdb: is another fairly recent DataFrame library. It offers both a SQL interface and a Python API: in other words, it can be used to query .csv and Arrow's .parquet files, but also in-memory pandas.DataFrames using both SQL and a syntax closer to Python or pandas. It supports window functions.
    • duckdb has a very nice blog that explains its optimistations under the hood: for example, here you can find an overview of how Apache's .parquet format works and the performance tricks used by the query engine to run several orders of magnitude faster than pandas.

Distributed Computing Libraries

Compared to the libraries above, the following are meant to orchestrate data manipulation over clusters, i.e. distribute computing across several nodes (multiple machines with multiple processors) via a query execution engine. Since they need to plan execution preemptively, they can be slower than pandas on a single-core machine.

Row-based storage

These are the "first generation" of query planners that - as of now - are not built around columnar storage. Nonetheless, they represent the industry standard for distributed computing, and offer much more than data manipulation: they can even implement machine learning libraries to train models on the cluster. The downside is that moving from pandas to, say, Apache's spark is not straightforward, as the API syntax can differ.

  • dask is among the first distributed computing DataFrame libraries, alongside spark. dask does not only offer a distributed computing equivalent of pandas, but also of numpy and scikit-learn, so that "you don't have to completely rewrite your code or retrain to scale up", i.e. to run code on distributed clusters. A Dask DataFrame "is a large parallel DataFrame composed of many smaller Pandas DataFrames, split along the index".
  • spark: the de-facto leader of distributed computing, now it is faring more competition. It offers MLib, a machine learning library, as well as GraphX for "graphs and graph-parallel computation". Its Python API is pyspark.

Columnar Based Storage

The "next generation" of distributed query planners, that leverage columnar-based storage.

Using Rust

  • arrow-datafusion: this is Arrow's query engine, written in the Rust programming language. Much like duckdb, datafusion offers both a SQL and Python-like API interface. Much like pyspark, datafusion "allows you to build a plan through SQL or a DataFrame API against in-memory data, parquet or CSV files, run it in a multi-threaded environment, and obtain the result back in Python".
    • ballista is the distributed query engine, written in Rust, built on top of arrow-datafusion. Here is a comparison between ballista and spark.

Using Ray

The following libraries leverage Ray as their default distributing engine. Ray is a Python API for building distributed applications (or, in technical jargon, "scaling your code") - i.e., it offers a series of functions to make your code run on multiple computing nodes. Using their own words, Ray is a "cloud-provider independent compute launcher/autoscaler", i.e. can be used seamlessly on cloud platforms such as GoogleCloud, Amazon AWS and the likes. Ray can be used to parallelise code - such as your own scripts - but also for every step of machine learning: hyperparameter tuning and model serving, for instance. The community has built many libraries to build a bridge between Ray and libraries such as XGBoost and scikit-learn, lightGBM, PyTorch Lightning, Ariflow, PyCaret and many more.

Ray itself has a Datasets "loading and pre-processing library" to load and exchange data in Ray libraries and applications. As explained in the blogpost of the official launch, however, "Datasets is not intended as a replacement for generic data processing systems like Spark. Datasets is meant to be the last-mile bridge between ETL pipelines and distributed applications running on Ray". Under the hood, though, Ray Datasets still leverages Apache Arrow to store in-memory tables in the distributed clusters.

  • modin attempts to parallelize as much of the pandas API as is possible. Its developers claim to "have worked through a significant portion of the DataFrame API" such much so that modin "is intended to be used as a drop-in replacement for pandas, such that even if the API is not yet parallelized, it is still defaulting to pandas". The library is currently under active development.
    • Under the hood, modin calls either to dask, ray or OmiSciDB to automatically orchestrate the tasks across cores on your machine or cluster. A good overview of what happens is here.
    • Offers a spreadsheet-like API to modify dataframes and a whole lot of other experimental features, like using SQL queries on DataFrames and even fitting XGBoost models using distributed computing.
    • Has excellent documentation, including explanatory articles on the differences between modin and pandas or dask (also here).
  • mars is a "unified framework for large-scale data computation", like dask, but is tensor-based. It offers modules to scale numpy, pandas, scikit-learn and many other libraries.

Higher Level APIs

These are libraries that implement an abstract layer to make pandas code easier to reuse across distributed frameworks, mainly dask and spark. It is to be noted that, on October 2021, pyspark adopted a pandas-like API.

  • fugue is "a unified interface for distributed computing that lets users execute Python, pandas, and SQL code on Spark and Dask without rewrites". Simply put, fugue adds an abstract layer that makes code portable between across differing computing frameworks.
  • koalas is a wrapper around spark that offers pandas-like APIs. It is no longer developed, since pyspark adopted the pandas API and koalas was merged into pyspark.

GPU Libraries

Generally libraries work on CPUs and clusters are usually made up of CPUs. Apart from some notable exceptions, such as deep learning libraries like Tensorflow and PyTorch, usually regular libraries do not work on GPUs. This is due to major architectural differences across the two chips.

  • cuDF is a GPU dataframe library, which is part of the RapidsAI framework, that enables "end-to-end data science and analytics pipelines entirely on GPUs". There are many other libraries, like cuML for machine learning, cuspatial for spatial data manipulations, and more. cuDF is based on Apache Arrow, because the memory format is compatible with both CPU and GPU architecture.
  • blazingSQL is "is a GPU accelerated [distributed] SQL engine built on top of the RAPIDS ecosystem" and, as such, leverages Apache Arrow. Think of this as Apache spark on GPU.

Other Libraries and Ports from R

R has an amazing library called {dplyr} that enables easy data manipulation. {dplyr} is part of the so-called {tidyverse}, a unified framework for data manipulation and visualisation.

  • pyjanitor was originally conceived as a pandas extension of the well-known {janitor} R package. The latter was a package to clean strings with ease in a R'data.frame or tibble objects, but later incorporated new methods to make it more similar to {dplyr}. Adding and removing column, for example, is easier with the dedicated methods df.remove_column() and df.add_column(), but also renaming column is easier with df.rename_column(). This enables to run smoother pipelines that exploit method chaining.
  • pydatatable is a Python port of the astounding {data.table} library in R, that achieves impressive results thanks to parallelisation.
  • pandasql allows to query pandas.DataFrames using SQL syntax.