This document defines a Python dataframe API.
A dataframe is a programming interface for expressing data manipulations over a data structure consisting of rows and columns. Columns are named, and values in a column share a common data type. This definition is intentionally left broad.
Dataframe libraries in several programming language exist, such as R, Scala, Julia and others.
In Python, the most popular dataframe library is pandas.
pandas was initially developed at a hedge fund, with a focus on
panel data and financial time series.
It was open sourced in 2009, and since then it has been growing in popularity, including
many other domains outside time series and financial data. While still rich in time series
functionality, today is considered a general-purpose dataframe library. The original
Panel
class that gave name to the library was deprecated in 2017 and removed in 2019,
to focus on the main DataFrame
class.
Internally, pandas is implemented on top of NumPy, which is used to store the data and to perform many of the operations. Some parts of pandas are written in Cython.
As of 2020 the pandas website has around one million and a half visitors per month.
Other libraries emerged in the last years, to address some of the limitations of pandas. But in most cases, the libraries implemented a public API very similar to pandas, to make the transition to their libraries easier. Next, there is a short description of the main dataframe libraries in Python.
Dask is a task scheduler built in Python, which implements a dataframe interface. Dask dataframe uses pandas internally in the workers, and it provides an API similar to pandas, adapted to its distributed and lazy nature.
Vaex is an out-of-core alternative to pandas. Vaex uses hdf5 to create memory maps that avoid loading data sets to memory. Some parts of Vaex are implemented in C++.
Modin is a distributed dataframe library originally built on Ray, but has a more modular way, that allows it to also use Dask as a scheduler, or replace the pandas-like public API by a SQLite-like one.
cuDF is a GPU dataframe library built on top of Apache Arrow and RAPIDS. It provides an API similar to pandas.
PySpark is a dataframe library that uses Spark as a backend. PySpark public API is based on the original Spark API, and not in pandas.
Koalas is a dataframe library built on top of PySpark that provides a pandas-like API.
Ibis is a dataframe library with multiple SQL backends. It uses SQLAlchemy and a custom SQL compiler to translate its pandas-like API to SQL statements, executed by the backends. It supports conventional DBMS, as well as big data systems such as Apache Impala or BigQuery.
Polars is a DataFrame library written in Rust, with Python bindings available. Their API is intentionally different to the pandas one.
Given the growing Python dataframe ecosystem, and its complexity, this document provides
a standard Python dataframe API. Until recently, pandas has been a de-facto standard for
Python dataframes. But currently there are a growing number of not only dataframe libraries,
but also libraries that interact with dataframes (visualization, statistical or machine learning
libraries for example). Interactions among libraries are becoming complex, and the pandas
public API is suboptimal as a standard, for its size, complexity, and implementation details
it exposes (for example, using NumPy data types or NaN
).
In the first iteration of the API standard, the scope is limited to create a data exchange protocol. In future iterations the scope will be broader, including elements to operate with the data.
It is in the scope of this document the different elements of the API. This includes signatures and semantics. To be more specific:
- Data structures and Python classes
- Functions, methods, attributes and other API elements
- Expected returns of the different operations
- Data types (Python and low-level types)
The scope of this document is limited to generic dataframes, and not dataframes specific to certain domains.
The goal of the first iteration is to provide a data exchange protocol, so consumers of dataframes can interact with a standard interface to access their data.
The goal of future iterations will be to provide a standard interface that encapsulates implementation details of dataframe libraries. This will allow users and third-party libraries to write code that interacts and operates with a standard dataframe, and not with specific implementations.
The main goals for the API defined in this document are:
- Make conversion of data among different implementations easier
- Let third party libraries consume dataframes from any implementations
In the future, besides a data exchange protocol, the standard aims to include common operations done with dataframe, with the next goals in mind:
- Provide a common API for dataframes so software using dataframes can work with all implementations
- Provide a common API for dataframes to build user interfaces on top of it, for example libraries for interactive use or specific domains and industries
- Help user transition from one dataframe library to another
See the use cases section for details on the exact use cases considered.
Implementation details of the dataframes and execution of operations. This includes:
- How data is represented and stored (whether the data is in memory, disk, distributed)
- Expectations on when the execution is happening (in an eager or lazy way) (see
execution model
for some caveats) - Other execution details
Rationale: The API defined in this document needs to be used by libraries as diverse as Ibis, Dask, Vaex or cuDF. The data can live in databases, distributed systems, disk or GPU memory. Any decision that involves assumptions on where the data is stored, or where execution happens could prevent implementation from adopting the standard.
It is out of scope to provide an API designed for interactive use. While interactive use is a key aspect of dataframes, an API designed for interactive use can be built on top of the API defined in this document.
Domain or industry specific APIs are also out of scope, but can benefit from the standard to better interact with the different dataframe implementation.
Rationale: Interactive or domain specific users are key in the Python dataframe ecosystem.
But the amount and diversity of users makes it unfeasible to standardize every dataframe feature
that is currently used. In particular, functionality built as syntactic sugar for convenience in
interactive use, or heavily overloaded create very complex APIs. For example, the pandas dataframe
constructor, which accepts a huge number of formats, or its __getitem__
(e.g. df[something]
)
which is heavily overloaded. Implementations can provide convenient functionality like this one
for the users they are targeting, but it is out-of-scope for the standard, so the standard is
simple and easy to adopt.
- Build an API that is appropriate to all users
- Have a unique dataframe implementation for Python
- Standardize functionalities specific to a domain or industry
This section provides the list of stakeholders considered for the definition of this API.
We encourage dataframe libraries in Python to implement the API defined in this document in their libraries.
The list of known Python dataframe libraries at the time of writing this document is next:
- cuDF
- Dask
- datatable
- dexplo
- Eland
- Grizzly
- Ibis
- Koalas
- Mars
- Modin
- pandas
- polars
- PySpark
- StaticFrame
- Turi Create
- Vaex
Authors of libraries that consume dataframes. They can use the API defined in this document to know how the data contained in a dataframe can be consumed, and which operations are implemented.
A non-exhaustive list of downstream library categories is next:
- Plotting and visualization (e.g. Matplotlib, Bokeh, Altair, Plotly)
- Statistical libraries (e.g. statsmodels)
- Machine learning libraries (e.g. scikit-learn)
Authors of libraries that provide functionality used by dataframes.
A non-exhaustive list of upstream categories is next:
- Data formats, protocols and libraries for data analytics (e.g. Apache Arrow, NumPy)
- Task schedulers (e.g. Dask, Ray, Mars)
- Big data systems (e.g. Spark, Hive, Impala, Presto)
- Libraries for database access (e.g. SQLAlchemy)
This group considers developers of reusable code that use dataframes. For example, developers of applications that use dataframes. Or authors of libraries that provide specialized dataframe APIs to be built on top of the standard API.
People using dataframes in an interactive way are considered out of scope. These users include data
analysts, data scientists and other users that are key for dataframes. But this type of user may need
shortcuts, or libraries that take decisions for them to save them time. For example automatic type
inference, or excessive use of very compact syntax like Python squared brackets / __getitem__
.
Standardizing on such practices can be extremely difficult, and it is out of scope.
With the development of a standard API that targets developers writing reusable code we expected to also serve data analysts and other interactive users. But in an indirect way, by providing a standard API where other libraries can be built on top. Including libraries with the syntactic sugar required for fast analysis of data.
The API specification itself can be found under {ref}api-specification
.
For guidance on how to read and understand the type annotations included in this specification, consult the Python documentation.
(how-to-adopt-this-api)=
Libraries which implement the Standard are required to provide the following methods:
__dataframe_consortium_standard__
: used for converting a non-compliant dataframe to a compliant one;__column_consortium_standard__
: used for converting a non-compliant column to a compliant one.
For example, pandas has pandas.DataFrame.__dataframe_consortium_standard__
and
pandas.Series.__column_consortium_standard__
as of version 2.1.0.
The signatures should be (note: docstring is optional):
def __dataframe_consortium_standard__(
self, *, api_version: str
) -> Any:
def __column_consortium_standard__(
self, *, api_version: str
) -> Any:
api_version
is
a string representing the version of the dataframe API specification
to be returned, in 'YYYY.MM'
form, for example, '2023.04'
.
If the given version is invalid or not implemented for the given module,
an error should be raised. It is suggested to use the earliest API
version required for maximum compatibility.
For some examples, please check https://github.com/data-apis/dataframe-api/tree/main/spec/examples.
Dataframe-consuming libraries are likely to want a mechanism for determining
whether a provided dataframe is specification compliant. The recommended
approach to check for compliance is by checking whether a dataframe object has
an __dataframe_namespace__
attribute, as this is the one distinguishing
feature of a dataframe-compliant object.
Checking for an __dataframe_namespace__
attribute can be implemented as a
small utility function similar to the following.
def is_dataframe_api_obj(x):
return hasattr(x, '__dataframe_namespace__')
It may be useful to have a way to discover all packages in a Python
environment which provide a conforming dataframe API implementation, and the
namespace that that implementation resides in.
To assist dataframe-consuming libraries which need to create dataframes originating
from multiple conforming dataframe implementations, or developers who want to perform
for example cross-library testing, libraries may provide an
{pypa}entry point <specifications/entry-points/>
in order to make a dataframe API
namespace discoverable.
:::{admonition} Optional feature Given that entry points typically require build system & package installer specific implementation, this standard chooses to recommend rather than mandate providing an entry point. :::
The following code is an example for how one can discover installed conforming libraries:
from importlib.metadata import entry_points
try:
eps = entry_points()['dataframe_api']
ep = next(ep for ep in eps if ep.name == 'package_name')
except TypeError:
# The dict interface for entry_points() is deprecated in py3.10,
# supplanted by a new select interface.
ep = entry_points(group='dataframe_api', name='package_name')
xp = ep.load()
An entry point must have the following properties:
- group: equal to
dataframe_api
. - name: equal to the package name.
- object reference: equal to the dataframe API namespace import path.
A conforming implementation of the dataframe API standard must provide and support all the functions, arguments, data types, syntax, and semantics described in this specification.
A conforming implementation of the dataframe API standard may provide additional values, objects, properties, data types, and functions beyond those described in this specification.
Libraries which aim to provide a conforming implementation but haven't yet completed such an implementation may, and are encouraged to, provide details on the level of (non-)conformance. For details on how to do this, see Verification - measuring conformance.