Skip to content

Commit

Permalink
Backport PR #39239: DEPR: raise deprecation warning in numpy ufuncs o…
Browse files Browse the repository at this point in the history
…n DataFrames if not aligned + fallback to <1.2.0 behaviour (#39288)

Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com>
  • Loading branch information
meeseeksmachine and jorisvandenbossche authored Jan 20, 2021
1 parent b4c0110 commit 69f4f96
Show file tree
Hide file tree
Showing 4 changed files with 292 additions and 15 deletions.
10 changes: 10 additions & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,8 @@ Other enhancements
- Added methods :meth:`IntegerArray.prod`, :meth:`IntegerArray.min`, and :meth:`IntegerArray.max` (:issue:`33790`)
- Calling a NumPy ufunc on a ``DataFrame`` with extension types now preserves the extension types when possible (:issue:`23743`)
- Calling a binary-input NumPy ufunc on multiple ``DataFrame`` objects now aligns, matching the behavior of binary operations and ufuncs on ``Series`` (:issue:`23743`).
This change has been reverted in pandas 1.2.1, and the behaviour to not align DataFrames
is deprecated instead, see the :ref:`the 1.2.1 release notes <whatsnew_121.ufunc_deprecation>`.
- Where possible :meth:`RangeIndex.difference` and :meth:`RangeIndex.symmetric_difference` will return :class:`RangeIndex` instead of :class:`Int64Index` (:issue:`36564`)
- :meth:`DataFrame.to_parquet` now supports :class:`MultiIndex` for columns in parquet format (:issue:`34777`)
- :func:`read_parquet` gained a ``use_nullable_dtypes=True`` option to use nullable dtypes that use ``pd.NA`` as missing value indicator where possible for the resulting DataFrame (default is ``False``, and only applicable for ``engine="pyarrow"``) (:issue:`31242`)
Expand Down Expand Up @@ -536,6 +538,14 @@ Deprecations
- The ``inplace`` parameter of :meth:`Categorical.remove_unused_categories` is deprecated and will be removed in a future version (:issue:`37643`)
- The ``null_counts`` parameter of :meth:`DataFrame.info` is deprecated and replaced by ``show_counts``. It will be removed in a future version (:issue:`37999`)

**Calling NumPy ufuncs on non-aligned DataFrames**

Calling NumPy ufuncs on non-aligned DataFrames changed behaviour in pandas
1.2.0 (to align the inputs before calling the ufunc), but this change is
reverted in pandas 1.2.1. The behaviour to not align is now deprecated instead,
see the :ref:`the 1.2.1 release notes <whatsnew_121.ufunc_deprecation>` for
more details.

.. ---------------------------------------------------------------------------
Expand Down
75 changes: 74 additions & 1 deletion doc/source/whatsnew/v1.2.1.rst
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
.. _whatsnew_121:

What's new in 1.2.1 (January 18, 2021)
What's new in 1.2.1 (January 20, 2021)
--------------------------------------

These are the changes in pandas 1.2.1. See :ref:`release` for a full changelog
Expand Down Expand Up @@ -42,6 +42,79 @@ As a result, bugs reported as fixed in pandas 1.2.0 related to inconsistent tick

.. ---------------------------------------------------------------------------
.. _whatsnew_121.ufunc_deprecation:

Calling NumPy ufuncs on non-aligned DataFrames
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Before pandas 1.2.0, calling a NumPy ufunc on non-aligned DataFrames (or
DataFrame / Series combination) would ignore the indices, only match
the inputs by shape, and use the index/columns of the first DataFrame for
the result:

.. code-block:: python
>>> df1 = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=[0, 1])
... df2 = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=[1, 2])
>>> df1
a b
0 1 3
1 2 4
>>> df2
a b
1 1 3
2 2 4
>>> np.add(df1, df2)
a b
0 2 6
1 4 8
This contrasts with how other pandas operations work, which first align
the inputs:

.. code-block:: python
>>> df1 + df2
a b
0 NaN NaN
1 3.0 7.0
2 NaN NaN
In pandas 1.2.0, we refactored how NumPy ufuncs are called on DataFrames, and
this started to align the inputs first (:issue:`39184`), as happens in other
pandas operations and as it happens for ufuncs called on Series objects.

For pandas 1.2.1, we restored the previous behaviour to avoid a breaking
change, but the above example of ``np.add(df1, df2)`` with non-aligned inputs
will now to raise a warning, and a future pandas 2.0 release will start
aligning the inputs first (:issue:`39184`). Calling a NumPy ufunc on Series
objects (eg ``np.add(s1, s2)``) already aligns and continues to do so.

To avoid the warning and keep the current behaviour of ignoring the indices,
convert one of the arguments to a NumPy array:

.. code-block:: python
>>> np.add(df1, np.asarray(df2))
a b
0 2 6
1 4 8
To obtain the future behaviour and silence the warning, you can align manually
before passing the arguments to the ufunc:

.. code-block:: python
>>> df1, df2 = df1.align(df2)
>>> np.add(df1, df2)
a b
0 NaN NaN
1 3.0 7.0
2 NaN NaN
.. ---------------------------------------------------------------------------
.. _whatsnew_121.bug_fixes:

Bug fixes
Expand Down
84 changes: 84 additions & 0 deletions pandas/core/arraylike.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,85 @@ def __rpow__(self, other):
return self._arith_method(other, roperator.rpow)


# -----------------------------------------------------------------------------
# Helpers to implement __array_ufunc__


def _is_aligned(frame, other):
"""
Helper to check if a DataFrame is aligned with another DataFrame or Series.
"""
from pandas import DataFrame

if isinstance(other, DataFrame):
return frame._indexed_same(other)
else:
# Series -> match index
return frame.columns.equals(other.index)


def _maybe_fallback(ufunc: Callable, method: str, *inputs: Any, **kwargs: Any):
"""
In the future DataFrame, inputs to ufuncs will be aligned before applying
the ufunc, but for now we ignore the index but raise a warning if behaviour
would change in the future.
This helper detects the case where a warning is needed and then fallbacks
to applying the ufunc on arrays to avoid alignment.
See https://github.com/pandas-dev/pandas/pull/39239
"""
from pandas import DataFrame
from pandas.core.generic import NDFrame

n_alignable = sum(isinstance(x, NDFrame) for x in inputs)
n_frames = sum(isinstance(x, DataFrame) for x in inputs)

if n_alignable >= 2 and n_frames >= 1:
# if there are 2 alignable inputs (Series or DataFrame), of which at least 1
# is a DataFrame -> we would have had no alignment before -> warn that this
# will align in the future

# the first frame is what determines the output index/columns in pandas < 1.2
first_frame = next(x for x in inputs if isinstance(x, DataFrame))

# check if the objects are aligned or not
non_aligned = sum(
not _is_aligned(first_frame, x) for x in inputs if isinstance(x, NDFrame)
)

# if at least one is not aligned -> warn and fallback to array behaviour
if non_aligned:
warnings.warn(
"Calling a ufunc on non-aligned DataFrames (or DataFrame/Series "
"combination). Currently, the indices are ignored and the result "
"takes the index/columns of the first DataFrame. In the future , "
"the DataFrames/Series will be aligned before applying the ufunc.\n"
"Convert one of the arguments to a NumPy array "
"(eg 'ufunc(df1, np.asarray(df2)') to keep the current behaviour, "
"or align manually (eg 'df1, df2 = df1.align(df2)') before passing to "
"the ufunc to obtain the future behaviour and silence this warning.",
FutureWarning,
stacklevel=4,
)

# keep the first dataframe of the inputs, other DataFrame/Series is
# converted to array for fallback behaviour
new_inputs = []
for x in inputs:
if x is first_frame:
new_inputs.append(x)
elif isinstance(x, NDFrame):
new_inputs.append(np.asarray(x))
else:
new_inputs.append(x)

# call the ufunc on those transformed inputs
return getattr(ufunc, method)(*new_inputs, **kwargs)

# signal that we didn't fallback / execute the ufunc yet
return NotImplemented


def array_ufunc(self, ufunc: Callable, method: str, *inputs: Any, **kwargs: Any):
"""
Compatibility with numpy ufuncs.
Expand All @@ -162,6 +241,11 @@ def array_ufunc(self, ufunc: Callable, method: str, *inputs: Any, **kwargs: Any)

cls = type(self)

# for backwards compatibility check and potentially fallback for non-aligned frames
result = _maybe_fallback(ufunc, method, *inputs, **kwargs)
if result is not NotImplemented:
return result

# for binary ops, use our custom dunder methods
result = maybe_dispatch_ufunc_to_dunder_op(self, ufunc, method, *inputs, **kwargs)
if result is not NotImplemented:
Expand Down
138 changes: 124 additions & 14 deletions pandas/tests/frame/test_ufunc.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import numpy as np
import pytest

import pandas.util._test_decorators as td

import pandas as pd
import pandas._testing as tm

Expand Down Expand Up @@ -70,12 +72,19 @@ def test_binary_input_aligns_columns(dtype_a, dtype_b):
dtype_b["C"] = dtype_b.pop("B")

df2 = pd.DataFrame({"A": [1, 2], "C": [3, 4]}).astype(dtype_b)
result = np.heaviside(df1, df2)
expected = np.heaviside(
np.array([[1, 3, np.nan], [2, 4, np.nan]]),
np.array([[1, np.nan, 3], [2, np.nan, 4]]),
)
expected = pd.DataFrame(expected, index=[0, 1], columns=["A", "B", "C"])
with tm.assert_produces_warning(FutureWarning):
result = np.heaviside(df1, df2)
# Expected future behaviour:
# expected = np.heaviside(
# np.array([[1, 3, np.nan], [2, 4, np.nan]]),
# np.array([[1, np.nan, 3], [2, np.nan, 4]]),
# )
# expected = pd.DataFrame(expected, index=[0, 1], columns=["A", "B", "C"])
expected = pd.DataFrame([[1.0, 1.0], [1.0, 1.0]], columns=["A", "B"])
tm.assert_frame_equal(result, expected)

# ensure the expected is the same when applying with numpy array
result = np.heaviside(df1, df2.values)
tm.assert_frame_equal(result, expected)


Expand All @@ -85,23 +94,35 @@ def test_binary_input_aligns_index(dtype):
pytest.xfail(reason="Extension / mixed with multiple inputs not implemented.")
df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]}, index=["a", "b"]).astype(dtype)
df2 = pd.DataFrame({"A": [1, 2], "B": [3, 4]}, index=["a", "c"]).astype(dtype)
result = np.heaviside(df1, df2)
expected = np.heaviside(
np.array([[1, 3], [3, 4], [np.nan, np.nan]]),
np.array([[1, 3], [np.nan, np.nan], [3, 4]]),
with tm.assert_produces_warning(FutureWarning):
result = np.heaviside(df1, df2)
# Expected future behaviour:
# expected = np.heaviside(
# np.array([[1, 3], [3, 4], [np.nan, np.nan]]),
# np.array([[1, 3], [np.nan, np.nan], [3, 4]]),
# )
# # TODO(FloatArray): this will be Float64Dtype.
# expected = pd.DataFrame(expected, index=["a", "b", "c"], columns=["A", "B"])
expected = pd.DataFrame(
[[1.0, 1.0], [1.0, 1.0]], columns=["A", "B"], index=["a", "b"]
)
# TODO(FloatArray): this will be Float64Dtype.
expected = pd.DataFrame(expected, index=["a", "b", "c"], columns=["A", "B"])
tm.assert_frame_equal(result, expected)

# ensure the expected is the same when applying with numpy array
result = np.heaviside(df1, df2.values)
tm.assert_frame_equal(result, expected)


@pytest.mark.filterwarnings("ignore:Calling a ufunc on non-aligned:FutureWarning")
def test_binary_frame_series_raises():
# We don't currently implement
df = pd.DataFrame({"A": [1, 2]})
with pytest.raises(NotImplementedError, match="logaddexp"):
# with pytest.raises(NotImplementedError, match="logaddexp"):
with pytest.raises(ValueError, match=""):
np.logaddexp(df, df["A"])

with pytest.raises(NotImplementedError, match="logaddexp"):
# with pytest.raises(NotImplementedError, match="logaddexp"):
with pytest.raises(ValueError, match=""):
np.logaddexp(df["A"], df)


Expand Down Expand Up @@ -130,3 +151,92 @@ def test_frame_outer_deprecated():
df = pd.DataFrame({"A": [1, 2]})
with tm.assert_produces_warning(FutureWarning):
np.subtract.outer(df, df)


def test_alignment_deprecation():
# https://github.com/pandas-dev/pandas/issues/39184
df1 = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df2 = pd.DataFrame({"b": [1, 2, 3], "c": [4, 5, 6]})
s1 = pd.Series([1, 2], index=["a", "b"])
s2 = pd.Series([1, 2], index=["b", "c"])

# binary dataframe / dataframe
expected = pd.DataFrame({"a": [2, 4, 6], "b": [8, 10, 12]})

with tm.assert_produces_warning(None):
# aligned -> no warning!
result = np.add(df1, df1)
tm.assert_frame_equal(result, expected)

with tm.assert_produces_warning(FutureWarning):
# non-aligned -> warns
result = np.add(df1, df2)
tm.assert_frame_equal(result, expected)

result = np.add(df1, df2.values)
tm.assert_frame_equal(result, expected)

result = np.add(df1.values, df2)
expected = pd.DataFrame({"b": [2, 4, 6], "c": [8, 10, 12]})
tm.assert_frame_equal(result, expected)

# binary dataframe / series
expected = pd.DataFrame({"a": [2, 3, 4], "b": [6, 7, 8]})

with tm.assert_produces_warning(None):
# aligned -> no warning!
result = np.add(df1, s1)
tm.assert_frame_equal(result, expected)

with tm.assert_produces_warning(FutureWarning):
result = np.add(df1, s2)
tm.assert_frame_equal(result, expected)

with tm.assert_produces_warning(FutureWarning):
result = np.add(s2, df1)
tm.assert_frame_equal(result, expected)

result = np.add(df1, s2.values)
tm.assert_frame_equal(result, expected)


@td.skip_if_no("numba", "0.46.0")
def test_alignment_deprecation_many_inputs():
# https://github.com/pandas-dev/pandas/issues/39184
# test that the deprecation also works with > 2 inputs -> using a numba
# written ufunc for this because numpy itself doesn't have such ufuncs
from numba import float64, vectorize

@vectorize([float64(float64, float64, float64)])
def my_ufunc(x, y, z):
return x + y + z

df1 = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df2 = pd.DataFrame({"b": [1, 2, 3], "c": [4, 5, 6]})
df3 = pd.DataFrame({"a": [1, 2, 3], "c": [4, 5, 6]})

with tm.assert_produces_warning(FutureWarning):
result = my_ufunc(df1, df2, df3)
expected = pd.DataFrame([[3.0, 12.0], [6.0, 15.0], [9.0, 18.0]], columns=["a", "b"])
tm.assert_frame_equal(result, expected)

# all aligned -> no warning
with tm.assert_produces_warning(None):
result = my_ufunc(df1, df1, df1)
tm.assert_frame_equal(result, expected)

# mixed frame / arrays
with tm.assert_produces_warning(FutureWarning):
result = my_ufunc(df1, df2, df3.values)
tm.assert_frame_equal(result, expected)

# single frame -> no warning
with tm.assert_produces_warning(None):
result = my_ufunc(df1, df2.values, df3.values)
tm.assert_frame_equal(result, expected)

# takes indices of first frame
with tm.assert_produces_warning(FutureWarning):
result = my_ufunc(df1.values, df2, df3)
expected = expected.set_axis(["b", "c"], axis=1)
tm.assert_frame_equal(result, expected)

0 comments on commit 69f4f96

Please sign in to comment.