Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DEPR: raise deprecation warning in numpy ufuncs on DataFrames if not aligned + fallback to <1.2.0 behaviour #39239

Merged
Show file tree
Hide file tree
Changes from 11 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 2 additions & 0 deletions doc/source/whatsnew/v1.2.0.rst
Original file line number Diff line number Diff line change
Expand Up @@ -286,6 +286,8 @@ Other enhancements
- Added methods :meth:`IntegerArray.prod`, :meth:`IntegerArray.min`, and :meth:`IntegerArray.max` (:issue:`33790`)
- Calling a NumPy ufunc on a ``DataFrame`` with extension types now preserves the extension types when possible (:issue:`23743`)
- Calling a binary-input NumPy ufunc on multiple ``DataFrame`` objects now aligns, matching the behavior of binary operations and ufuncs on ``Series`` (:issue:`23743`).
This change has been reverted in pandas 1.2.1, and the behaviour to not align DataFrames
jreback marked this conversation as resolved.
Show resolved Hide resolved
is deprecated instead, see the :ref:`the 1.2.1 release notes <whatsnew_121.ufunc_deprecation>`.
- Where possible :meth:`RangeIndex.difference` and :meth:`RangeIndex.symmetric_difference` will return :class:`RangeIndex` instead of :class:`Int64Index` (:issue:`36564`)
- :meth:`DataFrame.to_parquet` now supports :class:`MultiIndex` for columns in parquet format (:issue:`34777`)
- :func:`read_parquet` gained a ``use_nullable_dtypes=True`` option to use nullable dtypes that use ``pd.NA`` as missing value indicator where possible for the resulting DataFrame (default is ``False``, and only applicable for ``engine="pyarrow"``) (:issue:`31242`)
Expand Down
73 changes: 73 additions & 0 deletions doc/source/whatsnew/v1.2.1.rst
Original file line number Diff line number Diff line change
Expand Up @@ -41,6 +41,79 @@ As a result, bugs reported as fixed in pandas 1.2.0 related to inconsistent tick

.. ---------------------------------------------------------------------------

.. _whatsnew_121.ufunc_deprecation:

Calling NumPy ufuncs on non-aligned DataFrames
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

Before pandas 1.2.0, calling a NumPy ufunc on non-aligned DataFrames (or
DataFrame / Series combination) would ignore the indices, only match
the inputs by shape, and use the index/columns of the first DataFrame for
jreback marked this conversation as resolved.
Show resolved Hide resolved
the result:

.. code-block:: python

>>> df1 = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=[0, 1])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is an incorrect format

... df2 = pd.DataFrame({"a": [1, 2], "b": [3, 4]}, index=[1, 2])
>>> df1
a b
0 1 3
1 2 4
>>> df2
a b
1 1 3
2 2 4

>>> np.add(df1, df2)
a b
0 2 6
1 4 8

This contrasts with how other pandas operations work, which first align
the inputs:

.. code-block:: python

>>> df1 + df2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

make an actual ipython block

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I need to use some plain code-blocks since part of the example is showing old behaviour (or behaviour that will change in the future), and so prefer to use then code-blocks for all examples, for consistency within this section

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we use ipython blocks everywhere, pls do this

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would like to change these to be consistent

a b
0 NaN NaN
1 3.0 7.0
2 NaN NaN

In pandas 1.2.0, we refactored how NumPy ufuncs are called on DataFrames, and
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
this started to align the inputs first (:issue:`39184`), as happens in other
pandas operations and as it happens for ufuncs called on Series objects.

For pandas 1.2.1, we restored the previous behaviour to avoid a breaking
change, but the above example of ``np.add(df1, df2)`` with non-aligned inputs
will now to raise a warning, and a future pandas 2.0 release will start
aligning the inputs first (:issue:`39184`). Calling a NumPy ufunc on Series
objects (eg ``np.add(s1, s2)``) already aligns and continues to do so.

To avoid the warning and keep the current behaviour of ignoring the indices,
convert one of the arguments to a NumPy array:

.. code-block:: python

>>> np.add(df1, np.asarray(df2))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use an actual ipython format

a b
0 2 6
1 4 8

To obtain the future behaviour and silence the warning, you can align manually
before passing the arguments to the ufunc:

.. code-block:: python
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pls do not use code-blocks except to show older code. these are so error prone


>>> df1, df2 = df1.align(df2)
>>> np.add(df1, df2)
a b
0 NaN NaN
1 3.0 7.0
2 NaN NaN

.. ---------------------------------------------------------------------------

.. _whatsnew_121.bug_fixes:

Bug fixes
Expand Down
84 changes: 84 additions & 0 deletions pandas/core/arraylike.py
Original file line number Diff line number Diff line change
Expand Up @@ -149,6 +149,85 @@ def __rpow__(self, other):
return self._arith_method(other, roperator.rpow)


# -----------------------------------------------------------------------------
# Helpers to implement __array_ufunc__


def _is_aligned(frame, other):
"""
Helper to check if a DataFrame is aligned with another DataFrame or Series.
"""
from pandas.core.frame import DataFrame
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might as well just import from pandas here, this is only the import if you can import at the top of the file (not sure if you can), also maybe can use ABCDataFrame

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pandas.core.frame.py import from this file, so I don't think I can move the import to the top of the file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i get that you cannot put the import at the top. However when inside the function the style is to
from pandas import DataFrame

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, changed the imports


if isinstance(other, DataFrame):
return frame._indexed_same(other)
else:
# Series -> match index
return frame.columns.equals(other.index)


def _maybe_fallback(ufunc: Callable, method: str, *inputs: Any, **kwargs: Any):
"""
In the future DataFrame, inputs to ufuncs will be aligned before applying
the ufunc, but for now we ignore the index but raise a warning if behaviour
would change in the future.
This helper detects the case where a warning is needed and then fallbacks
to applying the ufunc on arrays to avoid alignment.

See https://github.com/pandas-dev/pandas/pull/39239
"""
from pandas.core.frame import DataFrame
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
from pandas.core.generic import NDFrame

is_ndframe = [isinstance(x, NDFrame) for x in inputs]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why would you do this? simply check is_series. this is amazingly confusing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is is_series ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have dataframes and series

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, and NDFrame is the parent class for both? Do you want me to put isinstance(x, (Series, DataFrame)) instead of isinstance(x, NDFrame) ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes i think its more clear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that below in this array_ufunc function, we are also using NDFrame for this purpose

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so rename this to is_series_or_frame i think is more clear

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I renamed it now to n_alignable, because alignable is the variable name that is already used below, for consistency. And it also matches the explanation in the comment (which says this is Series or DataFrame).
(but can also rename to n_series_or_frame if you prefer)

jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
is_frame = [isinstance(x, DataFrame) for x in inputs]

if (sum(is_ndframe) >= 2) and (sum(is_frame) >= 1):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this condition is impossible to reason about. pls make it simpler. you just want to know if you have 2 or more dataframes right? (or series)? if so, just say that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, I want to know if at least two alignable objects (DataFrame or Series) and at least one DataFrame, which is what the above line does, and which is what is explained on the line just below. I can try to clarify that comment if something is not clear about that?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

try to simplify.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry, Jeff, if you don't give me a clue about what exactly is unclear for you or about how you would do it differently, I have no idea how to improve this. The code reflects exactly what I just explained it needs checking, and it is explained in the line below as well.

Would eg change sum(is_frame) into a variable n_frames help? (and moving the sum to the list comprehension where now is_frame is defined)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

well, the problem that this is getting so complicated that you need to comment. I honestly don't think this is worth doing this much change at this late hour.

if you want to do for 1.2.2 or better yet 1.3.ok

waiting for the nth change is extremely painful and disruptive.

these are supposed to be lightweight backports. this is turning in to a nightmare.

this is likely going to be extremely fragile and break again. and will then have to be patched again.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Waiting for 1.2.2 or 1.3 is not going to make this change any simpler, if you don't help me find out what you don't like about it

waiting for the nth change is extremely painful and disruptive.

What is this about?

these are supposed to be lightweight backports. this is turning in to a nightmare.

The changes in this PR is a rather clean additional check in the array_ufunc function, to use a different code path in certain cases. It almost doesn't touch any existing code, so I would say it is a clean patch to backport.

# if there are 2 alignable inputs (NDFrames), of which at least 1 is a
# DataFrame -> we would have had no alignment before -> warn that this
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
# will align in the future

# the first frame is what determines the output index/columns in pandas < 1.2
first_frame = next(x for x in inputs if isinstance(x, DataFrame))

# check if the objects are aligned or not
jorisvandenbossche marked this conversation as resolved.
Show resolved Hide resolved
non_aligned = sum(
not _is_aligned(first_frame, x) for x in inputs if isinstance(x, NDFrame)
)

# if at least one is not aligned -> warn and fallback to array behaviour
if non_aligned:
warnings.warn(
"Calling a ufunc on non-aligned DataFrames (or DataFrame/Series "
"combination). Currently, the indices are ignored and the result "
"takes the index/columns of the first DataFrame. In the future "
"(pandas 2.0), the DataFrames/Series will be aligned before "
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

dont' need to mention the version

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would not mention here

"applying the ufunc.\nConvert one of the arguments to a NumPy array "
"(eg 'ufunc(df1, np.asarray(df2)') to keep the current behaviour, "
"or align manually (eg 'df1, df2 = df1.align(df2)') before passing to "
"the ufunc to obtain the future behaviour and silence this warning.",
FutureWarning,
stacklevel=4,
)

# keep the first dataframe of the inputs, other DataFrame/Series is
# converted to array for fallback behaviour
new_inputs = []
for x in inputs:
if x is first_frame:
new_inputs.append(x)
elif isinstance(x, NDFrame):
new_inputs.append(np.asarray(x))
else:
new_inputs.append(x)

# call the ufunc on those transformed inputs
return getattr(ufunc, method)(*new_inputs, **kwargs)

# signal that we didn't fallback / execute the ufunc yet
return NotImplemented


def array_ufunc(self, ufunc: Callable, method: str, *inputs: Any, **kwargs: Any):
"""
Compatibility with numpy ufuncs.
Expand All @@ -162,6 +241,11 @@ def array_ufunc(self, ufunc: Callable, method: str, *inputs: Any, **kwargs: Any)

cls = type(self)

# for backwards compatibility check and potentially fallback for non-aligned frames
result = _maybe_fallback(ufunc, method, *inputs, **kwargs)
if result is not NotImplemented:
return result

# for binary ops, use our custom dunder methods
result = maybe_dispatch_ufunc_to_dunder_op(self, ufunc, method, *inputs, **kwargs)
if result is not NotImplemented:
Expand Down
138 changes: 124 additions & 14 deletions pandas/tests/frame/test_ufunc.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,8 @@
import numpy as np
import pytest

import pandas.util._test_decorators as td

import pandas as pd
import pandas._testing as tm

Expand Down Expand Up @@ -78,12 +80,19 @@ def test_binary_input_aligns_columns(request, dtype_a, dtype_b):
dtype_b["C"] = dtype_b.pop("B")

df2 = pd.DataFrame({"A": [1, 2], "C": [3, 4]}).astype(dtype_b)
result = np.heaviside(df1, df2)
expected = np.heaviside(
np.array([[1, 3, np.nan], [2, 4, np.nan]]),
np.array([[1, np.nan, 3], [2, np.nan, 4]]),
)
expected = pd.DataFrame(expected, index=[0, 1], columns=["A", "B", "C"])
with tm.assert_produces_warning(FutureWarning):
result = np.heaviside(df1, df2)
# Expected future behaviour:
# expected = np.heaviside(
# np.array([[1, 3, np.nan], [2, 4, np.nan]]),
# np.array([[1, np.nan, 3], [2, np.nan, 4]]),
# )
# expected = pd.DataFrame(expected, index=[0, 1], columns=["A", "B", "C"])
expected = pd.DataFrame([[1.0, 1.0], [1.0, 1.0]], columns=["A", "B"])
tm.assert_frame_equal(result, expected)

# ensure the expected is the same when applying with numpy array
result = np.heaviside(df1, df2.values)
tm.assert_frame_equal(result, expected)


Expand All @@ -97,27 +106,128 @@ def test_binary_input_aligns_index(request, dtype):
)
df1 = pd.DataFrame({"A": [1, 2], "B": [3, 4]}, index=["a", "b"]).astype(dtype)
df2 = pd.DataFrame({"A": [1, 2], "B": [3, 4]}, index=["a", "c"]).astype(dtype)
result = np.heaviside(df1, df2)
expected = np.heaviside(
np.array([[1, 3], [3, 4], [np.nan, np.nan]]),
np.array([[1, 3], [np.nan, np.nan], [3, 4]]),
with tm.assert_produces_warning(FutureWarning):
result = np.heaviside(df1, df2)
# Expected future behaviour:
# expected = np.heaviside(
# np.array([[1, 3], [3, 4], [np.nan, np.nan]]),
# np.array([[1, 3], [np.nan, np.nan], [3, 4]]),
# )
# # TODO(FloatArray): this will be Float64Dtype.
# expected = pd.DataFrame(expected, index=["a", "b", "c"], columns=["A", "B"])
expected = pd.DataFrame(
[[1.0, 1.0], [1.0, 1.0]], columns=["A", "B"], index=["a", "b"]
)
# TODO(FloatArray): this will be Float64Dtype.
expected = pd.DataFrame(expected, index=["a", "b", "c"], columns=["A", "B"])
tm.assert_frame_equal(result, expected)

# ensure the expected is the same when applying with numpy array
result = np.heaviside(df1, df2.values)
tm.assert_frame_equal(result, expected)


@pytest.mark.filterwarnings("ignore:Calling a ufunc on non-aligned:FutureWarning")
def test_binary_frame_series_raises():
# We don't currently implement
df = pd.DataFrame({"A": [1, 2]})
with pytest.raises(NotImplementedError, match="logaddexp"):
# with pytest.raises(NotImplementedError, match="logaddexp"):
with pytest.raises(ValueError, match=""):
np.logaddexp(df, df["A"])

with pytest.raises(NotImplementedError, match="logaddexp"):
# with pytest.raises(NotImplementedError, match="logaddexp"):
with pytest.raises(ValueError, match=""):
np.logaddexp(df["A"], df)


def test_frame_outer_deprecated():
df = pd.DataFrame({"A": [1, 2]})
with tm.assert_produces_warning(FutureWarning):
np.subtract.outer(df, df)


def test_alignment_deprecation():
# https://github.com/pandas-dev/pandas/issues/39184
df1 = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df2 = pd.DataFrame({"b": [1, 2, 3], "c": [4, 5, 6]})
s1 = pd.Series([1, 2], index=["a", "b"])
s2 = pd.Series([1, 2], index=["b", "c"])

# binary dataframe / dataframe
expected = pd.DataFrame({"a": [2, 4, 6], "b": [8, 10, 12]})

with tm.assert_produces_warning(None):
# aligned -> no warning!
result = np.add(df1, df1)
tm.assert_frame_equal(result, expected)

with tm.assert_produces_warning(FutureWarning):
# non-aligned -> warns
result = np.add(df1, df2)
tm.assert_frame_equal(result, expected)

result = np.add(df1, df2.values)
tm.assert_frame_equal(result, expected)

result = np.add(df1.values, df2)
expected = pd.DataFrame({"b": [2, 4, 6], "c": [8, 10, 12]})
tm.assert_frame_equal(result, expected)

# binary dataframe / series
expected = pd.DataFrame({"a": [2, 3, 4], "b": [6, 7, 8]})

with tm.assert_produces_warning(None):
# aligned -> no warning!
result = np.add(df1, s1)
tm.assert_frame_equal(result, expected)

with tm.assert_produces_warning(FutureWarning):
result = np.add(df1, s2)
tm.assert_frame_equal(result, expected)

with tm.assert_produces_warning(FutureWarning):
result = np.add(s2, df1)
tm.assert_frame_equal(result, expected)

result = np.add(df1, s2.values)
tm.assert_frame_equal(result, expected)


@td.skip_if_no("numba", "0.46.0")
def test_alignment_deprecation_many_inputs():
# https://github.com/pandas-dev/pandas/issues/39184
# test that the deprecation also works with > 2 inputs -> using a numba
# written ufunc for this because numpy itself doesn't have such ufuncs
from numba import float64, vectorize

@vectorize([float64(float64, float64, float64)])
def my_ufunc(x, y, z):
return x + y + z

df1 = pd.DataFrame({"a": [1, 2, 3], "b": [4, 5, 6]})
df2 = pd.DataFrame({"b": [1, 2, 3], "c": [4, 5, 6]})
df3 = pd.DataFrame({"a": [1, 2, 3], "c": [4, 5, 6]})

with tm.assert_produces_warning(FutureWarning):
result = my_ufunc(df1, df2, df3)
expected = pd.DataFrame([[3.0, 12.0], [6.0, 15.0], [9.0, 18.0]], columns=["a", "b"])
tm.assert_frame_equal(result, expected)

# all aligned -> no warning
with tm.assert_produces_warning(None):
result = my_ufunc(df1, df1, df1)
tm.assert_frame_equal(result, expected)

# mixed frame / arrays
with tm.assert_produces_warning(FutureWarning):
result = my_ufunc(df1, df2, df3.values)
tm.assert_frame_equal(result, expected)

# single frame -> no warning
with tm.assert_produces_warning(None):
result = my_ufunc(df1, df2.values, df3.values)
tm.assert_frame_equal(result, expected)

# takes indices of first frame
with tm.assert_produces_warning(FutureWarning):
result = my_ufunc(df1.values, df2, df3)
expected = expected.set_axis(["b", "c"], axis=1)
tm.assert_frame_equal(result, expected)