Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame with Int64 columns casts to float64 with .max()/.min() #34210

Conversation

simonjayhawkins
Copy link
Member

@simonjayhawkins simonjayhawkins added Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays labels May 16, 2020
@simonjayhawkins
Copy link
Member Author

It looks like #25777 needs to be fixed first, so closing to clear queue.

@simonjayhawkins simonjayhawkins marked this pull request as draft May 17, 2020 11:28
if numeric_only is not None and axis in [0, 1]:
if (
self._is_homogeneous_type and self._mgr.any_extension_types and axis == 0
) or (numeric_only is not None and axis in [0, 1]):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the axis in [0, 1] is superfluous

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_any_all_np_func passes axis=None for reduction along both axes.

@@ -8419,7 +8419,9 @@ def _get_data(axis_matters):
raise NotImplementedError(msg)
return data

if numeric_only is not None and axis in [0, 1]:
if (
self._is_homogeneous_type and self._mgr.any_extension_types and axis == 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we really need all of these conditions?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no I don't think so, as this is hiding some bugs (or inconsistencies)

>>> import pandas as pd
>>>
>>> pd.__version__
'1.1.0.dev0+1600.gdede2c7a1e'
>>>
>>> df = pd.DataFrame({"A": pd.Series([0, 1], dtype="category")})
>>> df
   A
0  0
1  1
>>>
>>> df.all()
A    False
dtype: bool
>>>
>>> df.A.all()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\simon\pandas\pandas\core\generic.py", line 11394, in logical_func
    return self._reduce(
  File "C:\Users\simon\pandas\pandas\core\series.py", line 4028, in _reduce
    return delegate._reduce(name, skipna=skipna, **kwds)
  File "C:\Users\simon\pandas\pandas\core\arrays\categorical.py", line 2076, in _reduce
    raise TypeError(f"Categorical cannot perform the operation {name}")
TypeError: Categorical cannot perform the operation all
>>>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will need to fix #21020 first

@simonjayhawkins
Copy link
Member Author

@jorisvandenbossche @jreback This PR in the current state would close #32651

BUT the issue would persist for DataFrames with non-numeric columns.

I see the benefit of the changes here is consistency of the code path taken regardless of numeric_only parameter for numeric only DataFrames. I can't find an issue for this, but could potentially concoct one.

There is overlap here with #32867 and #32867 would likely close #32651 anyway.

I'll leave this as draft for now, or could close to clear queue if preferred.

in the meantime, I assume investigating #21020 would also benefit #32867.

There's also an issue with PeriodArray raising TypeError: mean is not implemented for PeriodArray since the meaning is ambiguous when dispatching to the EA which is currently ignored with DataFrame.mean containing Period column that I could also look into.

>>> df = pd.DataFrame(
...     {
...         "A": np.arange(3),
...         "B": pd.date_range("2016-01-01", periods=3),
...         "C": pd.timedelta_range("1D", periods=3),
...         "D": pd.period_range("2016", periods=3, freq="A"),
...     }
... )
>>> df.mean()
<stdin>:1: FutureWarning: DataFrame.mean and DataFrame.median with numeric_only=None will include datetime64 and datetime64tz columns in a futu
re version.
A                  1
C    2 days 00:00:00
dtype: object
>>>

I guess we want to maintain the same behaviour. or is inconsistent behaviour not desirable.

>>> df = pd.DataFrame(
...     {
...         "D": pd.period_range("2016", periods=3, freq="A"),
...     }
... )
>>>
>>> df.mean()
Series([], dtype: float64)
>>>
>>> df.D.mean()
Traceback (most recent call last):
...
TypeError: mean is not implemented for PeriodArray since the meaning is ambiguous.  An alternative is obj.to_timestamp(how='start').mean()
>>>

@jorisvandenbossche
Copy link
Member

I just wanted to comment that this overlaps with #32867, as you also just mentioned.

Regarding mean of periods, see also #33758

Copy link
Contributor

@jreback jreback left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks fine, just some questions.

@@ -8441,6 +8443,10 @@ def blk_func(values):
assert isinstance(res, dict)
if len(res):
assert len(res) == max(list(res.keys())) + 1, res.keys()
elif not out_dtype:
# The default dtype for empty Series will be 'object' instead of
# 'float64' in a future version.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have tests which currently hit this case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this was added because of test failures in tests/frame/test_analytics.py (maybe others)

========================================================================= short test summary info ==========================================================================
FAILED pandas/tests/frame/test_analytics.py::TestDataFrameAnalytics::test_stat_op_calc - AssertionError: Caused unexpected warning(s): [('DeprecationWarning', Deprecation...
FAILED pandas/tests/frame/test_analytics.py::TestDataFrameAnalytics::test_median - AssertionError: Caused unexpected warning(s): [('DeprecationWarning', DeprecationWarnin...
FAILED pandas/tests/frame/test_analytics.py::TestDataFrameAnalytics::test_mean_excludes_datetimes[None] - AssertionError: Caused unexpected warning(s): [('DeprecationWarn...
FAILED pandas/tests/frame/test_analytics.py::TestDataFrameAnalytics::test_mean_excludes_datetimes[UTC] - AssertionError: Caused unexpected warning(s): [('DeprecationWarni...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we have tests which currently hit this case?

so I guess on master, no tests hit this case

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these cases were for empty DataFrames (after removing timestamp columns). using all for is_numeric was returning True for empty iterable (_mgr.blocks). changing is_numeric and this elif not out_dtype code block doesn't need to be added.

@@ -8419,7 +8419,9 @@ def _get_data(axis_matters):
raise NotImplementedError(msg)
return data

if numeric_only is not None and axis in [0, 1]:
is_numeric = all(b.is_numeric for b in self._mgr.blocks)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do the added tests hit this sufficiently (e.g. some blocks numeric some blocks not)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, but that is another issue #34520

@jreback
Copy link
Contributor

jreback commented Jun 14, 2020

can you merge master and update to comments

@jorisvandenbossche
Copy link
Member

I think we should rather fix this in #32867 (the test could still be useful though, so that could be merged here as an xfail, or I can move it to that PR)

@simonjayhawkins
Copy link
Member Author

I think we should rather fix this in #32867 (the test could still be useful though, so that could be merged here as an xfail, or I can move it to that PR)

closing in favour of #32867

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug NA - MaskedArrays Related to pd.NA and nullable extension arrays
Projects
None yet
Development

Successfully merging this pull request may close these issues.

DataFrame with Int64 columns casts to float64 with .max()/.min()
3 participants