Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.argmax() fails with np.inf #16449

Merged
merged 1 commit into from
Aug 15, 2017
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 2 additions & 1 deletion doc/source/whatsnew/v0.21.0.txt
Original file line number Diff line number Diff line change
Expand Up @@ -265,7 +265,7 @@ Other API Changes
- Removed the ``@slow`` decorator from ``pandas.util.testing``, which caused issues for some downstream packages' test suites. Use ``@pytest.mark.slow`` instead, which achieves the same thing (:issue:`16850`)
- Moved definition of ``MergeError`` to the ``pandas.errors`` module.
- The signature of :func:`Series.set_axis` and :func:`DataFrame.set_axis` has been changed from ``set_axis(axis, labels)`` to ``set_axis(labels, axis=0)``, for consistency with the rest of the API. The old signature is deprecated and will show a ``FutureWarning`` (:issue:`14636`)

- :func:`Series.argmin` and :func:`Series.argmax` will now raise a ``TypeError`` when used with ``object`` dtypes, instead of a ``ValueError`` (:issue:`13595`)

.. _whatsnew_0210.deprecations:

Expand Down Expand Up @@ -369,6 +369,7 @@ Reshaping
- Fixes regression from 0.20, :func:`Series.aggregate` and :func:`DataFrame.aggregate` allow dictionaries as return values again (:issue:`16741`)
- Fixes dtype of result with integer dtype input, from :func:`pivot_table` when called with ``margins=True`` (:issue:`17013`)
- Bug in :func:`crosstab` where passing two ``Series`` with the same name raised a ``KeyError`` (:issue:`13279`)
- :func:`Series.argmin`, :func:`Series.argmax`, and their counterparts on ``DataFrame`` and groupby objects work correctly with floating point data that contains infinite values (:issue:`13595`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually we need to have a mini-section in api breaking changes that min/max now don't work on object dtypes.

and if you can add a couple of tests (does any exiting ones break because of the disallow('O') changes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay; I wrote something for the breaking API changes section. Ultimately, though, the difference is just that argmax used to throw ValueError and now throws TypeError with string data; is that an appropriate change for the breaking section?

I added a couple of other tests for Series with object dtype. I haven't seen any existing tests break as a result of this change.


Numeric
^^^^^^^
Expand Down
8 changes: 4 additions & 4 deletions pandas/core/nanops.py
Original file line number Diff line number Diff line change
Expand Up @@ -486,23 +486,23 @@ def reduction(values, axis=None, skipna=True):
nanmax = _nanminmax('max', fill_value_typ='-inf')


@disallow('O')
def nanargmax(values, axis=None, skipna=True):
"""
Returns -1 in the NA case
"""
values, mask, dtype, _ = _get_values(values, skipna, fill_value_typ='-inf',
isfinite=True)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would actually be ok with NOT allowing this for strings at all. IOW you can use the @disallow decorator. It is a slight break with numpy, but it doesn't work now anyhow. I am not sure why allowing this is a good thing when it has different py2/3 behavior (and it doesn't work on mixed things).

Copy link
Contributor Author

@DGrady DGrady Jun 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure what you mean by “it doesn't work now anyhow”? I think the build failures were just a linting error that I missed locally; trying again.

I think the argument for allowing it with strings is consistency: right now, the Pandas version of argmax etc is inconsistent with base Python, with NumPy, and with other Pandas functions. For example,

data = ['spam', 'eggs']
data_a = np.array(data, dtype='object')
data_s = pd.Series(data)

max(data) # -> 'spam'
data_a.max() # -> 'spam'
data_s.max() # -> 'spam'

# So far so good

data.index(max(data)) # -> 0
data_a.argmax() # -> 0
data_s.argmax() #-> ValueError: could not convert string to float: 'eggs'

This is confusing; we expect data_s.argmax() to return 0 just like everything else. After this PR, it does return 0, as expected.

It's true that if we change the example above to use mixed data types, like data2 = ['spam', 'eggs', None], then the behavior is different under Python 2 and Python 3, but I think that even in this case this PR makes the Pandas behavior more consistent with the prevailing behavior of whatever version of Python you're using. For example,

data2 = ['spam', 'eggs', None]
data2_a = np.array(data2, dtype='object')
data2_s = pd.Series(data2)

max(data2) # -> 'spam' under Py2, TypeError under Py3
data2_a.max() # -> 'spam' under Py2, TypeError under Py3
data2_s.max() # -> 'spam' under Py2, TypeError under Py3

data2.index(max(data2)) # -> 0 under Py2, TypeError under Py3
data2_a.argmax() # -> 0 under Py2, TypeError under Py3

Now, with the current release of Pandas, data2_s.argmax() will throw a ValueError under Py2 and Py3 with a message about converting strings to floats. After this PR, data2_s.argmax() will return 0 under Py2, as we would expect, and will throw a TypeError with a message about incomparable types under Py3.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would be ok with allowing this, but it HAS to work with missing values. Everything else does and no reason for this not too. you simply mask, apply function (e.g. argmax), then replace the missing by -1

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it better to put in the fix for inf, then in a followup fix the strings.

values, mask, dtype, _ = _get_values(values, skipna, fill_value_typ='-inf')
result = values.argmax(axis)
result = _maybe_arg_null_out(result, axis, mask, skipna)
return result


@disallow('O')
def nanargmin(values, axis=None, skipna=True):
"""
Returns -1 in the NA case
"""
values, mask, dtype, _ = _get_values(values, skipna, fill_value_typ='+inf',
isfinite=True)
values, mask, dtype, _ = _get_values(values, skipna, fill_value_typ='+inf')
result = values.argmin(axis)
result = _maybe_arg_null_out(result, axis, mask, skipna)
return result
Expand Down
2 changes: 1 addition & 1 deletion pandas/tests/groupby/test_groupby.py
Original file line number Diff line number Diff line change
Expand Up @@ -2339,7 +2339,7 @@ def test_non_cython_api(self):
assert_frame_equal(result, expected)

# idxmax
expected = DataFrame([[0], [nan]], columns=['B'], index=[1, 3])
expected = DataFrame([[0.0], [nan]], columns=['B'], index=[1, 3])
expected.index.name = 'A'
result = g.idxmax()
assert_frame_equal(result, expected)
Expand Down
47 changes: 47 additions & 0 deletions pandas/tests/series/test_operators.py
Original file line number Diff line number Diff line change
Expand Up @@ -1857,3 +1857,50 @@ def test_op_duplicate_index(self):
result = s1 + s2
expected = pd.Series([11, 12, np.nan], index=[1, 1, 2])
assert_series_equal(result, expected)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add another tests which asserts the behavior for strings

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is covered now, starting on line 1898 below

@pytest.mark.parametrize(
"test_input,error_type",
[
(pd.Series([]), ValueError),

# For strings, or any Series with dtype 'O'
(pd.Series(['foo', 'bar', 'baz']), TypeError),
(pd.Series([(1,), (2,)]), TypeError),

# For mixed data types
(
pd.Series(['foo', 'foo', 'bar', 'bar', None, np.nan, 'baz']),
TypeError
),
]
)
def test_assert_argminmax_raises(self, test_input, error_type):
"""
Cases where ``Series.argmax`` and related should raise an exception
"""
with pytest.raises(error_type):
test_input.argmin()
with pytest.raises(error_type):
test_input.argmin(skipna=False)
with pytest.raises(error_type):
test_input.argmax()
with pytest.raises(error_type):
test_input.argmax(skipna=False)

def test_argminmax_with_inf(self):
# For numeric data with NA and Inf (GH #13595)
s = pd.Series([0, -np.inf, np.inf, np.nan])

assert s.argmin() == 1
assert np.isnan(s.argmin(skipna=False))

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you do

with pd.option_context('mode.use_inf_as_null', True):
    ...

I suspect this will fail. Can you test with this as well.
(you may have to get the value of the option and pass to isfinite=pd.get_option('mode.use_inf_as_null'))

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't even know about this! So based on the notes on the options documentation page, it looks like Pandas used to treat floating point +/- inf, as well as nan, as missing values by default, and this behavior was changed at some point. That's a big help; will check this and update.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jreback can you confirm this is the expected behavior?

s = pd.Series([0, -np.inf, np.inf, np.nan])

with pd.option_context('mode.use_inf_as_null', True):
    s.argmax() # --> 0
    s.argmax(skipna=False) # --> nan

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like it's working; see line 1891 below

assert s.argmax() == 2
assert np.isnan(s.argmax(skipna=False))

# Using old-style behavior that treats floating point nan, -inf, and
# +inf as missing
with pd.option_context('mode.use_inf_as_na', True):
assert s.argmin() == 0
assert np.isnan(s.argmin(skipna=False))
assert s.argmax() == 0
np.isnan(s.argmax(skipna=False))