Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Categorical Arithmetic/Comparison Rules #19513

Closed
jbrockmendel opened this issue Feb 2, 2018 · 11 comments
Closed

Categorical Arithmetic/Comparison Rules #19513

jbrockmendel opened this issue Feb 2, 2018 · 11 comments
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions

Comments

@jbrockmendel
Copy link
Member

Series[not-categorical] > CategoricalIndex is inconsistent with the reversed operation. Which one is canonical?

ser = pd.Series([1, 2, 3])
idx = pd.CategoricalIndex(['A', 'B', 'A'])

>>> ser > idx
0    False
1    False
2    False

>>> idx < ser
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/indexes/category.py", line 752, in _evaluate_compare
    return getattr(self.values, opname)(other)
  File "pandas/core/arrays/categorical.py", line 56, in f
    raise TypeError("Unordered Categoricals can only compare "
TypeError: Unordered Categoricals can only compare equality or not

>>> ser > idx.values
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/ops.py", line 819, in wrapper
    .format(op=op, typ=self.dtype))
TypeError: Cannot compare a Categorical for op <built-in function gt> with Series of dtype int64.
If you want to compare values, use 'series <op> np.asarray(other)'.

I'm guessing the right thing to do is to a) have Series[categorical].__op__ wrap CategoricalIndex.__op__, and b) have Series[non-categorical] to dispatch to reversed-op for is_categorical_dtype(other), want to confirm before making a PR.

@jbrockmendel
Copy link
Member Author

A case derived from tests.categorical.test_operators.TestCategoricalOps.test_comparisons, where it has the comment

# categorical cannot be compared to Series or numpy array, and also
# not the other way around
cat = Series(Categorical([1, 2, 3], ordered=True))
s = Series([2, 2, 2])

>>> s < cat
Traceback (most recent call last):
[...]
TypeError: Cannot compare a Categorical for op __lt__ with type <type 'numpy.ndarray'>.
If you want to compare values, use 'np.asarray(cat) <op> other'.

>>> s < pd.CategoricalIndex(cat)
0    False
1    False
2     True
dtype: bool

@jreback
Copy link
Contributor

jreback commented Feb 6, 2018

there are some issues about this already. see if you can find them.

@jbrockmendel
Copy link
Member Author

Closest issues with categorical label are #18050, #8995, neither of which seem to differentiate between the behavior of Categorical.__op__ vs CategoricalIndex.__op__ vs Series[category].__op__

@TomAugspurger any thoughts on the One True Implementation for these operations?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Feb 6, 2018

No strong thoughts, but I suppose a decent heuristic is "what would the result be if NumPy had a categorical dtype?" In that case, let's just do the same as Series[int].op(Index[int]) and Series[int].op(ndarray[int]), etc. So

self other return
series ndarray series
series index series
index series ndarray / extension array
index series ndarray / extension array
ndarray series series
ndarray index ndarray / extension array

Does that make sense?

@jbrockmendel
Copy link
Member Author

I should clarify. The question isn't what the return type should be, but which operations are allowed at all.

The easy (i.e. pretty obviously wrong ATM) part is cases where x < y is allowed but y > x raises:

cat = pd.Categorical([1, 2, 3], ordered=True)
idx = pd.CategoricalIndex(cat)
ser = pd.Series(cat)

>>> cat < ser   # raises TypeError
>>> ser > cat   # returns Series
>>> cat < idx   # raises TypeError
>>> idx > cat   # returns array
>>> idx < ser   # raises TypeError
>>> ser > idx   # raises TypeError

The less obvious part is what casting rules are allowed. ATM CategoricalIndex.__op__ is much more forgiving about input dtypes and will cast other before dispatching to Categorical.__op__.

>>> ser < ser.astype('i8')   # raises TypeError
>>> idx < idx.astype('i8')   # returns array
>>> cat < cat.astype('i8')   # raises TypeError

>>> ser == ser.astype('i8')  # returns Series
>>> idx == idx.astype('i8')   # returns array
>>> cat == cat.astype('i8')  # returns array

The goal here is to have one implementation of each of these methods (ideally in Categorical) and then have the other classes wrap that instead of re-implementing and risking having the logic diverge.

@jreback
Copy link
Contributor

jreback commented Feb 10, 2018

Categorical is prob a bit strict about these checks ATM. Categorial and CI should behave exactly the same.

These should all (Series, Cat, CI) should all respect ordered (IOW they have to be the same). for comparison purposes.

@jschendel

@jreback jreback added API Design Categorical Categorical Data Type labels Feb 10, 2018
@jbrockmendel
Copy link
Member Author

It looks like Series.__cmp__(CategoricalIndex) effectively calls other = np.asarray(other) on the CategoricalIndex, which I expect isn't the desired behavior.

@jbrockmendel
Copy link
Member Author

One more:

What should (DatetimeIndex|TimedeltaIndex|PeriodIndex).__(add|sub)__(Categorical|CategoricalIndex) do? I think right now they return NotImplemented which seems reasonable, but they do it in a catch-all block instead of intentionally and I don't think the case is tested.

@jschendel are you the appropriate person to ping on this?

@jschendel
Copy link
Member

@jschendel are you the appropriate person to ping on this?

I don't think I have any type of definitive say on this; mostly worked on categorical vs. categorical comparisons.

I agree that this probably shouldn't be supported. I suppose you could perform the operation on the underlying category values, but it seems like you'd run into ambiguous corner cases pretty quickly. For example, would you actually want to allow add/sub if the categories were ordered with a non-standard ordering, e.g. Timestamp('2017-01-01') > Timestamp('2018-01-01') ? Not sure why anyone would ever do that, but could conceivably happen.

Also doesn't look like add/sub at the scalar level is implemented for Timestamp|Timedelta|Period vs. Categorical|CategoricalIndex, which I would expect to be implemented if we were to support the same operations on the index equivalent.

@mroeschke
Copy link
Member

The comparison in the first comment all raise the same error now. Could use a test if there are none.

In [3]: ser > idx
TypeError: Unordered Categoricals can only compare equality or not

In [4]: idx < ser
TypeError: Unordered Categoricals can only compare equality or not

In [5]: ser > idx.values
TypeError: Unordered Categoricals can only compare equality or not

@mroeschke mroeschke added good first issue Needs Tests Unit test(s) needed to prevent regressions and removed API Design Bug Categorical Categorical Data Type Numeric Operations Arithmetic, Comparison, and Logical operations labels Jun 18, 2021
@jbrockmendel
Copy link
Member Author

Closing as fixed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Needs Tests Unit test(s) needed to prevent regressions
Projects
None yet
Development

No branches or pull requests

5 participants