Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: DataFrame+DataFrame ops #22614

Closed
jbrockmendel opened this issue Sep 5, 2018 · 5 comments
Closed

BUG: DataFrame+DataFrame ops #22614

jbrockmendel opened this issue Sep 5, 2018 · 5 comments
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug DataFrame DataFrame data structure

Comments

@jbrockmendel
Copy link
Member

There are code paths for arithmetic methods (in particular DataFrame._combine_frame, _combine_match_index) that operate with self.values, other.values, and as a result behave differently from their Series/Index analogues. Example:

df = pd.DataFrame([pd.Timedelta(seconds=1), pd.Timedelta(seconds=2)])
df2 = pd.DataFrame([1*10**9, 2*10**9])

>>> df + df2  # <-- should raise TypeError
>>> df + df2
         0
0 00:00:02
1 00:00:04

>>> df[0] + df2[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/ops.py", line 1274, in wrapper
    result = dispatch_to_index_op(op, left, right, pd.TimedeltaIndex)
  File "pandas/core/ops.py", line 1331, in dispatch_to_index_op
    'operation [{name}]'.format(name=op.__name__))
TypeError: incompatible type for a datetime/timedelta operation [add]

AFAICT there are two options for how to fix these:

  1. Have all DataFrame arith/comparison ops operate column-wise. Downside is performance hit on currently-OK operations; AFAIK this is part of the motivation behind using Blocks in the first place.

  2. Make .values point at EAs, implement the arith/comparison methods on EAs.

@gfyoung gfyoung added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff DataFrame DataFrame data structure labels Sep 7, 2018
@jorisvandenbossche
Copy link
Member

@jbrockmendel the other option is to do it block wise? (which basically means column-wise for EAs, without the performance hit for consolidated numpy dtypes)

@jbrockmendel
Copy link
Member Author

the other option is to do it block wise? (which basically means column-wise for EAs, without the performance hit for consolidated numpy dtypes)

That is similar to what I had in mind with option 2 above. Just instead of dispatching to the consolidated numpy array ops, it would be dispatching to the consolidated EA ops.

There are a bunch of differences between pandas and numpy arith/comparison ops, most of which are handled in core.ops. In the status quo we still don't have full internal-consistency for how these operations behave. I think the best way to achieve internal consistency is to have One True Implementation for each of these ops. This can either be on the Series (option 1 above) or in EA (option 2). The former is definitely easier to implement short-term, but I think less elegant long-term.

@jorisvandenbossche
Copy link
Member

That is similar to what I had in mind with option 2 above. Just instead of dispatching to the consolidated numpy array ops, it would be dispatching to the consolidated EA ops.

Yes, but implementation wise that is still a big difference no?

@jbrockmendel
Copy link
Member Author

Yes, but implementation wise that is still a big difference no?

On the EA side, yes, since it would mean implementing a bunch of new EA subclasses (and moving a bunch of logic from core.ops). On the DataFrame/BlockManager/Block side, not so much.

@jbrockmendel
Copy link
Member Author

Closed by #22696.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff Bug DataFrame DataFrame data structure
Projects
None yet
Development

No branches or pull requests

3 participants