BUG: DataFrame+DataFrame ops #22614

jbrockmendel · 2018-09-05T19:09:48Z

There are code paths for arithmetic methods (in particular DataFrame._combine_frame, _combine_match_index) that operate with self.values, other.values, and as a result behave differently from their Series/Index analogues. Example:

df = pd.DataFrame([pd.Timedelta(seconds=1), pd.Timedelta(seconds=2)])
df2 = pd.DataFrame([1*10**9, 2*10**9])

>>> df + df2  # <-- should raise TypeError
>>> df + df2
         0
0 00:00:02
1 00:00:04

>>> df[0] + df2[0]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "pandas/core/ops.py", line 1274, in wrapper
    result = dispatch_to_index_op(op, left, right, pd.TimedeltaIndex)
  File "pandas/core/ops.py", line 1331, in dispatch_to_index_op
    'operation [{name}]'.format(name=op.__name__))
TypeError: incompatible type for a datetime/timedelta operation [add]

AFAICT there are two options for how to fix these:

Have all DataFrame arith/comparison ops operate column-wise. Downside is performance hit on currently-OK operations; AFAIK this is part of the motivation behind using Blocks in the first place.
Make .values point at EAs, implement the arith/comparison methods on EAs.

This would require implementing a bunch of not-currently-existing EAs for pretty much all the standard np dtypes (Make EAs for all Block subclasses #22388)
This would require supporting 2D EAs.

The text was updated successfully, but these errors were encountered:

jorisvandenbossche · 2018-09-26T14:03:21Z

@jbrockmendel the other option is to do it block wise? (which basically means column-wise for EAs, without the performance hit for consolidated numpy dtypes)

jbrockmendel · 2018-09-26T19:55:53Z

the other option is to do it block wise? (which basically means column-wise for EAs, without the performance hit for consolidated numpy dtypes)

That is similar to what I had in mind with option 2 above. Just instead of dispatching to the consolidated numpy array ops, it would be dispatching to the consolidated EA ops.

There are a bunch of differences between pandas and numpy arith/comparison ops, most of which are handled in core.ops. In the status quo we still don't have full internal-consistency for how these operations behave. I think the best way to achieve internal consistency is to have One True Implementation for each of these ops. This can either be on the Series (option 1 above) or in EA (option 2). The former is definitely easier to implement short-term, but I think less elegant long-term.

jorisvandenbossche · 2018-09-26T20:12:28Z

That is similar to what I had in mind with option 2 above. Just instead of dispatching to the consolidated numpy array ops, it would be dispatching to the consolidated EA ops.

Yes, but implementation wise that is still a big difference no?

jbrockmendel · 2018-09-26T20:17:56Z

Yes, but implementation wise that is still a big difference no?

On the EA side, yes, since it would mean implementing a bunch of new EA subclasses (and moving a bunch of logic from core.ops). On the DataFrame/BlockManager/Block side, not so much.

jbrockmendel · 2018-10-12T16:46:54Z

Closed by #22696.

jbrockmendel mentioned this issue Sep 7, 2018

CLN: use dispatch_to_series where possible #22534

Closed

gfyoung added Bug Algos Non-arithmetic algos: value_counts, factorize, sorting, isin, clip, shift, diff DataFrame DataFrame data structure labels Sep 7, 2018

jbrockmendel mentioned this issue Sep 13, 2018

BUG: fix DataFrame+DataFrame op with timedelta64 dtype #22696

Merged

4 tasks

jbrockmendel mentioned this issue Sep 20, 2018

Preserve Extension type on cross section #22785

Merged

jbrockmendel closed this as completed Oct 12, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BUG: DataFrame+DataFrame ops #22614

BUG: DataFrame+DataFrame ops #22614

jbrockmendel commented Sep 5, 2018

jorisvandenbossche commented Sep 26, 2018

jbrockmendel commented Sep 26, 2018

jorisvandenbossche commented Sep 26, 2018

jbrockmendel commented Sep 26, 2018

jbrockmendel commented Oct 12, 2018

BUG: DataFrame+DataFrame ops #22614

BUG: DataFrame+DataFrame ops #22614

Comments

jbrockmendel commented Sep 5, 2018

jorisvandenbossche commented Sep 26, 2018

jbrockmendel commented Sep 26, 2018

jorisvandenbossche commented Sep 26, 2018

jbrockmendel commented Sep 26, 2018

jbrockmendel commented Oct 12, 2018