Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

POC: ArrayManager -- array-based data manager for columnar store #36010

Merged
merged 42 commits into from
Jan 13, 2021

Conversation

jorisvandenbossche
Copy link
Member

@jorisvandenbossche jorisvandenbossche commented Aug 31, 2020

Related to the discussion in #10556, and following up on the mailing list discussion "A case for a simplified (non-consolidating) BlockManager with 1D blocks" (archive).

This branch experiments with an "array manager", storing a list of 1D arrays instead of blocks.
The idea is that this ArrayManager could optionally be used instead of BlockManager. If we ensure the "DataManager" has a clear interface for the rest of pandas (and thus parts outside of the internals don't rely on details like block layout, xref #34669), this should be possible without much changes outside of /core/internals.

Some notes on this experiment:

  • This is not a complete POC, not every aspect and behaviour of the BlockManager has already been replicated, and there are still places in pandas that rely on the blocks being present, so lots of tests are still failing (although changes in behaviour are also desired). That said, a lot of the basic operations do work. Two illustrations of this:

  • For now, I focused on an ArrayManager storing a list of numpy arrays. Of course we need to expand that to support ExtensionArrays as well (or ExtensionArrays only?), but the reason I limited to numpy arrays for now: besides making it a bit simpler to experiment with, this also gives a fairer comparison with the consolidated BlockManager (because it focuses on the numpy array being 1D vs 2D, and doesn't mix in performance/implementation differences of numpy array vs ExtensionArray).

  • Personally, I think this looks promising. Many of the methods are a lot simpler than the BlockManager equivalent (although not every aspect is implemented yet, that's correct). And for the case I showed in the notebook, performance looks also good. For the benchmark suite I ran, there are obviously slowdowns for the "wide dataframe" benchmarks.
    There is still a lot of work needed to make this fully working with the rest of pandas, though ;)

  • Given the early proof of concept stage, detailed code feedback is not yet needed, but I would find it very useful to discuss the following aspects:

    • High-level feedback on the approach: does the approach of the two subclasses look interesting? The approach of the ArrayManager itself storing a list of arrays? ...

    • What to do with Series, which now is a SingleBlockManager inheriting from BlockManager (should we also have a "SingleArrayManager"?)

    • If we find this interesting, how can we go from here? How do we decide on this? (what aspects already need to work, how fast does it need to be?) I don't think getting a fully complete implementation passing all tests is is possible in a single PR. Are we fine with merging something partial in master and continue from there? Or a shared feature branch in upstream? ...

Benchmark results for asv_bench/arithmetic.py

As an example, I ran asv continuous -f 1.1 upstream/master HEAD -b arithmetic.

The benchmarks with a slowdown bigger than a factor 2 can basically be brought back to two cases:

  • Benchmarks for "wide" dataframes (eg FrameWithFrameWide using a case with n_cols > n_rows)
  • Benchmarks from the IntFrameWithScalar class: from a quick profile, it seems that the usage of numexpr is the cause, and disabling this seems to reduce the slowdown to a factor 2. The numexpr code (and checking if it should be used etc) apparently has a high overhead per call, which I assume is something that can be solved (moving those checks a level higher up, so we don't need to repeat it for each column)
       before           after         ratio
     [b45327f5]       [047f9091]
     <master>                   
!        40.6±6ms           failed      n/a  arithmetic.Ops.time_frame_multi_and(False, 'default')
!        32.7±2ms           failed      n/a  arithmetic.Ops.time_frame_multi_and(False, 1)
!        26.5±1ms           failed      n/a  arithmetic.Ops.time_frame_multi_and(True, 'default')
!        37.7±2ms           failed      n/a  arithmetic.Ops.time_frame_multi_and(True, 1)
+      1.06±0.3ms         93.5±7ms    88.57  arithmetic.FrameWithFrameWide.time_op_same_blocks(<built-in function gt>)
+      1.51±0.2ms         80.6±3ms    53.34  arithmetic.FrameWithFrameWide.time_op_same_blocks(<built-in function add>)
+     1.22±0.08ms         55.1±5ms    45.19  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('le')
+     1.30±0.07ms        55.6±20ms    42.83  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('ne')
+      2.12±0.4ms         90.1±4ms    42.47  arithmetic.FrameWithFrameWide.time_op_different_blocks(<built-in function gt>)
+     1.17±0.04ms         49.4±4ms    42.38  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('gt')
+     1.28±0.07ms         52.9±3ms    41.28  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('lt')
+      1.29±0.2ms       52.5±0.6ms    40.63  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('ge')
+     1.44±0.02ms         56.8±7ms    39.56  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('eq')
+      2.08±0.3ms        78.9±10ms    37.90  arithmetic.Ops2.time_frame_float_mod
+      2.34±0.1ms         78.3±4ms    33.51  arithmetic.FrameWithFrameWide.time_op_different_blocks(<built-in function add>)
+      1.66±0.2ms         46.6±1ms    28.00  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('mul')
+      1.78±0.2ms         48.2±5ms    27.02  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('truediv')
+     1.14±0.04ms         26.8±4ms    23.49  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function le>)
+      1.83±0.2ms         42.9±1ms    23.39  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('add')
+      1.94±0.3ms         45.1±4ms    23.29  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('sub')
+     1.23±0.07ms         23.0±3ms    18.65  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function ge>)
+     1.33±0.08ms         22.8±1ms    17.14  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function eq>)
+     1.03±0.05ms         17.6±2ms    17.13  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function ge>)
+      1.65±0.5ms         28.1±7ms    17.00  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function eq>)
+     1.21±0.05ms         20.1±3ms    16.67  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function gt>)
+     1.18±0.03ms       19.4±0.9ms    16.54  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function eq>)
+     1.08±0.07ms         17.8±1ms    16.53  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function lt>)
+     1.22±0.05ms         20.0±2ms    16.41  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function gt>)
+     1.30±0.06ms         21.2±3ms    16.28  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function ne>)
+     1.15±0.06ms         18.6±3ms    16.18  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function lt>)
+      1.42±0.1ms         22.6±1ms    15.96  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function lt>)
+     1.11±0.01ms       17.6±0.4ms    15.85  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function ne>)
+      5.30±0.8ms        81.7±20ms    15.40  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('lt')
+      1.37±0.2ms         20.7±3ms    15.09  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function gt>)
+     1.22±0.05ms         18.0±6ms    14.72  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function ge>)
+      1.28±0.1ms         18.6±3ms    14.55  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function gt>)
+     1.17±0.08ms         17.0±3ms    14.54  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function gt>)
+      1.22±0.1ms       17.6±0.8ms    14.44  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function eq>)
+      1.35±0.1ms         19.4±2ms    14.35  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function le>)
+      1.35±0.1ms         19.2±4ms    14.21  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function ge>)
+      4.36±0.3ms         61.8±8ms    14.17  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('le')
+      1.31±0.1ms         18.5±2ms    14.09  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function lt>)
+      4.48±0.5ms         62.9±5ms    14.06  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('ge')
+      1.15±0.1ms         16.1±1ms    14.01  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function ne>)
+      1.33±0.1ms         18.6±2ms    14.00  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function ge>)
+      4.37±0.4ms         58.9±2ms    13.48  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('ne')
+      1.22±0.2ms         16.2±3ms    13.25  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function le>)
+      1.25±0.1ms         16.5±1ms    13.13  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function le>)
+      1.44±0.2ms         18.6±4ms    12.90  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function ge>)
+      1.75±0.3ms         22.3±2ms    12.74  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function eq>)
+      1.42±0.3ms         18.0±7ms    12.68  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function gt>)
+      1.36±0.1ms         17.2±1ms    12.67  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function ne>)
+        440±30μs       5.57±0.1ms    12.65  arithmetic.Ops2.time_frame_series_dot
+      1.63±0.2ms         20.6±2ms    12.65  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function lt>)
+     1.35±0.07ms         17.0±3ms    12.58  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function le>)
+      1.34±0.2ms         16.7±1ms    12.46  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function eq>)
+      1.50±0.1ms         18.6±5ms    12.43  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function ge>)
+     1.35±0.07ms         16.8±1ms    12.42  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function ge>)
+      1.35±0.1ms         16.7±2ms    12.37  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function le>)
+      1.55±0.3ms         18.9±2ms    12.20  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function le>)
+      1.67±0.3ms         20.3±5ms    12.17  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function ne>)
+      1.55±0.2ms       18.5±0.7ms    11.94  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function le>)
+      5.05±0.5ms         59.1±3ms    11.70  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('gt')
+      1.51±0.2ms         17.6±2ms    11.66  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function lt>)
+     1.33±0.08ms         15.3±1ms    11.50  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function ne>)
+      4.47±0.1ms         51.2±1ms    11.45  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('eq')
+      1.35±0.1ms         15.4±2ms    11.45  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function lt>)
+      1.76±0.5ms         19.8±2ms    11.28  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function lt>)
+     1.55±0.09ms       16.8±0.3ms    10.86  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function ne>)
+      1.71±0.1ms         18.2±2ms    10.58  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function eq>)
+      1.51±0.2ms         15.9±3ms    10.54  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function eq>)
+      1.53±0.2ms       15.6±0.3ms    10.19  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function ne>)
+      1.95±0.2ms         19.7±5ms    10.08  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function gt>)
+     2.22±0.08ms         21.6±4ms     9.73  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function add>)
+     1.77±0.08ms         16.7±1ms     9.48  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function gt>)
+      2.19±0.1ms         19.9±2ms     9.08  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function mul>)
+     1.91±0.04ms         17.0±2ms     8.88  arithmetic.Ops.time_frame_comparison(True, 'default')
+      2.18±0.1ms         19.0±1ms     8.73  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function add>)
+     2.23±0.08ms         19.1±1ms     8.59  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function sub>)
+     2.24±0.07ms         19.0±3ms     8.47  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function mul>)
+     2.34±0.06ms         19.5±2ms     8.31  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function truediv>)
+      2.52±0.2ms         20.3±6ms     8.06  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function truediv>)
+      2.39±0.2ms         19.2±2ms     8.05  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function truediv>)
+      3.07±0.4ms         24.4±5ms     7.94  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function mod>)
+      2.24±0.1ms         17.5±2ms     7.85  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function add>)
+      2.24±0.2ms       17.4±0.7ms     7.79  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function sub>)
+      2.33±0.1ms         18.0±2ms     7.73  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function mul>)
+      2.15±0.1ms         16.4±4ms     7.60  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function sub>)
+     2.10±0.05ms         15.9±2ms     7.57  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function add>)
+      2.27±0.1ms         16.8±1ms     7.39  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function add>)
+      3.59±0.1ms         26.1±5ms     7.27  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function mod>)
+      2.32±0.1ms         16.8±3ms     7.25  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function sub>)
+     2.36±0.08ms       17.1±0.7ms     7.23  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function truediv>)
+      2.42±0.2ms         17.4±2ms     7.17  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function sub>)
+     2.31±0.09ms       16.4±0.9ms     7.11  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function add>)
+      7.34±0.9ms         52.2±2ms     7.10  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('add')
+      2.32±0.1ms       16.4±0.9ms     7.07  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function add>)
+      2.25±0.2ms         15.8±2ms     7.03  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function sub>)
+      2.51±0.5ms         17.3±2ms     6.91  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function add>)
+      2.43±0.1ms       16.7±0.8ms     6.84  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function mul>)
+      2.24±0.1ms         15.2±2ms     6.81  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function mul>)
+        7.81±1ms         52.9±4ms     6.78  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('sub')
+      2.48±0.2ms         16.4±2ms     6.62  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function mul>)
+        6.82±1ms       44.4±0.7ms     6.51  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('mul')
+     2.25±0.05ms       14.6±0.8ms     6.48  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function sub>)
+      3.14±0.7ms         19.8±2ms     6.30  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function mod>)
+      2.57±0.2ms         15.9±2ms     6.19  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function sub>)
+      2.57±0.1ms         15.8±2ms     6.16  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function truediv>)
+        7.70±1ms         47.2±3ms     6.13  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('truediv')
+      3.02±0.1ms         18.4±3ms     6.08  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function mod>)
+      2.79±0.2ms       16.8±0.8ms     6.04  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function truediv>)
+      3.16±0.3ms       19.1±0.7ms     6.04  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function mod>)
+      2.51±0.2ms       14.9±0.5ms     5.92  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function mul>)
+      2.71±0.1ms       15.9±0.8ms     5.86  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function mul>)
+      2.72±0.3ms         15.9±1ms     5.83  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function truediv>)
+        11.9±1ms         64.0±5ms     5.39  arithmetic.Ops2.time_frame_int_mod
+      3.59±0.4ms         19.1±5ms     5.33  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 2, <built-in function mod>)
+      6.23±0.4ms         32.7±6ms     5.25  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 2, <built-in function mod>)
+      3.28±0.2ms         17.2±2ms     5.23  arithmetic.Ops.time_frame_add(True, 'default')
+        23.7±6ms          112±7ms     4.70  arithmetic.FrameWithFrameWide.time_op_same_blocks(<built-in function floordiv>)
+      3.51±0.4ms       16.5±0.6ms     4.70  arithmetic.Ops.time_frame_mult(True, 'default')
+        3.61±2ms         16.3±1ms     4.52  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function truediv>)
+        45.8±4ms         194±20ms     4.25  arithmetic.FrameWithFrameWide.time_op_different_blocks(<built-in function floordiv>)
+      5.64±0.6ms         21.9±1ms     3.89  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 4, <built-in function mod>)
+      3.13±0.1ms       11.4±0.5ms     3.63  arithmetic.Ops.time_frame_comparison(True, 1)
+      12.2±0.8ms         42.5±4ms     3.47  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function pow>)
+      4.03±0.7ms       11.2±0.3ms     2.79  arithmetic.Ops.time_frame_add(True, 1)
+        53.0±6ms         143±10ms     2.69  arithmetic.Ops2.time_frame_float_floor_by_zero
+      4.11±0.2ms         11.1±1ms     2.69  arithmetic.Ops.time_frame_mult(True, 1)
+        54.9±4ms          125±9ms     2.28  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('floordiv')
+      25.0±0.6ms         55.9±5ms     2.24  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 4, <built-in function pow>)
+      2.42±0.2ms       5.21±0.6ms     2.16  arithmetic.Ops.time_frame_comparison(False, 'default')
+        16.2±1ms         31.9±3ms     1.97  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 3.0, <built-in function pow>)
+        30.9±3ms        58.1±10ms     1.88  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function pow>)
+      3.36±0.3ms       5.76±0.4ms     1.71  arithmetic.Ops.time_frame_add(False, 'default')
+      3.10±0.3ms       5.03±0.3ms     1.62  arithmetic.Ops.time_frame_comparison(False, 1)
+        30.5±3ms         49.2±9ms     1.61  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function pow>)
+      3.42±0.3ms       5.51±0.4ms     1.61  arithmetic.Ops.time_frame_mult(False, 1)
+      3.52±0.2ms       5.63±0.1ms     1.60  arithmetic.Ops.time_frame_add(False, 1)
+      3.60±0.2ms       5.74±0.5ms     1.59  arithmetic.Ops.time_frame_mult(False, 'default')
+        57.9±1ms         89.7±6ms     1.55  arithmetic.Ops2.time_frame_float_div
+      32.1±0.5ms         48.7±2ms     1.52  arithmetic.Ops2.time_frame_dot
+     2.96±0.06ms       4.32±0.4ms     1.46  arithmetic.DateInferOps.time_add_timedeltas
+        65.9±2ms         93.8±1ms     1.42  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis1('pow')
+         106±2ms          132±3ms     1.25  arithmetic.MixedFrameWithSeriesAxis.time_frame_op_with_series_axis0('pow')
+     1.33±0.01ms       1.64±0.2ms     1.24  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<YearEnd: month=12>)
+      7.09±0.2ms       8.49±0.5ms     1.20  arithmetic.DateInferOps.time_subtract_datetimes
+        1.13±0ms      1.33±0.09ms     1.18  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<YearBegin: month=1>)
+     1.25±0.02ms       1.47±0.1ms     1.18  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<SemiMonthEnd: day_of_month=15>)
+     2.52±0.04ms       2.97±0.2ms     1.18  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<BusinessDay>)
+     1.16±0.01ms      1.32±0.06ms     1.13  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<QuarterBegin: startingMonth=3>)
-      1.67±0.2ms      1.42±0.02ms     0.85  arithmetic.OffsetArrayArithmetic.time_add_dti_offset(<MonthEnd>)
-        282±20μs          230±5μs     0.81  arithmetic.NumericInferOps.time_subtract(<class 'numpy.int8'>)
-      4.36±0.2ms       3.54±0.3ms     0.81  arithmetic.NumericInferOps.time_modulo(<class 'numpy.uint16'>)
-      1.29±0.1ms      1.03±0.06ms     0.80  arithmetic.NumericInferOps.time_multiply(<class 'numpy.int64'>)
-     1.77±0.09ms      1.39±0.03ms     0.79  arithmetic.OffsetArrayArithmetic.time_add_dti_offset(<SemiMonthBegin: day_of_month=15>)
-      1.54±0.2ms      1.13±0.02ms     0.74  arithmetic.NumericInferOps.time_divide(<class 'numpy.int8'>)
-        301±40μs          221±4μs     0.73  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<Day>)
-      3.85±0.5ms       2.58±0.2ms     0.67  arithmetic.OffsetArrayArithmetic.time_add_dti_offset(<BusinessDay>)

SOME BENCHMARKS HAVE CHANGED SIGNIFICANTLY.
PERFORMANCE DECREASED.

@jorisvandenbossche jorisvandenbossche added Refactor Internal refactoring of code Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action labels Aug 31, 2020
@jbrockmendel
Copy link
Member

jbrockmendel commented Aug 31, 2020

Whats DataManager?

Edit: nevermind, clear now that I look at the code.

do_integrity_check: bool = True,
):
self._axes = axes
self.arrays = arrays
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would have expected youd need to be backed by blocks to get e.g. replace to work. is there a nice way around that?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was my explicit goal of the current experiment to not use Blocks, because it gives another level of indirection / overhead. But for sure, there are currently certain algos like replace that are defined on the Blocks (see also the inline comment at def replace). So short term we can wrap the arrays in Blocks just for those operations when needed (to be clear, this would only be a hack to get a POC more fully working), and eventually I would abstract some of those algos into separate array-based algos (like we have for many others already which are not defined in the Blocks itself), and then those can be used by both the Block.replace as ArrayManager.replace.

So we will need to see a bit how many of those things from Blocks need reuse, but if it turns out possible (eg is limited to a set of algos that can be factored out), I would prefer to drop the Block concept in a potential ArrayManager.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

eventually I would abstract some of those algos into separate array-based algos

+1 for this idea

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poking at this, I think the hard part would be implementing can_hold_element as an array function.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Poking at this, I think the hard part would be implementing can_hold_element as an array function.

For numpy arrays, I think we could write a single helper function that works for all dtypes? And for ExtensionArrays, we could maybe add a similar method on the EA itself to the interface.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you're right on both points: a helper for ndarrays and a method for EAs would both be good.

Spent some time on this yesterday and found other Block methods are necessary (putmask comes to mind), to the point where keeping them as methods feels more intuitive than standalone functions. It might be easiest to use the existing Block code for these methods, but delay wrapping the arrays in Blocks until they are needed (at least for the POC stage)

@TomAugspurger
Copy link
Contributor

Overall, having a base class, BlockManager, and ArrayManager makes sense for easy prototyping / switching. That also lets us clearly define the interface between pandas' internals and the rest of pandas.

For ease of review, can you split internals/managers.py into base / array / block? Or perhaps just leave internals/managers.py as is for this PR (other than inheriting from DataManager) and just add stuff so the diff is cleanest.

@jorisvandenbossche
Copy link
Member Author

For ease of review, can you split internals/managers.py into base / array / block? Or perhaps just leave internals/managers.py as is for this PR (other than inheriting from DataManager) and just add stuff so the diff is cleanest.

The diff should be clean right now. It's added in manager.py itself, but it's purely an addition of lines, so the diff should look the same as if it was a new file.
(the base class itself right now doesn't have much content, but I agree that eventually it would make sense to split in multiple files)

@jbrockmendel
Copy link
Member

For the purpose of seeing how close this is to working, would it make sense to use a pd.options flag to control ArrayManager vs BlockManager instead of a DataFrame keyword? This would make it straightforward to run all the tests using ArrayManager with few edits.

@jorisvandenbossche
Copy link
Member Author

For the purpose of seeing how close this is to working, would it make sense to use a pd.options flag to control ArrayManager vs BlockManager instead of a DataFrame keyword? This would make it straightforward to run all the tests using ArrayManager with few edits.

Yes, that's a good idea (I now added a keyword to the DataFrame constructor, and for testing purposes switched the default. But indeed with an option it is easier to switch)

@jbrockmendel
Copy link
Member

@jorisvandenbossche im curious how this performs for the snippet discussed in #34683

df = pd.DataFrame(index=list(range(100)))

df1 = pd.DataFrame(index=list(range(100)))
df2 = pd.DataFrame(index=list(range(100)))

for i in range(10):
    df1[i] = np.random.randn(len(df))
    df2[i] = np.random.randn(len(df))


In [22]: %timeit pd.concat([df1, df2])

@jorisvandenbossche
Copy link
Member Author

It's a bit faster:

In [5]: pd.options.mode.data_manager = 'block'   

In [6]: ... # code to create df1 and df2   

In [7]: %timeit pd.concat([df1, df2])   
465 µs ± 29.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [8]: pd.options.mode.data_manager = 'array'  

In [9]: ... # code to create df1 and df2  

In [10]: %timeit pd.concat([df1, df2])   
298 µs ± 6.95 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

But: concat is not yet fully implemented (only for the simple cases like this), but that might also mean it is taking some shortcuts that otherwise need checks.

(it's also not directly related to "consolidation in reshape or not", because after the first iteration of the ``%timeit, df1` and `df2` will actually be consolidated already, because this happens inplace, but that's for the discussion in the other issue)

@pep8speaks
Copy link

pep8speaks commented Sep 4, 2020

Hello @jorisvandenbossche! Thanks for updating this PR. We checked the lines you've touched for PEP 8 issues, and found:

There are currently no PEP 8 issues detected in this Pull Request. Cheers! 🍻

Comment last updated at 2020-12-12 19:03:25 UTC

# mgr_shape = self.shape
# tot_items = sum(len(x.mgr_locs) for x in self.blocks)
# for block in self.blocks:
# if block._verify_integrity and block.shape[1:] != mgr_shape[1:]:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think this can be replaced with just checking that all of the array lengths match

@jorisvandenbossche
Copy link
Member Author

@TomAugspurger can you repeat again what you exactly prefer regarding the diff (missed that point a bit on the call) ?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 4, 2020 via email

@jorisvandenbossche
Copy link
Member Author

Some updates:

  • Added a apply_with_block that wraps the array in a Block to use some Block-functionality such as replace, where, putmask, .. (which can be temporary used as a workaround until we can factor things out of the blocks).
  • Skipped a bunch of JSON tests, because those otherwise segfault (instead of fail, which then stops running the tests). The segfault is because it relies on accessing the blocks in the C code (-> INT: the json C code should not deal with blocks #27164)

Big chunks of failing tests relate to: tests with extension dtypes (since I am only handling np.ndarrays for now), some other functionality relying on the blocks (eg HDF IO), some not-yet-implemented functionality (quantile, unstack).

@TomAugspurger
Copy link
Contributor

When do you want to handle (/ only use) ExtensionArrays? Would that be done before merging?

@jorisvandenbossche
Copy link
Member Author

Yeah, I was also wondering that today. Given that we want that eventually, it might make sense to just do the switch now?
It will make comparing performance a bit harder, but getting things working correctly first is probably more important, can optimize later.

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Sep 5, 2020 via email


if ignore_index:
axis = 1 if isinstance(self, ABCDataFrame) else 0
new_data.axes[axis] = ibase.default_index(len(indexer))
new_data.set_axis(axis, ibase.default_index(len(indexer)))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you comment on why these need to be changed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In hindsight, it's probably not needed anymore. Originally, I changed the axes attribute on ArrayManager to be (index, columns) order. But afterwards, I changed this to be stored in _axes, and having an axes property that switches the internal order, to match the "public" interface of BlockManager.

So with that change, I could revert those edits (I think, but can check that)

Copy link
Member Author

@jorisvandenbossche jorisvandenbossche Jan 8, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just checked this again, and so this change is still needed (without it I get failures for the array manager).
The reason is that mgr.axes is no longer the actual "storage" of the axes for the ArrayManager, but is there stored in _axes. And because mgr.axes returns a list, it's actually updating the list in place, and not the actual storage. Which is a bit of a gotcha now with the manager's axes attribute.

But based on a quick check, this seems to be the only place where we assign to mgr.axes[i] = ..

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for explaining. we should make another attempt to get axes out of FooManager before too long

@jorisvandenbossche
Copy link
Member Author

@jreback I tried to update according to your comments regarding the conversion of different types of managers: moved the bulk of the current implementation from frame.py to internals/construction.py, and simplified the code in frame.py. If can have another look (see second to last commit)

@jreback
Copy link
Contributor

jreback commented Jan 8, 2021

thanks @jorisvandenbossche looks pretty good.

  • can you do the Manager=Union[ArrayManager, BlockManager] in typing? (you may have commented on why not but didn't see it)
  • can you benchmark key things (df construction and ops) to see what slowdown this code adds), I suspect its just a very small amount because of the additional if checks but would be nice to see

otherwise rebase and looks ok to merge. cc @jbrockmendel

@@ -19,7 +19,8 @@ def test_to_numpy_dtype(self):

def test_to_numpy_copy(self):
arr = np.random.randn(4, 3)
df = DataFrame(arr)
with option_context("mode.data_manager", "block"):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would @td.skip_array_manager_invalid_test make more sense here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would @td.skip_array_manager_invalid_test make more sense here?

Indeed, will update (I think this was from before I added the decorators)

@@ -50,6 +52,21 @@ def concatenate_block_managers(
-------
BlockManager
"""
if isinstance(mgrs_indexers[0][0], ArrayManager):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we assuming that they are either all-ArrayManager or all-BlockManager?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are we assuming that they are either all-ArrayManager or all-BlockManager?

Yes, and this concat implementation right now is very limited in general (eg only the simple case without any reindexing needed, several tests are skipped because of this, I put several TODO(ArrayManager) concat with reindexing because of this).
Concatenation is one of the big areas of work for follow-up on this initial PR.

@jbrockmendel
Copy link
Member

A handful of small comments, generally looks nice. Over the weekend I'll see if I can chip away at the disabled tests

@jbrockmendel
Copy link
Member

running python3 -m pytest pandas/tests --skip-slow --skip-db --array-manager locally i got a segfault (MacOS 11.1, py3.9.1) in tests.io.test_fsspec

@jorisvandenbossche
Copy link
Member Author

running python3 -m pytest pandas/tests --skip-slow --skip-db --array-manager locally i got a segfault (MacOS 11.1, py3.9.1) in tests.io.test_fsspec

Yes, json tests segfault (because the C code is expecting a BlockManager). I skipped them when I started this PR, but the one you reference was added more recently. Will add a skip for those as well (for CI, I am currently only running a subset of tests/frame/methods that passes)

@jorisvandenbossche
Copy link
Member Author

@jreback

  • can you do the Manager=Union[ArrayManager, BlockManager] in typing? (you may have commented on why not but didn't see it)

So regarding using DataManager base class in typing, see the explanation above: #36010 (comment) (it's probably possible, but quite some work to get working, so I would rather not do it for this PR).
But I suppose here you are only meaning using an alias for the Union? Added that in the latest commits.

  • can you benchmark key things (df construction and ops) to see what slowdown this code adds), I suspect its just a very small amount because of the additional if checks but would be nice to see

I think the only potentially impacted code path (for normal use, so without enabling ArrayManager) right now is the DataFrame construction. The inline comment is a bit hidden, but see #36010 (comment) for some timings related to that. Summary is that the checking of the option only has a small impact (ca 2 µs), while the constructor itself already takes relatively more time (eg a pd.DataFrame(np.random.randn(4,3)) takes 50 µs, a pd.DataFrame({'a': [1, 2, 3]}) takes 200 µs).

@jorisvandenbossche
Copy link
Member Author

I think I addressed all remaining comments, so I am planning to merge this (it's getting a bit annoying to keep rebasing this, and I also don't plan to do substantial new feature work in this PR, there is plenty for follow-ups).
Getting more tests passing can also be done in targeted follow-ups (we need to skip many tests right now anyway, because of lacking features).

I will make an overview of the different areas of work for follow-ups.

@jorisvandenbossche jorisvandenbossche merged commit 4e93eb6 into pandas-dev:master Jan 13, 2021
@jorisvandenbossche jorisvandenbossche deleted the array-manager branch January 13, 2021 13:23
@jorisvandenbossche jorisvandenbossche added this to the 1.3 milestone Jan 13, 2021
@jorisvandenbossche
Copy link
Member Author

Thank all for the reviews!

I created an overview issue to track the different follow-ups here: #39146

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Internals Related to non-user accessible pandas implementation Needs Discussion Requires discussion from core team before further action Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants