Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CLN: dont consolidate in reshape.concat #34683

Merged
merged 19 commits into from
Dec 17, 2020

Conversation

jbrockmendel
Copy link
Member

Looks like this consolidation was added here 3b1c5b7 in 2012, no clear reason why it is needed. About to start an asv run.

@jorisvandenbossche
Copy link
Member

If you want to run benchmarks, I think you will need to run a dedicated benchmark where you create some non-consolidated data, as an impacted case is not necessarily covered by ASV.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2020

If you want to run benchmarks, I think you will need to run a dedicated benchmark where you create some non-consolidated data, as an impacted case is not necessarily covered by ASV.

if we don’t have asvs or tests then there is nothing to do and the change is fine. If you want to create and push an asv great but we cannot benchmark to some hypothetical asv.

@jorisvandenbossche
Copy link
Member

jorisvandenbossche commented Jun 10, 2020

if we don’t have asvs or tests then there is nothing to do and the change is fine.

Sorry, but I completely disagree with that statement. Regularly on PRs that might impact performance, we ask PR authors to ensure there is a benchmark for the case they are changing. If there isn't yet one, we add one. We very well know that are benchmark coverage is not complete.

For a general check, we can of course only rely on existing benchmarks, but for a PR that targets a very specific use case, we can test that specifically or add a benchmark.

@jreback
Copy link
Contributor

jreback commented Jun 10, 2020

my point which you missed that we already have tons of asvs for functions which hit this

but you don’t want to merge because we might impact some hypothetical which we can’t even right down?

@jorisvandenbossche
Copy link
Member

I am not sure we have any asv's specifically for this (it might be, but can you then point to one?).

I don't follow which hypothetical case you are talking about. You mean a non-consolidated dataframe? That's not hypothetical (eg add a column to a DataFrame, and you have a not-yet-consolidated dataframe). Why would we otherwise have those consolidate calls, if that would never have any effect?

Maybe in my original comment the "non-consolidated data" was a bit confusing. I am not talking about a hypothetical non-consolidating BlockManager, but about an actual, current DataFrame that is not consolidated at the moment when calling concat.

@TomAugspurger
Copy link
Contributor

Something like

In [21]: df = pd.DataFrame(index=list(range(100)))

In [22]: df1 = pd.DataFrame(index=list(range(100)))

In [23]: df2 = pd.DataFrame(index=list(range(100)))

In [24]: for i in range(10):
    ...:     df1[i] = np.random.randn(len(df))
    ...:     df2[i] = np.random.randn(len(df))
    ...:

In [25]: pd.concat([df1, df2])

Do we currently consolidate always in concat? Does it matter whether the non-concatenation axis is already aligned?

@jorisvandenbossche
Copy link
Member

Indeed, something like that.

BTW, with my original comment, I didn't mean to ask for much. Just a small example like the one Tom showed (but maybe a bit larger data), and do a %timeit on concat with master vs this branch.
(which is actually much less than running the full ASV suite ..)

And quite probably, such a timing would show there isn't much impact (or maybe even faster without consolidating since it's a copy less?), but if we don't check that, we don't know. Such a small check is IMO a minimum for PRs like this.

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Jun 10, 2020

Looks like we've got one 933x slower caseupdatePer comment below, the 933x appears to be incorrect.

       before           after         ratio
     [c45e92c3]       [64e96929]
     <ref-consolidate_less-7>       <cln-consolidate-concat>
+     1.03±0.05μs         3.09±2μs     3.00  index_cached_properties.IndexCache.time_shape('Float64Index')
+     1.25±0.04μs         3.47±2μs     2.78  index_cached_properties.IndexCache.time_shape('UInt64Index')
+        848±40ns         2.05±1μs     2.42  index_cached_properties.IndexCache.time_values('Float64Index')
+     1.25±0.03ms       2.39±0.4ms     1.90  index_cached_properties.IndexCache.time_is_unique('DatetimeIndex')
+      2.26±0.1μs         4.29±2μs     1.90  index_cached_properties.IndexCache.time_shape('TimedeltaIndex')
+        902±70ns       1.60±0.5μs     1.78  index_cached_properties.IndexCache.time_inferred_type('TimedeltaIndex')
+      10.7±0.4ms         17.7±1ms     1.66  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
+      1.62±0.1ms       2.65±0.3ms     1.64  index_object.Indexing.time_get_loc_non_unique('Float')
+       927±100ns       1.48±0.5μs     1.60  index_cached_properties.IndexCache.time_is_all_dates('TimedeltaIndex')
+        897±40ns         1.43±1μs     1.59  index_cached_properties.IndexCache.time_values('UInt64Index')
+        972±40ns      1.55±0.06μs     1.59  index_cached_properties.IndexCache.time_values('PeriodIndex')
+        27.5±1μs         43.5±5μs     1.58  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+      11.4±0.5ms       17.8±0.9ms     1.56  indexing.NumericSeriesIndexing.time_getitem_lists(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
+      13.6±0.2ms         20.9±2ms     1.54  index_object.SetOperations.time_operation('date_string', 'union')
+      3.90±0.1ms       5.99±0.6ms     1.53  reindex.DropDuplicates.time_frame_drop_dups_int(True)
+       265±100μs        403±100μs     1.52  index_cached_properties.IndexCache.time_is_monotonic_increasing('MultiIndex')
+      13.6±0.2ms         20.4±3ms     1.51  index_object.SetOperations.time_operation('date_string', 'intersection')
+     5.60±0.03ms       8.30±0.9ms     1.48  io.hdf.HDFStoreDataFrame.time_store_info
+      21.9±0.2μs         32.1±1μs     1.47  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.Int64Index'>, 'nonunique_monotonic_inc')
+        951±30μs       1.33±0.1ms     1.40  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
+     1.12±0.02ms       1.54±0.3ms     1.38  index_object.Indexing.time_get_loc_non_unique('Int')
+     2.04±0.02ms       2.79±0.1ms     1.37  rolling.Quantile.time_quantile('Series', 1000, 'float', 0, 'higher')
+     1.69±0.04ms       2.31±0.4ms     1.36  index_cached_properties.IndexCache.time_is_unique('MultiIndex')
+     12.0±0.09ms         16.4±2ms     1.36  io.hdf.HDFStoreDataFrame.time_read_store_table_wide
+     4.02±0.01ms         5.44±1ms     1.35  rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda> at 0x7f8a53b547a0>, 'cython')
+      99.7±0.8μs         135±40μs     1.35  index_cached_properties.IndexCache.time_is_monotonic_increasing('TimedeltaIndex')
+       99.0±10μs         131±20μs     1.33  index_cached_properties.IndexCache.time_is_monotonic('UInt64Index')
+     1.04±0.04ms      1.38±0.06ms     1.33  indexing.NumericSeriesIndexing.time_getitem_list_like(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+      2.53±0.1μs         3.35±1μs     1.32  index_cached_properties.IndexCache.time_shape('MultiIndex')
+      2.73±0.3μs       3.60±0.7μs     1.32  index_cached_properties.IndexCache.time_shape('PeriodIndex')
+         141±6μs         186±20μs     1.32  indexing.NumericSeriesIndexing.time_getitem_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+     3.99±0.07ms         5.20±1ms     1.30  reindex.DropDuplicates.time_frame_drop_dups_na(True)
+     1.51±0.03ms       1.96±0.1ms     1.30  io.csv.ReadCSVFloatPrecision.time_read_csv(',', '_', 'high')
+     5.74±0.07μs       7.39±0.8μs     1.29  index_object.Indexing.time_get_loc_non_unique_sorted('Int')
+     1.46±0.05ms       1.87±0.2ms     1.28  index_cached_properties.IndexCache.time_is_unique('IntervalIndex')
+         656±2μs        840±200μs     1.28  reindex.ReindexMethod.time_reindex_method('backfill', <function period_range at 0x7f8a4e893d40>)
+     1.40±0.03ms       1.79±0.2ms     1.27  index_cached_properties.IndexCache.time_is_unique('TimedeltaIndex')
+     1.39±0.05ms       1.77±0.2ms     1.27  index_cached_properties.IndexCache.time_is_unique('UInt64Index')
+     5.95±0.04ms       7.56±0.8ms     1.27  strings.Cat.time_cat(0, ',', None, 0.001)
+      30.5±0.3μs         38.6±4μs     1.26  index_cached_properties.IndexCache.time_shape('Int64Index')
+     1.33±0.05ms       1.67±0.4ms     1.25  index_cached_properties.IndexCache.time_is_unique('PeriodIndex')
+         132±6μs          165±6μs     1.25  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.Float64Index'>, 'nonunique_monotonic_inc')
+         298±5μs         373±20μs     1.25  join_merge.Append.time_append_homogenous
+      2.78±0.2μs       3.47±0.4μs     1.25  index_cached_properties.IndexCache.time_shape('CategoricalIndex')
+      3.24±0.2μs       4.05±0.9μs     1.25  index_cached_properties.IndexCache.time_shape('IntervalIndex')
+     3.51±0.04ms       4.38±0.7ms     1.25  index_object.IntervalIndexMethod.time_intersection(100000)
+     3.53±0.05ms       4.37±0.4ms     1.24  timeseries.AsOf.time_asof('DataFrame')
+     2.36±0.06μs       2.91±0.3μs     1.23  index_cached_properties.IndexCache.time_shape('DatetimeIndex')
+     1.82±0.04ms      2.24±0.06ms     1.23  indexing.NumericSeriesIndexing.time_getitem_array(<class 'pandas.core.indexes.numeric.Int64Index'>, 'unique_monotonic_inc')
+         226±1ns         277±40ns     1.23  multiindex_object.Integer.time_is_monotonic
+     13.3±0.04ms       16.2±0.2ms     1.22  timeseries.DatetimeAccessor.time_dt_accessor_day_name(None)
+      1.90±0.1μs       2.31±0.4μs     1.22  index_cached_properties.IndexCache.time_values('CategoricalIndex')
+         342±6ns         414±40ns     1.21  index_object.IntervalIndexMethod.time_is_unique(100000)
+     1.49±0.08μs       1.80±0.4μs     1.21  index_cached_properties.IndexCache.time_inferred_type('MultiIndex')
+     7.85±0.04ms       9.42±0.3ms     1.20  io.hdf.HDFStoreDataFrame.time_query_store_table
+        457±20μs        548±200μs     1.20  index_cached_properties.IndexCache.time_engine('MultiIndex')
+     1.46±0.04ms       1.75±0.2ms     1.20  index_cached_properties.IndexCache.time_is_unique('CategoricalIndex')
+     1.03±0.01ms      1.23±0.08ms     1.20  series_methods.NSort.time_nlargest('last')
+        832±40ns        993±200ns     1.19  index_cached_properties.IndexCache.time_is_all_dates('UInt64Index')
+         155±3μs         185±10μs     1.19  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Float32Engine'>, <class 'numpy.float32'>), 'monotonic_incr')
+         204±2ns         242±30ns     1.19  index_object.IntervalIndexMethod.time_is_unique(1000)
+     1.07±0.06μs       1.26±0.1μs     1.18  index_cached_properties.IndexCache.time_is_all_dates('MultiIndex')
+        118±10μs         139±80μs     1.18  index_cached_properties.IndexCache.time_is_monotonic_increasing('UInt64Index')
+       138±0.4μs         163±30μs     1.18  index_cached_properties.IndexCache.time_is_monotonic_decreasing('Float64Index')
+      1.65±0.1μs       1.95±0.2μs     1.18  index_cached_properties.IndexCache.time_is_all_dates('IntervalIndex')
+       236±0.9μs         276±40μs     1.17  reshape.Explode.time_explode(100, 10)
+      37.4±0.2ms         43.8±4ms     1.17  rolling.Quantile.time_quantile('Series', 10, 'int', 0.5, 'higher')
+        573±20ns         668±30ns     1.17  index_cached_properties.IndexCache.time_is_monotonic_increasing('RangeIndex')
+         204±1μs         237±30μs     1.16  indexing.CategoricalIndexIndexing.time_get_loc_scalar('monotonic_decr')
+      50.7±0.6ms         58.8±5ms     1.16  rolling.Quantile.time_quantile('DataFrame', 1000, 'int', 0.5, 'midpoint')
+     3.47±0.03ms       4.03±0.1ms     1.16  io.hdf.HDFStoreDataFrame.time_read_store_table
+        418±20ns         485±10ns     1.16  index_cached_properties.IndexCache.time_inferred_type('RangeIndex')
+      4.24±0.3μs         4.90±1μs     1.16  index_cached_properties.IndexCache.time_engine('TimedeltaIndex')
+      40.1±0.2ms         46.0±5ms     1.15  rolling.Quantile.time_quantile('DataFrame', 10, 'float', 0.5, 'nearest')
+      18.1±0.1ms         20.7±2ms     1.15  rolling.Apply.time_rolling('Series', 3, 'int', <built-in function sum>, False)
+        769±40ns         879±80ns     1.14  index_cached_properties.IndexCache.time_inferred_type('Float64Index')
+        921±60ns       1.05±0.3μs     1.14  index_cached_properties.IndexCache.time_inferred_type('UInt64Index')
+     5.76±0.04ms       6.56±0.3ms     1.14  series_methods.ValueCounts.time_value_counts('object')
+      35.4±0.5μs         40.2±3μs     1.14  inference.ToNumeric.time_from_float('ignore')
+     8.61±0.08ms       9.78±0.9ms     1.14  timeseries.Iteration.time_iter_preexit(<function date_range at 0x7f8a4e833f80>)
+      69.6±0.6ms         79.0±9ms     1.13  reshape.Cut.time_qcut_datetime(1000)
+        633±30ns         718±60ns     1.13  index_cached_properties.IndexCache.time_is_monotonic_increasing('Int64Index')
+       100±0.8μs         114±10μs     1.13  index_cached_properties.IndexCache.time_is_monotonic('TimedeltaIndex')
+         179±1μs         202±20μs     1.13  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int32Engine'>, <class 'numpy.int32'>), 'monotonic_decr')
+      38.6±0.2ms         43.5±4ms     1.13  rolling.Quantile.time_quantile('Series', 10, 'int', 0.5, 'nearest')
+     2.83±0.03ms       3.19±0.1ms     1.12  timeseries.ResampleSeries.time_resample('period', '1D', 'ohlc')
+        699±20ns         785±20ns     1.12  index_cached_properties.IndexCache.time_is_monotonic('RangeIndex')
+     2.64±0.03ms       2.95±0.2ms     1.12  timeseries.ResampleSeries.time_resample('period', '1D', 'mean')
+         100±1μs          112±9μs     1.12  index_cached_properties.IndexCache.time_is_monotonic_decreasing('TimedeltaIndex')
+       125±0.8ms          140±8ms     1.12  inference.ToNumericDowncast.time_downcast('string-int', 'signed')
+         273±2μs         304±20μs     1.11  period.Indexing.time_intersection
+      24.2±0.1ms         26.9±2ms     1.11  inference.ToNumericDowncast.time_downcast('int-list', 'integer')
+       114±0.2ms          127±8ms     1.11  inference.ToNumericDowncast.time_downcast('string-nint', 'float')
+      34.7±0.3μs         38.4±2μs     1.11  inference.ToNumeric.time_from_float('coerce')
+      56.2±0.6ms         62.1±4ms     1.10  indexing.CategoricalIndexIndexing.time_get_indexer_list('monotonic_decr')
+         576±7μs         636±60μs     1.10  index_object.IntervalIndexMethod.time_intersection_both_duplicate(1000)
+      89.0±0.1ms         98.1±7ms     1.10  inference.ToNumericDowncast.time_downcast('string-float', None)
+     1.07±0.05μs      1.18±0.08μs     1.10  index_cached_properties.IndexCache.time_values('DatetimeIndex')
+         157±1μs         172±20μs     1.10  index_cached_properties.IndexCache.time_is_monotonic_decreasing('CategoricalIndex')
+      42.8±0.4ms       46.9±0.7ms     1.10  timedelta.ToTimedeltaErrors.time_convert('ignore')
+     2.11±0.02ms      2.31±0.04ms     1.09  series_methods.IsIn.time_isin('object')
+       169±0.9μs         184±20μs     1.09  indexing_engines.NumericEngineIndexing.time_get_loc((<class 'pandas._libs.index.Int16Engine'>, <class 'numpy.int16'>), 'non_monotonic')
+      86.1±0.3ms         93.4±6ms     1.08  rolling.Apply.time_rolling('Series', 3, 'float', <function Apply.<lambda> at 0x7f8a539fc5f0>, False)
+     2.62±0.07μs       2.84±0.3μs     1.08  index_cached_properties.IndexCache.time_engine('UInt64Index')
+         102±1ms          110±9ms     1.08  index_cached_properties.IndexCache.time_is_monotonic_increasing('IntervalIndex')
+        446±20ns         482±20ns     1.08  index_cached_properties.IndexCache.time_is_unique('RangeIndex')
+     1.00±0.01μs      1.08±0.09μs     1.08  index_object.Range.time_min
+        78.2±1μs         84.3±3μs     1.08  indexing.NonNumericSeriesIndexing.time_getitem_pos_slice('period', 'nonunique_monotonic_inc')
+        869±10ns         934±10ns     1.07  index_object.Range.time_min_trivial
+     4.06±0.06ms      4.35±0.09ms     1.07  tslibs.offsets.OnOffset.time_on_offset(<CustomBusinessMonthBegin>)
+      1.78±0.1μs       1.90±0.2μs     1.07  index_cached_properties.IndexCache.time_inferred_type('IntervalIndex')
+      14.6±0.3μs       15.7±0.6μs     1.07  tslibs.offsets.OffestDatetimeArithmetic.time_subtract(<MonthBegin>)
+       102±0.4μs          109±4μs     1.07  index_cached_properties.IndexCache.time_is_monotonic('PeriodIndex')
+        42.6±1ms         45.5±1ms     1.07  timedelta.ToTimedeltaErrors.time_convert('coerce')
+        420±20ns         447±20ns     1.07  index_cached_properties.IndexCache.time_inferred_type('Int64Index')
+       107±0.8ms          114±6ms     1.06  index_cached_properties.IndexCache.time_is_monotonic_decreasing('IntervalIndex')
+            189M             201M     1.06  io.json.ReadJSONLines.peakmem_read_json_lines('int')
+      30.1±0.3μs       31.9±0.9μs     1.06  index_cached_properties.IndexCache.time_values('RangeIndex')
+        78.8±1μs         83.5±1μs     1.06  indexing.NonNumericSeriesIndexing.time_getitem_pos_slice('period', 'non_monotonic')
+     1.82±0.06ms       1.93±0.1ms     1.06  index_cached_properties.IndexCache.time_is_all_dates('CategoricalIndex')
+        432±10ns         457±10ns     1.06  index_cached_properties.IndexCache.time_is_unique('Int64Index')
+            133M             141M     1.06  io.json.ToJSON.peakmem_to_json_wide('columns', 'df_int_float_str')
+           55.2M            58.4M     1.06  rolling.VariableWindowMethods.peakmem_rolling('DataFrame', '1d', 'float', 'median')
+         115±1ms          121±5ms     1.06  join_merge.MergeAsof.time_by_int('forward', 5)
+         102±1ms          107±4ms     1.06  index_cached_properties.IndexCache.time_is_monotonic('IntervalIndex')
+         158±1μs         167±10μs     1.06  index_cached_properties.IndexCache.time_is_monotonic_increasing('CategoricalIndex')
+     2.26±0.09μs       2.38±0.1μs     1.05  index_cached_properties.IndexCache.time_engine('Float64Index')
+        81.5±1μs         85.7±3μs     1.05  series_methods.SearchSorted.time_searchsorted('float32')
+      31.6±0.2μs       33.2±0.7μs     1.05  index_cached_properties.IndexCache.time_engine('Int64Index')
+      56.5±0.4ms       58.9±0.3ms     1.04  groupby.GroupByMethods.time_dtype_as_field('datetime', 'unique', 'direct')
+      30.6±0.3μs         31.9±1μs     1.04  index_cached_properties.IndexCache.time_values('Int64Index')
+      62.9±0.2μs         65.3±2μs     1.04  indexing.NumericSeriesIndexing.time_getitem_scalar(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'unique_monotonic_inc')
+         102±1μs          106±2μs     1.04  index_cached_properties.IndexCache.time_is_monotonic_decreasing('PeriodIndex')
+      68.8±0.6μs         71.4±2μs     1.04  indexing.NumericSeriesIndexing.time_iloc_array(<class 'pandas.core.indexes.numeric.Float64Index'>, 'unique_monotonic_inc')
+       101±0.2μs          105±2μs     1.04  index_cached_properties.IndexCache.time_is_monotonic('DatetimeIndex')
+       139±0.4μs          144±5μs     1.04  index_cached_properties.IndexCache.time_is_monotonic('Float64Index')
+        443±20ns         458±20ns     1.03  index_cached_properties.IndexCache.time_is_all_dates('RangeIndex')
+     10.7±0.05ms       11.0±0.2ms     1.03  inference.ToNumericDowncast.time_downcast('datetime64', 'integer')
+       158±0.9μs          163±5μs     1.03  index_cached_properties.IndexCache.time_is_monotonic('CategoricalIndex')
+       102±0.3μs          105±2μs     1.03  index_cached_properties.IndexCache.time_is_monotonic_increasing('DatetimeIndex')
+        979±30ns      1.01±0.03μs     1.03  index_cached_properties.IndexCache.time_is_all_dates('PeriodIndex')
+            108M             111M     1.02  io.json.ToJSON.peakmem_to_json('split', 'df_int_floats')
+            114M             116M     1.02  io.json.ToJSON.peakmem_to_json_wide('values', 'df_date_idx')
+         203±2ms         207±20ms     1.02  index_cached_properties.IndexCache.time_values('MultiIndex')
+     5.13±0.01ms      5.22±0.07ms     1.02  sparse.Arithmetic.time_intersect(0.1, nan)
+            114M             116M     1.02  io.json.ToJSON.peakmem_to_json_wide('split', 'df')
-      53.7±0.7μs         53.0±1μs     0.99  ctors.SeriesConstructors.time_series_constructor(<function no_change at 0x7f8a4fab45f0>, False, 'int')
-     1.02±0.03μs      1.00±0.03μs     0.98  index_cached_properties.IndexCache.time_inferred_type('PeriodIndex')
-     1.75±0.07ms      1.71±0.03ms     0.98  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'int')
-           59.1M            57.7M     0.98  rolling.VariableWindowMethods.peakmem_rolling('Series', '1d', 'int', 'median')
-           57.7M            56.1M     0.97  rolling.VariableWindowMethods.peakmem_rolling('DataFrame', '1d', 'int', 'median')
-      2.15±0.2ms      2.09±0.05ms     0.97  ctors.SeriesConstructors.time_series_constructor(<function arr_dict at 0x7f8a4fab47a0>, False, 'float')
-      3.16±0.1ms      3.07±0.05ms     0.97  ctors.SeriesConstructors.time_series_constructor(<function arr_dict at 0x7f8a4fab47a0>, False, 'int')
-      28.2±0.6μs       27.3±0.1μs     0.97  boolean.TimeLogicalOps.time_and_scalar
-      2.41±0.2ms      2.34±0.05ms     0.97  ctors.SeriesConstructors.time_series_constructor(<function gen_of_str at 0x7f8a4fab4710>, False, 'int')
-     1.74±0.07ms      1.68±0.04ms     0.97  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'int')
-        771±40μs         745±20μs     0.97  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists at 0x7f8a4fab4950>, False, 'int')
-         171±2ms        165±0.5ms     0.97  groupby.AggEngine.time_series_numba
-        778±30μs         752±10μs     0.97  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists at 0x7f8a4fab4950>, True, 'int')
-        409±20μs          395±7μs     0.97  ctors.SeriesConstructors.time_series_constructor(<function list_of_str at 0x7f8a4fab4680>, True, 'int')
-        747±20μs          721±9μs     0.97  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, False, 'float')
-        792±40μs         765±10μs     0.97  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists_with_none at 0x7f8a4fab4a70>, True, 'int')
-      2.43±0.1ms      2.34±0.04ms     0.96  ctors.SeriesConstructors.time_series_constructor(<function gen_of_tuples at 0x7f8a4fab48c0>, False, 'float')
-        763±20μs         736±10μs     0.96  ctors.SeriesConstructors.time_series_constructor(<class 'list'>, True, 'float')
-       769±100μs         742±10μs     0.96  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples at 0x7f8a4fab4830>, False, 'int')
-      2.59±0.2ms      2.49±0.05ms     0.96  ctors.SeriesConstructors.time_series_constructor(<function gen_of_str at 0x7f8a4fab4710>, False, 'float')
-        778±40μs         750±10μs     0.96  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples at 0x7f8a4fab4830>, True, 'int')
-        763±30μs         734±10μs     0.96  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples_with_none at 0x7f8a4fab49e0>, False, 'float')
-        233±30μs          224±2μs     0.96  groupby.GroupByMethods.time_dtype_as_field('float', 'var', 'transformation')
-      64.9±0.3μs         62.4±2μs     0.96  ctors.SeriesConstructors.time_series_constructor(<function no_change at 0x7f8a4fab45f0>, True, 'int')
-       811±0.7μs         780±60μs     0.96  sparse.Arithmetic.time_make_union(0.01, 0)
-       799±100μs         768±30μs     0.96  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples_with_none at 0x7f8a4fab49e0>, True, 'float')
-        45.1±1μs       43.3±0.7μs     0.96  ctors.SeriesConstructors.time_series_constructor(<function no_change at 0x7f8a4fab45f0>, False, 'float')
-        394±10μs          378±6μs     0.96  ctors.SeriesConstructors.time_series_constructor(<function list_of_str at 0x7f8a4fab4680>, False, 'int')
-        781±30μs         749±20μs     0.96  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples_with_none at 0x7f8a4fab49e0>, True, 'int')
-        783±30μs         751±10μs     0.96  frame_ctor.FromRecords.time_frame_from_records_generator(1000)
-      3.27±0.2ms      3.13±0.06ms     0.96  ctors.SeriesConstructors.time_series_constructor(<function arr_dict at 0x7f8a4fab47a0>, True, 'int')
-        771±30μs         738±20μs     0.96  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists_with_none at 0x7f8a4fab4a70>, False, 'int')
-      2.44±0.3ms      2.33±0.04ms     0.96  ctors.SeriesConstructors.time_series_constructor(<function gen_of_tuples at 0x7f8a4fab48c0>, False, 'int')
-        779±40μs         743±20μs     0.95  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples at 0x7f8a4fab4830>, False, 'float')
-      11.3±0.2ms      10.7±0.05ms     0.95  frame_methods.Reindex.time_reindex_both_axes
-      2.27±0.1ms      2.16±0.05ms     0.95  ctors.SeriesConstructors.time_series_constructor(<function arr_dict at 0x7f8a4fab47a0>, True, 'float')
-        782±30μs         744±10μs     0.95  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists_with_none at 0x7f8a4fab4a70>, False, 'float')
-      11.2±0.2ms      10.7±0.04ms     0.95  arithmetic.Timeseries.time_timestamp_ops_diff('US/Eastern')
-      4.13±0.1μs      3.93±0.05μs     0.95  categoricals.SearchSorted.time_categorical_contains
-        804±50μs         764±20μs     0.95  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists_with_none at 0x7f8a4fab4a70>, True, 'float')
-        203±10ns          193±2ns     0.95  categoricals.IsMonotonic.time_categorical_index_is_monotonic_decreasing
-        781±40μs         740±20μs     0.95  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists at 0x7f8a4fab4950>, False, 'float')
-        732±50μs          694±5μs     0.95  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function ge>)
-         194±3μs          184±2μs     0.95  groupby.GroupByMethods.time_dtype_as_field('float', 'cumcount', 'direct')
-     2.13±0.02ms      2.02±0.05ms     0.95  arithmetic.DateInferOps.time_timedelta_plus_datetime
-        421±30μs          399±6μs     0.95  ctors.SeriesConstructors.time_series_constructor(<function list_of_str at 0x7f8a4fab4680>, True, 'float')
-      21.4±0.4ms       20.2±0.1ms     0.95  groupby.AggEngine.time_dataframe_cython
-     1.52±0.01μs      1.43±0.01μs     0.94  attrs_caching.SeriesArrayAttribute.time_extract_array_numpy('datetime64tz')
-     4.96±0.06μs      4.69±0.08μs     0.94  index_cached_properties.IndexCache.time_engine('DatetimeIndex')
-        949±60ns         895±40ns     0.94  index_cached_properties.IndexCache.time_is_monotonic_decreasing('Int64Index')
-      1.13±0.02s       1.06±0.01s     0.94  groupby.GroupByMethods.time_dtype_as_field('int', 'describe', 'direct')
-     1.23±0.04ms      1.16±0.01ms     0.94  arithmetic.NumericInferOps.time_modulo(<class 'numpy.int32'>)
-      11.6±0.2ms      10.9±0.04ms     0.94  arithmetic.Timeseries.time_timestamp_ops_diff_with_shift(None)
-        409±40μs          384±4μs     0.94  ctors.SeriesConstructors.time_series_constructor(<function list_of_str at 0x7f8a4fab4680>, False, 'float')
-     1.19±0.07ms      1.12±0.01ms     0.94  groupby.Datelike.time_sum('date_range_tz')
-       815±100μs         759±10μs     0.93  ctors.SeriesConstructors.time_series_constructor(<function list_of_lists at 0x7f8a4fab4950>, True, 'float')
-       819±200μs         754±10μs     0.92  ctors.SeriesConstructors.time_series_constructor(<function list_of_tuples at 0x7f8a4fab4830>, True, 'float')
-      2.47±0.1ms      2.25±0.07ms     0.91  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 5.0, <built-in function add>)
-      3.98±0.2μs      3.58±0.08μs     0.90  dtypes.InferDtypes.time_infer('np-int')
-        752±30μs         675±10μs     0.90  frame_methods.Iteration.time_items_cached
-            213M             190M     0.89  io.json.ReadJSONLines.peakmem_read_json_lines('datetime')
-        58.8±5ms         52.2±2ms     0.89  frame_ctor.FromRecords.time_frame_from_records_generator(None)
-         137±7ms          120±1ms     0.87  groupby.DateAttributes.time_len_groupby_object
-     1.01±0.08μs         871±40ns     0.86  index_cached_properties.IndexCache.time_is_monotonic_decreasing('RangeIndex')
-        856±80μs         714±10μs     0.83  algorithms.Quantile.time_quantile(0, 'linear', 'int')
-      9.96±0.3ms       8.04±0.3ms     0.81  algorithms.Duplicated.time_duplicated(False, 'first', 'string')
-       768±100μs          618±7μs     0.80  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function ne>)
-        12.7±3ms      8.65±0.05ms     0.68  frame_methods.Repr.time_repr_tall

@jbrockmendel
Copy link
Member Author

hmm the 933x one makes zero sense, since it doesn't go through the affected code path. Checking with timeit shows much smaller change, so writing that asv result off as incorrect

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Jun 11, 2020

Using the example Tom suggested, this is appreciably slower:

import numpy as np
import pandas as pd

ncols = 10
nrows = 100

df = pd.DataFrame(index=list(range(nrows)))

df1 = pd.DataFrame(index=list(range(nrows)))
df2 = pd.DataFrame(index=list(range(nrows)))

for i in range(ncols):
    df1[i] = np.random.randn(len(df))
    df2[i] = np.random.randn(len(df))


In [22]: %timeit pd.concat([df1, df2])
346 µs ± 3.55 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- master
1.34 ms ± 68.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- PR

[...] re-construct the non-consolidated DataFrames

In [23]: %timeit pd.concat([df1.copy(), df2.copy()])
830 µs ± 35.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- master
1.31 ms ± 17.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)  # <-- PR

I expected the result in [22], am surprised by [23], will have to take a look at what is driving the results.

We need to decide how much (if any) perf penalty we're willing to accept to avoid side-effects.

@jbrockmendel
Copy link
Member Author

Looks like the call to concatenate_block_managers is 8x slower non-consolidated, see below.

%prun -s cumulative for n in range(100): pd.concat([df1.copy(), df2.copy()])

PR:

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.273    0.273 {built-in method builtins.exec}
        1    0.002    0.002    0.273    0.273 <string>:1(<module>)
      100    0.000    0.000    0.246    0.002 concat.py:67(concat)
      100    0.001    0.000    0.215    0.002 concat.py:451(get_result)
      100    0.004    0.000    0.200    0.002 concat.py:31(concatenate_block_managers)
     1000    0.002    0.000    0.072    0.000 concat.py:435(_is_uniform_join_units)
     3000    0.002    0.000    0.070    0.000 {built-in method builtins.all}
     3000    0.003    0.000    0.066    0.000 concat.py:450(<genexpr>)
     1100    0.005    0.000    0.064    0.000 concat.py:110(concat_compat)
     2000    0.008    0.000    0.063    0.000 concat.py:204(is_na)
     1100    0.006    0.000    0.043    0.000 concat.py:29(get_dtype_kinds)
     2000    0.001    0.000    0.041    0.000 missing.py:47(isna)
   126200    0.024    0.000    0.041    0.000 {built-in method builtins.isinstance}
     2000    0.003    0.000    0.041    0.000 missing.py:130(_isna)
     2000    0.006    0.000    0.034    0.000 missing.py:193(_isna_ndarraylike)
    15100    0.007    0.000    0.032    0.000 base.py:256(is_dtype)
      100    0.001    0.000    0.031    0.000 concat.py:292(__init__)
     3200    0.002    0.000    0.030    0.000 common.py:1180(needs_i8_conversion)
     1000    0.002    0.000    0.028    0.000 blocks.py:2705(make_block)
      100    0.000    0.000    0.028    0.000 concat.py:512(_get_new_axes)

master

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    0.161    0.161 {built-in method builtins.exec}
        1    0.001    0.001    0.161    0.161 <string>:1(<module>)
      100    0.000    0.000    0.139    0.001 concat.py:67(concat)
      100    0.002    0.000    0.098    0.001 concat.py:292(__init__)
      200    0.000    0.000    0.066    0.000 generic.py:5309(_consolidate)
      200    0.000    0.000    0.065    0.000 generic.py:5301(_consolidate_inplace)
      200    0.000    0.000    0.065    0.000 generic.py:5290(_protect_consolidate)
      200    0.000    0.000    0.065    0.000 generic.py:5304(f)
      200    0.000    0.000    0.063    0.000 managers.py:934(consolidate)
      200    0.000    0.000    0.056    0.000 managers.py:950(_consolidate_inplace)
      200    0.001    0.000    0.049    0.000 managers.py:1847(_consolidate)
      100    0.001    0.000    0.040    0.000 concat.py:453(get_result)
      100    0.000    0.000    0.029    0.000 concat.py:514(_get_new_axes)
      100    0.000    0.000    0.029    0.000 concat.py:517(<listcomp>)
     4000    0.001    0.000    0.027    0.000 managers.py:1852(<lambda>)
     4000    0.006    0.000    0.025    0.000 blocks.py:167(_consolidate_key)
      100    0.001    0.000    0.024    0.000 concat.py:31(concatenate_block_managers)
      300    0.001    0.000    0.023    0.000 base.py:4175(equals)
      200    0.001    0.000    0.021    0.000 generic.py:5652(copy)
      300    0.002    0.000    0.020    0.000 missing.py:358(array_equivalent)
      200    0.002    0.000    0.020    0.000 managers.py:1864(_merge_blocks)
      200    0.000    0.000    0.019    0.000 managers.py:752(copy)
     4000    0.006    0.000    0.018    0.000 _dtype.py:333(_name_get)
      200    0.002    0.000    0.016    0.000 managers.py:362(apply)
      100    0.000    0.000    0.016    0.000 concat.py:521(_get_comb_axis)
      100    0.000    0.000    0.016    0.000 api.py:65(get_objs_combined_axis)
      200    0.001    0.000    0.016    0.000 {built-in method builtins.sorted}
    44400    0.009    0.000    0.015    0.000 {built-in method builtins.isinstance}
      100    0.000    0.000    0.015    0.000 api.py:109(_get_combined_index)
     1400    0.001    0.000    0.014    0.000 common.py:1180(needs_i8_conversion)
      200    0.001    0.000    0.013    0.000 concat.py:110(concat_compat)
      100    0.000    0.000    0.013    0.000 concat.py:531(_get_concat_axis)
      100    0.000    0.000    0.012    0.000 concat.py:588(_concat_indexes)
      100    0.000    0.000    0.012    0.000 base.py:4112(append)
      300    0.001    0.000    0.012    0.000 blocks.py:2705(make_block)
     5100    0.003    0.000    0.012    0.000 base.py:256(is_dtype)

@jreback jreback added the Reshaping Concat, Merge/Join, Stack/Unstack, Explode label Jun 14, 2020
@jbrockmendel
Copy link
Member Author

@jorisvandenbossche do you have a workaround in mind for how the all-1D case would avoid this perf hit? if so, could that workaround be applicable here?

@TomAugspurger
Copy link
Contributor

TomAugspurger commented Jun 22, 2020

@jbrockmendel just to get a sense for order of magnitudes here: is the change in workloads roughly equivalent to the following?

# consolidated: one (100, 10) array -> (200, 10) array
In [15]: a = np.ones((100, 10))

In [16]: %timeit np.concatenate([a, a])
1.86 µs ± 23.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

# non-consolidated: ten (100,) arrays -> ten (200,) arrays
In [17]: bs = [np.ones(100,) for _ in range(10)]

In [18]: %timeit [np.concatenate([b, b]) for b in bs]
15.9 µs ± 183 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

@jbrockmendel
Copy link
Member Author

@TomAugspurger not sure which measurement your example is supposed to be analogous to. The slowdown (8.5x) is pretty similar to the slowdown mentioned above inside concatenate_block_managers, but the actual user-facing slowdown is one of the smaller numbers (1.6x or 3.9x)

@jorisvandenbossche
Copy link
Member

@TomAugspurger if you use slightly larger data, the python overhead (of the for loop + multiple function calls) decreases quickly:

In [10]: a = np.ones((10, 10000))  

In [11]: %timeit np.concatenate([a, a], axis=1)  
49.6 µs ± 1.68 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

In [12]: bs = [np.ones(10000) for _ in range(10)]  

In [13]: %timeit [np.concatenate([b, b]) for b in bs]  
55 µs ± 382 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

@TomAugspurger
Copy link
Contributor

not sure which measurement your example is supposed to be analogous to.

Just comparing the general idea of "concat this 2D array" vs. "concat these many 1D arrays" to get a sense for how things perform.

The raw number of rows and the relative number of rows to columns does matter to an extent.

concat

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from timeit import default_timer as tic

%matplotlib inline
import seaborn as sns
import matplotlib.pyplot as plt

n_rows = [10, 100, 1000, 10_000, 100_000]
n_cols = [10, 100, 1000]

timings = []

for n in n_rows:
    for c in n_cols:
        for i in range(3):
            a = np.ones((n, c))
            bs = [np.ones(n,) for _ in range(c)]
            t0 = tic()
            np.concatenate([a, a])
            t1 = tic()
            [np.concatenate([b, b]) for b in bs]
            t2 = tic()
            timings.append((n, c, i, 'consolidated', t1 - t0))
            timings.append((n, c, i, 'split', t2 - t1))


df = pd.DataFrame(timings, columns=['n_rows', 'n_cols', 'trial', 'policy', 'time'])
df

g = sns.FacetGrid(df, col="n_cols", hue="policy", )
g.map(
    sns.lineplot, 'n_rows', "time",
)
g.set(xscale="log", yscale="log")
g.add_legend()

@jbrockmendel
Copy link
Member Author

So I definitely need to add an asv for this (based on #34683 (comment)). Aside from that, we need to do one of three things:

  1. Find a way to fix the performance hit in pd.concat (which I'm hoping @jorisvandenbossche has an idea for as I expect it will be the same for all-1D)
  2. Decide we're OK with this performance hit.
  3. Decide not to move forward with this.

@jbrockmendel
Copy link
Member Author

@jorisvandenbossche in the ArrayManager PR you said that the code snippet here performed better than in master. Any insight into improving the performance here?

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Sep 7, 2020

Looks like in concatenate_block_managers it is going through concat_compat and making more copies

Edit: nope, that doesnt explain it...

OK: within concatenate_block_managers we do two things per-block: JoinUnit.is_na, and concat_compat. It isn't clear to me why the ArrayManager version would be able to avoid these.

This was referenced Sep 12, 2020
@jreback
Copy link
Contributor

jreback commented Sep 13, 2020

how's the asv's on this now?

@jbrockmendel
Copy link
Member Author

Just pushed, this is now slightly faster than master on the benchmark above #34683 (comment)

Comment on lines +534 to 535
@cache_readonly
def _get_concat_axis(self) -> Index:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it is property, then maybe _concat_axis, without get would be better?

@github-actions
Copy link
Contributor

This pull request is stale because it has been open for thirty days with no activity. Please update or respond to this comment if you're still interested in working on this.

@github-actions github-actions bot added the Stale label Nov 18, 2020
@jbrockmendel
Copy link
Member Author

rebased+green


cls: Type[Block]

if is_sparse(dtype):
# Need this first(ish) so that Sparse[datetime] is sparse
cls = ExtensionBlock
elif is_categorical_dtype(values.dtype):
elif dtype.name == "category":
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not is Categorical ? e.g. since we are removing comparison vs 'category' generally (in your other PR)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could do isinstance(dtype, CategoricalDtype). either way is fine by me

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think prefer that

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated+green

@jreback jreback added this to the 1.3 milestone Dec 17, 2020
@jreback jreback merged commit 76a5a4f into pandas-dev:master Dec 17, 2020
@jreback
Copy link
Contributor

jreback commented Dec 17, 2020

thanks

@jbrockmendel jbrockmendel deleted the cln-consolidate-concat branch December 17, 2020 21:48
luckyvs1 pushed a commit to luckyvs1/pandas that referenced this pull request Jan 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Reshaping Concat, Merge/Join, Stack/Unstack, Explode Stale
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants