Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: DataFrame.transpose with dt64tz #40149

Merged
merged 166 commits into from
May 17, 2021
Merged

Conversation

jbrockmendel
Copy link
Member

Does what it says on the tin: DatetimeBlock.values is always DatetimeArray, and dt64tzblock.shape == dt64tzblock.values in all cases. Similarly TimedeltaBlock.values is always TimedeltaArray.

Notes:

  • It is straightforward to extend this to work for PeriodDtype (i have a branch). Haven't tried it, but I expect it would be similarly easy to do the same for CategoricalDtype.

Things that im not yet fully happy with:

  • fillna method on 2D (I think @simonjayhawkins commented on this in another branch recently),
  • nargminmax with 2D and mask.any()
  • pytables kludge

ASVs: run repeatedly (vs master from yesterday) with --record-samples --append-samples so im pretty confident these are stable (but still include some nonsense xref #40066)

       before           after         ratio
     [f4b67b5e]       [65792836]
     <master>         <ref-hybrid-3>
+        10.1±3ms         13.9±3ms     1.38  eval.Eval.time_add('python', 'all')
+     2.06±0.02ms      2.40±0.06ms     1.16  hash_functions.NumericSeriesIndexingShuffled.time_loc_slice(<class 'pandas.core.indexes.numeric.Int64Index'>, 1000000)
+         227±2μs          263±2μs     1.15  groupby.GroupByMethods.time_dtype_as_field('datetime', 'head', 'transformation')
+         228±2μs          261±2μs     1.15  groupby.GroupByMethods.time_dtype_as_field('datetime', 'head', 'direct')
+         238±2μs          272±2μs     1.14  groupby.GroupByMethods.time_dtype_as_field('datetime', 'tail', 'transformation')
+         248±6μs          282±5μs     1.14  groupby.GroupByMethods.time_dtype_as_field('datetime', 'tail', 'direct')
+     3.92±0.03ms      4.37±0.01ms     1.11  rolling.Engine.time_rolling_apply('DataFrame', 'float', <function Engine.<lambda> at 0x7fb1c0b40670>, 'cython', 'median')
+     2.83±0.02ms      3.14±0.06ms     1.11  io.hdf.HDFStoreDataFrame.time_store_info
-         275±4μs          248±4μs     0.90  groupby.GroupByMethods.time_dtype_as_field('datetime', 'shift', 'direct')
-     1.41±0.05ms      1.27±0.01ms     0.90  stat_ops.FrameOps.time_op('sum', 'int', 1)
-     1.13±0.06ms      1.02±0.07ms     0.90  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.int64'>, 5.0, <built-in function ne>)
-         271±2μs          242±2μs     0.89  groupby.GroupByMethods.time_dtype_as_field('datetime', 'shift', 'transformation')
-         188±3μs          167±1μs     0.89  algos.isin.IsIn.time_isin_empty('datetime64[ns]')
-         192±2μs          170±2μs     0.89  algos.isin.IsIn.time_isin_mismatched_dtype('datetime64[ns]')
-         227±2μs          200±2μs     0.88  groupby.GroupByMethods.time_dtype_as_field('datetime', 'any', 'direct')
-         226±2μs          199±1μs     0.88  groupby.GroupByMethods.time_dtype_as_field('datetime', 'all', 'transformation')
-         227±2μs          199±1μs     0.88  groupby.GroupByMethods.time_dtype_as_field('datetime', 'any', 'transformation')
-        895±60μs         785±80μs     0.88  arithmetic.IntFrameWithScalar.time_frame_op_with_scalar(<class 'numpy.float64'>, 3.0, <built-in function ge>)
-      10.2±0.3ms       8.93±0.7ms     0.88  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.int64'>, 19, 'inside')
-         235±4μs          204±4μs     0.87  groupby.GroupByMethods.time_dtype_as_field('datetime', 'all', 'direct')
-     3.26±0.03μs      2.83±0.03μs     0.87  frame_methods.ToNumpy.time_to_numpy_tall
-     3.28±0.03μs      2.82±0.02μs     0.86  frame_methods.ToNumpy.time_to_numpy_wide
-      9.77±0.2ms       8.40±0.2ms     0.86  indexing.NumericSeriesIndexing.time_loc_slice(<class 'pandas.core.indexes.numeric.UInt64Index'>, 'nonunique_monotonic_inc')
-     2.94±0.05μs      2.52±0.02μs     0.86  frame_methods.ToNumpy.time_values_tall
-     2.95±0.03μs      2.52±0.02μs     0.85  frame_methods.ToNumpy.time_values_wide
-     2.09±0.02ms      1.77±0.01ms     0.85  groupby.FillNA.time_df_ffill
-     2.09±0.02ms      1.77±0.01ms     0.85  groupby.FillNA.time_df_bfill
-         204±3μs          168±2μs     0.82  arithmetic.OffsetArrayArithmetic.time_add_series_offset(<Day>)
-         170±2μs          137±3μs     0.81  groupby.GroupByMethods.time_dtype_as_field('datetime', 'count', 'direct')
-         170±2μs          137±4μs     0.80  groupby.GroupByMethods.time_dtype_as_field('datetime', 'count', 'transformation')
-        29.1±3ms       22.9±0.4ms     0.79  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.uint64'>, 20, 'outside')
-        26.0±1ms         19.2±2ms     0.74  algos.isin.IsinAlmostFullWithRandomInt.time_isin(<class 'numpy.int64'>, 20, 'inside')
-      26.2±0.2ms      18.0±0.07ms     0.69  index_object.SetOperations.time_operation('date_string', 'symmetric_difference')
-      11.6±0.1ms      7.32±0.08ms     0.63  reshape.ReshapeExtensionDtype.time_stack('datetime64[ns, US/Pacific]')
-      40.2±0.5μs       25.0±0.3μs     0.62  ctors.SeriesDtypesConstructors.time_dtindex_from_index_with_series
-     3.77±0.03ms      2.08±0.03ms     0.55  reshape.ReshapeExtensionDtype.time_unstack_slow('datetime64[ns, US/Pacific]')
-      32.1±0.5μs       17.0±0.2μs     0.53  ctors.SeriesDtypesConstructors.time_dtindex_from_series
-     1.11±0.03ms          408±7μs     0.37  categoricals.Constructor.time_datetimes
-      14.1±0.1μs      1.26±0.02μs     0.09  attrs_caching.SeriesArrayAttribute.time_extract_array_numpy('datetime64')
-      13.7±0.1μs      1.04±0.03μs     0.08  attrs_caching.SeriesArrayAttribute.time_extract_array('datetime64')
-      13.0±0.2μs         455±10ns     0.04  attrs_caching.SeriesArrayAttribute.time_array('datetime64')
-        73.8±1ms      1.66±0.03ms     0.02  reshape.ReshapeExtensionDtype.time_unstack_fast('datetime64[ns, US/Pacific]')
-      64.3±0.9ms          258±2μs     0.00  reshape.ReshapeExtensionDtype.time_transpose('datetime64[ns, US/Pacific]')

IIRC the groupby.GroupByMethods.time_dtype_as_field were heavily influenced by constructor overhead, which motivated #40054. Still need to try out @jorisvandenbossche's suggestion of non-cython optimization there.

@jbrockmendel
Copy link
Member Author

@jreback would it help to split off the perf-improving part of this for a follow-up to further trim the diff?

@jreback
Copy link
Contributor

jreback commented Apr 20, 2021

@jreback would it help to split off the perf-improving part of this for a follow-up to further trim the diff?

sure

@jbrockmendel
Copy link
Member Author

jbrockmendel commented Apr 21, 2021

@jreback would it help to split off the perf-improving part of this for a follow-up to further trim the diff?

sure

hmm this is looking more involved than i expected. Can try if it makes a difference, basically would split off everything in frame.py

@jreback
Copy link
Contributor

jreback commented Apr 21, 2021

let me look again

i think if u can reduce what is added to frame would be good

jbrockmendel added a commit to jbrockmendel/pandas that referenced this pull request Apr 21, 2021
@jbrockmendel
Copy link
Member Author

broken in as close to half as possible (not that close) in #41082

@jbrockmendel
Copy link
Member Author

fairly small diff now

@jbrockmendel jbrockmendel changed the title POC/REF: Back DatetimeTZBlock directly by (sometimes 2D) DTA PERF: DataFrame.transpose with dt64tz May 10, 2021
@jreback jreback added this to the 1.3 milestone May 17, 2021
@jreback
Copy link
Contributor

jreback commented May 17, 2021

looks fine. can you rebase. pls add a whatsnew note (as this is a non-trivial perf increse). ping on green.

@jbrockmendel
Copy link
Member Author

ping

@jreback jreback merged commit 93fb9d9 into pandas-dev:master May 17, 2021
@jreback
Copy link
Contributor

jreback commented May 17, 2021

thanks!

@jbrockmendel jbrockmendel deleted the ref-hybrid-3 branch May 17, 2021 19:22
TLouf pushed a commit to TLouf/pandas that referenced this pull request Jun 1, 2021
JulianWgs pushed a commit to JulianWgs/pandas that referenced this pull request Jul 3, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Refactor Internal refactoring of code
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants