PERF: regression in DataFrame reduction ops performance #37081

jorisvandenbossche · 2020-10-12T14:49:31Z

From https://pandas.pydata.org/speed/pandas/#stat_ops.FrameOps.time_op?Cython=0.29.21&Cython=0.29.16&p-op='sum'&p-dtype='int'&p-axis=0&commits=3a043f2d-4c03d07b&x-axis-scale=date

Reproducer:

values = np.random.randn(100000, 4)  
df = pd.DataFrame(values).astype("int")
%timeit df.sum()

increased with a factor 2 to 3 x somewhere the last days.

The text was updated successfully, but these errors were encountered:

ukarroum · 2020-10-12T17:47:17Z

Would like to try working on that if possible.

ukarroum · 2020-10-12T17:47:24Z

take

jbrockmendel · 2020-10-12T23:20:05Z

#35881 is a candidate, though thats a much bigger impact than id expect

jorisvandenbossche · 2020-10-13T15:04:32Z

Checking out that commit vs the one before, seems to confirm that #35881 is indeed the cause of the slowdown

jorisvandenbossche · 2020-10-13T15:23:29Z

I did a quick profile, and commented on #35881 about some observations

jorisvandenbossche · 2020-10-13T15:24:30Z

@ukarroum especially #35881 (comment) is something you could test

ukarroum · 2020-10-14T17:28:07Z

Can confirm that by using self._iter_column_arrays() instead of self.dtypes we have significant performance improvement :

With self.dtypes :

In [8]: values = np.random.randn(100000, 4)   
   ...: df = pd.DataFrame(values).astype("int") 
   ...: %timeit df.sum() 
714 µs ± 21 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

With self._iter_column_arrays()

In [4]: values = np.random.randn(100000, 4)   
   ...: df = pd.DataFrame(values).astype("int") 
   ...: %timeit df.sum() 
477 µs ± 8.7 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

Gonna submit a PR soon.

…pe_is_dt'

jorisvandenbossche · 2020-10-20T16:34:57Z

Reopening since this dtype change only fixed part of the regression. See the updated plot at https://pandas.pydata.org/speed/pandas/#stat_ops.FrameOps.time_op?Cython=0.29.21&p-op='sum'&p-dtype='int'&p-axis=0&commits=3a043f2d-4c03d07b&x-axis-scale=date

ukarroum · 2020-10-20T18:48:42Z

Gonna do some profiling and open a PR asap.

ukarroum · 2020-10-24T10:37:05Z

The origin of the remaining regression is changing this condition :

if numeric_only is not None:

312 µs ± 2.74 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

to :

if numeric_only is not None or (
            numeric_only is None
            and axis == 0
            and not any_object
            and not self._mgr.any_extension_types
        ):

474 µs ± 2.29 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

in commit 7d257c6

EDIT : by removing the code not accessible from the used reproducer (check 1st comment), we have the minimal code :

df = self

ignore_failures = numeric_only is None

res, indexer = df._mgr.reduce(blk_func, ignore_failures=ignore_failures)
out = df._constructor(res).iloc[0]

return out

Changes in the above commit don't seem to add significant perfs regression to _mgr.reduce.

ukarroum · 2020-10-24T13:26:25Z

replacing out = df._constructor(res).iloc[0] with an 0(1) instruction (hardcoded serie) has almost no impact on time execution.

df._mgr.reduce(blk_func, ignore_failures=ignore_failures) is the line adding the 150 microsec.

EDIT : using try: func(values) before the above condition seems to solve the problem and passes the tests.

but i m not confident to do a PR (yet) since i don't fully understand this code and i'm afraid it may break some use cases that may need to use the condition block even if the try won't fail.

jbrockmendel · 2020-10-24T21:27:27Z

We don't particularly need ignore_failures in BlockManager.reduce right now, but it will be needed when we fix the many reduction bugs caused by calling .values further down in DataFrame._reduce. If the perf impact is that big, it can be reverted in the interim i guess.

ukarroum · 2020-10-25T06:52:31Z

@jbrockmendel : Actually we don't need to revert the whole change, we only need to revert the following condition :

if numeric_only is not None or (
            numeric_only is None
            and axis == 0
            and not any_object
            and not self._mgr.any_extension_types
        ):

to :

if numeric_only is not None:

This will solve the regression issue

EDIT : Do you think i can safely do so ?

jbrockmendel · 2020-10-26T01:29:08Z

Do you think i can safely do so ?

In the short term, yes, since we are not really relying on this. The upcoming bugfixes will rely on it. So its really a question of if we want to revert then unrevert later.

… (#37655) * Moving the file test_frame.py to a new directory * Сreated file test_frame_color.py * Transfer tests of test_frame.py to test_frame_color.py * PEP 8 fixes * Transfer tests of test_frame.py to test_frame_groupby.py and test_frame_subplots.py * Removing unnecessary imports * PEP 8 fixes * Fixed class name * Transfer tests of test_frame.py to test_frame_subplots.py * Transfer tests of test_frame.py to test_frame_groupby.py, test_frame_subplots.py, test_frame_color.py * Changed class names * Removed unnecessary imports * Removed import * catch FutureWarnings (#37587) * TST/REF: collect indexing tests by method (#37590) * REF: prelims for single-path setitem_with_indexer (#37588) * ENH: __repr__ for 2D DTA/TDA (#37164) * CLN: de-duplicate _validate_where_value with _validate_setitem_value (#37595) * TST/REF: collect tests by method (#37589) * TST/REF: move remaining setitem tests from test_timeseries * TST/REF: rehome test_timezones test * move misplaced arithmetic test * collect tests by method * move misplaced file * REF: Categorical.is_dtype_equal -> categories_match_up_to_permutation (#37545) * CLN refactor non-core (#37580) * refactor core/computation (#37585) * TST/REF: share method tests between DataFrame and Series (#37596) * BUG: Index.where casting ints to str (#37591) * REF: IntervalArray comparisons (#37124) * regression fix for merging DF with datetime index with empty DF (#36897) * ERR: fix error message in Period for invalid frequency (#37602) * CLN: remove rebox_native (#37608) * TST/REF: tests.generic (#37618) * TST: collect tests by method (#37617) * TST/REF: collect test_timeseries tests by method * misplaced DataFrame.values tst * misplaced dataframe.values test * collect test by method * TST/REF: share tests across Series/DataFrame (#37616) * Gh 36562 typeerror comparison not supported between float and str (#37096) * docs: fix punctuation (#37612) * REGR: pd.to_hdf(..., dropna=True) not dropping missing rows (#37564) * parametrize set_axis tests (#37619) * CLN: clean color selection in _matplotlib/style (#37203) * DEPR: DataFrame/Series.slice_shift (#37601) * REF: re-use validate_setitem_value in Categorical.fillna (#37597) * PERF: release gil for ewma_time (#37389) * BUG: Groupy dropped nan groups from result when grouping over single column (#36842) * ENH: implement timeszones support for read_json(orient='table') and astype() from 'object' (#35973) * REF/BUG/TYP: read_csv shouldn't close user-provided file handles (#36997) * BUG/REF: read_csv shouldn't close user-provided file handles * get_handle: typing, returns is_wrapped, use dataclass, and make sure that all created handlers are returned * remove unused imports * added IOHandleArgs.close * added IOArgs.close * mostly comments * move memory_map from TextReader to CParserWrapper * moved IOArgs and IOHandles * more comments Co-authored-by: Jeff Reback <jeff@reback.net> * more typing checks to pre-commit (#37539) * TST: 32bit dtype compat test_groupby_dropna (#37623) * BUG: Metadata propagation for groupby iterator (#37461) * BUG: read-only values in cython funcs (#37613) * CLN refactor core/arrays (#37581) * Fixed Metadata Propogation in DataFrame (#37381) * TYP: add Shape alias to pandas._typing (#37128) * DOC: Fix typo (#37630) * CLN: parametrize test_nat_comparisons (#37195) * dataframe dataclass docstring updated (#37632) * refactor core/groupby (#37583) * BUG: set index of DataFrame.apply(f) when f returns dict (#37544) (#37606) * BUG: to_dict should return a native datetime object for NumPy backed dataframes (#37571) * ENH: memory_map for compressed files (#37621) * DOC: add example & prose of slicing with labels when index has duplicate labels (#36814) * DOC: add example & prose of slicing with labels when index has duplicate labels #36251 * DOC: proofread the sentence. Co-authored-by: Jun Kudo <jun-lab@junnoMacBook-Pro.local> * DOC: Fix typo (#37636) "columns(s)" sounded odd, I believe it was supposed to be just "column(s)". * CI: troubleshoot win py38 builds (#37652) * TST/REF: collect indexing tests by method (#37638) * TST/REF: collect tests for get_numeric_data (#37634) * misplaced loc test * TST/REF: collect get_numeric_data tests * REF: de-duplicate _validate_insert_value with _validate_scalar (#37640) * CI: catch windows py38 OSError (#37659) * share test (#37679) * TST: match matplotlib warning message (#37666) * TST: match matplotlib warning message * TST: match full message * pd.Series.loc.__getitem__ promotes to float64 instead of raising KeyError (#37687) * REF/TST: misplaced Categorical tests (#37678) * REF/TST: collect indexing tests by method (#37677) * CLN: only call _wrap_results one place in nanmedian (#37673) * TYP: Index._concat (#37671) * BUG: CategoricalIndex.equals casting non-categories to np.nan (#37667) * CLN: _replace_single (#37683) * REF: simplify _replace_single by noting regex kwarg is bool * Annotate * CLN: remove never-False convert kwarg * TYP: make more internal funcs keyword-only (#37688) * REF: make Series._replace_single a regular method (#37691) * REF: simplify cycling through colors (#37664) * REF: implement _wrap_reduction_result (#37660) * BUG: preserve fold in Timestamp.replace (#37644) * CLN: Clean indexing tests (#37689) * TST: fix warning for pie chart (#37669) * PERF: reverted change from commit 7d257c6 to solve issue #37081 (#37426) * DataFrameGroupby.boxplot fails when subplots=False (#28102) * ENH: Improve error reporting for wrong merge cols (#37547) * Transfer tests of test_frame.py to test_frame_color.py * PEP8 * Fixes for linter * Сhange pd.DateFrame to DateFrame * Move inconsistent namespace check to pre-commit, fixup more files (#37662) * check for inconsistent namespace usage * doc * typos * verbose regex * use verbose flag * use verbose flag * match both directions * add test * don't import annotations from future * update extra couple of cases * 🚚 rename * typing * don't use subprocess * don't type tests * use pathlib * REF: simplify NDFrame.replace, ObjectBlock.replace (#37704) * REF: implement Categorical.encode_with_my_categories (#37650) * REF: implement Categorical.encode_with_my_categories * privatize * BUG: unpickling modifies Block.ndim (#37657) * REF: dont support dt64tz in nanmean (#37658) * CLN: Simplify groupby head/tail tests (#37702) * Bug in loc raised for numeric label even when label is in Index (#37675) * REF: implement replace_regex, remove unreachable branch in ObjectBlock.replace (#37696) * TYP: Check untyped defs (except vendored) (#37556) * REF: remove ObjectBlock._replace_single (#37710) * Transfer tests of test_frame.py to test_frame_color.py * TST/REF: collect indexing tests by method (#37590) * PEP8 * Сhange DateFrame to pd.DateFrame * Сhange pd.DateFrame to DateFrame * Removing imports * Bug fixes * Bug fixes * Fix incorrect merge * test_frame_color.py edit * Transfer tests of test_frame.py to test_frame_color.py, test_frame_groupby.py and test_frame_subplots.py * Removing unnecessary imports * PEP8 * # Conflicts: # pandas/tests/plotting/frame/test_frame.py # pandas/tests/plotting/frame/test_frame_color.py # pandas/tests/plotting/frame/test_frame_subplots.py * Moving the file test_frame.py to a new directory * Transfer tests of test_frame.py to test_frame_color.py, test_frame_groupby.py and test_frame_subplots.py * Removing unnecessary imports * PEP8 * CLN: clean categorical indexes tests (#37721) * Fix merge error * PEP 8 fixes * Fix merge error * Removing unnecessary imports * PEP 8 fixes * Fixed class name * Transfer tests of test_frame.py to test_frame_subplots.py * Transfer tests of test_frame.py to test_frame_groupby.py, test_frame_subplots.py, test_frame_color.py * Changed class names * Removed unnecessary imports * Removed import * TST/REF: collect indexing tests by method (#37590) * TST: match matplotlib warning message (#37666) * TST: match matplotlib warning message * TST: match full message * TST: fix warning for pie chart (#37669) * Transfer tests of test_frame.py to test_frame_color.py * PEP8 * Fixes for linter * Сhange pd.DateFrame to DateFrame * Transfer tests of test_frame.py to test_frame_color.py * PEP8 * Сhange DateFrame to pd.DateFrame * Сhange pd.DateFrame to DateFrame * Removing imports * Bug fixes * Bug fixes * Fix incorrect merge * test_frame_color.py edit * Fix merge error * Fix merge error * Removing unnecessary features * Resolving Commit Conflicts daf999f 365d843 * black fix Co-authored-by: jbrockmendel <jbrockmendel@gmail.com> Co-authored-by: Marco Gorelli <m.e.gorelli@gmail.com> Co-authored-by: Philip Cerles <philip.cerles@gmail.com> Co-authored-by: Joris Van den Bossche <jorisvandenbossche@gmail.com> Co-authored-by: Sven <sven.schellenberg@paradynsystems.com> Co-authored-by: Micael Jarniac <micael@jarniac.com> Co-authored-by: Andrew Wieteska <48889395+arw2019@users.noreply.github.com> Co-authored-by: Maxim Ivanov <41443370+ivanovmg@users.noreply.github.com> Co-authored-by: Erfan Nariman <34067903+erfannariman@users.noreply.github.com> Co-authored-by: Fangchen Li <fangchen.li@outlook.com> Co-authored-by: patrick <61934744+phofl@users.noreply.github.com> Co-authored-by: attack68 <24256554+attack68@users.noreply.github.com> Co-authored-by: Torsten Wörtwein <twoertwein@users.noreply.github.com> Co-authored-by: Jeff Reback <jeff@reback.net> Co-authored-by: Janus <janus@insignificancegalore.net> Co-authored-by: Joel Whittier <rootbeerfriend@gmail.com> Co-authored-by: taytzehao <jtth95@gmail.com> Co-authored-by: ma3da <34522496+ma3da@users.noreply.github.com> Co-authored-by: junk <juntrp0207@gmail.com> Co-authored-by: Jun Kudo <jun-lab@junnoMacBook-Pro.local> Co-authored-by: Alex Kirko <alexander.kirko@gmail.com> Co-authored-by: Yassir Karroum <ukarroum17@gmail.com> Co-authored-by: Kaiqi Dong <kaiqi@kth.se> Co-authored-by: Richard Shadrach <45562402+rhshadrach@users.noreply.github.com> Co-authored-by: Simon Hawkins <simonjayhawkins@gmail.com>

jorisvandenbossche added this to the 1.2 milestone Oct 12, 2020

github-actions bot assigned ukarroum Oct 12, 2020

ukarroum added a commit to ukarroum/pandas that referenced this issue Oct 14, 2020

[PERF] Fixed issue pandas-dev#37081

6357ac2

ukarroum mentioned this issue Oct 14, 2020

PERF: regression in DataFrame reduction ops performance #37081 #37118

Merged

4 tasks

ukarroum added a commit to ukarroum/pandas that referenced this issue Oct 14, 2020

PERF : pandas-dev#37081 Compute dtypes once for 'any_object' and 'dty…

41827fb

…pe_is_dt'

jreback closed this as completed in #37118 Oct 17, 2020

jreback pushed a commit that referenced this issue Oct 17, 2020

PERF: regression in DataFrame reduction ops performance #37081 (#37118)

9fed16c

jorisvandenbossche reopened this Oct 20, 2020

JulianWgs pushed a commit to JulianWgs/pandas that referenced this issue Oct 26, 2020

PERF: regression in DataFrame reduction ops performance pandas-dev#37081

a4e08f6

(pandas-dev#37118)

ukarroum added a commit to ukarroum/pandas that referenced this issue Oct 26, 2020

PERF: reverted change from commit 7d257c6 to solve issue pandas-dev#3…

6c7d391

…7081

ukarroum mentioned this issue Oct 26, 2020

PERF: reverted change from commit 7d257c69 to solve issue #37081 #37426

Merged

5 tasks

kesmit13 pushed a commit to kesmit13/pandas that referenced this issue Nov 2, 2020

PERF: regression in DataFrame reduction ops performance pandas-dev#37081

ee353ce

(pandas-dev#37118)

ukarroum added a commit to ukarroum/pandas that referenced this issue Nov 2, 2020

PERF: reverted change from commit 7d257c6 to solve issue pandas-dev#3…

481c8a7

…7081

ukarroum added a commit to ukarroum/pandas that referenced this issue Nov 7, 2020

PERF: reverted change from commit 7d257c6 to solve issue pandas-dev#3…

1228167

…7081

ukarroum added a commit to ukarroum/pandas that referenced this issue Nov 7, 2020

PERF: reverted change from commit 7d257c6 to solve issue pandas-dev#3…

288bd70

…7081

jreback closed this as completed in #37426 Nov 8, 2020

jreback pushed a commit that referenced this issue Nov 8, 2020

PERF: reverted change from commit 7d257c6 to solve issue #37081 (#37426)

0573c3a

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PERF: regression in DataFrame reduction ops performance #37081

PERF: regression in DataFrame reduction ops performance #37081

jorisvandenbossche commented Oct 12, 2020

ukarroum commented Oct 12, 2020

ukarroum commented Oct 12, 2020

jbrockmendel commented Oct 12, 2020

jorisvandenbossche commented Oct 13, 2020

jorisvandenbossche commented Oct 13, 2020

jorisvandenbossche commented Oct 13, 2020

ukarroum commented Oct 14, 2020

jorisvandenbossche commented Oct 20, 2020 •

edited

Loading

ukarroum commented Oct 20, 2020

ukarroum commented Oct 24, 2020 •

edited

Loading

ukarroum commented Oct 24, 2020 •

edited

Loading

jbrockmendel commented Oct 24, 2020

ukarroum commented Oct 25, 2020 •

edited

Loading

jbrockmendel commented Oct 26, 2020

PERF: regression in DataFrame reduction ops performance #37081

PERF: regression in DataFrame reduction ops performance #37081

Comments

jorisvandenbossche commented Oct 12, 2020

ukarroum commented Oct 12, 2020

ukarroum commented Oct 12, 2020

jbrockmendel commented Oct 12, 2020

jorisvandenbossche commented Oct 13, 2020

jorisvandenbossche commented Oct 13, 2020

jorisvandenbossche commented Oct 13, 2020

ukarroum commented Oct 14, 2020

jorisvandenbossche commented Oct 20, 2020 • edited Loading

ukarroum commented Oct 20, 2020

ukarroum commented Oct 24, 2020 • edited Loading

ukarroum commented Oct 24, 2020 • edited Loading

jbrockmendel commented Oct 24, 2020

ukarroum commented Oct 25, 2020 • edited Loading

jbrockmendel commented Oct 26, 2020

jorisvandenbossche commented Oct 20, 2020 •

edited

Loading

ukarroum commented Oct 24, 2020 •

edited

Loading

ukarroum commented Oct 24, 2020 •

edited

Loading

ukarroum commented Oct 25, 2020 •

edited

Loading