Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PERF: Limit memmove to >= 256 bytes, relax contiguity requirements #3130

Merged
0 commits merged into from
Mar 22, 2013

Conversation

stephenwlin
Copy link
Contributor

As per #3089, for now, I'm putting in a "magic" lower limit for memmove of 256 bytes (32 bytes was the case in the affected tests), which seems to be reasonable from local testing (::dubious::), unless someone on StackOverflow gives me a better idea of what to do.

Also, realized that only the stride in the dimension of the copy matters (i.e. the entire array doesn't have to be contiguous, only the copied subarrays do), so I relaxed that requirement (non-contiguous cases don't seem to be tested in our performance regressions, since they're pretty shallow unfortunately, but they do happen often in practice...this should be addressed by #3114).

Here are the vbench results on the low (improved) end (<90%):

series_drop_duplicates_string               0.7474      1.0182     0.7340
frame_multi_and_st                        763.4459   1008.7359     0.7568
reindex_frame_level_align                   1.5111      1.9937     0.7579
reindex_frame_level_reindex                 1.5304      1.9711     0.7764
panel_from_dict_equiv_indexes              34.4702     42.6086     0.8090
groupby_frame_cython_many_columns           6.4044      7.4999     0.8539
read_csv_comment2                          40.2233     45.7705     0.8788
groupby_series_simple_cython                7.5779      8.4962     0.8919

and the high (regressed) end (>105%, as there were no cases of >110%):

frame_ctor_nested_dict                    108.7902    103.3220     1.0529
stats_rolling_mean                          2.8128      2.6283     1.0702

I suspect the last results are just noise.

This is 32-bit Linux GCC 4.6.3, mileage may vary (still haven't set up at 64-bit environment), if anyone else could test this commit too that would be great.


EDIT

repeat run results:

reindex_frame_level_reindex                 1.4861     1.9335     0.7686
reindex_frame_level_align                   1.5578     1.9846     0.7849
panel_from_dict_equiv_indexes              35.0858    42.0521     0.8343
groupby_frame_cython_many_columns           6.5504     7.8101     0.8387
read_csv_comment2                          40.2616    46.1578     0.8723
groupby_last_float32                        7.7265     8.6995     0.8882

and

frame_reindex_columns                       0.4452     0.4237     1.0507
append_frame_single_homogenous              0.4923     0.4669     1.0543
reshape_pivot_time_series                 292.3670   277.2169     1.0547
join_dataframe_integer_key                  2.8652     2.6741     1.0714

@ghost
Copy link

ghost commented Mar 21, 2013

64 bit linux, gcc 4.7.2, sse2

./test_perf.sh -t e852b36 -b 4d3e34e -r reindex

-----------------------------------------------------------
Test name                 | target[ms] | base[ms] |   ratio
-----------------------------------------------------------

frame_reindex_upcast               21.6771    32.5228     0.6665
frame_reindex_both_axes_ix          0.4005     0.6000     0.6674
frame_reindex_axis0                 0.8208     0.9360     0.8769
reindex_multiindex                  1.0762     1.1021     0.9765
frame_reindex_both_axes             0.3261     0.3307     0.9862
reindex_daterange_pad               0.1768     0.1786     0.9901
reindex_frame_level_align           0.9876     0.9950     0.9925
reindex_fillna_backfill_float32     0.0964     0.0971     0.9927
reindex_frame_level_reindex         0.9515     0.9580     0.9931
reindex_fillna_backfill             0.1036     0.1042     0.9940
reindex_fillna_pad_float32          0.0964     0.0969     0.9951
reindex_fillna_pad                  0.1033     0.1037     0.9965
frame_reindex_columns               0.3024     0.3012     1.0041
dataframe_reindex                   0.3868     0.3837     1.0080
reindex_daterange_backfill          0.1661     0.1639     1.0135
frame_reindex_axis1                 1.9141     1.6649     1.1497

-----------------------------------------------------------
Test name                 | target[ms] | base[ms] |   ratio
-----------------------------------------------------------

Ratio < 1.0 means the target commit is faster then the baseline.

Target [e852b36] : PERF: Limit memmove to >= 256 bytes, relax contiguity requirements (only the stride in the dimension of the copy matters)
Base   [4d3e34e] : Merge pull request #3127 from jreback/reshape

repeated run

-----------------------------------------------------------
Test name                 | target[ms] | base[ms] |   ratio
-----------------------------------------------------------

reindex_frame_level_reindex         0.9444     1.0020     0.9426
frame_reindex_upcast               23.8891    24.6829     0.9678
frame_reindex_columns               0.3012     0.3107     0.9695
reindex_multiindex                  1.1992     1.2264     0.9778
reindex_daterange_backfill          0.1637     0.1666     0.9827
reindex_fillna_pad_float32          0.0959     0.0974     0.9842
reindex_fillna_backfill_float32     0.0957     0.0966     0.9906
reindex_daterange_pad               0.1759     0.1774     0.9916
reindex_fillna_pad                  0.1029     0.1034     0.9944
reindex_fillna_backfill             0.1035     0.1035     0.9993
frame_reindex_both_axes_ix          0.3980     0.3977     1.0008
frame_reindex_both_axes             0.3220     0.3217     1.0011
frame_reindex_axis1                 1.6690     1.6468     1.0135
frame_reindex_axis0                 0.8296     0.8147     1.0183
dataframe_reindex                   0.3955     0.3852     1.0269
reindex_frame_level_align           1.2472     1.0184     1.2247

-----------------------------------------------------------
Test name                 | target[ms] | base[ms] |   ratio
-----------------------------------------------------------

third run

-----------------------------------------------------------
Test name                 | target[ms] | base[ms] |   ratio
-----------------------------------------------------------

reindex_daterange_backfill          0.1659     0.2478     0.6693
frame_reindex_both_axes_ix          0.4036     0.5960     0.6772
dataframe_reindex                   0.4049     0.5648     0.7169
frame_reindex_upcast               21.8160    30.0523     0.7259
reindex_frame_level_align           1.0126     1.3830     0.7322
frame_reindex_axis1                 1.8061     2.0984     0.8607
frame_reindex_axis0                 0.8023     0.9220     0.8702
reindex_frame_level_reindex         0.9319     0.9538     0.9770
frame_reindex_both_axes             0.3199     0.3229     0.9908
frame_reindex_columns               0.3020     0.3034     0.9953
reindex_daterange_pad               0.1769     0.1769     0.9997
reindex_fillna_pad                  0.1032     0.1029     1.0028
reindex_fillna_backfill_float32     0.0962     0.0957     1.0055
reindex_fillna_backfill             0.1045     0.1034     1.0101
reindex_fillna_pad_float32          0.0969     0.0956     1.0136
reindex_multiindex                  1.2075     1.0829     1.1151

-----------------------------------------------------------
Test name                 | target[ms] | base[ms] |   ratio
-----------------------------------------------------------

still wondering about repeatability of vbenches...

@stephenwlin
Copy link
Contributor Author

@y-p, weird, it doesn't revert the 50% regression (on 64-bit) in reindex_frame_level_align and reindex_frame_level_reindex? Or maybe your machine doesn't have the regression to begin with? were you able to reproduce the regression found by @jreback with my original memmove commit? also, I guess 256 bytes might be too restrictive, since that means we're undoing an optimization we want; maybe I can bump it down to something lower (or make it machine word-size dependent...)...actually, even weirder is that reindex_frame_level_align goes from 95% to 122% on repeat run?

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

Results:
                                            t_head  t_baseline      ratio
name                                                                     
reindex_frame_level_reindex                 0.5425      1.0084     0.5379
reindex_frame_level_align                   0.5625      1.0328     0.5447
groupby_simple_compress_timing             25.1635     31.8418     0.7903

Target [4bc26b4] : PERF: Limit memmove to >= 64 bytes, relax contiguity requirements 
(only the stride in the dimension of the copy matters)
Baseline [a6db105] : Merge pull request #3103 from jreback/ensure_clean
Linux goat 2.6.32-5-amd64 #1 SMP Sun Sep 23 10:07:46 UTC 2012 x86_64 GNU/Linux
gcc version 4.4.5 (Debian 4.4.5-8) 

@stephenwlin
Copy link
Contributor Author

@y-p @jreback, I suspect you have processor/compiler differences that are relevant for reindex_frame_level_reindex and reindex_frame_level_align then

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

@y-p isn't the point of your build CACHER to not rebuild the cython code? so can't use it here, because we WANT to rebuild

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

@stephenwlin time to upgrade my gcc I guess!

@stephenwlin
Copy link
Contributor Author

@jreback, nothing regressed on the high end, right?

@stephenwlin
Copy link
Contributor Author

@jreback anyway, I guess we reverted the regression found originally, so that's good, without causing another one or undoing the optimizations in cases where they're useful (as far as we know...); we definitely need to find ways to be more systematic about low-level testing though

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

i'll post my 2nd run in a min

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

no neg results

Results:
                                            t_head  t_baseline      ratio
name                                                                     
reindex_frame_level_reindex                 0.5394      1.0058     0.5363
reindex_frame_level_align                   0.5718      1.0237     0.5585

even 20% can be random (because of the random num generation)

groupby_simple_compress_timing             32.0354     32.5414     0.9845

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

running same commits as @y-p (as I was running your 1st memmove commit), though I think you said just the comment changed...., will post in a few

@stephenwlin
Copy link
Contributor Author

@jreback just the comment changed, I diffed to double check

@jreback
Copy link
Contributor

jreback commented Mar 21, 2013

virtually identical to my above post

4bc26b48d9eec28efe0b43e1e314940052de8c1d to a6db105

@ghost
Copy link

ghost commented Mar 21, 2013

modulo bugs, the cython cache hack hashes all the input files that go into building the .so,
so changes should trigger a rebuild, and anecdotally, I've confirmed that indeed it does.

@ghost
Copy link

ghost commented Mar 21, 2013

@stephenwlin , give me a commit hash, and I'll rerun an average of 3.

@ghost
Copy link

ghost commented Mar 21, 2013

re diff between jeff's machine and me,
numpy/numpy#324
#2490

demonstrated that there are explicit differences in the memmove family between ubuntu
and debian libc. I'm on debian testing, I suspect jeff is on ubuntu, that might be a factor.

@stephenwlin
Copy link
Contributor Author

the original regression was from 759b0db (good) to 10f65e0 (bad); fix should be from 4d3e34e (bad) to e852b36 (good)

@wesm
Copy link
Member

wesm commented Mar 21, 2013

Nice work. Maybe in like 2017 we'll have an ATLAS-like tuning system for pandas

@stephenwlin
Copy link
Contributor Author

@wesm low-level tuning is fun :)

my senior thesis was on compiler optimization, specifically specialization based on propagating non-constant by range-limited value information...virtual table pointers in my case but this is the same general concept...I'm very very disappointed with gcc here that it can't figure this out...i might just go patch gcc (or clang, if it doesn't do this either...the codebase is much cleaner) for this...that doesn't help us anytime soon but it could by 2017

@jreback
Copy link
Contributor

jreback commented Mar 22, 2013

for consistency 4bc26b4 to 05a737d (v0.10.1)

Results:
                                            t_head  t_baseline      ratio
name                                                                     
frame_get_dtype_counts                      0.0971    207.3898     0.0005
frame_wide_repr                             0.5320    207.6731     0.0026
groupby_first_float32                       3.0259    335.9749     0.0090
groupby_last_float32                        3.2037    334.8918     0.0096
frame_to_csv2                             188.0970   2191.5438     0.0858
indexing_dataframe_boolean                 12.8686    126.6370     0.1016
write_csv_standard                         36.8581    232.5151     0.1585
frame_reindex_axis0                         0.3152      1.1079     0.2845
frame_to_csv_mixed                        349.8561   1112.4861     0.3145
frame_to_csv                              116.9171    218.2992     0.5356
frame_add                                  22.0810     34.8116     0.6343
frame_mult                                 21.9506     34.0294     0.6450
frame_reindex_upcast                       11.7105     17.3680     0.6743
frame_fancy_lookup_all                     14.2124     19.7532     0.7195
reindex_frame_level_reindex                 0.5592      0.7404     0.7553
reindex_frame_level_align                   0.5713      0.7551     0.7567
series_string_vector_slice                146.6720    178.7109     0.8207
join_dataframe_index_single_key_bigger     12.6925     14.7876     0.8583
join_dataframe_index_single_key_bigger      5.7026      6.6035     0.8636
dataframe_reindex                           0.3706      0.4290     0.8638
frame_fancy_lookup                          1.6022      1.8267     0.8771
groupby_multi_different_numpy_functions     9.9613     11.2368     0.8865
frame_boolean_row_select                    0.2046      0.2289     0.8937
groupby_multi_different_functions           9.9884     11.1221     0.8981

@ghost
Copy link

ghost commented Mar 22, 2013

I'm getting too much variability in results on my machine and the iteration
turn around is killing me. There must be a rogue process hiding on my machine.

Just go with jeff's results. sorry.

@stephenwlin
Copy link
Contributor Author

btw, these results:

reindex_frame_level_reindex                 0.5592      0.7404     0.7553
reindex_frame_level_align                   0.5713      0.7551     0.7567

are most likely due to keeping the output f-contiguous when the input is f-contiguous: I found that it helps even when not doing the memmove optimization; it's likely due to cache issues. this is why putting that change in was a win independently of doing memmove (both had noticable positive effects independently, they only had a negative effect when combined)

@jreback
Copy link
Contributor

jreback commented Mar 22, 2013

ok....unless any objections @y-p, @wesm going to merge this....

@ghost
Copy link

ghost commented Mar 22, 2013

Nope, great job stephen.

I split up the cleanups from the real changes in a branch on my fork: stephenwlin-memmove-limit
and added pointers to this issue, since this is somewhat magical. Suggest you merge from
there once travis passes. I Preserved authorship ofcourse.

λ gd  e852b36 bef52df
diff --git a/pandas/src/generate_code.py b/pandas/src/generate_code.py
index b80555d..2d58733 100644
--- a/pandas/src/generate_code.py
+++ b/pandas/src/generate_code.py
@@ -93,6 +93,7 @@ def take_2d_axis0_%(name)s_%(dest)s(ndarray[%(c_type_in)s, ndim=2] values,
         cdef:
             %(c_type_out)s *v, *o

+        #GH3130
         if (values.strides[1] == out.strides[1] and
             values.strides[1] == sizeof(%(c_type_out)s) and
             sizeof(%(c_type_out)s) * n >= 256):
@@ -141,6 +142,7 @@ def take_2d_axis1_%(name)s_%(dest)s(ndarray[%(c_type_in)s, ndim=2] values,
         cdef:
             %(c_type_out)s *v, *o

+        #GH3130
         if (values.strides[0] == out.strides[0] and
             values.strides[0] == sizeof(%(c_type_out)s) and
             sizeof(%(c_type_out)s) * n >= 256):

@wesm
Copy link
Member

wesm commented Mar 22, 2013

I say bombs away whenever you guys are satisfied with this.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2013

bow to git fu master @y-p
go ahead and merge when u r ready

@ghost
Copy link

ghost commented Mar 22, 2013

why, thank you grasshopper.

I use some tools which might make you more productive, but the health and
safety persons are afraid your keyboard will catch on fire and associated
liability issues etc', so I'm holding off.

will merge.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2013

@y-p slighlty OT, but I think to avoid merge conflicts on the relase notes...I should always add the issue reference at the end of the list (even though it makes it out of order)?

@ghost ghost merged commit bef52df into pandas-dev:master Mar 22, 2013
@ghost
Copy link

ghost commented Mar 22, 2013

exactly the opposite, the diff context is shared by multiple commits that way,
which creates the conflict. randomize to avoid hitting the worst case on average.

@jreback
Copy link
Contributor

jreback commented Mar 22, 2013

@y-p great thxs

@stephenwlin stephenwlin deleted the memmove-limit branch March 22, 2013 17:41
@stephenwlin
Copy link
Contributor Author

@y-p I don't know what's up, but I seem to be getting very different and inconsistent results when using the cache directory by the way...could be a fluke and ymmv, of course

@ghost
Copy link

ghost commented Mar 22, 2013

oh no. how do I repro?

@ghost
Copy link

ghost commented Mar 22, 2013

can you try a sanity check:
λ md5sum /tmp/.pandas_build_cache/* | sort
λ md5sum pandas/*.so | sort

the latter is a subset of the first.

@stephenwlin
Copy link
Contributor Author

hmm, seems ok, could be a fluke

@ghost
Copy link

ghost commented Mar 23, 2013

I've added and pushed a random seed option to test_perf. grepping for "seed"
in the vbench repo came up empty so that ought to have taken care of it.
(It seeds python random as well numpy.random).
all runs now use a default seed , if not specified explicitly.

I'm still seeing the same variability i've always seen with vb.

This pull request was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Performance Memory or execution speed performance
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants