-
-
Notifications
You must be signed in to change notification settings - Fork 18.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PERF: Limit memmove to >= 256 bytes, relax contiguity requirements #3130
Conversation
64 bit linux, gcc 4.7.2, sse2
repeated run
third run
still wondering about repeatability of vbenches... |
@y-p, weird, it doesn't revert the 50% regression (on 64-bit) in |
|
@y-p isn't the point of your build CACHER to not rebuild the cython code? so can't use it here, because we WANT to rebuild |
@stephenwlin time to upgrade my gcc I guess! |
@jreback, nothing regressed on the high end, right? |
@jreback anyway, I guess we reverted the regression found originally, so that's good, without causing another one or undoing the optimizations in cases where they're useful (as far as we know...); we definitely need to find ways to be more systematic about low-level testing though |
i'll post my 2nd run in a min |
no neg results
even 20% can be random (because of the random num generation)
|
running same commits as @y-p (as I was running your 1st memmove commit), though I think you said just the comment changed...., will post in a few |
@jreback just the comment changed, I diffed to double check |
virtually identical to my above post 4bc26b48d9eec28efe0b43e1e314940052de8c1d to a6db105 |
modulo bugs, the cython cache hack hashes all the input files that go into building the .so, |
@stephenwlin , give me a commit hash, and I'll rerun an average of 3. |
re diff between jeff's machine and me, demonstrated that there are explicit differences in the memmove family between ubuntu |
Nice work. Maybe in like 2017 we'll have an ATLAS-like tuning system for pandas |
@wesm low-level tuning is fun :) my senior thesis was on compiler optimization, specifically specialization based on propagating non-constant by range-limited value information...virtual table pointers in my case but this is the same general concept...I'm very very disappointed with gcc here that it can't figure this out...i might just go patch gcc (or clang, if it doesn't do this either...the codebase is much cleaner) for this...that doesn't help us anytime soon but it could by 2017 |
for consistency 4bc26b4 to 05a737d (v0.10.1)
|
I'm getting too much variability in results on my machine and the iteration Just go with jeff's results. sorry. |
btw, these results:
are most likely due to keeping the output f-contiguous when the input is f-contiguous: I found that it helps even when not doing the memmove optimization; it's likely due to cache issues. this is why putting that change in was a win independently of doing memmove (both had noticable positive effects independently, they only had a negative effect when combined) |
Nope, great job stephen. I split up the cleanups from the real changes in a branch on my fork: stephenwlin-memmove-limit
|
I say bombs away whenever you guys are satisfied with this. |
bow to git fu master @y-p |
why, thank you grasshopper. I use some tools which might make you more productive, but the health and will merge. |
@y-p slighlty OT, but I think to avoid merge conflicts on the relase notes...I should always add the issue reference at the end of the list (even though it makes it out of order)? |
exactly the opposite, the diff context is shared by multiple commits that way, |
@y-p great thxs |
@y-p I don't know what's up, but I seem to be getting very different and inconsistent results when using the cache directory by the way...could be a fluke and ymmv, of course |
oh no. how do I repro? |
can you try a sanity check: the latter is a subset of the first. |
hmm, seems ok, could be a fluke |
I've added and pushed a random seed option to test_perf. grepping for "seed" I'm still seeing the same variability i've always seen with vb. |
As per #3089, for now, I'm putting in a "magic" lower limit for memmove of 256 bytes (32 bytes was the case in the affected tests), which seems to be reasonable from local testing (::dubious::), unless someone on StackOverflow gives me a better idea of what to do.
Also, realized that only the stride in the dimension of the copy matters (i.e. the entire array doesn't have to be contiguous, only the copied subarrays do), so I relaxed that requirement (non-contiguous cases don't seem to be tested in our performance regressions, since they're pretty shallow unfortunately, but they do happen often in practice...this should be addressed by #3114).
Here are the vbench results on the low (improved) end (<90%):
and the high (regressed) end (>105%, as there were no cases of >110%):
I suspect the last results are just noise.
This is 32-bit Linux GCC 4.6.3, mileage may vary (still haven't set up at 64-bit environment), if anyone else could test this commit too that would be great.
EDIT
repeat run results:
and