Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

x86(64) runtime performance irregularities #31503

Closed
MagaTailor opened this issue Feb 9, 2016 · 6 comments
Closed

x86(64) runtime performance irregularities #31503

MagaTailor opened this issue Feb 9, 2016 · 6 comments
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@MagaTailor
Copy link

Mandel-rust benchmark produces the following results:

https://gist.github.com/petevine/b70b6e5a434f23b40ab5

TL;DR
32-bit code performance looks like this:
P2(3) > Core2 > P4 (x86_64 too)

P2(3) being the only ones to scale on 2 cores in all benchmarks.

It's either a sign of LLVM being buggy or I was more right about P4 codegen producing suboptimal code than I'd ever suspected. (x86_64 is affected too so it could be something else though)

Naturally, the common theme could be the use of SSE2 which is absent from the fastest code:

Configuration: re1: -2.00, re2: 1.00, img1: -1.50, img2: 1.50, max_iter: 2048, img_size: 1024, num_threads: 2
Time taken for this run (serial): 2469.21302 ms
Time taken for this run (scoped_thread_pool): 1248.45883 ms
Time taken for this run (simple_parallel): 1284.73761 ms
Time taken for this run (rayon_join): 1246.36625 ms
Time taken for this run (rayon_par_iter): 1337.93075 ms
Time taken for this run (rust_scoped_pool): 1240.33273 ms
Time taken for this run (job_steal): 1241.20777 ms
Time taken for this run (job_steal_join): 1246.34885 ms
Time taken for this run (kirk_crossbeam): 1244.10723 ms
@MagaTailor MagaTailor changed the title x86 runtime performance irregularities x86(64) runtime performance irregularities Feb 9, 2016
@MagaTailor
Copy link
Author

Doing a -C target-cpu=pentium2 -C target-feature=+sse2 immediately destroys performance (compared to just the first flag):

Configuration: re1: -2.00, re2: 1.00, img1: -1.50, img2: 1.50, max_iter: 2048, img_size: 1024, num_threads: 2
Time taken for this run (serial): 2762.07442 ms
Time taken for this run (scoped_thread_pool): 2256.02250 ms
Time taken for this run (simple_parallel): 2244.52544 ms
Time taken for this run (rayon_join): 1429.93142 ms
Time taken for this run (rayon_par_iter): 1392.21168 ms
Time taken for this run (rust_scoped_pool): 2252.52324 ms
Time taken for this run (job_steal): 2268.11608 ms
Time taken for this run (job_steal_join): 1417.85656 ms
Time taken for this run (kirk_crossbeam): 2259.48977 ms

Looks like the autovectorizer might be too eager and the availability of SSE2 is actually detrimental.
Using target-feature=+sse4.1 improves performance but doesn't get all of it back on x86_64

@steveklabnik steveklabnik added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Feb 15, 2016
@MagaTailor
Copy link
Author

#35662 (comment)

@MagaTailor
Copy link
Author

MagaTailor commented Aug 19, 2016

@eddyb I've profiled just the affected benchmarks together, and turning SSE2 on causes almost a 2x slowdown in multithreaded code.
x87-profile.txt
sse2-profile.txt

@MagaTailor
Copy link
Author

@eddyb As the single-threaded version of this benchmark is also affected (20% slower) I profiled just that and produced the main assembly files annotated with operf. I hope they might give you a clue what's wrong here.

x87_asm.txt
sse2_asm.txt

@eddyb
Copy link
Member

eddyb commented Aug 21, 2016

Honestly, I don't know what to say other than maybe LLVM's cost model is inaccurate for your CPU?

cc @rust-lang/compiler @pcwalton

@MagaTailor
Copy link
Author

More like inhibiting optimisations in the default Pentium 4 or generic x86_64 codegen (or simply +sse2) . I was inquisitive enough to discover basic i686 produces the fastest code for this codebase. (and that includes many multithreaded libs like rayon and crossbeam).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

3 participants