-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove branch in optimized is_ascii #74562
Conversation
I think it's the other way around -- the tail being aligned has a probability of 1/8. I only added a bench for one possible misalignment, but there are several. Edit: To be clear, I'm not opposed to this change in principal, but I think it's worth keeping in mind that the unaligned tail case is the common case (an argument could be made either way about unaligned head). Edit 2: It's also worth running these benchmarks on a platform with slower unaligned loads -- this trades an aligned load + a branch for an unaligned load. I'm unsure that this is a good trade. |
Ah, I meant the tail being aligned have a probability of 1/8. Sorry if I am not being clear.
I don't understand how unaligned tail with a probability of 1/8 is the common case (> 1/2)? Isn't it the other way around? Most of these checks I believe is either based on user input or have it being read somewhere, I don't think users will think to type in numbers of 8 to be aligned with the compiler. |
That's exactly why unaligned tail is the common case. Strings whose length modulo 8 is 1, 2, 3, 4, 5, 6, or 7 all have unaligned tails. |
Oh, I thought it is the other way around. Sorry my bad. But still, won't we have more consistent worst case? |
This amount of improvement sounds like its most likely within the error margin. At this point we probably would need to start measuring cycles (with e.g. llvm-mca), rather than nanoseconds. Much like @thomcc I’m worried that that the impact on having an unconditional unaligned load will move the performance hit outside of the error margin on architectures where unaligned loads are expensive (and possibly simulated in generated code). One thing we could try to mitigate the cost somewhat is replace the unaligned load with a plain byte-by-byte loop if we are making that part of code unconditional... though it would probably make the improvement also vanish... All that said, I guess I'm not super against the idea of landing this, but it does not look like a slam dunk either. |
It's not uncommon for these byte-for-byte loops to be slower than the unaligned load on ARM, so I don't know if I'd assume right out of the gate that that's a better option. |
Maybe we do a benchmark on ARM? Should I try it on my raspberry pi and see which is the fastest? |
Gcc compile farm has ARM machines, but guess you already know that. |
Oh, I didn't know about that. |
I ran the same benchmark in ARM APM X-Gene Mustang board, gcc115 of https://cfarm.tetaneutral.net/machines/list/#.
|
☔ The latest upstream changes (presumably #73265) made this pull request unmergeable. Please resolve the merge conflicts. |
@pickfire can you please fix the merge conflicts? |
Performs slightly better in short or medium bytes by eliminating the last branch check on `byte_pos == len` and always check the last byte as it is always at most one `usize`. Benchmark, before `libcore`, after `libcore_new`. It improves medium and short by 1ns but regresses unaligned_tail by 2ns, either way we can get unaligned_tail have a tiny chance of 1/8 on a 64 bit machine. I don't think we should bet on that, the probability is worse than dice. test long::case00_libcore ... bench: 38 ns/iter (+/- 1) = 183947 MB/s test long::case00_libcore_new ... bench: 38 ns/iter (+/- 1) = 183947 MB/s test long::case01_iter_all ... bench: 227 ns/iter (+/- 6) = 30792 MB/s test long::case02_align_to ... bench: 40 ns/iter (+/- 1) = 174750 MB/s test long::case03_align_to_unrolled ... bench: 19 ns/iter (+/- 1) = 367894 MB/s test medium::case00_libcore ... bench: 5 ns/iter (+/- 0) = 6400 MB/s test medium::case00_libcore_new ... bench: 4 ns/iter (+/- 0) = 8000 MB/s test medium::case01_iter_all ... bench: 20 ns/iter (+/- 1) = 1600 MB/s test medium::case02_align_to ... bench: 6 ns/iter (+/- 0) = 5333 MB/s test medium::case03_align_to_unrolled ... bench: 5 ns/iter (+/- 0) = 6400 MB/s test short::case00_libcore ... bench: 7 ns/iter (+/- 0) = 1000 MB/s test short::case00_libcore_new ... bench: 6 ns/iter (+/- 0) = 1166 MB/s test short::case01_iter_all ... bench: 5 ns/iter (+/- 0) = 1400 MB/s test short::case02_align_to ... bench: 5 ns/iter (+/- 0) = 1400 MB/s test short::case03_align_to_unrolled ... bench: 5 ns/iter (+/- 1) = 1400 MB/s test unaligned_both::case00_libcore ... bench: 4 ns/iter (+/- 0) = 7500 MB/s test unaligned_both::case00_libcore_new ... bench: 4 ns/iter (+/- 0) = 7500 MB/s test unaligned_both::case01_iter_all ... bench: 26 ns/iter (+/- 0) = 1153 MB/s test unaligned_both::case02_align_to ... bench: 13 ns/iter (+/- 2) = 2307 MB/s test unaligned_both::case03_align_to_unrolled ... bench: 11 ns/iter (+/- 0) = 2727 MB/s test unaligned_head::case00_libcore ... bench: 5 ns/iter (+/- 0) = 6200 MB/s test unaligned_head::case00_libcore_new ... bench: 5 ns/iter (+/- 0) = 6200 MB/s test unaligned_head::case01_iter_all ... bench: 19 ns/iter (+/- 1) = 1631 MB/s test unaligned_head::case02_align_to ... bench: 10 ns/iter (+/- 0) = 3100 MB/s test unaligned_head::case03_align_to_unrolled ... bench: 14 ns/iter (+/- 0) = 2214 MB/s test unaligned_tail::case00_libcore ... bench: 3 ns/iter (+/- 0) = 10333 MB/s test unaligned_tail::case00_libcore_new ... bench: 5 ns/iter (+/- 0) = 6200 MB/s test unaligned_tail::case01_iter_all ... bench: 19 ns/iter (+/- 0) = 1631 MB/s test unaligned_tail::case02_align_to ... bench: 10 ns/iter (+/- 0) = 3100 MB/s test unaligned_tail::case03_align_to_unrolled ... bench: 13 ns/iter (+/- 0) = 2384 MB/s Rough (unfair) maths on improvements for fun: 1ns * 7/8 - 2ns * 1/8 = 0.625ns Inspired by fish and zsh clever trick to highlight missing linefeeds (⏎) and branchless implementation of binary_search in rust.
dc91eb3
to
8ec348a
Compare
@bors r+ |
📌 Commit 8ec348a has been approved by |
☀️ Test successful - checks-actions, checks-azure |
Performs slightly better in short or medium bytes by eliminating
the last branch check on
byte_pos == len
and always check thelast byte as it is always at most one
usize
.Benchmark, before
libcore
, afterlibcore_new
. It improvesmedium and short by 1ns but regresses unaligned_tail by 2ns,
either way we can get unaligned_tail have a tiny chance of 1/8
on a 64 bit machine. I don't think we should bet on that, the
probability is worse than dice.
Rough (unfair) maths on improvements for fun: 1ns * 7/8 - 2ns * 1/8 = 0.625ns
Inspired by fish and zsh clever trick to highlight missing linefeeds (⏎)
and branchless implementation of binary_search in rust.
cc @thomcc #74066
r? @nagisa