-
Notifications
You must be signed in to change notification settings - Fork 12.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Built-in u128 performance compared to extprim crate #39078
Comments
On which target do you run this, and which one was it compiled for? Also, how does the |
Right. I am using a 64 bit Windows GNU target. |
I think this is the formatter for that crate:
Where the div_rem uses a function udivmodti4 from here: https://github.com/kennytm/extprim/blob/master/src/compiler_rt.rs |
There’s at most two places where the difference in runtime could come:
Or some combination of two. While the linked
This issue is applicable to intrinsics in general; all of them could benefit from having some of their low hanging fruit plucked. |
There is at least one place to do a possible speed optimisation: https://github.com/rust-lang/rust/blob/master/src/libcompiler_builtins/lib.rs#L233 Now that we have a benchmark we can actually compare it. However it should also be applied to the libcompiler_builtins crate, as otherwise the time was wasted once that crate gets merged. |
Triage: when trying to make a benchmark test out of this, it seemed like the test runner hung... when creating a simple |
I have updated the first test code (note that N_ITER should be 60, unlike in the old version): // Compiled with:
// -O -Z mir-opt-level=3 -C target-cpu=native -C lto -C panic=abort -C codegen-units=1
fn euler55b() -> usize {
const LIMIT: u128 = 100_000;
const N_ITER: usize = 60;
fn reverse(n: u128) -> String { n.to_string().chars().rev().collect() }
fn is_lychrel(mut n: u128) -> bool {
for _ in 0 .. N_ITER {
n += reverse(n).parse::<u128>().unwrap();
if n.to_string() == reverse(n) {
return false;
}
}
true
}
(0 .. LIMIT).filter(|&i| is_lychrel(i)).count()
}
fn main() {
println!("{}", euler55b() == 6_091);
} Now it runs in about 0.86 seconds, on the same machine (rustc 1.48.0-nightly d006f57 2020-08-28). |
Did my code outperform the extprim crate? If so, then the only major remaining problem with divisions is #44545. |
Actually, it does not seem that the performance changes have made it to nightly yet |
I see no performance change yet (rustc 1.48.0-nightly dbb73f8 2020-09-12). |
I published a new version of compiler-builtins with the new division code. |
@Amanieu I looks like Rust isn't getting the asm-related improvements because the I think we should setup compiler-builtins to enable that feature by default. |
It doesn't look like Rust is getting any improvements at all. I don't see any benchmark improvements with the latest nightly. I could know for sure if I had a way to look at the assembly of |
You can try dumping the disassembly of the final executable with |
When dividing random 128 bit integers in a test program, it doesn't generate more than 1 |
I should clarify that I should see |
You need to update the |
I think we should wait until the "asm" feature flag issue is resolved, |
It looks like some of the perf changes (but not all the things since |
With the latest Nightly the benchmark above (of Aug 29) can't even be compiled with the specified compiler switches (-Z mir-opt-level=3 fails). While the original benchmark at the top of this page takes now 0.42 seconds on the same PC and OS. |
The perf changes have made it into stable Rust by now, what does it look like? |
Now the code at Aug 29 runs in 0.50-0.52 seconds, and the original (buggy) code runs in 0.43-0.47 seconds (this variability is caused by different compilation flags). I've tried the code that uses extprim, and the performance is the same. So we can close this issue down. The next LLVM improvement step could be to optimize remainder/divisions by small constants: pub fn foo(x: u128) -> u8 {
(x % 10) as _
} rustc 1.53.0-nightly 4a20eb6 2021-03-28 gives: foo:
sub rsp, 8
mov edx, 10
xor ecx, ecx
call qword ptr [rip + __umodti3@GOTPCREL]
pop rcx
ret g++ (trunk) 11.0.1 20210329 gives: #include <stdint.h>
uint8_t foo(const __uint128_t x) {
return (uint8_t)(x % 10);
} foo(unsigned __int128):
mov r9, rdi
mov r8, rsi
mov rsi, rdi
mov rcx, r9
mov rdi, r8
add rcx, r8
movabs r8, -3689348814741910323
adc rcx, 0
xor r11d, r11d
mov rax, rcx
mul r8
mov rax, rdx
and rdx, -4
shr rax, 2
add rdx, rax
mov rax, r9
sub rcx, rdx
mov rdx, rdi
sub rax, rcx
sbb rdx, r11
mov r11d, 10
mov rcx, rdx
movabs rdx, -3689348814741910324
imul rcx, r8
imul rdx, rax
add rcx, rdx
mul r8
add rdx, rcx
shrd rax, rdx, 1
mul r11
sub rsi, rax
mov eax, esi
ret |
This is a performance bug report.
This little program solves a problem derived from this (a ten times larger problem):
https://projecteuler.net/problem=55
The solution (this is not the fastest solution for this problem, but it's a easy to understand one):
On my PC it runs in about 1.12 seconds.
Using an external crate for the u128 bit values, with the same code (extprim = "1.1.0"):
This runs in about 0.72 seconds.
The text was updated successfully, but these errors were encountered: