Built-in u128 performance compared to extprim crate #39078

leonardo-m · 2017-01-15T11:28:18Z

This is a performance bug report.

This little program solves a problem derived from this (a ten times larger problem):
https://projecteuler.net/problem=55

The solution (this is not the fastest solution for this problem, but it's a easy to understand one):

#![feature(i128_type)]

fn is_lychrel(mut n: u128) -> bool {
    for _ in 0 .. 50 {
        n += n.to_string().chars().rev().collect::<String>().parse().unwrap();
        if n.to_string() == n.to_string().chars().rev().collect::<String>() {
            return false;
        }
    }
    true
}

fn e55() -> usize {
    (0u32 .. 100_000).filter(|&i| is_lychrel(i.into())).count()
}

fn main() {
    println!("{}", e55() == 6208);
}

On my PC it runs in about 1.12 seconds.

Using an external crate for the u128 bit values, with the same code (extprim = "1.1.0"):

extern crate extprim;
use extprim::u128::u128;

fn is_lychrel(mut n: u128) -> bool {
    for _ in 0 .. 50 {
        n += n.to_string().chars().rev().collect::<String>().parse().unwrap();
        if n.to_string() == n.to_string().chars().rev().collect::<String>() {
            return false;
        }
    }
    true
}

fn e55() -> usize {
    (0u32 .. 100_000).filter(|&i| is_lychrel(i.into())).count()
}

fn main() {
    println!("{}", e55() == 6208);
}

This runs in about 0.72 seconds.

The text was updated successfully, but these errors were encountered:

est31 · 2017-01-15T12:05:36Z

On which target do you run this, and which one was it compiled for? Also, how does the to_string implementation look like? Is it different in the two examples?

est31 · 2017-01-15T12:06:34Z

cc #35118
@nagisa

leonardo-m · 2017-01-15T12:07:57Z

Right. I am using a 64 bit Windows GNU target.

leonardo-m · 2017-01-15T12:14:27Z

I think this is the formatter for that crate:

impl fmt::Display for u128 {
    fn fmt(&self, formatter: &mut fmt::Formatter) -> fmt::Result {
        if self.hi == 0 {
            self.lo.fmt(formatter)
        } else {
            const TEN19: u128 = u128 { lo: 10000000000000000000, hi: 0 };

            let mut buffer = [0u8; 39];
            let mut buf = FormatBuffer::new(&mut buffer);

            let (mid, lo) = div_rem(*self, TEN19);
            if mid.hi == 0 {
                try!(write!(&mut buf, "{}{:019}", mid.lo, lo.lo));
            } else {
                let (hi, mid) = div_rem(mid, TEN19);
                try!(write!(&mut buf, "{}{:019}{:019}", hi.lo, mid.lo, lo.lo));
            }

            formatter.pad_integral(true, "", unsafe { buf.into_str() })
        }
    }
}

Where the div_rem uses a function udivmodti4 from here:

https://github.com/kennytm/extprim/blob/master/src/compiler_rt.rs

nagisa · 2017-01-15T13:34:44Z

There’s at most two places where the difference in runtime could come:

Implementation of the intrinsic function(s);
Overhead due to ABI differences (unlikely to be 2x penalty).

Or some combination of two. While the linked udivmodti4 intrinsic seems fairly similar, one can spot a number of potentially relevant differences (also signature violations) as well.

perf also shows about 55% of runtime is spent on a few particularly hot instructions on my machine. Notably there’s a shr that’s as hot as the hottest divisions, which seems to suggest some pathological case. Some gentle massaging of the intrinsic could help to improve performance of such code immensely.

This issue is applicable to intrinsics in general; all of them could benefit from having some of their low hanging fruit plucked.

est31 · 2017-01-15T14:29:02Z

There is at least one place to do a possible speed optimisation: https://github.com/rust-lang/rust/blob/master/src/libcompiler_builtins/lib.rs#L233

Now that we have a benchmark we can actually compare it. However it should also be applied to the libcompiler_builtins crate, as otherwise the time was wasted once that crate gets merged.

steveklabnik · 2019-10-09T12:48:17Z

Triage: when trying to make a benchmark test out of this, it seemed like the test runner hung... when creating a simple main function that did the calculation, I'm still seeing similar results to @leonardo-m

leonardo-m · 2020-08-29T09:25:14Z

I have updated the first test code (note that N_ITER should be 60, unlike in the old version):

// Compiled with:
// -O -Z mir-opt-level=3 -C target-cpu=native -C lto -C panic=abort -C codegen-units=1

fn euler55b() -> usize {
    const LIMIT: u128 = 100_000;
    const N_ITER: usize = 60;

    fn reverse(n: u128) -> String { n.to_string().chars().rev().collect() }

    fn is_lychrel(mut n: u128) -> bool {
        for _ in 0 .. N_ITER {
            n += reverse(n).parse::<u128>().unwrap();
            if n.to_string() == reverse(n) {
                return false;
            }
        }
        true
    }

    (0 .. LIMIT).filter(|&i| is_lychrel(i)).count()
}

fn main() {
    println!("{}", euler55b() == 6_091);
}

Now it runs in about 0.86 seconds, on the same machine (rustc 1.48.0-nightly d006f57 2020-08-28).

AaronKutch · 2020-09-12T02:45:28Z

Did my code outperform the extprim crate? If so, then the only major remaining problem with divisions is #44545.
Edit: I realized that I need the tweak the inlining more to almost double the performance of __divti3

AaronKutch · 2020-09-12T03:09:34Z

Actually, it does not seem that the performance changes have made it to nightly yet

leonardo-m · 2020-09-13T09:02:58Z

I see no performance change yet (rustc 1.48.0-nightly dbb73f8 2020-09-12).

Amanieu · 2020-09-13T19:50:26Z

I published a new version of compiler-builtins with the new division code.

josephlr · 2020-10-15T10:22:51Z

@Amanieu I looks like Rust isn't getting the asm-related improvements because the "asm" feature isn't being enabled by alloc/std/etc...

I think we should setup compiler-builtins to enable that feature by default.

AaronKutch · 2020-10-15T13:52:23Z

It doesn't look like Rust is getting any improvements at all. I don't see any benchmark improvements with the latest nightly. I could know for sure if I had a way to look at the assembly of __divti3, but the assembly for that isn't shown in a normal cargo rustc --release -- --emit='asm'

Amanieu · 2020-10-15T15:43:45Z

You can try dumping the disassembly of the final executable with objdump -d <file> or llvm-objdump -d <file>. Search for __divti3 in the output.

AaronKutch · 2020-10-15T16:15:30Z

When dividing random 128 bit integers in a test program, it doesn't generate more than 1 divq instruction (that is probably for the output formatting), but I would expect several. Is the latest nightly still not using the most recent compiler-builtins version? Is there a Cargo.toml with a specified compiler-builtins version somewhere or is it supposed to pull the latest one for every nightly build?

AaronKutch · 2020-10-15T16:18:46Z

I should clarify that I should see divq instructions being used regardless of whether the "asm" feature is enabled or not, because I am just using assembly to get more use out of divq than LLVM normally would.

Amanieu · 2020-10-16T15:16:53Z

You need to update the compiler-builtins version in Cargo.lock of rust-lang/rust.

AaronKutch · 2020-10-16T16:20:37Z

I think we should wait until the "asm" feature flag issue is resolved, compiler-builtins PR #384 is merged (which is important for fixing an issue with cranelift and other performance issues), and after that a new version of compiler-builtins is released (which will also include the division tweaks I made).

AaronKutch · 2020-11-20T20:26:29Z

It looks like some of the perf changes (but not all the things since compiler-builtins 1.36) have made it into nightly, what do the benchmarks look like?

leonardo-m · 2020-11-21T20:45:52Z

With the latest Nightly the benchmark above (of Aug 29) can't even be compiled with the specified compiler switches (-Z mir-opt-level=3 fails).

While the original benchmark at the top of this page takes now 0.42 seconds on the same PC and OS.

AaronKutch · 2021-03-30T04:48:09Z

The perf changes have made it into stable Rust by now, what does it look like?

leonardo-m · 2021-03-30T11:05:50Z

Now the code at Aug 29 runs in 0.50-0.52 seconds, and the original (buggy) code runs in 0.43-0.47 seconds (this variability is caused by different compilation flags). I've tried the code that uses extprim, and the performance is the same. So we can close this issue down.

The next LLVM improvement step could be to optimize remainder/divisions by small constants:

pub fn foo(x: u128) -> u8 {
    (x % 10) as _
}

rustc 1.53.0-nightly 4a20eb6 2021-03-28 gives:

foo:
    sub     rsp, 8
    mov     edx, 10
    xor     ecx, ecx
    call    qword ptr [rip + __umodti3@GOTPCREL]
    pop     rcx
    ret

g++ (trunk) 11.0.1 20210329 gives:

#include <stdint.h>
uint8_t foo(const __uint128_t x) {
    return (uint8_t)(x % 10);
}

foo(unsigned __int128):
    mov     r9, rdi
    mov     r8, rsi
    mov     rsi, rdi
    mov     rcx, r9
    mov     rdi, r8
    add     rcx, r8
    movabs  r8, -3689348814741910323
    adc     rcx, 0
    xor     r11d, r11d
    mov     rax, rcx
    mul     r8
    mov     rax, rdx
    and     rdx, -4
    shr     rax, 2
    add     rdx, rax
    mov     rax, r9
    sub     rcx, rdx
    mov     rdx, rdi
    sub     rax, rcx
    sbb     rdx, r11
    mov     r11d, 10
    mov     rcx, rdx
    movabs  rdx, -3689348814741910324
    imul    rcx, r8
    imul    rdx, rax
    add     rcx, rdx
    mul     r8
    add     rdx, rcx
    shrd    rax, rdx, 1
    mul     r11
    sub     rsi, rax
    mov     eax, esi
    ret

Mark-Simulacrum added the I-slow Issue: Problems and improvements with respect to performance of generated code. label May 19, 2017

Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jul 26, 2017

henninglive mentioned this issue Sep 13, 2017

128-bit integer division with remainder is not combined to a single operation #44545

Open

leonardo-m mentioned this issue Aug 29, 2020

Use less divisions in display u128/i128 #76017

Merged

leonardo-m mentioned this issue Sep 11, 2020

Greatly improve division performance for u128 and other cases rust-lang/compiler-builtins#332

Merged

josephlr mentioned this issue Oct 15, 2020

Move from an "asm" flag to a "no-asm" feature flag rust-lang/compiler-builtins#386

Merged

leonardo-m closed this as completed Mar 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Built-in u128 performance compared to extprim crate #39078

Built-in u128 performance compared to extprim crate #39078

leonardo-m commented Jan 15, 2017

est31 commented Jan 15, 2017

est31 commented Jan 15, 2017

leonardo-m commented Jan 15, 2017

leonardo-m commented Jan 15, 2017

nagisa commented Jan 15, 2017

est31 commented Jan 15, 2017

steveklabnik commented Oct 9, 2019

leonardo-m commented Aug 29, 2020

AaronKutch commented Sep 12, 2020 •

edited

Loading

AaronKutch commented Sep 12, 2020

leonardo-m commented Sep 13, 2020

Amanieu commented Sep 13, 2020

josephlr commented Oct 15, 2020

AaronKutch commented Oct 15, 2020

Amanieu commented Oct 15, 2020

AaronKutch commented Oct 15, 2020

AaronKutch commented Oct 15, 2020

Amanieu commented Oct 16, 2020

AaronKutch commented Oct 16, 2020

AaronKutch commented Nov 20, 2020

leonardo-m commented Nov 21, 2020 •

edited

Loading

AaronKutch commented Mar 30, 2021

leonardo-m commented Mar 30, 2021 •

edited

Loading

Built-in u128 performance compared to extprim crate #39078

Built-in u128 performance compared to extprim crate #39078

Comments

leonardo-m commented Jan 15, 2017

est31 commented Jan 15, 2017

est31 commented Jan 15, 2017

leonardo-m commented Jan 15, 2017

leonardo-m commented Jan 15, 2017

nagisa commented Jan 15, 2017

est31 commented Jan 15, 2017

steveklabnik commented Oct 9, 2019

leonardo-m commented Aug 29, 2020

AaronKutch commented Sep 12, 2020 • edited Loading

AaronKutch commented Sep 12, 2020

leonardo-m commented Sep 13, 2020

Amanieu commented Sep 13, 2020

josephlr commented Oct 15, 2020

AaronKutch commented Oct 15, 2020

Amanieu commented Oct 15, 2020

AaronKutch commented Oct 15, 2020

AaronKutch commented Oct 15, 2020

Amanieu commented Oct 16, 2020

AaronKutch commented Oct 16, 2020

AaronKutch commented Nov 20, 2020

leonardo-m commented Nov 21, 2020 • edited Loading

AaronKutch commented Mar 30, 2021

leonardo-m commented Mar 30, 2021 • edited Loading

AaronKutch commented Sep 12, 2020 •

edited

Loading

leonardo-m commented Nov 21, 2020 •

edited

Loading

leonardo-m commented Mar 30, 2021 •

edited

Loading