Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slice::rotate_left/right is not simplified for compile-time-constant small rotates #89714

Open
saethlin opened this issue Oct 9, 2021 · 8 comments
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.

Comments

@saethlin
Copy link
Member

saethlin commented Oct 9, 2021

The current (2021-10-08 nightly) and recent implementations of slice::rotate_right produce 15x more instructions than a manual special-case implementation for a small rotate by a compile-time constant offset. This results in a 5-50% performance difference, depending on the length of the slice in question.

godbolt demonstration: https://godbolt.org/z/a4YdqP1rn

pub fn shift_after_fast(vec: &mut [usize], i: usize) {
    if i > vec.len() || (vec.len() - i) < 2 {
        return;
    }
    unsafe {
        let tmp = core::ptr::read(vec.as_ptr().add(vec.len() - 1));
        core::ptr::copy(
            vec.as_ptr().add(i),
            vec.as_mut_ptr().add(i + 1),
            vec.len() - i - 1,
        );
        core::ptr::write(vec.as_mut_ptr().add(i), tmp);
    }
}

pub fn shift_after(vec: &mut [usize], i: usize) {
    vec[i..].rotate_right(1);
}

I think the current implementation doesn't optimize well because it is a loop { surrounding three rotate algorithms with their own predicates, the first of which is almost never known at compile time:

if (left + right < 24) || (mem::size_of::<T>() > mem::size_of::<[usize; 4]>()) {

and it is the second algorithm that we want to run in this case:

} else if cmp::min(left, right) <= mem::size_of::<BufType>() / mem::size_of::<T>() {

Simply pasting a copy of the second algorithm above the loop removes the performance difference and most but not all of the code size difference.

cc @scottmcm I've finally gotten around to reporting this after months

rustc 1.57.0-nightly (54bb4fec6 2021-10-08)
binary: rustc
commit-hash: 54bb4fec68cb592e23077896baea072919721573
commit-date: 2021-10-08
host: x86_64-unknown-linux-gnu
release: 1.57.0-nightly
LLVM version: 13.0.0
@paolobarbolini
Copy link
Contributor

paolobarbolini commented Jan 16, 2022

#92967 should make it better in some cases (if T isn't too big and the rotate amount is known to be within a certain low range at compiletime) https://rust.godbolt.org/z/Ph1TWE1jn

@Urgau
Copy link
Member

Urgau commented Jan 16, 2022

@paolobarbolini you might be interested to known that this naive implementation outperform (at least for u32) both the current std and the std fixed version:

pub fn rorate_left_with_reverse(s: &mut [u32], mid: usize) {
    assert!(mid <= s.len());

    let _ = s[..mid].reverse();
    let _ = s[mid..].reverse();
    let _ = s.reverse();
}

Benchmark with s = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] and mid = 4 (run on armv7):

test rotate_left_fixed   ... bench:          91 ns/iter (+/- 1)
test rotate_left_reverse ... bench:          37 ns/iter (+/- 0)
test rotate_left_std     ... bench:          91 ns/iter (+/- 0)
x86_64
running 3 tests
test rotate_left_fixed   ... bench:          11 ns/iter (+/- 5)
test rotate_left_reverse ... bench:           3 ns/iter (+/- 0)
test rotate_left_std     ... bench:          11 ns/iter (+/- 0)

@Urgau
Copy link
Member

Urgau commented Jan 16, 2022

I known what's going on. My naive implementation doesn't use memcpy or memmove at all, effectively saving the overhead of calling those functions. A quick test seems to indicate that on my machine with u32 the point of bascule (where the cost of memcpy or memmove become negative) is around 60 ± 5 elements in the slice. This suggest to me that the std implementation could be further improved by avoiding for small slice the cost of using memcpy and memmove.

ASM of rorate_left_with_reverse (naive implementation)
example::rorate_left_with_reverse:
        push    rbx
        mov     r8, rsi
        sub     r8, rdx
        jb      .LBB1_26
        cmp     rdx, 2
        jb      .LBB1_9
        mov     r9, rdx
        shr     r9
        cmp     r9, 1
        jne     .LBB1_4
        xor     ecx, ecx
        test    dl, 2
        jne     .LBB1_8
        jmp     .LBB1_9
.LBB1_4:
        lea     rax, [rdi + 4]
        lea     r10, [rdi + 4*rdx]
        add     r10, -4
        and     r9, -2
        neg     r9
        xor     ecx, ecx
.LBB1_5:
        mov     r11d, dword ptr [rax - 4]
        mov     ebx, dword ptr [r10 + 4*rcx]
        mov     dword ptr [rax - 4], ebx
        mov     dword ptr [r10 + 4*rcx], r11d
        mov     r11d, dword ptr [rax]
        mov     ebx, dword ptr [r10 + 4*rcx - 4]
        mov     dword ptr [rax], ebx
        mov     dword ptr [r10 + 4*rcx - 4], r11d
        add     rax, 8
        add     rcx, -2
        cmp     r9, rcx
        jne     .LBB1_5
        neg     rcx
        test    dl, 2
        je      .LBB1_9
.LBB1_8:
        mov     rax, rcx
        not     rax
        add     rax, rdx
        mov     r9d, dword ptr [rdi + 4*rcx]
        mov     ebx, dword ptr [rdi + 4*rax]
        mov     dword ptr [rdi + 4*rcx], ebx
        mov     dword ptr [rdi + 4*rax], r9d
.LBB1_9:
        cmp     r8, 2
        jb      .LBB1_17
        mov     r9, r8
        shr     r9
        cmp     r9, 1
        jne     .LBB1_12
        xor     ecx, ecx
        test    r8b, 2
        jne     .LBB1_16
        jmp     .LBB1_17
.LBB1_12:
        lea     rax, [rdi + 4*rdx]
        add     rax, 4
        lea     r10, [rdi + 4*rsi]
        add     r10, -4
        and     r9, -2
        neg     r9
        xor     ecx, ecx
.LBB1_13:
        mov     r11d, dword ptr [rax - 4]
        mov     ebx, dword ptr [r10 + 4*rcx]
        mov     dword ptr [rax - 4], ebx
        mov     dword ptr [r10 + 4*rcx], r11d
        mov     r11d, dword ptr [rax]
        mov     ebx, dword ptr [r10 + 4*rcx - 4]
        mov     dword ptr [rax], ebx
        mov     dword ptr [r10 + 4*rcx - 4], r11d
        add     rax, 8
        add     rcx, -2
        cmp     r9, rcx
        jne     .LBB1_13
        neg     rcx
        test    r8b, 2
        je      .LBB1_17
.LBB1_16:
        lea     rax, [rdi + 4*rdx]
        mov     rdx, rcx
        not     rdx
        add     r8, rdx
        mov     edx, dword ptr [rax + 4*rcx]
        mov     ebx, dword ptr [rax + 4*r8]
        mov     dword ptr [rax + 4*rcx], ebx
        mov     dword ptr [rax + 4*r8], edx
.LBB1_17:
        cmp     rsi, 2
        jb      .LBB1_25
        mov     r8, rsi
        shr     r8
        cmp     r8, 1
        jne     .LBB1_20
        xor     eax, eax
        test    sil, 2
        jne     .LBB1_24
        jmp     .LBB1_25
.LBB1_20:
        lea     rdx, [rdi + 4]
        lea     rcx, [rdi + 4*rsi]
        add     rcx, -4
        and     r8, -2
        neg     r8
        xor     eax, eax
.LBB1_21:
        mov     r9d, dword ptr [rdx - 4]
        mov     ebx, dword ptr [rcx + 4*rax]
        mov     dword ptr [rdx - 4], ebx
        mov     dword ptr [rcx + 4*rax], r9d
        mov     r9d, dword ptr [rdx]
        mov     ebx, dword ptr [rcx + 4*rax - 4]
        mov     dword ptr [rdx], ebx
        mov     dword ptr [rcx + 4*rax - 4], r9d
        add     rdx, 8
        add     rax, -2
        cmp     r8, rax
        jne     .LBB1_21
        neg     rax
        test    sil, 2
        je      .LBB1_25
.LBB1_24:
        mov     rcx, rax
        not     rcx
        add     rcx, rsi
        mov     edx, dword ptr [rdi + 4*rax]
        mov     esi, dword ptr [rdi + 4*rcx]
        mov     dword ptr [rdi + 4*rax], esi
        mov     dword ptr [rdi + 4*rcx], edx
.LBB1_25:
        pop     rbx
        ret
.LBB1_26:
        lea     rdi, [rip + .L__unnamed_1]
        lea     rdx, [rip + .L__unnamed_2]
        mov     esi, 32
        call    qword ptr [rip + core::panicking::panic@GOTPCREL]
        ud2
ASM of rotate_u8_right1 (fixed std version)
example::rotate_u8_right1:
        push    rbp
        push    r14
        push    rbx
        test    rsi, rsi
        je      .LBB0_11
        mov     rdx, rsi
        add     rdx, -1
        je      .LBB0_10
        mov     r14, rdi
        cmp     rsi, 23
        ja      .LBB0_8
        mov     bpl, byte ptr [r14]
        mov     eax, 1
        mov     ecx, 1
        sub     rcx, rsi
        xor     esi, esi
        mov     edi, 1
        jmp     .LBB0_4
.LBB0_5:
        add     rdi, 1
.LBB0_4:
        mov     ebx, ebp
        movzx   ebp, byte ptr [r14 + rdi]
        mov     byte ptr [r14 + rdi], bl
        cmp     rdi, rdx
        jb      .LBB0_5
        add     rdi, rcx
        je      .LBB0_9
        cmp     rdi, rax
        cmovb   rax, rsi
        cmovb   rdi, rsi
        jmp     .LBB0_4
.LBB0_8:
        lea     rdi, [r14 + 1]
        mov     bpl, byte ptr [r14 + rdx]
        mov     rsi, r14
        call    qword ptr [rip + memmove@GOTPCREL]
.LBB0_9:
        mov     byte ptr [r14], bpl
.LBB0_10:
        pop     rbx
        pop     r14
        pop     rbp
        ret
.LBB0_11:
        lea     rdi, [rip + .L__unnamed_1]
        lea     rdx, [rip + .L__unnamed_2]
        mov     esi, 34
        call    qword ptr [rip + core::panicking::panic@GOTPCREL]
        ud2

@paolobarbolini
Copy link
Contributor

paolobarbolini commented Jan 16, 2022

I have a crate where the number of items to .rotate_right(1) is always < 6 and the input is an array, so I wrote:

/// Same as `array[..=usize::from(n)].rotate_right(1);`
pub fn rotate_right1_fastest(array: &mut [u8; 256], n: u8) {
    debug_assert!(n < 6);

    let n = usize::from(n);

    let b = array[n];
    array[5 - usize::from(n < 5)] = array[4];
    array[4 - usize::from(n < 4)] = array[3];
    array[3 - usize::from(n < 3)] = array[2];
    array[2 - usize::from(n < 2)] = array[1];
    array[1 - usize::from(n < 1)] = array[0];
    array[0] = b;
}
ASM
example::rotate_right1_fastest:
        movzx   eax, sil
        mov     al, byte ptr [rdi + rax]
        mov     cl, byte ptr [rdi + 4]
        cmp     sil, 5
        mov     rdx, rdi
        sbb     rdx, 0
        mov     byte ptr [rdx + 5], cl
        mov     cl, byte ptr [rdi + 3]
        cmp     sil, 4
        mov     rdx, rdi
        sbb     rdx, 0
        mov     byte ptr [rdx + 4], cl
        mov     cl, byte ptr [rdi + 2]
        cmp     sil, 3
        mov     rdx, rdi
        sbb     rdx, 0
        mov     byte ptr [rdx + 3], cl
        cmp     sil, 2
        mov     rcx, rdi
        sbb     rcx, 0
        mov     dl, byte ptr [rdi + 1]
        mov     byte ptr [rcx + 2], dl
        mov     cl, byte ptr [rdi]
        cmp     sil, 1
        mov     rdx, rdi
        sbb     rdx, -1
        mov     byte ptr [rdx], cl
        mov     byte ptr [rdi], al
        ret

@scottmcm
Copy link
Member

@Urgau Note that u32 is one of the better cases for reverse. I suspect if you try something like [u8; 3] you'll see that one start to perform noticeably worse.

That said, I did recently rewrite reverse to appear simpler but actually be faster because it could be vectorized (#90821). It might be that the newer LLVM can similarly do better with a different rotate implementation than the current one.

Actually, maybe the swap could be re-written to not be manually vectorized, using the same techniques as reverse used. That might let the original vectorizable implementation shine again, if the reverse one is currently doing well.

@Urgau
Copy link
Member

Urgau commented Jan 17, 2022

@scottmcm Effectively, I didn't taught of that.
I have done some implementation benchmark on my x86_64 machine, they are very noisy, but I hope they helps:

Current std:

running 7 tests
test slice::rotate_16_usize_4                                   ... bench:         309 ns/iter (+/- 48)
test slice::rotate_16_usize_5                                   ... bench:         300 ns/iter (+/- 41)
test slice::rotate_64_usize_4                                   ... bench:       4,016 ns/iter (+/- 471)
test slice::rotate_64_usize_5                                   ... bench:       4,759 ns/iter (+/- 561)
test slice::rotate_rgb                                          ... bench:         342 ns/iter (+/- 36)
test slice::rotate_u8                                           ... bench:         349 ns/iter (+/- 48)
test slice::rotate_usize                                        ... bench:         387 ns/iter (+/- 53)

Nearly original (ManuallyDrop + nits):

running 7 tests
test slice::rotate_16_usize_4                                   ... bench:         248 ns/iter (+/- 98)
test slice::rotate_16_usize_5                                   ... bench:         317 ns/iter (+/- 41)
test slice::rotate_64_usize_4                                   ... bench:       4,117 ns/iter (+/- 1,414)
test slice::rotate_64_usize_5                                   ... bench:       7,041 ns/iter (+/- 5,687)
test slice::rotate_rgb                                          ... bench:         318 ns/iter (+/- 34)
test slice::rotate_u8                                           ... bench:         325 ns/iter (+/- 174)
test slice::rotate_usize                                        ... bench:         406 ns/iter (+/- 102)

Nearly original + with original ptr_swap_n:

running 7 tests
test slice::rotate_16_usize_4                                   ... bench:         228 ns/iter (+/- 37)
test slice::rotate_16_usize_5                                   ... bench:         306 ns/iter (+/- 113)
test slice::rotate_64_usize_4                                   ... bench:       3,834 ns/iter (+/- 379)
test slice::rotate_64_usize_5                                   ... bench:       6,107 ns/iter (+/- 688)
test slice::rotate_rgb                                          ... bench:         338 ns/iter (+/- 138)
test slice::rotate_u8                                           ... bench:         325 ns/iter (+/- 121)
test slice::rotate_usize                                        ... bench:         358 ns/iter (+/- 39)

Naive implementation:

running 7 tests
test slice::rotate_16_usize_4                                   ... bench:         324 ns/iter (+/- 41)
test slice::rotate_16_usize_5                                   ... bench:         493 ns/iter (+/- 38)
test slice::rotate_64_usize_4                                   ... bench:       5,052 ns/iter (+/- 407)
test slice::rotate_64_usize_5                                   ... bench:       8,078 ns/iter (+/- 558)
test slice::rotate_rgb                                          ... bench:       1,285 ns/iter (+/- 142)
test slice::rotate_u8                                           ... bench:         569 ns/iter (+/- 152)
test slice::rotate_usize                                        ... bench:         522 ns/iter (+/- 50)

Conclusion, unsurprisingly the current implementation seems to be overhaul better.

@clubby789
Copy link
Contributor

@rustbot label +I-slow

@rustbot rustbot added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Mar 30, 2023
@ghost
Copy link

ghost commented May 13, 2023

There is another problem when the size of the type is big like for example

struct BallastType {
    ballast: [u64; 32], // the compiler keeps this, even if unused
    value: usize,
}

The algorithm will switch immediately to the gcd algorithm because of (mem::size_of::<T>() > mem::size_of::<[usize; 4]>()).
But this will slow down very much (factor 2 or more) for this type compared to triple reversal or the default method that block swaps step by step.

I think the idea to use the gcd algorithm here was that big types are more costly to copy around and the gcd algorithm requires the optimal number of moves for all inputs. But it should be kept in mind that it is highly cache-inefficient and that big types fit much less into memory so that cache misses have an even higher impact with a slice of same length compared to a smaller type.

I also noticed that when the gcd algorithm is used with smaller types it depends on the length ratio, whether it is faster than a different method or not.

I think, the gcd method should not be used like in the current way. If the type is too big, it will slow down much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-slow Issue: Problems and improvements with respect to performance of generated code.
Projects
None yet
Development

No branches or pull requests

6 participants