Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up mem::swap #40454

Merged
merged 16 commits into from
Jun 11, 2017
Merged

speed up mem::swap #40454

merged 16 commits into from
Jun 11, 2017

Conversation

djzin
Copy link
Contributor

@djzin djzin commented Mar 12, 2017

I would have thought that the mem::swap code didn't need an intermediate variable precisely because the pointers are guaranteed never to alias. And.. it doesn't! It seems that llvm will also auto-vectorize this case for large structs, but alas it doesn't seem to have all the aliasing info it needs and so will add redundant checks (and even not bother with autovectorizing for small types). Looks like a lot of performance could still be gained here, so this might be a good test case for future optimizer improvements.

Here are the current benchmarks for the simd version of mem::swap; the timings are in cycles (code below) measured with 10 iterations. The timings for sizes > 32 which are not a multiple of 8 tend to be ever so slightly faster in the old code, but not always. For large struct sizes (> 1024) the new code shows a marked improvement.

* = latest commit
† = subtracted from other measurements

arr_length noop rust_stdlib simd_u64x4* simd_u64x8
8 80 90 90 90
16 72 177 177 177
24 32 76 76 76
32 68 188 112 188
40 32 80 60 80
48 32 84 56 84
56 32 108 72 108
64 32 108 72 76
72 80 350 220 230
80 80 350 220 230
88 80 420 270 270
96 80 420 270 270
104 80 500 320 320
112 80 490 320 320
120 72 528 342 342
128 48 360 234 234
136 72 987 387 387
144 80 1070 420 420
152 64 856 376 376
160 68 804 400 400
168 80 1060 520 520
176 80 1070 520 520
184 32 464 228 228
192 32 504 228 228
200 32 440 248 248
208 72 987 573 573
216 80 1464 220 220
224 48 852 450 450
232 72 1182 666 666
240 32 428 288 288
248 32 428 308 308
256 80 860 770 770
264 80 1130 820 820
272 80 1340 820 820
280 80 1220 870 870
288 72 1227 804 804
296 72 1356 849 849

@rust-highfive
Copy link
Collaborator

Thanks for the pull request, and welcome! The Rust team is excited to review your changes, and you should hear from @sfackler (or someone else) soon.

If any changes to this PR are deemed necessary, please add them as extra commits. This ensures that the reviewer can see what has changed since they last reviewed the code. Due to the way GitHub handles out-of-date commits, this should also make it reasonably obvious what issues have or haven't been addressed. Large or tricky changes may require several passes of review and changes.

Please see the contribution instructions for more information.

@djzin
Copy link
Contributor Author

djzin commented Mar 12, 2017

hm, got to be careful with the benchmarks. If llvm sucks (and it might), then what previously was optimized away in for loops because of using memmove intrinsics that llvm special cases, might now not be optimized. The relevant code here:

#[stable(feature = "rust1", since = "1.0.0")]
impl<A: Step> Iterator for ops::Range<A> where
    for<'a> &'a A: Add<&'a A, Output = A>
{
    type Item = A;

    #[inline]
    fn next(&mut self) -> Option<A> {
        if self.start < self.end {
            let mut n = self.start.add_one();
            // this line may not be optimized as well now for simple integer types
            // note this was the cause of unintentional recursion before ;)
            mem::swap(&mut n, &mut self.start); 
            Some(n)
        } else {
            None
        }
    }

    #[inline]
    fn size_hint(&self) -> (usize, Option<usize>) {
        match Step::steps_between_by_one(&self.start, &self.end) {
            Some(hint) => (hint, Some(hint)),
            None => (0, None)
        }
    }
}

@djzin
Copy link
Contributor Author

djzin commented Mar 12, 2017

Testing confirms that llvm is unable to optimize loops as well when this implementation of mem::swap is used. For some reason even a manual implementation of swap -

pub fn swap<T: Copy>(x: &mut T, y: &mut T) {
    let t = *x;
    *x = *y;
    *y = t;
}

performs slower than the llvm intrinsic version. Defninitely something to investigate...

@djzin
Copy link
Contributor Author

djzin commented Mar 12, 2017

OK so the reason this simple swap function was slower seems to be that copy_nonoverlapping tells llvm that the pointers do not alias in the case where it can't work that out on its own. The reason the xor-swap was faster was simply down to cache locality so I wrote a new function

@djzin
Copy link
Contributor Author

djzin commented Mar 12, 2017

I am trying to get something close to this: for an array of 2048 chars clang spits out the following:

#include <array>

void swap(std::array<char, 2048>&__restrict__ x, std::array<char, 2048>&__restrict__ y) {
    std::swap(x, y);
}
clang -std=c++11 swap.cpp -S -O3
# BB#0:
    xorl    %eax, %eax
    .p2align    4, 0x90
.LBB0_1:                                # =>This Inner Loop Header: Depth=1
    movups  (%rdi,%rax), %xmm0
    movups  16(%rdi,%rax), %xmm1
    movups  (%rsi,%rax), %xmm2
    movups  16(%rsi,%rax), %xmm3
    movups  %xmm2, (%rdi,%rax)
    movups  %xmm3, 16(%rdi,%rax)
    movups  %xmm0, (%rsi,%rax)
    movups  %xmm1, 16(%rsi,%rax)
    movups  32(%rdi,%rax), %xmm0
    movups  48(%rdi,%rax), %xmm1
    movups  32(%rsi,%rax), %xmm2
    movups  48(%rsi,%rax), %xmm3
    movups  %xmm2, 32(%rdi,%rax)
    movups  %xmm3, 48(%rdi,%rax)
    movups  %xmm0, 32(%rsi,%rax)
    movups  %xmm1, 48(%rsi,%rax)
    addq    $64, %rax
    cmpq    $2048, %rax             # imm = 0x800
    jne .LBB0_1
# BB#2:
    retq

This seems like a nice efficient swap utilizing 4 registers
movups as I understand will be just as efficient as movaps if the pointers are in fact aligned

In rust on the other hand I get code like the following:

use std::{mem, ptr};

fn swap<T>(x: &mut T, y: &mut T) {
    unsafe {
        // Give ourselves some scratch space to work with
        let mut t: [u8; 16] = mem::uninitialized();

        let x = x as *mut T as *mut u8;
        let y = y as *mut T as *mut u8;
        let t = &mut t as *mut _ as *mut u8;

        // can't use a for loop as the `range` impl calls `mem::swap` recursively
        let len = mem::size_of::<T>() as isize;
        let mut i = 0;
        while i + 16 <= len {
            // Perform the swap 16 bytes at a time, `&mut` pointers never alias
            ptr::copy_nonoverlapping(x.offset(i), t, 16);
            ptr::copy_nonoverlapping(y.offset(i), x.offset(i), 16);
            ptr::copy_nonoverlapping(t, y.offset(i), 16);
            i += 16;
        }
        if i < len {
            // Swap any remaining bytes
            let rem = (len - i) as usize;
            ptr::copy_nonoverlapping(x.offset(i), t, rem);
            ptr::copy_nonoverlapping(y.offset(i), x.offset(i), rem);
            ptr::copy_nonoverlapping(t, y.offset(i), rem);
        }
    }
}

#[no_mangle]
pub fn swap_array(x: &mut [u8; 2048], y: &mut [u8; 2048]) {
    swap(x, y)
}
rustc swap.rs --crate-type=lib --emit=asm -C opt-level=3
swap_array:
    .cfi_startproc
    xorl    %eax, %eax
    .p2align    4, 0x90
.LBB0_1:
    movups  (%rdi,%rax), %xmm0
    movaps  %xmm0, -24(%rsp)
    movups  (%rsi,%rax), %xmm0
    movups  %xmm0, (%rdi,%rax)
    movaps  -24(%rsp), %xmm0
    movups  %xmm0, (%rsi,%rax)
    movups  16(%rdi,%rax), %xmm0
    movaps  %xmm0, -24(%rsp)
    movups  16(%rsi,%rax), %xmm0
    movups  %xmm0, 16(%rdi,%rax)
    movaps  -24(%rsp), %xmm0
    movups  %xmm0, 16(%rsi,%rax)
    movups  32(%rdi,%rax), %xmm0
    movaps  %xmm0, -24(%rsp)
    movups  32(%rsi,%rax), %xmm0
    movups  %xmm0, 32(%rdi,%rax)
    movaps  -24(%rsp), %xmm0
    movups  %xmm0, 32(%rsi,%rax)
    movups  48(%rdi,%rax), %xmm0
    movaps  %xmm0, -24(%rsp)
    movups  48(%rsi,%rax), %xmm0
    movups  %xmm0, 48(%rdi,%rax)
    movaps  -24(%rsp), %xmm0
    movups  %xmm0, 48(%rsi,%rax)
    leaq    64(%rax), %rcx
    addq    $80, %rax
    cmpq    $2049, %rax
    movq    %rcx, %rax
    jl  .LBB0_1
    retq

It is for some reason not eliminating the stores to and loads from the stack (the variable t stays on the stack and is not eliminated). For reference I got the same with clang on an equivalent c program so it's not rust's fault. I also tried adding an extra s variable to mirror what the c++ code was doing but to no avail.

@djzin
Copy link
Contributor Author

djzin commented Mar 12, 2017

I can get it to work by forcing the alignment of t using #[repr(simd)], but there wasn't any meaningful performance difference actually - no matter how I unrolled the loops or what size simd struct I used. So although it looks like this code is much worse, in practice it performs pretty much optimally

}
if i < len {
// Swap any remaining bytes
let rem = (len - i) as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it ever matters, but it occurs to me that calculating rem as len % 16 instead of len - i might be more obvious to an optimizer, as knowing what len - i is requires reasoning about a loop. Similarly, it would be better to test whether rem == 0 than to test i < len, as it is more trivially determined at compile time.

I doubt this matters in most cases, but I could see the optimizer failing when the size is large, since the loop might not be fully unrolled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it; looks like the generated assembly is the same either way. len here is a constant known at compile time so I guess it just gets completely eliminated

@@ -448,17 +448,29 @@ pub unsafe fn uninitialized<T>() -> T {
pub fn swap<T>(x: &mut T, y: &mut T) {
unsafe {
// Give ourselves some scratch space to work with
let mut t: T = uninitialized();
let mut t: [u8; 16] = uninitialized();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor point: Instead of using a literal 16 here, it might be slightly better to stick a

const SWAP_BLOCK_SIZE: usize = 16;

at the top of the function. This would make it slightly less error-prone to tweak the value later, and it would make it easy to, e.g., pick different sizes for different architectures.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a good suggestion, thanks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for reference, the actual block size didn't matter too much but that was just on my machine - 16 may very well not be optimal

@eddyb
Copy link
Member

eddyb commented Mar 13, 2017

cc @rust-lang/compiler This seems much harder to ever perform MIR optimizations.
I'm torn, especially since we could've had those optimizations already but we don't.

@arthurprs
Copy link
Contributor

arthurprs commented Mar 14, 2017

This is extremely performance sensitive stuff (used in hashmap for example), thanks for working on it.

@alexcrichton alexcrichton added the T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. label Mar 14, 2017
@brson
Copy link
Contributor

brson commented Mar 14, 2017

Please add comments to the source code explaining why this code is so complex.

@arthurprs
Copy link
Contributor

arthurprs commented Mar 14, 2017

Did you try a 32 byte buffer? This way the compiler could emit code that uses 2 xmm register in parallel.

Edit: After giving this a deeper look, I think for it to be really optimal we'll probably need multiple versions depending on the size of T.

@djzin
Copy link
Contributor Author

djzin commented Mar 14, 2017

OK so I did some more testing and it seems the SIMD version is indeed faster - this makes sense to me. On a struct of size 2048 I get 630 cycles for the SIMD version and 990 cycles for the 16-byte version. It would be nice to find a way to get llvm to use the SIMD instructions without resorting to unstable features though.

On my machine the u64x4 and u64x8 versions had exactly the same performance profile; the C++ version earlier used only 4 registers which apparently is optimal. I had a u64x16 version but it's slower.

here is the SIMD version for reference:

fn swap_u64x4<T>(x: &mut T, y: &mut T) {
    unsafe {
        #[allow(non_camel_case_types)]
        #[repr(simd)]
        struct u64x4(u64, u64, u64, u64);

        let mut t: u64x4 = mem::uninitialized();

        let x = x as *mut T as *mut u8; 
        let y = y as *mut T as *mut u8; 
        let t = &mut t as *mut _ as *mut u8; 

        let len = mem::size_of::<T>() as isize;
        let block_sz = mem::size_of::<u64x4>();

        let mut i = 0;
        while i + block_sz as isize <= len {
            ptr::copy_nonoverlapping(x.offset(i), t, block_sz);
            ptr::copy_nonoverlapping(y.offset(i), x.offset(i), block_sz);
            ptr::copy_nonoverlapping(t, y.offset(i), block_sz);
            i += block_sz as isize;
        }
        if i < len {
            let rem = (len - i) as usize;
            ptr::copy_nonoverlapping(x.offset(i), t, rem);
            ptr::copy_nonoverlapping(y.offset(i), x.offset(i), rem);
            ptr::copy_nonoverlapping(t, y.offset(i), rem);
        }   
    }   
}

#[no_mangle]
pub fn swap_array_u64x4(x: &mut [u8; 2048], y: &mut [u8; 2048]) {
    swap_u64x4(x, y)
}

with the generated assembly:

swap_array_u64x4:
    .cfi_startproc
    xorl    %eax, %eax
    .p2align    4, 0x90
.LBB0_1:
    movups  (%rdi,%rax), %xmm0
    movups  16(%rdi,%rax), %xmm1
    movups  (%rsi,%rax), %xmm2
    movups  16(%rsi,%rax), %xmm3
    movups  %xmm3, 16(%rdi,%rax)
    movups  %xmm2, (%rdi,%rax)
    movups  %xmm1, 16(%rsi,%rax)
    movups  %xmm0, (%rsi,%rax)
    movups  32(%rdi,%rax), %xmm0
    movups  48(%rdi,%rax), %xmm1
    movups  32(%rsi,%rax), %xmm2
    movups  48(%rsi,%rax), %xmm3
    movups  %xmm3, 48(%rdi,%rax)
    movups  %xmm2, 32(%rdi,%rax)
    movups  %xmm1, 48(%rsi,%rax)
    movups  %xmm0, 32(%rsi,%rax)
    movups  64(%rdi,%rax), %xmm0
    movups  80(%rdi,%rax), %xmm1
    movups  64(%rsi,%rax), %xmm2
    movups  80(%rsi,%rax), %xmm3
    movups  %xmm3, 80(%rdi,%rax)
    movups  %xmm2, 64(%rdi,%rax)
    movups  %xmm1, 80(%rsi,%rax)
    movups  %xmm0, 64(%rsi,%rax)
    movups  96(%rdi,%rax), %xmm0
    movups  112(%rdi,%rax), %xmm1
    movups  96(%rsi,%rax), %xmm2
    movups  112(%rsi,%rax), %xmm3
    movups  %xmm3, 112(%rdi,%rax)
    movups  %xmm2, 96(%rdi,%rax)
    movups  %xmm1, 112(%rsi,%rax)
    movups  %xmm0, 96(%rsi,%rax)
    movq    %rax, %rcx
    subq    $-128, %rcx
    addq    $160, %rax
    cmpq    $2049, %rax
    movq    %rcx, %rax
    jl  .LBB0_1
    retq

The code I am using to time it...

unsafe fn time(swap: fn(&mut [u8; ARR_LENGTH], &mut [u8; ARR_LENGTH])) -> u64 {
    let mut x: [u8; ARR_LENGTH] = [42; ARR_LENGTH];
    let mut y: [u8; ARR_LENGTH] = [69; ARR_LENGTH];
    asm!("cpuid
          cpuid
          cpuid
          cpuid
          cpuid" 
          :
          :
          : "rax", "rbx", "rcx", "rdx" 
          : "volatile");
    let mut min_diff = !0;
    let mut max_diff = 0;
    let mut i = 0;
    while i < 80 {
        let before_high: u32;
        let before_low: u32;
        let after_high: u32;
        let after_low: u32;
        asm!("cpuid
              rdtsc
              movl %edx, $0
              movl %eax, $1"
              : "=r" (before_high), "=r" (before_low)
              :
              : "rax", "rbx", "rcx", "rdx"
              : "volatile" );
        swap(&mut x, &mut y);
        asm!("rdtscp
              movl %edx, $0
              movl %eax, $1
              cpuid"
              : "=r" (after_high), "=r" (after_low)
              :
              : "rax", "rbx", "rcx", "rdx"
              : "volatile" );
        asm!("" :: "r"(&x), "r"(&y));
        let diff = ((after_high as u64 >> 32) | after_low as u64) - ((before_high as u64 >> 32) | before_low as u64);
        if diff < min_diff { min_diff = diff; }
        if diff > max_diff { max_diff = diff; }
        i += 1;
    }

    return min_diff;
}

fn noop(_: &mut [u8; ARR_LENGTH], _: &mut [u8; ARR_LENGTH]) {}

fn main() {
    unsafe {
        let min_timing = time(noop);
        let min_execution_time_16 = time(swap_array_16);
        let min_execution_time_32 = time(swap_array_32);
        let min_execution_time_64 = time(swap_array_64);
        let min_execution_time_u64x4 = time(swap_array_u64x4);
        let min_execution_time_u64x8 = time(swap_array_u64x8);
        let min_execution_time_u64x16 = time(swap_array_u64x16);
        println!("arr length {}", ARR_LENGTH);
        println!("min timing {}", min_timing);
        println!("16_bytes {}", min_execution_time_16 - min_timing);
        println!("32_bytes {}", min_execution_time_32 - min_timing);
        println!("64_bytes {}", min_execution_time_64 - min_timing);
        println!("simd_u64x4 {}", min_execution_time_u64x4 - min_timing);
        println!("simd_u64x8 {}", min_execution_time_u64x8 - min_timing);
        println!("simd_u64x16 {}", min_execution_time_u64x16 - min_timing);
    }
}

gives

arr length 64
min timing 80
16_bytes 10
32_bytes 10
64_bytes 10
simd_u64x4 10
simd_u64x8 10
simd_u64x16 10

arr length 256
min timing 80
16_bytes 110
32_bytes 90
64_bytes 90
simd_u64x4 50
simd_u64x8 50
simd_u64x16 60

arr length 517
min timing 80
16_bytes 240
32_bytes 230
64_bytes 220
simd_u64x4 150
simd_u64x8 160
simd_u64x16 170

arr length 834
min timing 80
16_bytes 380
32_bytes 370
64_bytes 360
simd_u64x4 240
simd_u64x8 240
simd_u64x16 280

arr length 2048
min timing 80
16_bytes 990
32_bytes 1000
64_bytes 980
simd_u64x4 630
simd_u64x8 630
simd_u64x16 660

arr length 65536
min timing 80
16_bytes 41730
32_bytes 41880
64_bytes 41600
simd_u64x4 33570
simd_u64x8 33640
simd_u64x16 33970

arr length 99999
min timing 80
16_bytes 58380
32_bytes 58310
64_bytes 57990
simd_u64x4 52260
simd_u64x8 51960
simd_u64x16 52460

@arthurprs
Copy link
Contributor

arthurprs commented Mar 15, 2017

I've spend some time on this and there's some cases that the old code is better (T sizes: 57, 71, 73, 77; for ex.)
The difference seems to lie in how the "tail copy" is done, the new code uses general purpose registers while the old still manages to do it with the xmm registers. I can't prove it but it could affect real code negatively.

Thoughts?

@djzin
Copy link
Contributor Author

djzin commented Mar 19, 2017

@arthurprs I see a similar effect - I don't see that the xmm registers are not being used, but for certain sizes the old code is better. I'm going to see if there's a way to make it faster in every case, but for now here's the full data (NOTE I timed these using 10 iterations each, count is in cycles):

arr_length	min_timing	rust_stdlib	simd_u64x4	simd_u64x8
1	80	90	90	90	
2	72	84	84	84	
3	68	144	144	144	
4	48	54	54	54	
5	72	156	156	156	
6	80	160	160	160	
7	68	180	180	180	
8	80	90	90	90	
9	32	72	72	72	
10	68	144	144	144	
11	32	80	80	80	
12	80	160	160	160	
13	72	516	516	519	
14	32	224	224	224	
15	72	516	516	516	
16	72	177	177	177	
17	72	195	195	195	
18	80	210	210	210	
19	32	80	80	80	
20	32	80	80	80	
21	80	560	560	560	
22	80	560	560	560	
23	80	560	560	560	
24	32	76	76	76	
25	80	600	600	600	
26	80	600	600	600	
27	72	555	555	555	
28	72	555	555	555	
29	80	600	600	600	
30	64	480	480	480	
31	80	600	600	600	
32	68	188	112	188	
33	32	84	52	84	
34	72	195	120	195	
35	68	232	232	232	
36	80	240	130	230	
37	72	564	249	564	
38	68	520	232	520	
39	32	244	136	244	
40	32	80	60	80	
41	80	640	270	640	
42	72	591	258	591	
43	48	432	228	432	
44	80	640	270	640	
45	80	640	590	640	
46	80	640	600	640	
47	40	940	325	940	
48	32	84	56	84	
49	32	108	112	108	
50	72	249	258	249	
51	48	204	210	204	
52	72	249	258	249	
53	72	528	555	528	
54	32	228	240	228	
55	72	528	555	528	
56	32	108	72	108	
57	68	520	532	520	
58	80	610	620	610	
59	32	244	248	244	
60	48	384	390	384	
61	80	610	620	610	
62	72	564	573	564	
63	68	520	532	520	
64	32	108	72	76	
65	32	136	88	92	
66	72	324	204	213	
67	32	168	148	172	
68	80	340	220	230	
69	80	620	370	420	
70	72	573	342	396	
71	72	573	408	462	
72	80	350	220	230	
73	80	640	370	430	
74	80	640	380	430	
75	32	256	176	200	
76	80	640	380	420	
77	32	256	280	300	
78	68	548	600	640	
79	72	591	648	693	
80	80	350	220	230	
81	80	420	370	430	
82	80	420	380	430	
83	68	420	376	428	
84	32	168	148	172	
85	72	528	657	693	
86	80	570	710	750	
87	80	650	700	770	
88	80	420	270	270	
89	32	256	288	308	
90	64	520	576	616	
91	32	256	288	308	
92	80	640	720	770	
93	32	256	288	308	
94	32	256	288	308	
95	72	591	666	711	
96	80	420	270	270	
97	72	453	297	462	
98	32	200	128	200	
99	60	427	351	427	
100	68	420	272	428	
101	32	244	188	320	
102	80	610	470	800	
103	80	610	540	800	
104	80	500	320	320	
105	32	256	192	328	
106	68	548	400	700	
107	72	591	498	756	
108	64	512	376	656	
109	68	548	676	700	
110	80	2070	790	2020	
111	32	256	316	328	
112	80	490	320	320	
113	80	570	480	570	
114	32	228	192	228	
115	64	672	432	520	
116	32	228	192	228	
117	72	537	729	738	
118	72	537	738	738	
119	32	232	316	320	
120	72	528	342	342	
121	32	236	328	328	
122	32	236	328	328	
123	68	504	700	700	
124	32	236	328	328	
125	32	276	376	376	
126	32	236	328	328	
127	32	236	328	328	
128	48	360	234	234	
129	72	987	387	387	
130	80	1170	420	420	
131	64	856	456	496	
132	80	1070	420	420	
133	32	428	228	252	
134	80	1070	570	620	
135	80	1070	640	700	
136	72	987	387	387	
137	64	856	456	496	
138	72	1116	528	582	
139	32	428	256	280	
140	60	801	427	464	
141	32	428	356	388	
142	56	755	628	685	
143	80	1070	890	970	
144	80	1070	420	420	
145	80	1010	580	630	
146	72	906	537	582	
147	80	990	650	700	
148	68	892	488	532	
149	68	856	760	812	
150	64	792	712	760	
151	24	363	312	339	
152	64	856	376	376	
153	60	734	690	727	
154	32	396	368	388	
155	72	906	849	897	
156	80	980	920	970	
157	80	990	920	970	
158	80	1000	920	970	
159	80	1080	920	970	
160	68	804	400	400	
161	32	428	208	280	
162	72	1008	480	648	
163	72	987	618	711	
164	80	1070	520	700	
165	80	1070	680	1010	
166	80	1070	670	1010	
167	80	1070	740	1010	
168	80	1060	520	520	
169	72	987	618	942	
170	32	492	308	472	
171	32	428	296	408	
172	80	1350	680	1020	
173	32	428	400	408	
174	32	968	400	916	
175	80	1070	1000	1020	
176	80	1070	520	520	
177	80	1240	670	770	
178	72	897	627	711	
179	80	970	740	850	
180	80	970	670	770	
181	80	970	1000	1000	
182	80	1210	1000	1000	
183	80	970	1000	1000	
184	32	464	228	228	
185	72	1119	942	942	
186	40	1395	560	560	
187	68	1044	872	872	
188	72	897	942	942	
189	32	388	408	408	
190	72	897	942	942	
191	72	897	942	942	
192	32	504	228	228	
193	60	877	464	464	
194	72	2187	573	573	
195	32	428	308	328	
196	72	1080	573	573	
197	80	2460	770	820	
198	60	801	577	614	
199	80	1130	840	900	
200	32	440	248	248	
201	32	428	308	328
202	68	916	668	704
203	72	990	777	831
204	32	428	308	328
205	80	1070	1100	1150
206	80	2400	1100	1150
207	68	916	940	984
208	72	987	573	573
209	80	970	770	820
210	68	832	660	700
211	72	897	777	831
212	32	388	312	332
213	32	388	440	460
214	40	530	600	630
215	40	560	625	650
216	80	1464	220	220
217	72	906	1035	1080
218	60	801	840	877
219	80	889	1027	1072
220	72	906	1035	1080
221	80	1260	1120	1170
222	72	897	1035	1080
223	32	388	448	468
224	48	852	450	450
225	48	714	480	600
226	72	996	666	831
227	68	916	744	832
228	68	916	616	772
229	32	428	348	484
230	68	916	744	1036
231	80	1070	940	1210
232	72	1182	666	666
233	68	916	744	1044
234	80	1070	880	1220
235	72	987	867	1128
236	32	428	348	488
237	32	428	476	488
238	80	1070	1190	1220
239	80	1070	1190	1220
240	32	428	288	288
241	64	776	696	776
242	72	999	813	897
243	60	1770	711	787
244	72	897	804	897
245	56	685	840	847
246	80	970	1190	1210
247	80	1010	1190	1200
248	32	428	308	308
249	80	970	1220	1220
250	80	970	1220	1220
251	72	897	1128	1128
252	80	970	1220	1220
253	32	388	488	488
254	80	970	1220	1220
255	80	889	1120	1120
256	80	860	770	770
257	40	645	470	470
258	72	1053	756	756
259	32	476	388	412
260	72	1053	756	756
261	80	1190	970	1030
262	32	456	388	412
263	72	1053	960	1017
264	80	1130	820	820
265	72	1140	906	951
266	32	456	388	408
267	80	1140	1050	1100
268	32	476	388	412
269	80	1450	1290	1350
270	72	1053	1191	1248
271	72	1053	1200	1248
272	80	1340	820	820
273	80	1140	970	1030
274	40	620	530	565
275	32	456	420	440
276	32	456	392	412
277	32	456	520	540
278	80	1140	1300	1350
279	80	1130	1300	1350
280	80	1220	870	870
281	80	1380	1320	1370
282	68	1180	1132	1172
283	48	864	834	864
284	72	1266	1218	1266
285	72	1275	1218	1266
286	80	3100	1320	1370
287	80	1380	1320	1370
288	72	1227	804	804
289	80	1500	920	1100
290	68	1292	788	940
291	80	1460	1070	1170
292	72	1386	849	1017
293	72	2004	996	1293
294	80	1460	1070	1410
295	32	584	456	560
296	72	1356	849	849
297	72	2031	987	1311
298	32	584	428	568
299	56	1031	805	1002
300	80	1460	1070	1420

@djzin
Copy link
Contributor Author

djzin commented Mar 19, 2017

One thing to note is that sizes of 127 etc are probably not that common - for instance, if any of the elements of the struct is a u64 the size will be a multiple of 8. Looking at just the multiples of 8 the new code is faster every time

arr_length	min_timing	rust_stdlib	simd_u64x4	simd_u64x8
8	80	90	90	90	
16	72	177	177	177	
24	32	76	76	76	
32	68	188	112	188	
40	32	80	60	80	
48	32	84	56	84	
56	32	108	72	108	
64	32	108	72	76	
72	80	350	220	230	
80	80	350	220	230	
88	80	420	270	270	
96	80	420	270	270	
104	80	500	320	320	
112	80	490	320	320	
120	72	528	342	342	
128	48	360	234	234	
136	72	987	387	387	
144	80	1070	420	420	
152	64	856	376	376	
160	68	804	400	400	
168	80	1060	520	520	
176	80	1070	520	520	
184	32	464	228	228	
192	32	504	228	228	
200	32	440	248	248	
208	72	987	573	573
216	80	1464	220	220
224	48	852	450	450
232	72	1182	666	666
240	32	428	288	288
248	32	428	308	308
256	80	860	770	770
264	80	1130	820	820
272	80	1340	820	820
280	80	1220	870	870
288	72	1227	804	804
296	72	1356	849	849

@arthurprs
Copy link
Contributor

arthurprs commented Mar 19, 2017

My concern is that using general purpose registers inside very hot loops will throw off the optimizer. I'll try to come up with test cases or use the hashmap code.

@arthurprs
Copy link
Contributor

Ok, I'm probably over-thinking this.

Either way It's important to check results in x86 and some ARM variant.

@@ -109,7 +109,7 @@ pub use intrinsics::transmute;
/// [`Clone`][clone]. You need the value's destructor to run only once,
/// because a double `free` is undefined behavior.
///
/// An example is the definition of [`mem::swap`][swap] in this module:
/// An example is the (old) definition of [`mem::swap`][swap] in this module:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would phrase this differently - maybe "An example is a possible implementation of mem::swap" or something like that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree we should not intimate the existance of a former impl.

@sfackler
Copy link
Member

sfackler commented Apr 9, 2017

Is there more investigation that needs to be done or is this good to go?

@eddyb
Copy link
Member

eddyb commented Apr 9, 2017

I believe my MIR optimization concerns can be alleviated by moving this optimization into rustc or LLVM later on, so this PR wouldn't necessarily have any lasting negative impact on that (to counter its wins).

@alexcrichton
Copy link
Member

Nah that looks legitimate, seems vectorized ops leaked into the backend which emscripten couldn't handle. Dunno if that's a bug with us or emscripten, but a legitimate bug regardless!

@nagisa
Copy link
Member

nagisa commented Jun 5, 2017

This is a bug in emscripten the way I see it.

It is likely not plausible this will get fixed in emscripten any time soon, so the next best fix is a cfg version for emscripten with old code and a FIXME. We should still report this against emscripten.

@brson
Copy link
Contributor

brson commented Jun 5, 2017

Thanks for persevering @djzin. Cool patch.

@brson brson added the relnotes Marks issues that should be documented in the release notes of the next release. label Jun 5, 2017
@arielb1 arielb1 added S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jun 6, 2017
@djzin
Copy link
Contributor Author

djzin commented Jun 9, 2017

@sfackler should be fixed now...

@Mark-Simulacrum
Copy link
Member

@bors r=sfackler

@bors
Copy link
Contributor

bors commented Jun 11, 2017

📌 Commit 83f1f11 has been approved by sfackler

@bors
Copy link
Contributor

bors commented Jun 11, 2017

⌛ Testing commit 83f1f11 with merge 27650ee...

bors added a commit that referenced this pull request Jun 11, 2017
speed up mem::swap

I would have thought that the mem::swap code didn't need an intermediate variable precisely because the pointers are guaranteed never to alias. And.. it doesn't! It seems that llvm will also auto-vectorize this case for large structs, but alas it doesn't seem to have all the aliasing info it needs and so will add redundant checks (and even not bother with autovectorizing for small types). Looks like a lot of performance could still be gained here, so this might be a good test case for future optimizer improvements.

Here are the current benchmarks for the simd version of mem::swap; the timings are in cycles (code below) measured with 10 iterations. The timings for sizes > 32 which are not a multiple of 8 tend to be ever so slightly faster in the old code, but not always. For large struct sizes (> 1024) the new code shows a marked improvement.

\* = latest commit
† = subtracted from other measurements

| arr_length	| noop<sup>†</sup>	| rust_stdlib	| simd_u64x4\*	| simd_u64x8
|------------------|------------|-------------------|-------------------|-------------------
8|80|90|90|90
16|72|177|177|177
24|32|76|76|76
32|68|188|112|188
40|32|80|60|80
48|32|84|56|84
56|32|108|72|108
64|32|108|72|76
72|80|350|220|230
80|80|350|220|230
88|80|420|270|270
96|80|420|270|270
104|80|500|320|320
112|80|490|320|320
120|72|528|342|342
128|48|360|234|234
136|72|987|387|387
144|80|1070|420|420
152|64|856|376|376
160|68|804|400|400
168|80|1060|520|520
176|80|1070|520|520
184|32|464|228|228
192|32|504|228|228
200|32|440|248|248
208|72|987|573|573
216|80|1464|220|220
224|48|852|450|450
232|72|1182|666|666
240|32|428|288|288
248|32|428|308|308
256|80|860|770|770
264|80|1130|820|820
272|80|1340|820|820
280|80|1220|870|870
288|72|1227|804|804
296|72|1356|849|849
@Mark-Simulacrum Mark-Simulacrum added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-author Status: This is awaiting some action (such as code changes or more information) from the author. labels Jun 11, 2017
@bors
Copy link
Contributor

bors commented Jun 11, 2017

☀️ Test successful - status-appveyor, status-travis
Approved by: sfackler
Pushing 27650ee to master...

@bors bors merged commit 83f1f11 into rust-lang:master Jun 11, 2017
bors added a commit that referenced this pull request Jun 28, 2017
Reuse the mem::swap optimizations to speed up slice::rotate

This is most helpful for compound types where LLVM didn't vectorize the loop.  Highlight: bench slice::rotate_medium_by727_strings gets 38% faster.

Exposes the swapping logic from PR #40454 as `pub unsafe fn ptr::swap_nonoverlapping` under library feature `swap_nonoverlapping` #42818.

(The new method seemed plausible, and was the simplest way to share the logic.  I'm not attached to it, though, so let me know if a different way would be better.)
cuviper added a commit to cuviper/rust that referenced this pull request Jul 11, 2017
This is a workaround for rust-lang#42778, which was git-bisected to rust-lang#40454's
optimizations to `mem::swap`, later moved to `ptr` in rust-lang#42819.  Natively
compiled rustc couldn't even compile stage1 libcore on powerpc64 and
s390x, but they work fine without this `repr(simd)`.  Since powerpc64le
works OK, it seems probably related to being big-endian.

The underlying problem is not yet known, but this at least makes those
architectures functional again in the meantime.

cc @arielb1
frewsxcv added a commit to frewsxcv/rust that referenced this pull request Jul 15, 2017
Disable big-endian simd in swap_nonoverlapping_bytes

This is a workaround for rust-lang#42778, which was git-bisected to rust-lang#40454's
optimizations to `mem::swap`, later moved to `ptr` in rust-lang#42819.  Natively
compiled rustc couldn't even compile stage1 libcore on powerpc64 and
s390x, but they work fine without this `repr(simd)`.  Since powerpc64le
works OK, it seems probably related to being big-endian.

The underlying problem is not yet known, but this at least makes those
architectures functional again in the meantime.

cc @arielb1
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
relnotes Marks issues that should be documented in the release notes of the next release. S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.
Projects
None yet
Development

Successfully merging this pull request may close these issues.