Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up mem::swap #40454

Merged
merged 16 commits into from
Jun 11, 2017
56 changes: 43 additions & 13 deletions src/libcore/mem.rs
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ pub use intrinsics::transmute;
/// [`Clone`][clone]. You need the value's destructor to run only once,
/// because a double `free` is undefined behavior.
///
/// An example is the definition of [`mem::swap`][swap] in this module:
/// An example is a possible implementation of [`mem::swap`][swap]:
///
/// ```
/// use std::mem;
Expand Down Expand Up @@ -447,18 +447,48 @@ pub unsafe fn uninitialized<T>() -> T {
#[stable(feature = "rust1", since = "1.0.0")]
pub fn swap<T>(x: &mut T, y: &mut T) {
unsafe {
// Give ourselves some scratch space to work with
let mut t: T = uninitialized();

// Perform the swap, `&mut` pointers never alias
ptr::copy_nonoverlapping(&*x, &mut t, 1);
ptr::copy_nonoverlapping(&*y, x, 1);
ptr::copy_nonoverlapping(&t, y, 1);

// y and t now point to the same thing, but we need to completely
// forget `t` because we do not want to run the destructor for `T`
// on its value, which is still owned somewhere outside this function.
forget(t);
// The approach here is to utilize simd to swap x & y efficiently. Testing reveals
// that swapping either 32 bytes or 64 bytes at a time is most efficient for intel
// Haswell E processors. LLVM is more able to optimize if we give a struct a
// #[repr(simd)], even if we don't actually use this struct directly.
#[repr(simd)]
struct Block(u64, u64, u64, u64);
struct UnalignedBlock(u64, u64, u64, u64);

let block_size = size_of::<Block>();

// Get raw pointers to the bytes of x & y for easier manipulation
let x = x as *mut T as *mut u8;
let y = y as *mut T as *mut u8;

// Loop through x & y, copying them `Block` at a time
// The optimizer should unroll the loop fully for most types
// N.B. We can't use a for loop as the `range` impl calls `mem::swap` recursively
let len = size_of::<T>() as isize;
let mut i = 0;
while i + block_size as isize <= len {
// Create some uninitialized memory as scratch space
// Decaring `t` here avoids aligning the stack when this loop is unused
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Declaring

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

gah!

let mut t: Block = uninitialized();
let t = &mut t as *mut _ as *mut u8;

// Swap a block of bytes of x & y, using t as a temporary buffer
// This should be optimized into efficient SIMD operations where available
ptr::copy_nonoverlapping(x.offset(i), t, block_size);
ptr::copy_nonoverlapping(y.offset(i), x.offset(i), block_size);
ptr::copy_nonoverlapping(t, y.offset(i), block_size);
i += block_size as isize;
}
if i < len {
// Swap any remaining bytes
let mut t: UnalignedBlock = uninitialized();
let t = &mut t as *mut _ as *mut u8;

let rem = (len - i) as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it ever matters, but it occurs to me that calculating rem as len % 16 instead of len - i might be more obvious to an optimizer, as knowing what len - i is requires reasoning about a loop. Similarly, it would be better to test whether rem == 0 than to test i < len, as it is more trivially determined at compile time.

I doubt this matters in most cases, but I could see the optimizer failing when the size is large, since the loop might not be fully unrolled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it; looks like the generated assembly is the same either way. len here is a constant known at compile time so I guess it just gets completely eliminated

ptr::copy_nonoverlapping(x.offset(i), t, rem);
ptr::copy_nonoverlapping(y.offset(i), x.offset(i), rem);
ptr::copy_nonoverlapping(t, y.offset(i), rem);
}
}
}

Expand Down