Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

speed up mem::swap #40454

Merged
merged 16 commits into from
Jun 11, 2017
89 changes: 76 additions & 13 deletions src/libcore/mem.rs
Original file line number Diff line number Diff line change
Expand Up @@ -109,7 +109,7 @@ pub use intrinsics::transmute;
/// [`Clone`][clone]. You need the value's destructor to run only once,
/// because a double `free` is undefined behavior.
///
/// An example is the definition of [`mem::swap`][swap] in this module:
/// An example is a possible implementation of [`mem::swap`][swap]:
///
/// ```
/// use std::mem;
Expand Down Expand Up @@ -499,18 +499,81 @@ pub unsafe fn uninitialized<T>() -> T {
#[stable(feature = "rust1", since = "1.0.0")]
pub fn swap<T>(x: &mut T, y: &mut T) {
unsafe {
// Give ourselves some scratch space to work with
let mut t: T = uninitialized();

// Perform the swap, `&mut` pointers never alias
ptr::copy_nonoverlapping(&*x, &mut t, 1);
ptr::copy_nonoverlapping(&*y, x, 1);
ptr::copy_nonoverlapping(&t, y, 1);

// y and t now point to the same thing, but we need to completely
// forget `t` because we do not want to run the destructor for `T`
// on its value, which is still owned somewhere outside this function.
forget(t);
// The approach here is to utilize simd to swap x & y efficiently. Testing reveals
// that swapping either 32 bytes or 64 bytes at a time is most efficient for intel
// Haswell E processors. LLVM is more able to optimize if we give a struct a
// #[repr(simd)], even if we don't actually use this struct directly.
#[repr(simd)]
struct Block(u64, u64, u64, u64);
struct UnalignedBlock(u64, u64, u64, u64);

let block_size = size_of::<Block>();

// Get raw pointers to the bytes of x & y for easier manipulation
let x = x as *mut T as *mut u8;
let y = y as *mut T as *mut u8;

// Loop through x & y, copying them `Block` at a time
// The optimizer should unroll the loop fully for most types
// N.B. We can't use a for loop as the `range` impl calls `mem::swap` recursively
let len = size_of::<T>() as isize;
let mut i = 0;
while i + block_size as isize <= len {
// Create some uninitialized memory as scratch space
// Declaring `t` here avoids aligning the stack when this loop is unused
let mut t: Block = uninitialized();
let t = &mut t as *mut _ as *mut u8;

// Swap a block of bytes of x & y, using t as a temporary buffer
// This should be optimized into efficient SIMD operations where available
ptr::copy_nonoverlapping(x.offset(i), t, block_size);
ptr::copy_nonoverlapping(y.offset(i), x.offset(i), block_size);
ptr::copy_nonoverlapping(t, y.offset(i), block_size);
i += block_size as isize;
}


if i < len {
// Swap any remaining bytes, using aligned types to copy
// where appropriate (this information is lost by conversion
// to *mut u8, so restore it manually here)
let mut t: UnalignedBlock = uninitialized();
let rem = (len - i) as usize;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know if it ever matters, but it occurs to me that calculating rem as len % 16 instead of len - i might be more obvious to an optimizer, as knowing what len - i is requires reasoning about a loop. Similarly, it would be better to test whether rem == 0 than to test i < len, as it is more trivially determined at compile time.

I doubt this matters in most cases, but I could see the optimizer failing when the size is large, since the loop might not be fully unrolled.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested it; looks like the generated assembly is the same either way. len here is a constant known at compile time so I guess it just gets completely eliminated


if align_of::<T>() % 8 == 0 && len % 8 == 0 {
let t = &mut t as *mut _ as *mut u64;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I suggested making a union earlier was that I did the same thing as this in an earlier PR, and was told that it has undef issues on something like (u64,u8).

(At least I think this is the same as that -- I never really understood how ptr::copy_nonoverlapping could be a problem when it's a memcpy.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

argh I think you are right. stupid undef

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would it be undef to change (e.g.) u64 to [u8; 8]? because this is a simple fix... but...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think [u8;N] would be UB (less than a byte can't be undef, afaik), but I assume it would lose the alignment again.

Maybe something like as *mut Foo<[u8;8], u64>, with this?

#[allow(unions_with_drop_fields)]
union Foo<T,A> {
    value: T,
    _align: A,
}

let x = x.offset(i) as *mut u64;
let y = y.offset(i) as *mut u64;

ptr::copy_nonoverlapping(x, t, rem / 8);
ptr::copy_nonoverlapping(y, x, rem / 8);
ptr::copy_nonoverlapping(t, y, rem / 8);
} else if align_of::<T>() % 4 == 0 && len % 4 == 0 {
let t = &mut t as *mut _ as *mut u32;
let x = x.offset(i) as *mut u32;
let y = y.offset(i) as *mut u32;

ptr::copy_nonoverlapping(x, t, rem / 4);
ptr::copy_nonoverlapping(y, x, rem / 4);
ptr::copy_nonoverlapping(t, y, rem / 4);
} else if align_of::<T>() % 2 == 0 && len % 2 == 0 {
let t = &mut t as *mut _ as *mut u16;
let x = x.offset(i) as *mut u16;
let y = y.offset(i) as *mut u16;

ptr::copy_nonoverlapping(x, t, rem / 2);
ptr::copy_nonoverlapping(y, x, rem / 2);
ptr::copy_nonoverlapping(t, y, rem / 2);
} else {
let t = &mut t as *mut _ as *mut u8;
let x = x.offset(i);
let y = y.offset(i);

ptr::copy_nonoverlapping(x, t, rem);
ptr::copy_nonoverlapping(y, x, rem);
ptr::copy_nonoverlapping(t, y, rem);
}
}
}
}

Expand Down