Skip to content

Commit

Permalink
Auto merge of #37573 - ruuda:faster-cursor, r=alexcrichton
Browse files Browse the repository at this point in the history
Add small-copy optimization for copy_from_slice

## Summary

During benchmarking, I found that one of my programs spent between 5 and 10 percent of the time doing memmoves. Ultimately I tracked these down to single-byte slices being copied with a memcopy. Doing a manual copy if the slice contains only one element can speed things up significantly. For my program, this reduced the running time by 20%.

## Background

I am optimizing a program that relies heavily on reading a single byte at a time. To avoid IO overhead, I read all data into a vector once, and then I use a `Cursor` around that vector to read from. During profiling, I noticed that `__memmove_avx_unaligned_erms` was hot, taking up 7.3% of the running time. It turns out that these were caused by calls to `Cursor::read()`, which calls `<&[u8] as Read>::read()`, which calls `&[T]::copy_from_slice()`, which calls `ptr::copy_nonoverlapping()`. This one is implemented as a memcopy. Copying a single byte with a memcopy is very wasteful, because (at least on my platform) it involves calling `memcpy` in libc. This is an indirect call when libc is linked dynamically, and furthermore `memcpy` is optimized for copying large amounts of data at the cost of a bit of overhead for small copies.

## Benchmarks

Before I made this change, `perf` reported the following for my program. I only included the relevant functions, and how they rank. (This is on a different machine than where I ran the original benchmarks. It has an older CPU, so `__memmove_sse2_unaligned_erms` is called instead of `__memmove_avx_unaligned_erms`.)

```
#3   5.47%  bench_decode  libc-2.24.so      [.] __memmove_sse2_unaligned_erms
#5   1.67%  bench_decode  libc-2.24.so      [.] memcpy@GLIBC_2.2.5
#6   1.51%  bench_decode  bench_decode      [.] memcpy@plt
```

`memcpy` is eating up 8.65% of the total running time, and the overhead of dispatching to a specialized fast copy function (`memcpy@GLIBC` showing up) is clearly visible. The price of dynamic linking (`memcpy@plt` showing up) is visible too.

After this change, this is what `perf` reports:

```
#5   0.33%  bench_decode  libc-2.24.so      [.] __memmove_sse2_unaligned_erms
#14  0.01%  bench_decode  libc-2.24.so      [.] memcpy@GLIBC_2.2.5
```

Now only 0.34% of the running time is spent on memcopies. The dynamic linking overhead is not significant at all any more.

To add some more data, my program generates timing results for the operation in its main loop. These are the timings before and after the change:

| Time before   | Time after    | After/Before |
|---------------|---------------|--------------|
| 29.8 ± 0.8 ns | 23.6 ± 0.5 ns |  0.79 ± 0.03 |

The time is basically the total running time divided by a constant; the actual numbers are not important. This change reduced the total running time by 21% (much more than the original 9% spent on memmoves, likely because the CPU is stalling a lot less because data dependencies are more transparent). Of course YMMV and for most programs this will not matter at all. But when it does, the gains can be significant!

## Alternatives

* At first I implemented this in `io::Cursor`. I moved it to `&[T]::copy_from_slice()` instead, but this might be too intrusive, especially because it applies to all `T`, not just `u8`. To restrict this to `io::Read`, `<&[u8] as Read>::read()` is probably the best place.
* I tried copying bytes in a loop up to 64 or 8 bytes before calling `Read::read`, but both resulted in about a 20% slowdown instead of speedup.
  • Loading branch information
bors committed Dec 1, 2016
2 parents dc81742 + 3be2c3b commit 070fad1
Showing 1 changed file with 20 additions and 2 deletions.
22 changes: 20 additions & 2 deletions src/libstd/io/impls.rs
Original file line number Diff line number Diff line change
Expand Up @@ -157,7 +157,16 @@ impl<'a> Read for &'a [u8] {
fn read(&mut self, buf: &mut [u8]) -> io::Result<usize> {
let amt = cmp::min(buf.len(), self.len());
let (a, b) = self.split_at(amt);
buf[..amt].copy_from_slice(a);

// First check if the amount of bytes we want to read is small:
// `copy_from_slice` will generally expand to a call to `memcpy`, and
// for a single byte the overhead is significant.
if amt == 1 {
buf[0] = a[0];
} else {
buf[..amt].copy_from_slice(a);
}

*self = b;
Ok(amt)
}
Expand All @@ -169,7 +178,16 @@ impl<'a> Read for &'a [u8] {
"failed to fill whole buffer"));
}
let (a, b) = self.split_at(buf.len());
buf.copy_from_slice(a);

// First check if the amount of bytes we want to read is small:
// `copy_from_slice` will generally expand to a call to `memcpy`, and
// for a single byte the overhead is significant.
if buf.len() == 1 {
buf[0] = a[0];
} else {
buf.copy_from_slice(a);
}

*self = b;
Ok(())
}
Expand Down

0 comments on commit 070fad1

Please sign in to comment.