Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speed up compilation of large constant arrays #51833

Merged
merged 9 commits into from
Jul 1, 2018

Conversation

wesleywiser
Copy link
Member

@wesleywiser wesleywiser commented Jun 27, 2018

This is a different approach to #51672 as suggested by @oli-obk. Rather
than write each repeated value one-by-one, we write the first one and
then copy its value directly into the remaining memory.

With this change, the toy program goes from 63 seconds to 19 seconds on my machine.

Edit: Inlining Size::bytes() saves an additional 6 seconds dropping the total time to 13 seconds on my machine.

Edit2: Now down to 2.8 seconds.

r? @oli-obk

cc @nnethercote @eddyb

This is a different approach to rust-lang#51672 as suggested by @oli-obk. Rather
than write each repeated value one-by-one, we write the first one and
then copy its value directly into the remaining memory.
@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 27, 2018
This save an additional 6 seconds on the test program.
@nnethercote
Copy link
Contributor

Nice work!

Edit: Inlining Size::bytes() saves an additional 6 seconds dropping the total time to 13 seconds on my machine.

For the first patch in #51672 I inlined several methods in addition to bytes(). This was based on profiling data -- while bytes() is clearly the hottest, the others were still hot, and I think it would be worth trying to inline all of them to see if you get a bigger speed-up.

Copy link
Contributor

@oli-obk oli-obk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are some awesome improvements! After the review is addressed we'll throw it into the perf tests.

}
}

self.copy_undef_mask(src, dest, size)?;
self.copy_undef_mask(src, dest, size * length)?;
// copy back the relocations
self.get_mut(dest.alloc_id)?.relocations.insert_presorted(relocations);
Copy link
Contributor

@oli-obk oli-obk Jun 27, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need to reapeat this, too (and offset the indices).

Try a [&FOO; 500] (for non-ZST FOO) and then access any field but the first (at compile-time! at runtime you'll get a segfault). If I'm reading the code correctly this will tell you about a dangling pointer.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it, thanks! Can you double check my math?

}
}

self.copy_undef_mask(src, dest, size)?;
self.copy_undef_mask(src, dest, size * length)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this results in the correct result, it does n^2/2 copies instead of n copies. Inside the function itself we should probably move the self.get(src.alloc_id)? out of the loops, too. We can probably improve the nonoverlapping case enormously, too by not requiring an intermediate allocation.

@michaelwoerister
Copy link
Member

michaelwoerister commented Jun 27, 2018

It would be interesting to see if this could be further sped up by copying larger chunks of memory at a time. Right now this makes n calls to memcpy but memcpy can be more efficient when it works with more memory at once (via SIMD). We could try increase the chunk size and see what that does to performance. In particular copy [0 .. size] to [size .. size *2], then [0 .. size*2] to [size*2 .. size*4], then [0..size*4] to [size*4 .. size*8], etc, up to a maximum chunk size. This makes copy_repeatedly quite a bit more complicated but it might be worth it?

This save 3 seconds on the test program.
@scottmcm
Copy link
Member

increase the chunk size

Looks like the logic for that already exists in core:

// `2^expn` repetition is done by doubling `buf` `expn`-times.

Maybe there's a way to extract the copying part into an unsafe ptr method and re-use it here?

@michaelwoerister
Copy link
Member

@scottmcm Great find!

@wesleywiser
Copy link
Member Author

@nnethercote That's a good idea. I tried that and it saved an additional 3 seconds on the compile time (now 10 seconds).

This saves 4.5 seconds and takes the compile time down to 5.5 seconds.
This saves 0.5 seconds on the test compilation.
This saves 2.5 seconds on the test program.
@wesleywiser
Copy link
Member Author

wesleywiser commented Jun 30, 2018

@michaelwoerister @scottmcm I tried that but it only saved about 100ms. perf indicates that the memory copy operations aren't hot. Good idea though!

Edit: I can push that commit too if you think it's worth pursuing.

@wesleywiser
Copy link
Member Author

With the latest changes, the compile time is down to 2.8 seconds on my machine.

@nnethercote
Copy link
Contributor

With the latest changes, the compile time is down to 2.8 seconds on my machine.

A 22.5x speedup!

Copy link
Contributor

@oli-obk oli-obk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome improvements. Just a nit so miri the tool keeps working.

@@ -882,25 +882,16 @@ impl<'a, 'mir, 'tcx, M: Machine<'mir, 'tcx>> Memory<'a, 'mir, 'tcx, M> {
) -> EvalResult<'tcx> {
// The bits have to be saved locally before writing to dest in case src and dest overlap.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment makes me think that we should not do this commit, otherwise we'll run into trouble in the future (and in miri right now). Can you do an if for whether there is overlap and if there is, just run the old code?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm. I thought I preserved the existing behavior by cloning the source allocation's undef_mask before writing to the destination's. Is that sufficient?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh right. sorry. I misread the code.

I still think the code isn't doing the right thing. It's only copying once, when it should be copying N-1 times.

You can try this out by creating an array of types with padding, everything starting at the third element will probably not have undef masks for the padding. (you'll need unions to get the bits and then attempt to use them for an array length to actually get a compiler error from that)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm afraid I'm not quite following. We do call this function with size * length so shouldn't it cover all of the repeated copies? Can you provide a sample program that will fail?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, you are using the length, but that just means that the entire array is copied from 0..N to 1..=N, not that the 1st element is copied N times.

I'll make a regression test

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm fairly certain that the following test will succeed to compile on your PR: http://play.rust-lang.org/?gist=1d0183fcfb65164d1ca58ccd9614c33c

);
}
for i in 0..size.bytes() {
let defined = undef_mask.get(src.offset + Size::from_bytes(i));
Copy link
Contributor

@oli-obk oli-obk Jun 30, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if you pass a repeat counter to the function, you should be able to just modulo the i here over the size and have the for loop go from 0 ot size.bytes() * repeat

@oli-obk
Copy link
Contributor

oli-obk commented Jun 30, 2018

Please also add a test for http://play.rust-lang.org/?gist=1c0e90ac9064edfa12fbd286902e20ef to make sure we always properly copy the relocations.

@wesleywiser
Copy link
Member Author

Added tests and fixed that issue

@rust-highfive
Copy link
Collaborator

The job x86_64-gnu-llvm-3.9 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.

[00:05:01] travis_fold:start:tidy
travis_time:start:tidy
tidy check
[00:05:01] tidy error: /checkout/src/test/compile-fail/const-err4.rs: missing trailing newline
[00:05:02] some tidy checks failed
[00:05:02] 
[00:05:02] 
[00:05:02] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/tidy" "/checkout/src" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "--no-vendor" "--quiet"
[00:05:02] 
[00:05:02] 
[00:05:02] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test src/tools/tidy
[00:05:02] Build completed unsuccessfully in 0:01:52
[00:05:02] Build completed unsuccessfully in 0:01:52
[00:05:02] Makefile:79: recipe for target 'tidy' failed
[00:05:02] make: *** [tidy] Error 1

The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:13ca9c5e
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
---
travis_time:end:007a23e0:start=1530387249551729531,finish=1530387249560171319,duration=8441788
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:012b7739
$ head -30 ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
head: cannot open ‘./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers’ for reading: No such file or directory
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:009c4c44
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

@wesleywiser wesleywiser force-pushed the faster_large_constant_arrays branch from 7c64a63 to 46512e0 Compare July 1, 2018 14:32
@wesleywiser
Copy link
Member Author

Fixed tidy

@oli-obk
Copy link
Contributor

oli-obk commented Jul 1, 2018

@bors r+

We should probably add a bunch of tests to the perf tests to ensure this doesn't regress.

@bors
Copy link
Contributor

bors commented Jul 1, 2018

📌 Commit 46512e0 has been approved by oli-obk

@bors bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 1, 2018
@bors
Copy link
Contributor

bors commented Jul 1, 2018

⌛ Testing commit 46512e0 with merge a2be769...

bors added a commit that referenced this pull request Jul 1, 2018
…i-obk

Speed up compilation of large constant arrays

This is a different approach to #51672 as suggested by @oli-obk. Rather
than write each repeated value one-by-one, we write the first one and
then copy its value directly into the remaining memory.

With this change, the [toy program](https://github.com/rust-lang/rust/blob/c2f4744d2db4e162df824d0bd0b093ba4b351545/src/test/run-pass/mir_heavy_promoted.rs) goes from 63 seconds to 19 seconds on my machine.

Edit: Inlining `Size::bytes()` saves an additional 6 seconds dropping the total time to 13 seconds on my machine.

Edit2: Now down to 2.8 seconds.

r? @oli-obk

cc @nnethercote @eddyb
@bors
Copy link
Contributor

bors commented Jul 1, 2018

☀️ Test successful - status-appveyor, status-travis
Approved by: oli-obk
Pushing a2be769 to master...

@bors bors merged commit 46512e0 into rust-lang:master Jul 1, 2018
@michaelwoerister
Copy link
Member

@michaelwoerister @scottmcm I tried that but it only saved about 100ms. perf indicates that the memory copy operations aren't hot. Good idea though!

Thanks for giving it a try, @wesleywiser. Great work!

wesleywiser added a commit that referenced this pull request Oct 3, 2018
In  #51833, I improved the performance of `copy_undef_mask()`. As such, the old FIXME wasn't appropriate anymore. The main remaining thing left to do is to implement a fast path for non-overlapping copies (per @oli-obk).
kennytm added a commit to kennytm/rust that referenced this pull request Oct 3, 2018
…i-obk

Update a FIXME in memory.rs

In  rust-lang#51833, I improved the performance of `copy_undef_mask()`. As such, the old FIXME wasn't appropriate anymore. The main remaining thing left to do is to implement a fast path for non-overlapping copies (per @oli-obk).

r? @oli-obk
pietroalbini added a commit to pietroalbini/rust that referenced this pull request Oct 3, 2018
…i-obk

Update a FIXME in memory.rs

In  rust-lang#51833, I improved the performance of `copy_undef_mask()`. As such, the old FIXME wasn't appropriate anymore. The main remaining thing left to do is to implement a fast path for non-overlapping copies (per @oli-obk).

r? @oli-obk
pietroalbini added a commit to pietroalbini/rust that referenced this pull request Oct 4, 2018
…i-obk

Update a FIXME in memory.rs

In  rust-lang#51833, I improved the performance of `copy_undef_mask()`. As such, the old FIXME wasn't appropriate anymore. The main remaining thing left to do is to implement a fast path for non-overlapping copies (per @oli-obk).

r? @oli-obk
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants