Speed up compilation of large constant arrays #51833

wesleywiser · 2018-06-27T04:06:44Z

This is a different approach to #51672 as suggested by @oli-obk. Rather
than write each repeated value one-by-one, we write the first one and
then copy its value directly into the remaining memory.

With this change, the toy program goes from 63 seconds to 19 seconds on my machine.

Edit: Inlining Size::bytes() saves an additional 6 seconds dropping the total time to 13 seconds on my machine.

Edit2: Now down to 2.8 seconds.

r? @oli-obk

cc @nnethercote @eddyb

@oli-obk

This is a different approach to rust-lang#51672 as suggested by @oli-obk. Rather than write each repeated value one-by-one, we write the first one and then copy its value directly into the remaining memory.

This save an additional 6 seconds on the test program.

nnethercote · 2018-06-27T04:49:12Z

Nice work!

Edit: Inlining Size::bytes() saves an additional 6 seconds dropping the total time to 13 seconds on my machine.

For the first patch in #51672 I inlined several methods in addition to bytes(). This was based on profiling data -- while bytes() is clearly the hottest, the others were still hot, and I think it would be worth trying to inline all of them to see if you get a bigger speed-up.

oli-obk

Those are some awesome improvements! After the review is addressed we'll throw it into the perf tests.

oli-obk · 2018-06-27T09:08:20Z

src/librustc_mir/interpret/memory.rs

            }
        }

-        self.copy_undef_mask(src, dest, size)?;
+        self.copy_undef_mask(src, dest, size * length)?;
        // copy back the relocations
        self.get_mut(dest.alloc_id)?.relocations.insert_presorted(relocations);


I think you need to reapeat this, too (and offset the indices).

Try a [&FOO; 500] (for non-ZST FOO) and then access any field but the first (at compile-time! at runtime you'll get a segfault). If I'm reading the code correctly this will tell you about a dangling pointer.

Got it, thanks! Can you double check my math?

oli-obk · 2018-06-27T09:14:23Z

src/librustc_mir/interpret/memory.rs

            }
        }

-        self.copy_undef_mask(src, dest, size)?;
+        self.copy_undef_mask(src, dest, size * length)?;


While this results in the correct result, it does n^2/2 copies instead of n copies. Inside the function itself we should probably move the self.get(src.alloc_id)? out of the loops, too. We can probably improve the nonoverlapping case enormously, too by not requiring an intermediate allocation.

michaelwoerister · 2018-06-27T09:44:58Z

It would be interesting to see if this could be further sped up by copying larger chunks of memory at a time. Right now this makes n calls to memcpy but memcpy can be more efficient when it works with more memory at once (via SIMD). We could try increase the chunk size and see what that does to performance. In particular copy [0 .. size] to [size .. size *2], then [0 .. size*2] to [size*2 .. size*4], then [0..size*4] to [size*4 .. size*8], etc, up to a maximum chunk size. This makes copy_repeatedly quite a bit more complicated but it might be worth it?

This save 3 seconds on the test program.

scottmcm · 2018-06-28T04:50:25Z

increase the chunk size

Looks like the logic for that already exists in core:

rust/src/liballoc/slice.rs

Line 422 in 3515dab

// `2^expn` repetition is done by doubling `buf` `expn`-times.

Maybe there's a way to extract the copying part into an unsafe ptr method and re-use it here?

michaelwoerister · 2018-06-28T08:49:43Z

@scottmcm Great find!

wesleywiser · 2018-06-28T14:15:49Z

@nnethercote That's a good idea. I tried that and it saved an additional 3 seconds on the compile time (now 10 seconds).

This saves 4.5 seconds and takes the compile time down to 5.5 seconds.

This saves 0.5 seconds on the test compilation.

This saves 2.5 seconds on the test program.

wesleywiser · 2018-06-30T04:49:49Z

@michaelwoerister @scottmcm I tried that but it only saved about 100ms. perf indicates that the memory copy operations aren't hot. Good idea though!

Edit: I can push that commit too if you think it's worth pursuing.

wesleywiser · 2018-06-30T04:50:16Z

With the latest changes, the compile time is down to 2.8 seconds on my machine.

nnethercote · 2018-06-30T05:18:33Z

With the latest changes, the compile time is down to 2.8 seconds on my machine.

A 22.5x speedup!

oli-obk

Awesome improvements. Just a nit so miri the tool keeps working.

oli-obk · 2018-06-30T07:39:52Z

src/librustc_mir/interpret/memory.rs

@@ -882,25 +882,16 @@ impl<'a, 'mir, 'tcx, M: Machine<'mir, 'tcx>> Memory<'a, 'mir, 'tcx, M> {
    ) -> EvalResult<'tcx> {
        // The bits have to be saved locally before writing to dest in case src and dest overlap.


This comment makes me think that we should not do this commit, otherwise we'll run into trouble in the future (and in miri right now). Can you do an if for whether there is overlap and if there is, just run the old code?

Hmm. I thought I preserved the existing behavior by cloning the source allocation's undef_mask before writing to the destination's. Is that sufficient?

oh right. sorry. I misread the code.

I still think the code isn't doing the right thing. It's only copying once, when it should be copying N-1 times.

You can try this out by creating an array of types with padding, everything starting at the third element will probably not have undef masks for the padding. (you'll need unions to get the bits and then attempt to use them for an array length to actually get a compiler error from that)

I'm afraid I'm not quite following. We do call this function with size * length so shouldn't it cover all of the repeated copies? Can you provide a sample program that will fail?

yes, you are using the length, but that just means that the entire array is copied from 0..N to 1..=N, not that the 1st element is copied N times.

I'll make a regression test

I'm fairly certain that the following test will succeed to compile on your PR: http://play.rust-lang.org/?gist=1d0183fcfb65164d1ca58ccd9614c33c

oli-obk · 2018-06-30T14:43:34Z

src/librustc_mir/interpret/memory.rs

-                );
-            }
+        for i in 0..size.bytes() {
+            let defined = undef_mask.get(src.offset + Size::from_bytes(i));


if you pass a repeat counter to the function, you should be able to just modulo the i here over the size and have the for loop go from 0 ot size.bytes() * repeat

oli-obk · 2018-06-30T15:41:54Z

Please also add a test for http://play.rust-lang.org/?gist=1c0e90ac9064edfa12fbd286902e20ef to make sure we always properly copy the relocations.

wesleywiser · 2018-06-30T19:27:41Z

Added tests and fixed that issue

rust-highfive · 2018-06-30T19:42:10Z

The job x86_64-gnu-llvm-3.9 of your PR failed on Travis (raw log). Through arcane magic we have determined that the following fragments from the build log may contain information about the problem.

Click to expand the log.


[00:05:01] travis_fold:start:tidy
travis_time:start:tidy
tidy check
[00:05:01] tidy error: /checkout/src/test/compile-fail/const-err4.rs: missing trailing newline
[00:05:02] some tidy checks failed
[00:05:02] 
[00:05:02] 
[00:05:02] command did not execute successfully: "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0-tools-bin/tidy" "/checkout/src" "/checkout/obj/build/x86_64-unknown-linux-gnu/stage0/bin/cargo" "--no-vendor" "--quiet"
[00:05:02] 
[00:05:02] 
[00:05:02] failed to run: /checkout/obj/build/bootstrap/debug/bootstrap test src/tools/tidy
[00:05:02] Build completed unsuccessfully in 0:01:52
[00:05:02] Build completed unsuccessfully in 0:01:52
[00:05:02] Makefile:79: recipe for target 'tidy' failed
[00:05:02] make: *** [tidy] Error 1

The command "stamp sh -x -c "$RUN_SCRIPT"" exited with 2.
travis_time:start:13ca9c5e
$ date && (curl -fs --head https://google.com | grep ^Date: | sed 's/Date: //g' || true)
---
travis_time:end:007a23e0:start=1530387249551729531,finish=1530387249560171319,duration=8441788
travis_fold:end:after_failure.3
travis_fold:start:after_failure.4
travis_time:start:012b7739
$ head -30 ./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers || true
head: cannot open ‘./obj/build/x86_64-unknown-linux-gnu/native/asan/build/lib/asan/clang_rt.asan-dynamic-i386.vers’ for reading: No such file or directory
travis_fold:end:after_failure.4
travis_fold:start:after_failure.5
travis_time:start:009c4c44
$ dmesg | grep -i kill

I'm a bot! I can only do what humans tell me to, so if this was not helpful or you have suggestions for improvements, please ping or otherwise contact @TimNN. (Feature Requests)

wesleywiser · 2018-07-01T15:11:46Z

Fixed tidy

oli-obk · 2018-07-01T16:36:24Z

@bors r+

We should probably add a bunch of tests to the perf tests to ensure this doesn't regress.

bors · 2018-07-01T16:36:24Z

📌 Commit 46512e0 has been approved by oli-obk

bors · 2018-07-01T18:43:48Z

⌛ Testing commit 46512e0 with merge a2be769...

@oli-obk

…i-obk Speed up compilation of large constant arrays This is a different approach to #51672 as suggested by @oli-obk. Rather than write each repeated value one-by-one, we write the first one and then copy its value directly into the remaining memory. With this change, the [toy program](https://github.com/rust-lang/rust/blob/c2f4744d2db4e162df824d0bd0b093ba4b351545/src/test/run-pass/mir_heavy_promoted.rs) goes from 63 seconds to 19 seconds on my machine. Edit: Inlining `Size::bytes()` saves an additional 6 seconds dropping the total time to 13 seconds on my machine. Edit2: Now down to 2.8 seconds. r? @oli-obk cc @nnethercote @eddyb

bors · 2018-07-01T20:48:52Z

☀️ Test successful - status-appveyor, status-travis
Approved by: oli-obk
Pushing a2be769 to master...

michaelwoerister · 2018-07-02T08:16:43Z

@michaelwoerister @scottmcm I tried that but it only saved about 100ms. perf indicates that the memory copy operations aren't hot. Good idea though!

Thanks for giving it a try, @wesleywiser. Great work!

@oli-obk

In #51833, I improved the performance of `copy_undef_mask()`. As such, the old FIXME wasn't appropriate anymore. The main remaining thing left to do is to implement a fast path for non-overlapping copies (per @oli-obk).

@oli-obk

…i-obk Update a FIXME in memory.rs In rust-lang#51833, I improved the performance of `copy_undef_mask()`. As such, the old FIXME wasn't appropriate anymore. The main remaining thing left to do is to implement a fast path for non-overlapping copies (per @oli-obk). r? @oli-obk

@oli-obk

…i-obk Update a FIXME in memory.rs In rust-lang#51833, I improved the performance of `copy_undef_mask()`. As such, the old FIXME wasn't appropriate anymore. The main remaining thing left to do is to implement a fast path for non-overlapping copies (per @oli-obk). r? @oli-obk

@oli-obk

…i-obk Update a FIXME in memory.rs In rust-lang#51833, I improved the performance of `copy_undef_mask()`. As such, the old FIXME wasn't appropriate anymore. The main remaining thing left to do is to implement a fast path for non-overlapping copies (per @oli-obk). r? @oli-obk

Speed up compilation of large constant arrays

202aea5

This is a different approach to rust-lang#51672 as suggested by @oli-obk. Rather than write each repeated value one-by-one, we write the first one and then copy its value directly into the remaining memory.

rust-highfive assigned oli-obk Jun 27, 2018

rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Jun 27, 2018

Inline abi::Size::bytes()

63ab0cb

This save an additional 6 seconds on the test program.

oli-obk requested changes Jun 27, 2018

View reviewed changes

Inline all methods on abi::Size

429bc8d

This save 3 seconds on the test program.

stokhos mentioned this pull request Jun 29, 2018

Speed up compilation of huge constant arrays #51672

Closed

wesleywiser added 4 commits June 29, 2018 20:22

Optimize copy_undef_mask() by lifting some loop invariant operations

1ffa99d

This saves 4.5 seconds and takes the compile time down to 5.5 seconds.

Optimize copy_undef_mask() to use one pass

8f969ed

This saves 0.5 seconds on the test compilation.

Inline a few UndefMask methods.

c431f3f

This saves 2.5 seconds on the test program.

Fix relocations to include repeated values

84fe0c4

oli-obk requested changes Jun 30, 2018

View reviewed changes

oli-obk reviewed Jun 30, 2018

View reviewed changes

Copy undef_masks correctly for repeated bytes

faef6a3

Add two regression tests for const eval

46512e0

wesleywiser force-pushed the faster_large_constant_arrays branch from 7c64a63 to 46512e0 Compare July 1, 2018 14:32

oli-obk approved these changes Jul 1, 2018

View reviewed changes

bors added S-waiting-on-bors Status: Waiting on bors to run and complete tests. Bors will change the label on completion. and removed S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. labels Jul 1, 2018

bors merged commit 46512e0 into rust-lang:master Jul 1, 2018

wesleywiser mentioned this pull request Oct 3, 2018

Update a FIXME in memory.rs #54773

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up compilation of large constant arrays #51833

Speed up compilation of large constant arrays #51833

wesleywiser commented Jun 27, 2018 •

edited

Loading

nnethercote commented Jun 27, 2018

oli-obk left a comment

oli-obk Jun 27, 2018 •

edited

Loading

wesleywiser Jun 30, 2018

oli-obk Jun 27, 2018

michaelwoerister commented Jun 27, 2018 •

edited

Loading

scottmcm commented Jun 28, 2018

michaelwoerister commented Jun 28, 2018

wesleywiser commented Jun 28, 2018

wesleywiser commented Jun 30, 2018 •

edited

Loading

wesleywiser commented Jun 30, 2018

nnethercote commented Jun 30, 2018

oli-obk left a comment

oli-obk Jun 30, 2018

wesleywiser Jun 30, 2018

oli-obk Jun 30, 2018

wesleywiser Jun 30, 2018

oli-obk Jun 30, 2018

oli-obk Jun 30, 2018

oli-obk Jun 30, 2018 •

edited

Loading

oli-obk commented Jun 30, 2018

wesleywiser commented Jun 30, 2018

rust-highfive commented Jun 30, 2018

wesleywiser commented Jul 1, 2018

oli-obk commented Jul 1, 2018

bors commented Jul 1, 2018

bors commented Jul 1, 2018

bors commented Jul 1, 2018

michaelwoerister commented Jul 2, 2018

		@@ -882,25 +882,16 @@ impl<'a, 'mir, 'tcx, M: Machine<'mir, 'tcx>> Memory<'a, 'mir, 'tcx, M> {
		) -> EvalResult<'tcx> {
		// The bits have to be saved locally before writing to dest in case src and dest overlap.

Speed up compilation of large constant arrays #51833

Speed up compilation of large constant arrays #51833

Conversation

wesleywiser commented Jun 27, 2018 • edited Loading

nnethercote commented Jun 27, 2018

oli-obk left a comment

Choose a reason for hiding this comment

oli-obk Jun 27, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

michaelwoerister commented Jun 27, 2018 • edited Loading

scottmcm commented Jun 28, 2018

michaelwoerister commented Jun 28, 2018

wesleywiser commented Jun 28, 2018

wesleywiser commented Jun 30, 2018 • edited Loading

wesleywiser commented Jun 30, 2018

nnethercote commented Jun 30, 2018

oli-obk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

oli-obk Jun 30, 2018 • edited Loading

Choose a reason for hiding this comment

oli-obk commented Jun 30, 2018

wesleywiser commented Jun 30, 2018

rust-highfive commented Jun 30, 2018

wesleywiser commented Jul 1, 2018

oli-obk commented Jul 1, 2018

bors commented Jul 1, 2018

bors commented Jul 1, 2018

bors commented Jul 1, 2018

michaelwoerister commented Jul 2, 2018

wesleywiser commented Jun 27, 2018 •

edited

Loading

oli-obk Jun 27, 2018 •

edited

Loading

michaelwoerister commented Jun 27, 2018 •

edited

Loading

wesleywiser commented Jun 30, 2018 •

edited

Loading

oli-obk Jun 30, 2018 •

edited

Loading