Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[HACK] rustc_codegen_ssa: remove micro-opt around cleanup_ret trampoline. #84403

Closed
wants to merge 2 commits into from

Conversation

eddyb
Copy link
Member

@eddyb eddyb commented Apr 21, 2021

I'm hoping this only has a minimal impact, and that we can clean up this codepath.

My motivation for messing with this at all, is pre-generating all such auxiliary blocks, without having to know ahead of time whether they will get used via (the suboptimally named) llblock (which can be e.g. one of the targets of a conditional branch or switch) or funclet_br (used for direct branches).

(By pre-generating all such blocks, the lifecycles of basic block builders - and so the borrowing relationship between Bx and CodegenCx - could be much simpler. I already started on a refactor, and landing pad / funclet cleanup_ret trampolines are the main offenders, because they're on-demand)

Sadly, I don't have a good way to test a MSVC target, and much less so compare compiletimes (especially as AFAIK we don't have anything like perf.rust-lang.org for Windows - cc @Mark-Simulacrum)

I've switched out the PR CI builder to x86_64-msvc-2 for this test PR but I don't know how much good it will do - I'm hoping it will give me comparable build times to auto bors CI, at least.

@rust-highfive
Copy link
Collaborator

r? @pietroalbini

(rust-highfive has picked a reviewer for you, use r? to override)

@rust-highfive rust-highfive added the S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. label Apr 21, 2021
@rust-log-analyzer

This comment has been minimized.

@eddyb eddyb force-pushed the msvc-less-microopt branch from c299d36 to b43c8bb Compare April 21, 2021 17:35
@rust-log-analyzer

This comment has been minimized.

@eddyb eddyb force-pushed the msvc-less-microopt branch from b43c8bb to 1a996dc Compare April 21, 2021 18:01
@eddyb

This comment has been minimized.

@Mark-Simulacrum
Copy link
Member

At least for the regular PR builder, and the dist x86_64-unknown-linux-gnu builder (standard for try builds) we regularly see variance in the tens of minutes, so I don't think such a minor difference is going to be reflective (at least without dozens of runs and subsequent statistical analysis, if it's even possible).

@nagisa
Copy link
Member

nagisa commented Apr 21, 2021

I mildly (in my defense, it was ages ago) remember doing funclet-based cleanups the most obvious and straightforward way (including around this area) would cause LLVM to assert left and right, so I think this would want a crater run or similar on MSVC. But my experience is also from back when funclets were just introduced, so there's that.

I would be happy, perhaps, with just a compile of a few most prominent crates in the ecosystem and if that builds, then I'm happy with this PR, personally.

@workingjubilee
Copy link
Member

workingjubilee commented Apr 21, 2021

Is an MSVC-specific perf check possible? Not for an actual perf test, per se, mostly that such would be a fairly representative sample.

@eddyb
Copy link
Member Author

eddyb commented Apr 21, 2021

remember doing funclet-based cleanups the most obvious and straightforward way (including around this area) would cause LLVM to assert left and right, so I think this would want a crater run or similar on MSVC

Heh, not really surprised there. However, I expect the "less efficient" way to do this to always work, whereas the micro-opt is the one that could cause issues.

We always use the extra trampoline block (which itself does cleanup_ret) for non-trivial branches, and the micro-opt is to skip the trampoline and directly cleanup_ret from the source block. As long as the funclet shapes are the same, I expect the extra jump through the trampoline to work at least as good as a direct cleanup_ret (which I'm even a bit surprised LLVM allows more than one).

I also suspect that the cost of always going through the trampoline block is that LLVM's CFG simplification pass will be required to bring it back to the previous form, so it should impact compile times but not output quality (but I haven't confirmed this yet).

I'll look into setting up a Windows VM and finding some crates to test compile times on (presumably some of the larger ones on perf.r-l.o will do). Thanks for the feedback!

@workingjubilee
Copy link
Member

workingjubilee commented Apr 21, 2021

I should rephrase: I have a Windows machine at the moment.

I guess I can't use the Linux-specific perf infra for rustc-perf, but I can definitely iterate over the directory still and collect what time remarks I can? And artifact sizes, since I have been informed that it is desired to make sure those are constant.

EDIT: Oof, I didn't manage to pull something together satisfactorily because I ran into a wall of even most of the example crates I tried not being set up to compile on Windows. This commit does compile some of the big-ish ones though: diesel and ripgrep, for instance, and also passes the windows-rs and winapi test suites.

@eddyb eddyb force-pushed the msvc-less-microopt branch from 1a996dc to 6d892ed Compare April 22, 2021 07:38
@eddyb
Copy link
Member Author

eddyb commented Apr 22, 2021

FWIW I rebased on b849326 to match rustc 1.53.0-nightly (b84932674 2021-04-21).
I have a Windows 10 VM set up and will attempt to do some measurements with it today.

@eddyb
Copy link
Member Author

eddyb commented Apr 23, 2021

So far I've been able to get 10 rm build/*/stage1-*; ./x.py build --stage 2 runs, of both a base commit, and the change in this PR, and produced these measurements (the scale is in minutes, and I didn't include the smaller steps included in the total, like building stage1-std, because they would compress the scale further):

  • Preview
    image

  • Full chart (which you can also find on plotly, including the data)

    (click to open tall image)

    image

There is some variation but they mostly overlap well enough that I can say that this shouldn't impact CI times.

Also, stage2/bin/librustc_driver-*.dll seem to be identical size (but not contents) between the base and this PR, I wonder if that's only a coincidence?


Next up, I'll try to time UI tests (even harder, since parallelism is fully used, but at least I want to make sure there's no obvious large-scale difference. I'll probably not do 10 runs again).

(FWIW, the UI tests do pass, I've already checked that, and so do codegen tests)

After that, I can look at some ecosystem crates - build times, especially, given that @workingjubilee confirmed that a bunch of crates seem to continue compiling / run tests.

@pietroalbini pietroalbini removed their assignment Apr 23, 2021
@eddyb
Copy link
Member Author

eddyb commented Apr 24, 2021

I've now repeated the same for UI tests (rm -r build/*/test/ui; ./x.py test --stage N src/test/ui with N={1,2}), but ran into these complications which wasted a lot of time:

  • I hadn't realized what Windows Defender can do - just having "Real-time protection" on is enough to double UI test times, but repeating the tests gave me exponential increases (roughly doubling or tripling each time, until 15 minutes became 3 hours)
    • my guess for the reason that it affects UI tests, but not x.py build, is that UI tests create thousands of executables, most of which also get executed, and an antivirus (presumably) wants to scan them all
  • it also turns out that the build directory matters for tests, and I shouldn't have been using two workdirs (since the difference between them turned out to persist even between testing the same commit in both)

Anyway, here's the UI test data (again measured in minutes):

  • Preview
    image

  • Full chart (which you can also find on plotly, including the data)

    (click to open tall image)

    image

I wish it was easier to see, Plotly's default colors can combine rather annoyingly.

Anyway, it overlaps pretty well, with some excursions (upwards for PR (stage1) and downwards for PR (stage2)).

But the variation here is nothing compared to the difference between build dirs mentioned earlier as something I forgot to account for initially (which, for the same exact tested commit, had a separation of 16 seconds between the max of the build dir with lower times and the min of the build dir with higher times, and 23 seconds between the averages).


Tomorrow, I'll start looking at ecosystem crates, but given everything so far, I expect almost no effects from this PR.

@eddyb
Copy link
Member Author

eddyb commented Apr 26, 2021

So I was gathering data on cargo, ripgrep and syn, all in release mode, using -Z self-profile, x10 runs for each of nightly-2021-04-22 (same commit as "base" but not locally built), base and PR (same builds as from the previous comment).

But the data looks pretty bad, e.g. this is cargo (values are the sum of all "self time"s for *LLVM* entries, in seconds):

image

I'm not sure why this didn't show up before, but it's either more "deterministic PRNG" nonsense (doubtful, this is in the same build directory, and the change is isolated to the backend), or an actual regression in build times due to this PR.

I believe I can (at least partially) resolve the LLVM side of the problem (and it does look like it's mostly in LLVM) by keeping the micro-opt (of a cleanup_ret that bypasses the trampoline block) but still emit the block regardless (which is the reason I wanted to remove the micro-opt - the br vs cleanup_ret difference itself doesn't matter much).

If I can remove the regression, I'll probably open a different PR, and close this as-is.
I'll also likely stick to mostly benching cargo+ripgrep+syn, and not waste whole days on timing builds/UI tests.

@Mark-Simulacrum
Copy link
Member

Hm, when you compare against nightly, are you building that locally? there are a number of things CI does which aren't default locally (e.g., CGU=1 for std)

@eddyb
Copy link
Member Author

eddyb commented Apr 26, 2021

Hm, when you compare against nightly, are you building that locally? there are a number of things CI does which aren't default locally (e.g., CGU=1 for std)

Ah I forgot to mention this but nightly is only there as a rough sanity check. The fact that it's close but a bit faster is pretty much what I would expect. The important difference is "Base" vs "PR" (which is pretty clearly a regression).

@eddyb
Copy link
Member Author

eddyb commented Apr 26, 2021

This is the same data as in #84403 (comment) but I've added a measurement of this idea, as "Compromise":

by keeping the micro-opt (of a cleanup_ret that bypasses the trampoline block) but still emit the block regardless

image

I'd say that removes most of the regression, and the rest may be partly avoided by adding any caching (instead of always generating a fresh new block on the fly), deleting the block if it's never actually used, etc.

However, that approach has no need to land separately (since it's just wastefully creating extra LLVM blocks that are never branched into, there's no correctness concerns), so I'll incorporate it into the larger refactor and close this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-review Status: Awaiting review from the assignee but also interested parties.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants