-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HACK] rustc_codegen_ssa: remove micro-opt around cleanup_ret
trampoline.
#84403
Conversation
(rust-highfive has picked a reviewer for you, use r? to override) |
This comment has been minimized.
This comment has been minimized.
c299d36
to
b43c8bb
Compare
This comment has been minimized.
This comment has been minimized.
b43c8bb
to
1a996dc
Compare
This comment has been minimized.
This comment has been minimized.
At least for the regular PR builder, and the dist x86_64-unknown-linux-gnu builder (standard for try builds) we regularly see variance in the tens of minutes, so I don't think such a minor difference is going to be reflective (at least without dozens of runs and subsequent statistical analysis, if it's even possible). |
I mildly (in my defense, it was ages ago) remember doing funclet-based cleanups the most obvious and straightforward way (including around this area) would cause LLVM to assert left and right, so I think this would want a crater run or similar on MSVC. But my experience is also from back when funclets were just introduced, so there's that. I would be happy, perhaps, with just a compile of a few most prominent crates in the ecosystem and if that builds, then I'm happy with this PR, personally. |
Is an MSVC-specific perf check possible? Not for an actual perf test, per se, mostly that such would be a fairly representative sample. |
Heh, not really surprised there. However, I expect the "less efficient" way to do this to always work, whereas the micro-opt is the one that could cause issues. We always use the extra trampoline block (which itself does I also suspect that the cost of always going through the trampoline block is that LLVM's CFG simplification pass will be required to bring it back to the previous form, so it should impact compile times but not output quality (but I haven't confirmed this yet). I'll look into setting up a Windows VM and finding some crates to test compile times on (presumably some of the larger ones on perf.r-l.o will do). Thanks for the feedback! |
I should rephrase: I have a Windows machine at the moment. I guess I can't use the Linux-specific perf infra for rustc-perf, but I can definitely iterate over the directory still and collect what time remarks I can? And artifact sizes, since I have been informed that it is desired to make sure those are constant. EDIT: Oof, I didn't manage to pull something together satisfactorily because I ran into a wall of even most of the example crates I tried not being set up to compile on Windows. This commit does compile some of the big-ish ones though: diesel and ripgrep, for instance, and also passes the windows-rs and winapi test suites. |
1a996dc
to
6d892ed
Compare
FWIW I rebased on b849326 to match |
So far I've been able to get 10
There is some variation but they mostly overlap well enough that I can say that this shouldn't impact CI times. Also, Next up, I'll try to time UI tests (even harder, since parallelism is fully used, but at least I want to make sure there's no obvious large-scale difference. I'll probably not do 10 runs again). (FWIW, the UI tests do pass, I've already checked that, and so do codegen tests) After that, I can look at some ecosystem crates - build times, especially, given that @workingjubilee confirmed that a bunch of crates seem to continue compiling / run tests. |
I've now repeated the same for UI tests (
Anyway, here's the UI test data (again measured in minutes):
I wish it was easier to see, Plotly's default colors can combine rather annoyingly. Anyway, it overlaps pretty well, with some excursions (upwards for But the variation here is nothing compared to the difference between build dirs mentioned earlier as something I forgot to account for initially (which, for the same exact tested commit, had a separation of Tomorrow, I'll start looking at ecosystem crates, but given everything so far, I expect almost no effects from this PR. |
So I was gathering data on But the data looks pretty bad, e.g. this is I'm not sure why this didn't show up before, but it's either more "deterministic PRNG" nonsense (doubtful, this is in the same build directory, and the change is isolated to the backend), or an actual regression in build times due to this PR. I believe I can (at least partially) resolve the LLVM side of the problem (and it does look like it's mostly in LLVM) by keeping the micro-opt (of a If I can remove the regression, I'll probably open a different PR, and close this as-is. |
Hm, when you compare against nightly, are you building that locally? there are a number of things CI does which aren't default locally (e.g., CGU=1 for std) |
Ah I forgot to mention this but nightly is only there as a rough sanity check. The fact that it's close but a bit faster is pretty much what I would expect. The important difference is "Base" vs "PR" (which is pretty clearly a regression). |
This is the same data as in #84403 (comment) but I've added a measurement of this idea, as "Compromise":
I'd say that removes most of the regression, and the rest may be partly avoided by adding any caching (instead of always generating a fresh new block on the fly), deleting the block if it's never actually used, etc. However, that approach has no need to land separately (since it's just wastefully creating extra LLVM blocks that are never branched into, there's no correctness concerns), so I'll incorporate it into the larger refactor and close this PR. |
I'm hoping this only has a minimal impact, and that we can clean up this codepath.
My motivation for messing with this at all, is pre-generating all such auxiliary blocks, without having to know ahead of time whether they will get used via (the suboptimally named)
llblock
(which can be e.g. one of the targets of a conditional branch orswitch
) orfunclet_br
(used for direct branches).(By pre-generating all such blocks, the lifecycles of basic block builders - and so the borrowing relationship between
Bx
andCodegenCx
- could be much simpler. I already started on a refactor, and landing pad / funcletcleanup_ret
trampolines are the main offenders, because they're on-demand)Sadly, I don't have a good way to test a MSVC target, and much less so compare compiletimes (especially as AFAIK we don't have anything like perf.rust-lang.org for Windows - cc @Mark-Simulacrum)
I've switched out the PR CI builder to
x86_64-msvc-2
for this test PR but I don't know how much good it will do - I'm hoping it will give me comparable build times toauto
bors CI, at least.