-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
trans/LLVM: Don't keep all LLVM modules in memory at the same time #39280
Comments
Will this actually cause problems? Unless there are too many threads (in which case that should be addressed), it feels like the OS will handle this better by paging to disk than we will by (effectively) doing so manually. Also, for many devs, 10GB is not too much memory usage and anything we do here (unless we're really smart) will slow down compilation for those users. |
My first thought would be the latter, consider translation part of a "work unit" and don't actually start doing it until it's ready to go all the way into LLVM. That way we cap the number of active LLVM modules at any one point in time. I think @nrc's got a good point, though, in that using the disk as a cache may be better left to the OS rather than the compiler itself (until proven otherwise, of course) |
Some data points:
It lead to my not being able to compile the
If everything fits into memory at once, having hundreds of parallel LLVM threads does not seem to have a detrimental effect (running 140 threads was a few percent faster than having 4 threads on a 4 core processor). OTOH, I would suspect that paging would completely destroy performance if many threads had to fight for RAM.
That's certainly a valid concern! |
If you really want to allow the OS to do a good job of paging out memory you need to ensure that modules don't share pages with each other. If a given module is spread out across your memory with lots of fragmentation, then you'll have to have a lot more memory paged in than necessary, reducing the effectiveness of the OS paging stuff out. Also, there are ways to hint to the OS that you don't need some memory at the moment, so it can be more willing to page it out sooner in favor of keeping other memory in memory. |
It seems to me like letting the OS take care of this via paging is really the most efficient and most easily implemented solution. But I'm also a bit concerned about |
Would option 2 actually be any slower or otherwise less efficient? It doesn't seem to do any additional work. As @retep998 outlines, it is not guaranteed that the OS will actually do a perfect job of paging modules in and out. So ignoring option 1, which does a lot of unnecessary disk I/O, the choice seems to be between doing the same work and either:
|
I'm a bit confused. It seems to me that there are three things that need to be done:
Currently, we do the first one for everything. Then we start some threads to do the second one -- at what point do we do the third one? (I can't remember.) In particular, do we wait for all the threads to finish? |
That's done by the last LLVM pass, IIRC. |
@michaelwoerister so, in that case, I disagree with the assessment of "Let the O/S do it". In particular, it seems likely that "optimization and transformation of the LLVM IR" takes a lot of (temporary) memory. If we start N threads (one per CGU) and those are all going simultaneously, that's going to lead to a larger peak than is necessary. But if we throttle starting at phase 2, this will reduce the peak, right? In particular, the O/S is good at paging memory that is no longer used, but if we start N threads, all that memory is in active use ("working set" is large). If we throttle, working set is smaller, and we can free as we finish each item (since It might reduce the peak further to do the full pipeline one CGU at a time, but maybe not, because as it is we can free the HIR and tcx entirely once trans is done (and I believe we do so). |
I would limit the number of concurrent threads to something around the number of CPU cores, since there is no real reason to have one thread per CGU. That would limit the working set. If we keep doing as many threads as CGUs then yes, paging won't be efficient.
Yes, that is the main disadvantage of that approach. However, as @rkruppe noted, we can free the tcx once the last llvm module has been generated. If you have a low number of codegen units (as is common in the non-incremental case) then we can free the tcx pretty early. If you have a high number of CGUs than chances are good that the amount of memory needed per LLVM module is low, so keeping N small LLVM modules in addition to the tcx might not be a big deal. One advantage of approach (2) is that translation and LLVM can run in parallel for much of the time.
Trans would have to be throttled though as for the queue not fill up and consume all memory again. |
@michaelwoerister that seems reasonable. Though I think that restructuring in this way is a mild amount of work, whereas limiting number of worker threads is very easy, right? In any case, I imagine we would basically start N threads all pulling from a single fixed-length channel. Each item they pull out is an LLVM module, so they would process it. Meanwhile, trans would be doing a loop like this: let mut fixed_len_channel = channel(1); // fixed-len channel of length 1
start_threads();
for item in work_items {
let llvm_ir = translate(next_item);
fixed_len_channel.send(llvm_ir); // will block if channel is full
}
free_tcx_etc();
join_threads(); Right? Seems nice. |
(Of course we could make fixed-length channel have a longer length too, any value will serve to throttle.) |
Yes, that's pretty much how I imagined it. |
Yes. |
…, r=nikomatsakis back: Limit the number of LLVM worker threads. This should fix issue rust-lang#39568. Also see rust-lang#39280. r? @nikomatsakis
…, r=nikomatsakis back: Limit the number of LLVM worker threads. This should fix issue rust-lang#39568. Also see rust-lang#39280. r? @nikomatsakis
…, r=nikomatsakis back: Limit the number of LLVM worker threads. This should fix issue rust-lang#39568. Also see rust-lang#39280. r? @nikomatsakis
…, r=nikomatsakis back: Limit the number of LLVM worker threads. This should fix issue rust-lang#39568. Also see rust-lang#39280. r? @nikomatsakis
I hit this independently when experimenting with incremental compilation on Servo’s CI: servo/servo#15565 (comment) On a builder with 8 GB of RAM and no swap, rustc crashed with:
|
This feels like it would be a pretty easy thing to fix. I'll try to either take a crack at it or write up some mentoring instructions tomorrow. I am behind on incremental PRs -- I had hoped to get to it today, but spent the whole day reading notifications instead. Yay for 3-day weekends! :) |
@nikomatsakis Done. servo/servo#15565 (comment)
|
Just to follow-up: The simple "fix" I implemented will only make swapping less of a problem, since fewer threads will compete for the physical memory available. If |
It's been a while since this issue has been updated. What's the current status? |
Some refactoring has been done, but the main change remains undone. I've been meaning for a while now to try and write-up a few notes and try to mentor a PR for this. |
OK, so, I think this would be an awesome bug for someone who is interested in helping out with incremental. It's going to be a certain amount of refactoring, but a lot of it is done I think. To be honest, I'm not 100% sure what's the best path here, so let me lay out a few thoughts and see if anyone is interested in picking it up. The big goalRight now, we do the following to generate code:
The problem is that storing the LLVM bitcode for ALL the modules leads to a huge memory spike. This in turn leads to slow compilation. How slow? Hard to say. I see that using incremental the base cost of building the script crate is 2x slower than not (at least if I remember correctly), but I'm not sure how much of that is attributed to this spike in memory usage. Regardless, it's a problem. One (relatively minor) complicationOne complication is step 2 in the list above. If we are not in incremental mode, then we have this bit of code that figures out which functions wind up getting used by other code-gen-units and marking them as "external", and the rest as internal. Marking things as internal permits more optimization. We don't do this in incremental mode though because we want to allow other modules to be updated independently. The problem is that this pass inherently requires us to have built all codegen-units so we can walk them, and we have to do that before we do any optimizations. It's sort of the opposite of what we want. There are two ways to fix this. The right way, and the wrong way. The right way is probably to modify the "trans collector" (which is the thing that creates the codegen-units in the first place) and have IT figure out what must be marked as private/public. This would allow us to do the optimization even in incremental mode (though we might choose not to), since we could easily detect when things have changed (i.e., when something must now be public that was private before) and recompile as needed. The wrong way is to keep the current setup, which is also very clean in its own way, but just have it only enable if (a) incremental is off and (b) the number of codegen-units is small. Then basically we can kind of disable the "pipelineing" if it's not going to buy us much, effectively, and trade it for better optimization. One bigger complicationWe have to restructure the driver and the way is separates "phase 4" (LLVM generation) from "phase 5" (LLVM optimization). This distinction isn't really that meaningful and eliminating it is, after all, the whole point of this bug. The overall planMy general plan is basically to start up a thread (or many threads) that will do LLVM optimization early on, basically before phase 4. These threads will share a fixed-size queue (with a size ~ the number of cores or something) with the LLVM generation threads. The generation threads will generate IR and then push it in the queue; if the queue is full, they will block. The other threads will take things out of the queue and process it. What is now called "phase 5" would thus not be starting the threads but rather joining the threads. To handle the "internalize symbols" stuff, we might tweak this process a bit -- e.g., if there are a very small number of CGU, then we could delay starting the worker threads, accumulate the CGUs into a vector, and then process the vector in a big chunk -- and fire off all the completed results to the worker threads. PiecesIf you're interested, there are probably a couple of independent PRs:
Or maybe some other path. This is not an easy refactor, but it seems like a fun one to me! |
Hi, I'm interested in doing this. :) @nikomatsakis |
@krstoff great! let me know if I can be of help. I think probably the best place to start is just by reshuffling the "phase4/phase5" setup so that LLVM optimizations aren't something the user can separately invoke. |
@krstoff any updates? I'm thinking of taking a crack at some preliminary refactoring here, but I don't want to step on your toes. |
@nikomatsakis - sounds like this one might be open for people jumping in. Would still be great to see this one go in |
@michaelwoerister has done some initial refactoring, removing at least one obstacle here. |
I think he plans to keep working on it. |
…lvm, r=eddyb trans: Internalize symbols without relying on LLVM This PR makes the compiler use the information gather by the trans collector in order to determine which symbols/trans-items can be made internal. This has the advantages: + of being LLVM independent, + of also working in incremental mode, and + of allowing to not keep all LLVM modules in memory at the same time. This is in preparation for fixing issue #39280. cc @rust-lang/compiler
@michaelwoerister this is closed now right with #43506 landed? |
Yes 🎉 |
UPDATE: There are some mentoring instructions below if you're interested in looking into this issue.
NB: Part of the roadmap issue on incremental compilation.
Currently the compiler translates all codegen units into LLVM modules and then runs LLVM on all of them in parallel. In the context of incremental compilation, where there can be hundreds of codegen units for a single crate, but also for non-incremental builds with a high number of codegen units, this can mean a lot of memory pressure -- e.g. for Servo's
script
crate that's more than 10 GB peak memory usage (while otherwise it's around 4.5 GB).There's no real need to keep more LLVM modules in memory than are currently being worked on. Two possible solutions:
N
LLVM threads, start persisting LLVM modules to disk at theN+1
st and reload them later for optimization and codegen. That's probably relatively easy to implement but involves more disk I/O than necessary.N
codegen units to LLVM modules, translate them all the way to object files, then go back and translate the next codegen unit. This has the advantage that we would not need to temporarily store anything on disk, but we would need to keep thetcx
in memory as long as there are still untranslated codegen units. It would also require a much bigger refactoring than the first approach.Any other ideas?
cc @rust-lang/compiler @rust-lang/tools @nagisa
The text was updated successfully, but these errors were encountered: