-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Parallelize the compiler via two-pass compilation #4767
Conversation
Also simplify the logic to write the .tasty files.
Useful when generating tasty outline files, see the next commit.
This avoids a cycle when unpickling scala.Predef This change uncovered a bug when using -Ythrough-tasty: some trees were unpickled at the wrong phase because we use `withPhaseNoLater(ctx.picklerPhase)` in TreeUnpickler but the TASTYCompiler previously dropped the Pickler phase, so the phase change was a silent no-op. To avoid this issue, we change TASTYCompiler to not drop the Pickler phase, instead we change Pickler#run to not do anything when running with -from-tasty. We should also change how the ctx.xxxPhase methods work to avoid this kind of silent issues.
This lead to cycles when unpickling the standard library from Tasty.
Previously the parameter of a dummy constructor was emitted without the flag "Param" instead of the flag "ParamAccessor", this isn't meaningful and lead to compilation errors when unpickled from tasty outline files.
This should be replaced by flags or tags in Tasty that actually represent the semantics of each Java construct we need to encode.
When this flag is enabled: - The body of a definition (def, val or var) is not typechecked and its body is replaced by `???` (unless its result type needs to be inferred, or it's a special case, see Typer#canDropBody). - Statements in a class body are dropped. - .tasty files are emitted for Java source files too. - Compilation is stopped after the Pickler phase (which will emit both .tasty files as well as empty .class files).
Ideally, all -*path options would work with list of virtual or non-virtual directories but that's not needed to get the proof-of-concept working. So instead we just reusing the same logic that is used to make "-d" work.
When enabled, compilation will proceed in two passes: - The first pass is sequential and generates tasty outline files, these files are not written to disk but stored in memory. - The second pass splits the list of input files into N groups and compiles each group in parallel. The tasty outline files from the first pass are available on the classpath of each of these compilers, they contain the type signatures needed for the separate compilation of each group to succeed. TODO: Instead of splitting the input into N groups, implement work-stealing to avoid leaving some threads idle.
I'm working on fixing this in another branch.
It's also worth noting that in a project with two subprojects A and B, where B depends on A, the output of the first-pass compilation for A could be used as input for B to start its compilation earlier instead of waiting for A to finish compiling, @jvican has been experimenting with something like this in the context of scalac. |
Yes, it will be available in Bloop 1.1.0, together with implementation notes and detailed explanation of how everything works. |
Super exciting results! I like how you've been able to use compilation from Tasty to leverage the signatures produced in the first phase. Here's our take on this, based on a similar architecture but compatible with Scalac: twitter/rsc#85. We have implemented support for a subset of Scala to produce signatures for automatically rewritten core of Twitter Util, and we will be working on adding support for more Scala features according to the roadmap. |
Interesting! I think that there's one significant difference with the approach I'm taking here though, in https://github.com/twitter/rsc/blob/master/docs/language.md you state:
By contrast, the parallelism that this PR enables works fine with code bases that do not have explicit result types everywhere, the first pass will just be slower the more it needs to do type inference. As an example, I've tweaked the core of Twitter Util to make it compile with Dotty (I should add it to the dotty-community-build and report the bugs I worked around too). On my laptop, hot compilation with |
@@ -46,6 +46,7 @@ class ScalaSettings extends Settings.SettingGroup { | |||
val rewrite = OptionSetting[Rewrites]("-rewrite", "When used in conjunction with -language:Scala2 rewrites sources to migrate to new syntax") | |||
val silentWarnings = BooleanSetting("-nowarn", "Silence all warnings.") | |||
val fromTasty = BooleanSetting("-from-tasty", "Compile classes from tasty in classpath. The arguments are used as class names.") | |||
val parallelism = IntSetting("-parallelism", "Number of parallel threads, 0 to use all cores.", 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should it be all cores or all threads?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When parallelism is set to 0, we create one thread per "core" (for some definition of core) on your computer (to be more precise, we createruntime.getRuntime().availableProcessors()
threads).
This is really cool! I agree that dividing the typechecking into symbol table typechecking (signatures) and method body typechecking is the way to go. The key variable in this approach is how quickly one can compute the whole symbol table. This is the question I was researching in Kentucky Mule and I became convinced symbol table can be computed really quickly with careful effort. The symbol table calculation itself can be parallelized which came as a surprise to me. I gave a talk on this subject at SF Scala meetup two months ago. @smarter out of curiosity, do you plan to work more on parallelization of dotty? PS. I'm in Basel for the summer and happy to chat about this subject. |
Yes, though I have other higher priorities so not sure how much time I'll spend on it.
You should come to EPFL to give a talk :) |
@smarter This has been inactive for 6 months. Should we keep it open? |
Yes, I'm still on it, I thinking keeping PRs marked WIP open is OK. |
There was no activity on this one for a long while, so let's close it. |
How do you parallelize a compiler? A first approach might be: given a list of source files, split it into N groups then compile it using N compiler instances. If each file can be compiled independently this works, but usually files will refer to symbols defined in other files, so we need to know in advance about these symbols and their types. If we had previously compiled these files, this is easy: we can just read the compiler output (.class and .tasty files) to find every symbol and its type, this is how incremental compilation already works. Of course, we can't assume that the files we're trying to compile have already been compiled, but we don't need the full compiler output: we're only intested in the name and type of symbols. From this, we can sketch a simple way to parallelize the compiler:
This is what this PR implements using the flag
-parallelism N
. The first pass works by running the compiler in a special mode where:???
(unless its result type needs to be inferred, or it's a special case, see Typer#canDropBody).When compiling Dotty itself, the first pass compile time is about 20% of a regular non-parallel compilation (it'd probably be quite a bit less if we added explicit result types to every definition, I haven't tried that yet).
The second pass just runs N instances of the regular compiler, then combines the results.
The implementation complexity of this approach is very low (we only need to add a few lines of the code to the typechecker), and already gives interesting results. I haven't done any rigorous benchmarking yet, but here's what I get on my laptop (quad core, with hyper-threading enabled) when compiling Dotty itself:
I suspect we could improve this significantly by implementing a work-stealing algorithm to decide which file will be compiled by which thread instead of simply dividing the list of files into N groups, since threads may sit idle if some groups end up being faster to compiles than other. An intermediate solution would be to get each thread to compile approximately the same number of lines of code instead of the same number of files.
An interesting property of the two-pass approach is that it can be combined with other parallelization techniques if they bear fruits:
Note that this PR is based on #4467, so the first 5 commits can be skipped. Note also that this is still a work in progress, but it should already work well enough for people to experiment with it. Enjoy!