-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
TLS lookups in libsyntax_pos are expensive #59718
Comments
Could these be sped up by using
If you're asking why it's not a plain |
#59655 allows you to compare symbols against a predefined list of symbols without doing a TLS lookup and a string comparison. That will hopefully help some. I'm also working on a PR which removes I'd also like to replace the |
I think arena-allocating the AST is the way forward anyway, so I wouldn't mind the lifetime tbh. |
Is it related to #25088? |
@mati865 We can figure out by trying to use a |
While |
I'm asking why a global data structure requires TLS to access it... global data structures and TLS seem entirely orthogonal and incompatible to me. Clearly I'm missing something. What does "multiple instances per process" mean -- instances of what? |
Rustdoc will use rustc_driver and a set of other APIs to essentially attempt to call rustc as if it was a function. That spawns a thread (or more, with the parallel compiler enabled); each of those threads receives its own copy of these proto-globals; that means that they aren't necessarily global in the standard sense -- more so rustc-local. |
@nnethercote All "globals" in rustc are "thread-local globals" - as in, they're "global" in the sense of "accessible from a function with no arguments" but scoped to a thread. And by "rustc supports multiple instances" I meant "multiple instances of itself", i.e. multiple |
cc #59749 (Measure upper limit for performance of 32 bit The same thing can be measured for the symbol interner as well, I guess, to estimate the impact. |
So "thread" doesn't actually mean OS thread, but a rustc invocation that contains one or more OS threads, depending on whether rustc is serial or parellel. And These names are... well... I now feel more justified about my prior confusion. I've seen the word "session" used in the code, does that match "rustc invocation" as I've used it above? I still don't understand how, in a parellel rustc, multiple OS threads can access the same TLS. Does each OS thread end up with a reference to the single mutex-protected quasi-global? How important is the ability to run multiple rustc invocations? @eddyb said it's used for "rustdoc uses that to compile doc tests". Is it used for anything else? |
The threads do correspond to OS threads. However, my understanding is that Yes, sessions are rustc "invocation" specific.
Yes, the TLS just contains a pointer to the actual "global."
My understanding is that doc tests would be considerably slower if we didn't have this in-process multi-invocationy style of building tests. I don't think it's used for anything else, necessarily, beyond perhaps unit tests in a few compiler tests. I think historically the scoped TLS in the compiler has been used as an implicit context for things like Span, TyCtxt, etc. where there's some associated state that we don't currently thread through manually. I think it's possible that over time we could migrate away from TLS and towards other methods of threading the state through (and/or true globals via e.g. lazy_static) but I am unsure if that's feasible. I think historically it's not really been viable to completely remove (we use it too much, and it may be better than the alternative). |
We certainly do not consider "true globals" a reasonable limitation for "rustc as a library" (not to mention they'd need locks in cases where today we can use Cell/RefCell), and likely RLS would be impacted too (at least before we add multi-crate sessions to rustc). Ideally we'd move to some language-integrated "implicit contexts" but that is nowhere near on the horizon. |
A problem with the current I tried changing I got it working, but unfortunately it was a clear slowdown of a few percent. |
I don't get which part of your change made this slow - as it now doesn't use TLS isn't it supposed to be faster? Or the indirect string length made things slower? Random note: if we're still struggling with 64-bit pointers, maybe it's time for a Java-like 32-bit mode. |
I improved things and now the performance is roughly the same -- some workloads are slightly better, some are slightly worse. A Cachegrind diff suggests that the ones that are worse are mostly because |
I don't think this is going down the right route, IMO, we should be making TLS cheaper (or relying on explicitly passed contexts where possible), rather than adding pointers everywhere. @nnethercote Doesn't making it global cross-thread mean you now pay for locking where there was none before? I don't think we should be using Tearing down a compiler instance in a process should not leave around leaked garbage, IMO (and keep in mind that, at least for a while longer, gensyms create fresh symbols, so something like RLS would just keep leaking memory). |
As I understand, the locking will be there anyway once
Hmm, looks like it's not possible to link to a Discord message, so I'll copypaste:
So, if we provide some minimal garbage collection interface, RLS will be able to avoid leaks.
The const eval cannot create an empty |
That works with indices, and |
I had an idea, I tried it, it didn't improve things, and I'm just reporting my experience in case it's interesting or helpful to others. Don't worry, I haven't filed a PR. But I don't like the current interner design. I think the |
I wouldn't personally mind an IMO Rust could have properly statically checked implicit contexts with efficient access (i.e. not relying on TLS at all), but we're nowhere near a design for that, so we make do with what we've got. |
I should mention, as a data point, that the It was a compromise and maybe we need to keep it at that, relying on whatever contexts we have on hand instead, for most operations. |
To expand on what @eddyb said, here's an alternative design.
This is already how This would give a combination of speed and convenience. |
The following PRs have reduced |
Here are some updated numbers, compared to the old numbers from the first comment above.
The I got some nice wins from the abovementioned PRs, but nothing as big as #59693. I estimated earlier that "approximately half the speedup is from avoiding TLS lookups", but in hindsight I think that's an overestimate. The |
Is this still an issue now that we use |
I don't see TLS much in profiles any more. Whether or not that is because of #78201 I don't know. But I think this issue can be closed. |
#59693 is a nice speed-up for rustc, reducing instruction counts by as much as 12%. #59693 (comment) shows that approximately half the speedup is from avoiding TLS lookups.
So I thought: what else is using TLS lookups? I did some profiling and found that
syntax_pos::GLOBALS
accounts for most of it. It has three pieces,symbol_interner
,hygiene_data
,span_interner
. I did some profiling of the places where they are accessed viaGLOBALS::with
:These measurements are from a rustc that didn't have #59693's change applied, which avoids almost all of the
span_interner
accesses. And those accesses were only 11.0-24.8% of thesyntax_pos::GLOBALS
accesses. In other words, if we could eliminate most or all of thehygiene_data
andsymbol_interner
accesses, we'd get even bigger wins than what we saw in #59693.I admit that I don't understand how
syntax_pos::GLOBALS
works, why the TLS reference is needed for a global value.One possible idea is to increase the size of
Symbol
from 4 bytes to 8 bytes, and then store short symbols (7 bytes or less) inline. Some preliminary profiling suggests this could capture roughly half of the symbols.hygiene_data
is a harder nut to crack, being a more complicated structure.cc @rust-lang/wg-compiler-performance
The text was updated successfully, but these errors were encountered: