-
-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Try to speed up #[derive(ValidGrouping)]
#3163
Conversation
this results in a 30x improvement for a 128-column struct
this results in a few %age improvements
Thanks for opening this PR and doing all this analysis ❤️ While trying to put together a simple example showing why those quadratic expansion of Seems like we can replace the the corresponding derive + some other recursive impls as well. See weiznich@61351f7 for the corresponding change. This brings down the time of a Given that success I've tried to apply the same approach to the recursive diesel/diesel/src/type_impls/tuples.rs Lines 511 to 560 in 0c6219f
Unfortunately it is not possible to apply the same pattern there. This results in quite large build times (> 13 minutes for me with all cargo check --features "sqlite" --no-default-features ). It's not clear for me why this happens and what's even different there compared to the other traits. All of those tuple impls basically work by forwarding stuff to the next smaller tuple. In therory rustc should be able to resolve that without quadratic behaviour, assuming the corresponding impls are cached.
|
Great news ! I was checking this as part of the initiative to improve compile times, in case it would be harder to remove these tuple wrappers and the associated derives, rather than optimizing the macro. Your change will help I was able to reproduce your results on the first commit: it removes indeed most of the proc-macro expansion step, as well as a chunk of regular macro expansion, and hits your other issue #81263 much less than before. For the second commit, like you mention it's very slow, but 99% of compile times are because of another rustc issue, about the specialization graph. I've seen it in a few other crates: it's something quadratic in coherence checking IIRC, that starts becoming noticeable as bigger Ns weigh heavy in a O(N²) algorithm, e.g. in bigger crates. I wonder if your other MCVEs do hit the same issue: it would be great to have a smaller scale reproduction for this issue which is hard to reduce among all the unrelated code that big crates have (along with the actual code triggering the slowness). If your other issues' reproduction examples do hit it, it would be super helpful for people to even begin analyzing the nonlinearity in coherence. |
In case it helps, here's a breakdown of the traits whose tuple impls trigger the specialization graph slowness (until I have more info on what exactly is slow there, as it's doing a bunch of things at the same time, like setting up specialization/coherence/overlap):
For reproduction and minimization purposes, it looks like |
Yes, I expect that no one outside of diesel is using this derive with more than a handful type parameters. After all that derive is designed to be used with hand written query DSL extensions and you don't want to maintain a list of dozens of type parameters there manually.
How do I verify that the example given as part of rust-lang/rust#75542 hits the same issues? I would assume that, as this is a quite literal extraction of the relevant code from diesel. This example is fairly self contained, as it only contains 5 trait + 3 structs. Additionally I can add that removing the generic impl<T> IntoNullable for T
where
T: SqlType<IsNull = is_nullable::NotNull> + SingleValue,
{
type Nullable = Nullable<T>;
}
I've tried to put That all written: I'm happy to help with whatever code/knowledge from diesels side is required to fix this issue. |
Great, so we can probably just close this PR then, it's not going to land. #3167 has landed instead, and it's better progress for what I was initially looking at. A check build for 128-column tables is now reduced to 7mins down from 10-11 minutes.
Unfortunately, I don't think one easily can. From the high-level self-profiling output, I was expecting this to be a coherence issue: they're both showing up as heavy "specialization_graph_of" activity. I have since then been looking into this part of rustc more, timing the different parts of specialization/coherence/trait selection, and I'm not exactly sure it's the exact same issue (the one I was thinking of looks older than this one). It's pretty close: it's still called in the same rustc query and in coherence, but then is trying to prove predicates for negative impls or something, and then it becomes slow in the trait system for some yet unknown reason. It seems to me like it could be related to the exponential blowup that was seen when some of the trait system cache was removed to fix incremental compilation issues. But that happened in 1.57 IIRC, and it looks like the regression we're seeing in rust-lang/rust#75542 was actually introduced in 1.56:
I have not yet managed to find which nightly or PR introduced this issue, but will try to do so. It should help show the exact part of rustc that is affected by this example structure. In the meantime, I have also been trying to talk the experts in that area of the compiler, to see if we can have a discussion and try to understand what's going on in this example, and if/how it can be fixed. I've also played with your example, and I believe it's good enough for that purpose: you have done the hard work minimizing already. Thanks for that. As you mentioned, I had also seen that its weird behavior is kind of fragile, and even small changes to the repro make it compile fast. I will continue this investigation, and discuss with the trait system & caching and incremental query system experts. |
Thanks for looking into this ❤️
I would like to add here that the original rustc issue predates 1.55, is from nearly 2 years back. So be prepared for another underlying issue there. This likely does not show up immediately as the minimal example only sets 16 columns as default in line 861. Increasing that to 32 columns did lead to a compile time "explosion" even before the the resent incremental compilation issue fixes. |
Yes indeed, I've noticed this as well. There are multiple steps of regressions over time, unfortunately, as well as multiple issues. And they affect For the 16-column case above, it was easy-ish, we can see the 30x increase so we can like look for the version that makes this query go over 1s. I've bisected the 1.56 regression this way to rust-lang/rust#89125 for example. |
While looking at rust-lang/rust#81262, I also tried looking at the master branch to see how things had evolved since this issue was opened, in particular since the use of proc_macros.
While testing 32/64/128 columns, I noticed that:
diesel
itself, it seemed most of the proc_macro slowness came from theTupleWrapper
s and in particular the#[derive(ValidGrouping)]
This proc_macro roughly takes one second to execute each time from 100 to 128 columns. I believe it could be interesting for compilation times to have a structure that is less expansion-heavy, esp. the
MixedAggregates
chain causing a quadratic amount of expansion.Since I'm not familiar enough with
diesel
and its aggregates to successfully come up with such a design, I wanted to notify you about this finding, while trying to help in the meantime. This PR reduces the constant factors of this proc_macro (while still retaining its quadratic nature) so maybe it can help a bit, in that it could be "fast enough", and that new slowness and opportunities for improvement are to be found elsewhere.It seems the mixed aggregates chain, and use of
quote
for short quoting/unquoting concatenation, builds up big token streams that are slower and slower to parse and reparse at each step. In this PR, I've tried to build up the code in a string buffer, to limit the number of times they are parsed withquote
. It's not ideal, probably a bit more error-prone, but I don't think thatquote
was made to handle this use-case quickly (even after the recent performance improvements)In a benchmark with a single use of the derive per column-count, I see these results for a non-incremental
cargo check
of the benchmark, from 1 to 128 columns:.
Proc_macro expansion itself is 30x faster at 128 columns (from 1s to 35ms or so) for this one macro invocation, and as seen in the above chart, total compilation of this simple stress test is also divided by a bit over 2 (it then also shows other bottlenecks in rustc).
That also translates to
diesel
itself, proc_macro expansion is reduced, for acargo check
:32-column-tables
: from 1s to 280ms64-column-tables
: from 6s to 600ms128-column-tables
: from 45s to 2sEnd-to-end compile times are reduced a tiny bit for the heavier cases: these -40s are somewhat noticeable, but the
cargo check
itself is around 10 minutes long, so that's only a 4-5% improvement there.I'm opening this PR as a draft for feedback, to see if the approach is acceptable; and to see if CI likes it (as it seems likely that I forgot some use-cases with different forms of input).