Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

perf regression between nightly 2023-06-28 and 2023-07-01 #113372

Closed
Dirbaio opened this issue Jul 5, 2023 · 25 comments · Fixed by #114948
Closed

perf regression between nightly 2023-06-28 and 2023-07-01 #113372

Dirbaio opened this issue Jul 5, 2023 · 25 comments · Fixed by #114948
Assignees
Labels
I-compiletime Issue: Problems and improvements with respect to compile times. requires-nightly This issue requires a nightly compiler in some way.

Comments

@Dirbaio
Copy link
Contributor

Dirbaio commented Jul 5, 2023

There's some performance regression between nightly-2023-06-28 and nightly-2023-07-01.

Compilation times for my company's firmware project went from 13s on nightly-2023-06-28 to 1m 43s on nightly-2023-07-01. This is for "change a single line, then build", not for a cold build.

The project is a single rather big crate making extensive use of async and AFIT, and TAIT through the #[embassy_executor::task] macro to place futures in statics. Unfortunately it's not open source. I also haven't been able to isolate the regression to a single bad pattern of code causing it. It seems all async code contributes to the slowness: as I remove code compilation gets faster, and it seems linearly proportional.

I've taken flamegraphs with -Zself-profile. The extra time is spent in codegen_module -> fn_abi_of_instance -> deduced_param_attrs -> is_freeze_raw -> evaluate_obligation

nightly-2023-06-28, 13s: flamegraph
nightly-2023-07-01, 1m 43s: flamegraph

I do have one open-source project which also seems affected by the regression, but not as egregiously. To reproduce, cd examples/nrf; cargo build --release.

3 attempts of "change 1 line then build":
nightly-2023-06-28: 2.51s 2.85s 3.51s. flamegraph
nightly-2023-07-01: 4.22s 4.10s 4.09s. flamegraph

fn_abi_of_instance takes 14% of the time in the new nightly, but barely none in the old one. So it seems it's the same issue.

Also unfortunately nightly-2023-06-29 and nightly-2023-06-30 have an unrelated (?) bug which makes compilation fail with bogus "impl ?Sized is not a Future" errors, which is #113155 (introduced in #98867, fixed in #113165). I don't think that's the cause of the perf regression, but it prevents bisecting it (it just points me to #98867. Scripting the bisect to fail only on timeout would probably point to #113165, though I haven't tried)

@compiler-errors
Copy link
Member

It's probably #113108.

Do you have really deeply nested futures? It's probably that.

@Dirbaio
Copy link
Contributor Author

Dirbaio commented Jul 5, 2023

Do you have really deeply nested futures?

Yes, I do.

@compiler-errors
Copy link
Member

compiler-errors commented Jul 5, 2023

Even if it is #113108, I don't really know what I can do about this situation without at least being able to see some shared code.

For the record -- that PR doesn't just introduce a regression for no good reason, it fixes real codegen errors that are due to the fact that we weren't actually revealing opaque types during codegen even though we should1. The problem is that you've probably got extremely deep futures that were hiding behind the (accidental) type erasure of opaque types, and now that we're revealing them, we're doing more work when checking auto traits like Freeze.

But that's just my conjecture for the cause of the regression.

Footnotes

  1. In specific situations that are unfortunately easier to encounter with futures, especially ones that borrow lifetimes

@compiler-errors
Copy link
Member

compiler-errors commented Jul 5, 2023

You may be able to mitigate this regression by introducing boxing in cases where you know that the futures are gonna be super deep, though it's sometimes kinda random where you'd have to introduce it.

You're probably working on embedded stuff anyways, so nevermind on the boxing lol

@Dirbaio
Copy link
Contributor Author

Dirbaio commented Jul 5, 2023

is the ekv repro I've linked above not big enough? I've tried to see if I can cook up some artificial example that bloats compile times more, unsuccessfully so far. I'll keep trying.

And yeah I can't Box :(

@compiler-errors
Copy link
Member

Oh, sorry, didn't see the link. Just saw "company code" and stopped reading 😓

@compiler-errors
Copy link
Member

Hm.... I don't know if I can reasonably debug a 2s -> 4s compile time regression. I really need code that's doing a significant amount of more work. If you can find something that's more dramatic of a regression than that, I would really appreciate it.

Dirbaio added a commit to embassy-rs/embassy that referenced this issue Jul 5, 2023
Newer nightlies have a bad perf regression rust-lang/rust#113372
Dirbaio added a commit to embassy-rs/ekv that referenced this issue Jul 5, 2023
@Dirbaio
Copy link
Contributor Author

Dirbaio commented Jul 5, 2023

Okay, I found a pattern of death: embassy-rs/ekv@53918ac

with nightly-2023-06-28 it builds fine in 2.5s.
with nightly-2023-07-01 it hangs and eats all my RAM.

@compiler-errors
Copy link
Member

Interesting. I'll look at that PR and see if there's anything that can be done to optimize the post-mono trait solving to be less unfriendly to large futures, but I will note that creating arbitrarily large futures eventually will cause the compiler to slow down...

What I'd really like to understand is how a real-world program ends up with such large futures that this regression becomes meaningful -- but I understand how that may be hard to summarize without sharing too much info about private code 😶

@Dirbaio
Copy link
Contributor Author

Dirbaio commented Jul 5, 2023

I don't think I'm doing anything particularly cursed, just writing async code 😅

The deepest future I have in my firmware is this I think (removing the code for it made the biggest difference in the compile time):

  • top level task with a loop handling incoming network requests
  • Some request header is parsed from the socket with a "request ID", a match is done on it to dispatch to the right handler.
  • One of the request IDs is a "config update", which requires...
  • Parsing a stream of key+value from the socket, writing it to ekv.
  • ekv itself has quite a bit of layers with complex logic (transaction, record, file, page, then a Flash async trait. you can see that in that repo)
  • flash access goes through a Mutex because other parts of the system can use other flash partitions concurrently
  • flash is implemented using more async in the HAL

so yeah that can easily be 30 nested futures 😓.

It's not particularly cursed though! it's straightforward code that you could also write blocking, just with async and .await in the right places. I don't go out of my way to write things as state machines manually and then wrap them with async to reduce futures depth. I don't want to, I'm using async to not have to do that.

@compiler-errors
Copy link
Member

Makes sense. I'll see what I can do about it.

@KittyBorgX KittyBorgX added the perf-regression Performance regression. label Jul 6, 2023
@compiler-errors compiler-errors added I-slow Issue: Problems and improvements with respect to performance of generated code. and removed perf-regression Performance regression. labels Jul 6, 2023
@compiler-errors
Copy link
Member

@KittyBorgX: that tag is usually used to track PRs that are perf regressions with the internal perf testing suite (rust-timer). We have another tag for issues like this: I-slow.

@lqd
Copy link
Member

lqd commented Jul 6, 2023

@compiler-errors I-compiletime (I-slow is for performance of generated code)

@compiler-errors compiler-errors added I-compiletime Issue: Problems and improvements with respect to compile times. and removed I-slow Issue: Problems and improvements with respect to performance of generated code. labels Jul 6, 2023
@compiler-errors
Copy link
Member

compiler-errors commented Jul 6, 2023

Corrected! Should've read the rest of the tag description 😸

@KittyBorgX
Copy link
Member

@KittyBorgX: that tag is usually used to track PRs that are perf regressions with the internal perf testing suite (rust-timer). We have another tag for issues like this: I-slow.

Ah my bad 😅 I'm still new to this and I'm learning haha :)

@compiler-errors compiler-errors added the requires-nightly This issue requires a nightly compiler in some way. label Jul 10, 2023
@compiler-errors
Copy link
Member

@Dirbaio: I haven't been able to get around to this yet, but since y'all are using AFIT, I'm not gonna consider this a regression that needs to be reverted on beta or anything like that since you're presumably just pinned to the nightly right before this regression.

If I haven't gotten to it in another week or so, please ping me so I may reconsider reverting the PR that I regressed this in, but just trying to make sure I understand and am able to set a reasonable timeline for this fix.

@Dirbaio
Copy link
Contributor Author

Dirbaio commented Jul 10, 2023

Thanks for the update. If there's anything I can do to help please let me know. For example, would it help if I minimized it to a self-contained .rs file using no libs?

@compiler-errors
Copy link
Member

That would certainly be cool if you could.

@ForsakenHarmony

This comment was marked as off-topic.

@compiler-errors
Copy link
Member

@Dirbaio I'm free enough to look at this again -- how do I reproduce that ekv example? Just a cargo build? I don't see a hang, but maybe I'm missing a feature flag or something?

@Dirbaio
Copy link
Contributor Author

Dirbaio commented Aug 1, 2023

git clone https://github.com/embassy-rs/ekv --branch perf-regression-repro
cd ekv/examples/nrf
cargo +nightly build --release

@compiler-errors
Copy link
Member

ah, nrf is what i needed to be building 🫠

@Dirbaio
Copy link
Contributor Author

Dirbaio commented Aug 17, 2023

Minimized: https://github.com/Dirbaio/perf-regression-repro/blob/main/src/main.rs

A lot of stuff needs to happen for it to reproduce 😅

  • lto=fat and opt-level=s
  • Future must be stored in a static (I guess this is what causes checking it is Freeze)
  • Future must be initialized via a function taking a closure
  • Needs lots of nested futures (this is what causes exponential blowup)
  • Needs a generic async fn
  • the async fn needs to call an async method on the trait.

The more mainNN() levels you add, it gets exponentially slower until it runs out of memory (after eating 20gb+ of RAM)

30: 0m6.349s
31: 0m10.144s
32: 0m16.839s
33: 0m23.656s
34: 0m43.653s
35: 1m15.742s
36: 1m50.963s
37: killed by OOM

EDIT: turns out putting the future in a static is not needed. it's the closure that causes the hang.

Also, I've found that this repro does hang with nightly-2023-06-28 and older, so it's not a repro of the regression, it's just a repro of a pre-existing hang.

the original non-minimized repro has less nesting levels (20) but more complex code inside, it does hang on 2023-07-01 but works on 2023-06-28 and earlier. There's something about EKV that makes newer nightlies choke harder, but I haven't been able to minimize it yet...

@Dirbaio
Copy link
Contributor Author

Dirbaio commented Aug 17, 2023

I can confirm #114948 fixes the issue! It fixes all the hang repros found, and also the regression on the project I originally found it in (company code)

"change one line" build times on company code:

  • nightly-2023-06-28: 17s-20s
  • nightly-2023-08-16: 1m 43s
  • PR #114948: 22s-24s (EDIT: not entrely apples-to-apples because I haven't built with PGO+LTO+BOLT unlike the published nightlies)

@Dirbaio
Copy link
Contributor Author

Dirbaio commented Aug 18, 2023

🥳

thank you @compiler-errors !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
I-compiletime Issue: Problems and improvements with respect to compile times. requires-nightly This issue requires a nightly compiler in some way.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants