Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make inference profiling thread safe for julia 1.9+ #48051

Closed
wants to merge 2 commits into from

Conversation

NHDaly
Copy link
Member

@NHDaly NHDaly commented Dec 30, 2022

Starting in 1.9, julia does not hold the codegen lock for all of typeinference! This means that we cannot use a single shared scratch-space for profiling type inference.

This PR changes the scratch space to use Task Local Storage, so that we build a separate inference profile tree per Task doing inference, and then report them to the global results buffer at the end.

This makes things thread safe as long as type inference itself does not spawn Tasks, or perform compilation across concurrent Tasks. If that changes in the future, we will need to change the entire inference profiling model, since it will no longer be a single depth-first inference tree per invocation of type inference. (CC: @pchintalapudi / @vchuravy)

This PR is based on top of #47615.

Until that PR merges, you can view a diff of the changes, here:
vilterp/julia@pv-compiler-timings-lock...JuliaLang:nhd-inference-profiling-thread-safe-1.9

@vchuravy
Copy link
Member

This makes things thread safe as long as type inference itself does not spawn Tasks, or perform compilation across concurrent Tasks. If that changes in the future, we will need to change the entire inference profiling model, since it will no longer be a single depth-first inference tree per invocation of type inference.

Inference itself does not spawn tasks, but the goal is to indeed allow for concurrent inference, so yes it is much harder to create a single inference profile.

@NHDaly
Copy link
Member Author

NHDaly commented Dec 31, 2022

Mmm good to know! That's exciting! :)

But yeah, it will mean we'll need to look at redesigning the inference profiling. 👍
In #47615 we're already changing the shape of the inference profiling results into a vector of inference trees, rather than a single inference tree rooted at a fake ROOT node, to better reflect what's actually happening: several invocations of type inference. So maybe it won't be too hard to redesign this so that even a single invocation of type inference is itself a collection of several type inference trees spread over concurrent tasks? Or possibly some other design. I'm looking forward to thinking that through with you all! 😊

But in the meantime, i guess this PR should be enough to make the profiling in 1.9 thread-safe, so we should merge this and backport it. Then we can start talking about what should be done for the new concurrent type inference.

Ideally, the profiling changes would be done together with the compilation changes themselves that you're talking about.

@NHDaly NHDaly added the backport 1.9 Change should be backported to release-1.9 label Dec 31, 2022
@KristofferC KristofferC mentioned this pull request Jan 2, 2023
41 tasks
@pchintalapudi
Copy link
Member

Concurrent inference is probably already occurring; the type inference lock has been removed for some time, so inference results from one thread might get used by another thread depending on when the cache is updated.

@NHDaly
Copy link
Member Author

NHDaly commented Jan 4, 2023

Yeah but we're still on 1.8 at RAI so we probably haven't noticed this issue. When you say "for some time," you mean on 1.9+, yeah? That's why i filed this issue. I agree that without the inference lock, 1.9+ could be racing. 👍

Or are you saying that even before 1.9, there can be concurrent inference?

@pchintalapudi
Copy link
Member

Only in 1.9 will there be concurrent inference, but that means the inference profiler probably has to be redesigned for reporting concurrent inference in 1.9. I doubt we'd want to re-serialize inference or block 1.9 release on the profiler being redesigned, so it probably makes sense to notate this limitation somewhere in the release documentation.

@NHDaly
Copy link
Member Author

NHDaly commented Jan 4, 2023

Okay yeah, cool! :)

Well, i think this PR is already enough to correctly support concurrent inference in 1.9! That's what it's meant to do, anyway. The only change is switching away from a global working stack, and into a task-local storage working stack. I think that's enough to support concurrent inference. The final inference results vector is already thread safe as of #47615, so it can be pushed to from multiple concurrent tasks. (And they only push right at the end, so i don't think this would have a serious impact on the concurrency. it shouldn't serialize the inference, it would only grab a lock for the very tail end of an entire inference tree, pushing the single pointer onto the vector.)

The diff for this PR is just this one change:
vilterp/julia@pv-compiler-timings-lock...JuliaLang:nhd-inference-profiling-thread-safe-1.9

So i think with this PR concurrent inference in 1.9 should be safe! Does that hold up in your understanding as well? Anything i'm missing? Thanks for engaging with me on these PRs! 😊

Co-authored-by: Nathan Daly <NHDaly@gmail.com>
base/compiler/typeinfer.jl Outdated Show resolved Hide resolved
@KristofferC KristofferC mentioned this pull request Jan 17, 2023
35 tasks
Starting in 1.9, julia does not hold the codegen lock for all of
typeinference! This means that we cannot use a single shared
scratch-space for profiling type inference.

This PR changes the scratch space to use Task Local Storage, so that we
build a separate inference profile tree per Task doing inference, and
then report them to the global results buffer at the end.
@NHDaly NHDaly force-pushed the nhd-inference-profiling-thread-safe-1.9 branch from e0b6225 to e161d91 Compare January 18, 2023 22:29
@NHDaly
Copy link
Member Author

NHDaly commented Jan 18, 2023

I'm going to close this PR, then, since it is merged into #47615 now.

@NHDaly NHDaly closed this Jan 18, 2023
@giordano giordano deleted the nhd-inference-profiling-thread-safe-1.9 branch January 19, 2023 00:09
@KristofferC KristofferC removed the backport 1.9 Change should be backported to release-1.9 label Jan 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants