Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Close TLS-based soundness holes by running closure on a newly created thread. #3646

Draft
wants to merge 9 commits into
base: main
Choose a base branch
from

Conversation

adamreichold
Copy link
Member

@adamreichold adamreichold commented Dec 12, 2023

To be done before this is a candidate for merging:

  • add unsafe escape hatch to avoid threading overhead
  • update inline documentation
  • determine if Python thread state should be moved to runtime thread
  • switch unsafe_allow_threads to unsafe token approach
  • decide which method takes the existing allow_threads name
  • switch to builder-like API for choosing strategy
  • add changelog and migration guide entries
  • actually benchmark the performance impact?
  • enable allowing threads even if the GIL is not actually held as discussed here

Closes #2141
Closes #3640
Closes #3658

@adamreichold adamreichold added the CI-skip-changelog Skip checking changelog entry label Dec 12, 2023
@adamreichold adamreichold force-pushed the allow-threads-new-thread branch 7 times, most recently from 3b55a4b to 8df3961 Compare December 12, 2023 19:53
src/marker.rs Outdated Show resolved Hide resolved
@adamreichold adamreichold force-pushed the allow-threads-new-thread branch 5 times, most recently from 7a1c30f to 924ab8c Compare December 13, 2023 09:37
Copy link
Member

@davidhewitt davidhewitt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice! Overall, this looks like a promising approach to avoid the tls related issues and the auto trait.

The only downside I can see about this is that it brings a whole new set of thread locals into play, for the worker thread. I think there are interactions here which we'd need to at least document:

  • Callers of allow_threads will have to treat all thread_local values as in an undefined state at the beginning of the execution of the closure. I say "undefined" and not "uninitialized" because with the thread pool, reuse of the threads means that it's possible to observe effects from one closure in another. I wonder if tokio has had prior discussion of this with their spawn_blocking API, which I think works similarly.
  • Inside the closure, if a user then called Python::with_gil, this will create a new Python thread state for the worker thread. Is this expected?
    • Do we make with_gil run back on the original thread which is blocked waiting for the result? (e.g. by using a similar mechanism to Task in reverse?).
    • Or do we just document that this will happen, i.e. put it in the same bucket as all the rest of the "thread locals are undefined" statement of above?

I understand the desire to keep the safe allow_threads name, but I still have a feeling that the semantic change here might mean that it's better to rename and break existing code.

src/marker.rs Outdated Show resolved Hide resolved
src/marker.rs Outdated Show resolved Hide resolved
src/marker.rs Outdated Show resolved Hide resolved
src/marker.rs Outdated Show resolved Hide resolved
src/marker.rs Outdated Show resolved Hide resolved
@adamreichold
Copy link
Member Author

adamreichold commented Dec 14, 2023

The only downside I can see about this is that it brings a whole new set of thread locals into play, for the worker thread. I think there are interactions here which we'd need to at least document:

* Callers of `allow_threads` will have to treat all `thread_local` values as in an undefined state at the beginning of the execution of the closure. I say "undefined" and not "uninitialized" because with the thread pool, reuse of the threads means that it's possible to observe effects from one closure in another. I wonder if `tokio` has had prior discussion of this with their `spawn_blocking` API, which I think works similarly.

Indeed. I think we should certainly document this and of course, what I would really want to have from the Rust standard library here would be a function like

fn with_isolated_tls<F,T>(f: F) where F: Send + FnOnce() -> T, T: Send;

which would just wipe the slate clean w.r.t. the system-level TLS implementation if that is actually possible? Do you have an educated guess on that @steffahn?

It did not find any directly addressing this in the spawn_blocking documentation. I guess this is because it is more obvious that this is using a thread pool and leaves the reader to figure out the implications. I will try to come up with some useful wording myself and then we can knead into shape together.

* Inside the closure, if a user then called `Python::with_gil`, this will create a new Python thread state for the worker thread. Is this expected?
  
  * Do we make `with_gil` run back on the original thread which is blocked waiting for the result? (e.g. by using a similar mechanism to `Task` in reverse?).
  * Or do we just document that this will happen, i.e. put it in the same bucket as all the rest of the "thread locals are undefined" statement of above?

I would prefer to not trying to call back and forth, mainly to keep with_gil simple, performant and more explainable. Hence, I would indeed work this into the general caveat on thread identity as discussed above.

I understand the desire to keep the safe allow_threads name, but I still have a feeling that the semantic change here might mean that it's better to rename and break existing code.

Still torn because I think we should keep the safe way the obvious and effortless way. Otherwise, people like me will start looking into the alternatives and see that the unsafe one is more efficient and wonder "Maybe that little bit of unsafe is worth avoid the HUGE COST of spawning a thread?!" and be tempted by the dark side of Rust.

If the more efficient but unsafe way must be looked for and opted in explicit, this will more likely happen only if there is an actually diagnosed/measured performance issue. At least for scientific code for example, I find that unlikely because the closure in allow threads will usually start/call upon a whole bunch of threads doing something using e.g. Rayon.

It is not something I would like to fight about, but I would be glad if we could get some more input from other maintainers and/or users. (And this stays a draft until we decided one way or the other. I am not trying to decided this having this PR which hijacks the name and is still missing the escape hatch version in any case.)

@adamreichold
Copy link
Member Author

(Will take care of the MSRV build separately... Uff.)

@davidhewitt
Copy link
Member

(Will take care of the MSRV build separately... Uff.)

Time to bump to 1.62? 😈

@davidhewitt
Copy link
Member

Still torn because I think we should keep the safe way the obvious and effortless way. Otherwise, people like me will start looking into the alternatives and see that the unsafe one is more efficient and wonder "Maybe that little bit of unsafe is worth avoid the HUGE COST of spawning a thread?!" and be tempted by the dark side of Rust.

I take your point here, my concern can probably be mitigated by a clear entry in the migration guide.

@davidhewitt
Copy link
Member

I would prefer to not trying to call back and forth, mainly to keep with_gil simple, performant and more explainable. Hence, I would indeed work this into the general caveat on thread identity as discussed above.

One option which might work is to move the Python thread state into the worker thread using some kind of PyEval_RestoreThread technique. Actually I'm not sure if with_gil is called inside allow_threads already whether a new thread state is created, maybe my concern here is moot. We should check.

@adamreichold
Copy link
Member Author

adamreichold commented Dec 14, 2023

Indeed. I think we should certainly document this and of course, what I would really want to have from the Rust standard library here would be a function like

fn with_isolated_tls<F,T>(f: F) where F: Send + FnOnce() -> T, T: Send;

which would just wipe the slate clean w.r.t. the system-level TLS implementation if that is actually possible? Do you have an educated guess on that @steffahn?

Just noticed that this would only close the scoped-tls loophole, not the send-wrapper one, so never mind...

src/marker.rs Outdated Show resolved Hide resolved
@adamreichold
Copy link
Member Author

adamreichold commented Dec 24, 2023

Just curious if switching from the vec to TLS made any observable change in the benchmark?

Note that I think the main difference here (no chance for contention, better locality, but higher chance of cold starts and more threads overall) will not be reflected by the micro benchmark which measures only the uncontended and warmed-up case and is unable to quantify system-level effects like the cost of more threads overall.

I did rerun the micro benchmark and the word-count benchmarks and in the first case, any changes were lost within the noise of switching to another thread (i.e. basically still 5µs remote versus 30ns local) and in the second case, the difference between local and remote was dwarfed by the cost of Python's ThreadPoolExecutor.

So while I have little numbers to back this, I do think the improved locality and code simplicity of the TLS variant is the better choice.

@mejrs
Copy link
Member

mejrs commented Dec 25, 2023

First of all I very much dislike having a custom threadpool in pyo3, let alone one written with unsafe. Personally I don't feel qualified to review it and convince myself there are no soundness holes or gotchas in it - let alone make any changes or adjustments to it myself.

Which is of course the main thing still missing: So do we actually want to do this or not? It does close the known soundness loopholes w.r.t. allow_threads which I think is a nice place to be and I also what I would want personally. But the cost is high, either the problematic performance profile of safe_allow_threads (the micro benchmark is bad, but the example usage is completely unaffected, we are still missing measurements from more realistic scenarios)

I think we shouldn't do this, see #3640 (comment) for my reasoning. tl;dr: these issues are so niche, hard to accidentally encounter and unlikely to actually "work" that it's not worth the overhead and complexity of this solution. Also, what is more likely: someone hitting the current soundness issue, or an issue with that custom written threadpool?

or the unsafe and weird API of unsafe_allow_threads with safety invariants which are basically uncheckable if unlikely to be broken unintentionally.

unsafe_allow_threads is just bad api design. Its unsafety pollutes the entire program and all of its dependencies. It's hard to check that it is actually sound, and that it remains sound as the user's program grows. To me, it looks like it's just there in the case that when it does go wrong we can just say "well, it's unsafe and you used the unsafe keyword so it's not our fault" which is technically true but doesn't help anyone.

@mejrs
Copy link
Member

mejrs commented Dec 25, 2023

Finally I do want to thank you for all the effort you and the reviewers have put into this. I really appreciate the commitment whether or not we end up merging this 👍

@adamreichold
Copy link
Member Author

First of all I very much dislike having a custom threadpool in pyo3, let alone one written with unsafe.

Note that we are not using a global thread pool any more but rather a sort of "companion thread", i.e. each user thread will have at most one worker thread associated and there is no cross communication between these pairs of user and worker thread which simplifies the concurrency structure significantly.

The unsafety is also limited to implementing "scoped execution" for the closures, i.e. the actual communication with the worker thread is entirely safe including passing the closure and its result between user and worker thread. Hence, I would argue that this is actually somewhat simpler than say the GIL pool which we are currently maintaining.

That said, ...

Also, what is more likely: someone hitting the current soundness issue, or an issue with that custom written threadpool?

... I don't think that us making a mistake in the implementation of dispatching those closures is the main problem here (We will probably make mistakes as elsewhere but we can fix those centrally here.), but rather the change in observable behaviour, e.g. user code which actually relies on thread-local storage to do something useful.

This why we need to have the three options on the builder for remote/safe or local/unsafe or even unconstrained/unsafe operation, so that code which does rely on TLS can continue to work (of course with the punted responsibility of only using the right kind of TLS).

Furthermore, I'd say that the status quo involving the stable-this-nightly-that Ungil trait with its similarity to core's rightfully unloved Unpin is by no means simple. Finally, we can have a very simple implementation with the same soundness properties, i.e. the first commit which basically changes a single line from f() to thread::scope(|s| s.spawn(f).join().unwrap()) which is completely safe if less efficient.

But in the end, the crux of the issue IMHO is

tl;dr: these issues are so niche, hard to accidentally encounter and unlikely to actually "work" that it's not worth the overhead and complexity of this solution.

i.e. how much do we value soundness in the sense that safe API cannot be abused. Personally, I'd say the whole story of Rust is investing additional engineering effort to allow safe code to be written without having to worry what some other code in the code base or its dependency closure does. So from my perspective, us taking on additional complexity and maintenance effort is yet another step in the basic direction that Rust embodies in the programming language domain.

@davidhewitt
Copy link
Member

As per #3640 (comment) I feel quite strongly that we cannot do nothing here, and in particular I don't think we can leave the existing allow_threads as a safe API despite its flaws.

tl;dr: these issues are so niche, hard to accidentally encounter and unlikely to actually "work" that it's not worth the overhead and complexity of this solution.

i.e. how much do we value soundness in the sense that safe API cannot be abused. Personally, I'd say the whole story of Rust is investing additional engineering effort to allow safe code to be written without having to worry what some other code in the code base or its dependency closure does. So from my perspective, us taking on additional complexity and maintenance effort is yet another step in the basic direction that Rust embodies in the programming language domain.

I agree with this statement, I think a key part of the point of our work here is to build a watertight solution for Rust/Python interop, so that downstream code can worry about their application choices without worrying about whether our implementation is sound.

unsafe_allow_threads is just bad api design. Its unsafety pollutes the entire program and all of its dependencies. It's hard to check that it is actually sound, and that it remains sound as the user's program grows.

While I agree with this too, I also think that the existing allow_threads is even worse. It's got carries the same unsoundness properties. If we make the argument that unsafe_allow_threads is difficult to verify then I think we would be straight up misleading users if we left that functionality present without making it unsafe, as then they are not even aware that they should be thinking about the soundness of their program. Just changing the exact same API to be unsafe greatly clarifies the functionality available to users, even if the fundamental design is no different.

To me, it looks like it's just there in the case that when it does go wrong we can just say "well, it's unsafe and you used the unsafe keyword so it's not our fault" which is technically true but doesn't help anyone.

I think this argument can be made about any unsafe API. I think a more helpful framing of the issue is that here in this PR we have tried our best to build a safe construction, and we give the user the choice whether the concerns we highlight are relevant to them. Again, I think the key merit of unsafe here is about communication to users, not about who we can blame when there's a problem.


@mejrs - I think we all hope to have maintainer consensus before moving forward, so it's doubly painful given that I feel doing nothing isn't an option. Is there a reduced form of this PR or an alternative proposal that you would find more palatable?

@davidhewitt
Copy link
Member

So, following up on this a month later, my feeling is that we proceed to merge this for 0.21. It doesn't seem to me that we have any real option other than to try to make this API sound. While I don't love having three forms of what's more or less the same API this seems like the lesser of the evils we need to pick from.

@mejrs
Copy link
Member

mejrs commented Jan 30, 2024

My preferred alternative proposal is the debug mode check that the gil is held.

But, given that there are no real life cases of people running into these issues (as far as I am aware) I think just doing nothing is also acceptable.

@davidhewitt
Copy link
Member

cc @alex; as you are a security expert, any opinion you have on what we should be trying to do here would be valuable.

I definitely want to be cautious but if the consensus is a less drastic option is a better choice then I can be overridden.

@alex
Copy link
Contributor

alex commented Jan 30, 2024

(Adding to my TODO list to review tonight)

@adamreichold
Copy link
Member Author

I think just doing nothing is also acceptable.

I don't think doing nothing is a good approach. We could decide to not lose the soundness loophole, but I think in that case we should remove Ungil as it just raises expectations that it will never be able to fulfil. And we should update our documentation to explicitly state that we do not consider this loophole to be an actual problem and will hence not fix it.

I also think that we should involve the community pro-actively if we do not close it as it does run against the general consensus established for the Rust ecosystem. Personally, I would add that I think this should be done by those advocating for not closing it as I see it as part of the necessary work to implement that decision.

@alex
Copy link
Contributor

alex commented Jan 30, 2024

On performance (and, honestly, aesthetic) grounds I hate having to run closures on a different thread. I think it takes what should be a simple operation and makes it very complex and gives unexpected performance implications.

But soundness is kind of important, it's why we like Rust :-)

So my big question is: Are there any alternative designs that might provide soundness without the performance cost? @mejrs mentioned a gil held check, but I don't think I saw the description of that.

Assuming there's not a better solution to providing soundness (and I guess we're talking here because we don't have one), I think we should do this. Yes, it's ugly. But it provides safety. Note: I haven't reviewed the PR itself, I'm not vouching for the implementation, just the concept :-)

A major motivator in my thinking is that once there's a nogil python, it obviates this question entirely: if there's no GIL, there's no need to call allow_threads :-) So if you think this is ugly, the best thing to do is put our energy into getting pyo3 compatible with py3.13's nogil.

@adamreichold
Copy link
Member Author

mentioned a gil held check, but I don't think I saw the description of that.

This would have to happen on call into the interpreter and hence performance would suffer everywhere which is not palatable. The proposal was to use a debug_assert!-like check which would surface the problem in dev builds but not provide soundness for release builds.

Are there any alternative designs that might provide soundness without the performance cost?

If we had the option to control thread identity, i.e. just stash thread-local storage instead of actually switching threads, this could be faster. But this is purely hypothetical as this would need to be implemented in the compiler and standard library.

@davidhewitt
Copy link
Member

davidhewitt commented Jan 31, 2024

The proposal was to use a debug_assert!-like check which would surface the problem in dev builds but not provide soundness for release builds.

It's worth noting here that on the surface this seemed to me a little bit like how integer overflow panics only in dev builds, but rather than being UB in release integer overflow is well-defined as two's complement. So I think the choice to have soundness only in dev would be somewhat more extreme.

(We also haven't measured how much this performance overhead would impact dev builds, nor how many places in the code we'd have to introduce this check.)

@adamreichold
Copy link
Member Author

adamreichold commented Jan 31, 2024

An unexpected but IMHO salient data point on how theoretical this problem is: #3788 (reply in thread)

@davidhewitt
Copy link
Member

So, I think we're quickly approaching the 0.21 release, and we have yet to make a decision here.

I still strongly believe that despite no option being appealing, moving forward with this is the most practical way to close the soundness hole and that we should do this.

I am tempted to just rebase this and see it over the line, but I'm also mindful about the amount of churn already going into 0.21 with the Bound API.

So, how about as a compromise, we let this slip one release until 0.22? That gives folks with other opinions time to gather support, and if there hasn't been a compelling alternative presented by the time we're ready to push 0.22 then we default to this option.

@alex
Copy link
Contributor

alex commented Feb 24, 2024

I'm personally ok with that.

@davidhewitt davidhewitt added this to the 0.22 milestone Feb 24, 2024
@davidhewitt davidhewitt mentioned this pull request Apr 2, 2024
5 tasks
@davidhewitt
Copy link
Member

@adamreichold are you ok if I take this PR over and push the finishing work tomorrow while I'm at the PyCon sprints?

@adamreichold
Copy link
Member Author

@adamreichold are you ok if I take this PR over and push the finishing work tomorrow while I'm at the PyCon sprints?

I would be rather grateful if find the time to do this! I have been dreading rebasing this for some time now...

@davidhewitt
Copy link
Member

Well in the end time shot past me but I'll aim to rebase this in the very near future so we can ship 0.22

@davidhewitt
Copy link
Member

Ok, first conflict resolution done, I think there will be some finishing work which I'll try later.

@davidhewitt
Copy link
Member

Ok, this now has all conflicts / tests fixed. Just needs some docs, which I'll aim to do in the very near future.

@Icxolu Icxolu mentioned this pull request Sep 13, 2024
3 tasks
@davidhewitt davidhewitt modified the milestones: 0.22, 0.24 Oct 4, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
6 participants