-
Notifications
You must be signed in to change notification settings - Fork 13k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Scoped thread implicit join doesn't wait for thread locals to be dropped #116237
Comments
You know, technically, it only says it has the "happens before" on the API docs of |
This error is spotty, sometimes it happens, sometimes it doesn't. It happens since at least 1.63, when this function was stabilized. |
nope, seems this has been happening since day -128, checked with 7737e0b 2022-04-04 using slightly modified code for version changes#![feature(bench_black_box)]
#![feature(scoped_threads)]
use std::{hint::black_box, mem::replace, thread};
static mut CONTROL: Option<String> = None;
fn main() {
// Explicit join of scoped thread.
{
for _ in 0..5000 {
thread::scope(|s| {
let h = s.spawn(|_| {
FOO.with(|v| {
black_box(v);
});
});
h.join().unwrap();
});
// SAFETY: this happens after the thread join, which provides a `happens before` guarantee
let v = unsafe { &CONTROL };
black_box(format!("{:?}", v));
print!(".");
}
println!("Completed EXPLICIT join loop.");
}
println!();
// Implicit join of scoped thread.
{
for _ in 0..5000 {
thread::scope(|s| {
let _h = s.spawn(|_| {
FOO.with(|v| {
black_box(v);
});
});
});
// SAFETY: this happens after the implicit thread join, which should provide a `happens before` guarantee
let v = unsafe { &CONTROL };
black_box(format!("{:?}", v));
print!(".");
}
println!("Completed IMPLICIT join loop.");
}
}
struct Foo(());
impl Drop for Foo {
fn drop(&mut self) {
// SAFETY: this happens before the thread join, which provides a `happens before` guarantee
let _: Option<String> = unsafe { replace(&mut CONTROL, Some("abcd".to_owned())) };
}
}
thread_local! {
static FOO: Foo = Foo(());
} |
Explicit joins go through the platforms Implicit joins on the other hand are implemented via an atomic counter that tracks the number of extant threads. The spawning thread parks at the end of the scope until the counter has been set to zero. Scoped threads decrement the counter when exiting their main function and unpark the spawning thread if necessary. The bug described here occurs because the TLS destructors are run after the main function completes, therefore there is no relationship formed between their operations and the counter load at the end of the scope. As I see it, there are two ways to resolve this:
|
Thanks for the bug report! It's a good question whether this behaviour is acceptable or not. In many ways, TLS destructors behave a bit weirdly. As mentioned above, the implicit joining behaviour does properly wait for the thread's main function to finish, so this is never a problem in safe code. So, I don't think I'd categorize this as a high priority bug, or perhaps not even as a bug at all. But I do agree it would be good to fix this, if we can find a good way to do that. Keeping track of all the join/thread handles in some sort of (linked) list would be problematic, as that'd result in an ever growing list (~a memory leak) when using a long-lived scope. (E.g. a webserver that all runs in a single thread::scope, which spawns a thread for every request. It'd never exit the scope, so the list would just grow forever.) I'll investigate to see if we can find a better solution. Perhaps the counter needs to be decremented by a TLS destructor itself, or we can trigger TLS destruction ourselves, or perhaps we can think of another solution. |
This might work. TLS destructors are executed in reverse order of registration, so if we register one right at the start of the scoped thread, it should execute last after any other TLS destructors in the thread. |
That's not guaranteed and very unreliable. If a destructor registers another destructor, that one will run after all others. On platforms that use keyed TLS, there is no predefined order at all. |
It could be acceptable to define the implicit join as inherently slightly racy if we're confident that it has no unsoundness per se in doing so (i.e. it is very obvious this is bad if combined with the aforementioned However, that leaves the question: // SAFETY: this happens after the implicit thread join
// which should provide a `happens before` guarantee
let v = unsafe { &CONTROL };
black_box(format!("{:?}", v));
print!(".");
}
println!("Completed IMPLICIT join loop.");
}
}
impl Drop for Foo {
fn drop(&mut self) {
// SAFETY: this happens before the thread join
// which should provide a `happens before` guarantee
let _: Option<String> = unsafe { replace(&mut CONTROL, Some("abcd".to_owned())) };
}
} Would either of these unsafe blocks, themselves, be okay, if the other one didn't exist? It seems the answer is that the unsafe read of the Also, in the previous example, it should be noted that executing with optimizations appears to make the first |
We discussed this in the library team meeting and the decision was the keep the existing behavior. However we are open to any suggestions on how the documentation can be improved to be clearer. Feel free to open a doc PR. |
That may be more easily done if we know what the motivation was for deciding to preserve the existing behavior, contra the alternatives? And which |
That meeting happened on Zulip, here. What we discussed is that the way to solve this involves keeping track of all the join handles in the main thread, which can pile up (behave like a memory leak) for a long living scope, so we see no other resolution than just accepting the current behaviour of |
Note that the
Although it's easy to argue that this "best effort" is only about whether they are run or not, not whether they are run at the expected time. |
I believe we don't guarantee anywhere that TLS destructors will run before .join(). (I wonder how much would break if that behaviour changed.) But if we do want to guarantee that, we could document that |
Note that this is already a memory leak for exactly the same sort of reason inside of the OS's thread library. Any joinable thread that has not been joined will require some memory to handle that state and pass through any return value supported by the API. It would increase the size of the memory leak of course, likely by around 2x at most if the OS has optimized the joinable-but-exited thread state down to minimal amounts. |
Ouch, yeah that's quite the surprising footgun. Personally I'd rather have a bit of memory accumulation in |
To clarify, is there a consensus on whether this is a soundness bug, or just a documentation bug? |
We explicitly detach those threads when the join handle is dropped, so those threads are no longer joinable, so are not leaking memory. |
Yeah, that's true, I missed that. |
That is already the behavior today. Explicitly calling This issue is about what happens to threads that you have not explicitly |
Proposing to close this issue with resolution "not a bug, current behavior at scope exit needs to be properly documented". @rfcbot close |
Team member @m-ou-se has proposed to close this. The next step is review by the rest of the tagged team members: No concerns currently listed. Once a majority of reviewers approve (and at most 2 approvals are outstanding), this will enter its final comment period. If you spot a major issue that hasn't been raised at any point in this process, please speak up! See this document for info about what commands tagged team members can give me. |
@m-ou-se reply:on #116237 (comment):
I have created a separate issue #127571 about this. |
🔔 This is now entering its final comment period, as per the review above. 🔔 |
The usecase of a webserver running in The idea would be to have the threads that are stopping participate to the joining effort. In short:
And that's it. All threads are joined. I am not sure if the idea is completely feasible, though. Specifically, there's the issue of auto-join vs user-join. Today one cannot have two Advantages:
Disadvantages:
For bonus points, ditch |
The final comment period, with a disposition to close, as per the review above, is now complete. As the automated representative of the governance process, I would like to thank the author for their work and everyone else who contributed. |
The documentation for |
See also #116179:
Scoped thread implicit join fails 'happens before' guarantee
As documented in the Rust Atomics and Locks book by Mara Bos, "joining a thread creates a happens-before relationship between the joined thread and what happens after the join() call". This is not true for implicit joins on scoped threads.
I tried this code:
I expected to see this happen: Both the explicit join and implicit join portions of the code should terminate normally.
Instead, this happened: The explicit join portion terminates normally as expected, but the implicit join portion panics (the number of iterations on the loop before it panics varies) .
This issue is related to issue #116179.
Meta
The same behaviour is observed on the nightly version nightly-x86_64-unknown-linux-gnu unchanged - rustc 1.74.0-nightly (0288f2e 2023-09-25).
rustc --version --verbose
:Backtrace
The text was updated successfully, but these errors were encountered: