Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Yield from the main thread a few times at shutdown #2660

Closed
wants to merge 5 commits into from

Conversation

saethlin
Copy link
Member

@saethlin saethlin commented Nov 9, 2022

Closes #2629

... how should we test this? The naive approach is to run ./miri many-seeds ./miri run example.rs: #2629 (comment)

use std::thread;

fn main() {
  thread::scope(|s| {
    s.spawn(|| {});
  });
}

But running many seeds takes an annoyingly long time. How many seeds is enough? And how does such a test fit into our existing testing system?

(I'm also not enthused about the way I'm continuing to pile on complexity into the scheduler here, should this PR include a refactor?)

@oli-obk
Copy link
Contributor

oli-obk commented Nov 9, 2022

how should we test this?

we could pick a seed that causes the problems and hardcode it into the test's arguments. When it fails due to rustc changes, we need to find a new seed.

@saethlin
Copy link
Member Author

saethlin commented Nov 9, 2022

that causes the problems

The seed that causes the problem depends on the details of stdlib internals and MIR generation. Nearly all seeds are fine. So the problem we have is that any selection of a seed may not reproduce the problematic scheduling in the future.

When it fails due to rustc changes, we need to find a new seed.

The problem would be the test passing with whatever seed we pick, even though the bug is back.

@oli-obk
Copy link
Contributor

oli-obk commented Nov 9, 2022

The problem would be the test passing with whatever seed we pick, even though the bug is back.

oh 🤦 yea, we'd need a way to disable this yielding scheme and then run the test once without it, expect it to fail and once with it, expecting it to pass... that's annoying and even more noise here.

@bors
Copy link
Contributor

bors commented Nov 17, 2022

☔ The latest upstream changes (presumably 3162b7a) made this pull request unmergeable. Please resolve the merge conflicts.

@RalfJung
Copy link
Member

Yeah sadly I don't think we can test for this.

(I'm also not enthused about the way I'm continuing to pile on complexity into the scheduler here, should this PR include a refactor?)

Not this PR, but if you have a good idea for a refactor that'd be great!
I keep wondering if we can use generators for this somehow, it's kind of a state machine thing... though this applies more to the TLS logic, which is quite terrible too, then the core scheduler.

src/concurrency/thread.rs Outdated Show resolved Hide resolved
src/concurrency/thread.rs Outdated Show resolved Hide resolved
src/concurrency/thread.rs Outdated Show resolved Hide resolved
&& self.main_thread_yields_at_shutdown_remaining > 0
{
self.yield_active_thread = true;
self.main_thread_yields_at_shutdown_remaining -= 1;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally we'd do this yielding after executing main thread TLS dtors (because it'd be UB to access them). Not sure how terrible that would be to implement though, so a FIXME is also fine.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... I would like to understand how to do this correctly. The naive implementation is just shifting around the big check this PR adds so that the original if should_terminate() { return Ok(ExecuteDtors) } comes first, but that seems to produce data races whenever this yielding behavior kicks in: We run the main thread's dtors, then schedule another thread on the next step, and that thread reports a data race on one of a few allocations in the Rust runtime.

It wouldn't surprise me if these data races are legitimate. After all, we do have this comment:

                // The main thread terminated; stop the program.
                // We do *not* run TLS dtors of remaining threads, which seems to match rustc behavior.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard to say more without knowing the details of the race and how you implemented this. But we already currently will automatically yield to other threads while main thread Dtors are running, so we should already be seeing such races?

The best place to put this yielding behavior seems to be in the if self.threads[MAIN_THREAD].state == ThreadState::Terminated check, no? Instead of stopping, we keep going, but with some kind of bound.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually a bound on the steps is probably better than yielding... if automatic yielding is disabled, this might otherwise keep going forever even after the main thread stopped.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed a commit which does the change I was trying to describe above. I've seen two data races reported, one in the test suite and the other when I run this file:

use std::thread;

fn main() {
    thread::scope(|s| {
        s.spawn(|| {});
    });
}

With ./miri many-seeds ./miri demo.rs

I agree that a bound on the extra steps sounds reasonable. Is the concern that this yielding behavior may take programs which currently exit under the non-yielding option and turn them into programs that never exit?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the concern that this yielding behavior may take programs which currently exit under the non-yielding option and turn them into programs that never exit?

Yes.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TLS shims re-enable terminated threads. Is that a good thing? Do they have to be Enabled in order to execute user code?

Copy link
Member

@RalfJung RalfJung Nov 25, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, threads must definitely be enabled when running TLS dtors.

I wish they wouldn't be disabled and then re-enabled... this entire scheduler and TLS stuff needs a rework at some point, but I haven't yet found a nice way to factor this. I still think a generator would be good... basically yielding a function to run, which is then executed until the stack is empty again, at which point the generator is consulted to figure out whether there is another function to run or whether this thread is done.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah sadly using generators for this will not work.

src/concurrency/thread.rs Outdated Show resolved Hide resolved
Comment on lines +639 to +640
if self.active_thread == MAIN_THREAD
&& active_thread.state == ThreadState::Terminated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We already know self.threads[MAIN_THREAD].state == ThreadState::Terminated here, so this seems redundant?

@RalfJung
Copy link
Member

Uh, wth... the thread isn't even reading anything?^^

note: tracking was triggered
  |
  = note: created machine-managed memory allocation of 8 bytes (alignment 8 bytes) with id 28
  = note: (no span available)

Dropping 0
error: Undefined Behavior: Data race detected between Read on thread `<unnamed>` and Write on thread `main` at alloc28
  --> tests/pass/threadleak_ignored.rs:29:9
   |
29 |         loop {}
   |         ^^^^^^^ Data race detected between Read on thread `<unnamed>` and Write on thread `main` at alloc28
   |
   = help: this indicates a bug in the program: it performed an invalid operation, and caused Undefined Behavior
   = help: see https://doc.rust-lang.org/nightly/reference/behavior-considered-undefined.html for further information
   = note: BACKTRACE:
   = note: inside closure at tests/pass/threadleak_ignored.rs:29:9

@saethlin
Copy link
Member Author

I believe alloc28 is main's return place

@RalfJung
Copy link
Member

RalfJung commented Nov 21, 2022

Is the active thread getting inconsistent or so? It is pointing at the loop; the last statements executed before that are

│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ├─11031ms  INFO rustc_const_eval::interpret::step return
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ├─11031ms  INFO rustc_const_eval::interpret::eval_context popping stack frame (returning from function)
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │┌┘rustc_const_eval::interpret::eval_context::frame std::sys_common::thread_local_dtor::register_dtor_fallback::run_dtors
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ├┘rustc_const_eval::interpret::eval_context::frame main::{closure#1}
│ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ ├─32962ms  INFO rustc_const_eval::interpret::step _12 = const ()

Unfortunately the trace isn't very helpful when there are multiple threads...

I think this race just happens when the main thread TLS dtors are done (i.e., immediately when the new logic kicks in).

@RalfJung
Copy link
Member

Oh I have a theory... when we are reading the return place here, is the active thread still the other thread? That would look like a race.

@RalfJung
Copy link
Member

@rustbot author

@rustbot rustbot added the S-waiting-on-author Status: Waiting for the PR author to address review comments label Nov 23, 2022
@RalfJung RalfJung mentioned this pull request Nov 27, 2022
@RalfJung
Copy link
Member

I implemented a scheduler refactoring with an alternative way to do this yielding in #2699.

bors added a commit that referenced this pull request Dec 1, 2022
refactor scheduler

Refactors the scheduler to use something akin to a generator -- a callback that will be invoked when the stack of a thread is empty, which has the chance to push a new stack frame or do other things and then indicates whether this thread is done, or should be scheduled again. (Unfortunately I think we [cannot use actual generators](https://rust-lang.zulipchat.com/#narrow/stream/213817-t-lang/topic/Generators.20that.20borrow.20on.20each.20resume.3F) here.) The interpreter loop is now a proper infinite loop, the only way to leave it is for some kind of interrupt to be triggered (represented as `InterpError`) -- unifying how we handle 'exit when calling `process::exit`' and 'exit when main thread quits'.

The last commit implements an alternative approach to #2660 using this new structure. Fixes #2629.
@RalfJung
Copy link
Member

RalfJung commented Dec 1, 2022

Closing in favor of #2699. Thanks for the PR!

@RalfJung RalfJung closed this Dec 1, 2022
RalfJung pushed a commit to RalfJung/rust that referenced this pull request Dec 3, 2022
refactor scheduler

Refactors the scheduler to use something akin to a generator -- a callback that will be invoked when the stack of a thread is empty, which has the chance to push a new stack frame or do other things and then indicates whether this thread is done, or should be scheduled again. (Unfortunately I think we [cannot use actual generators](https://rust-lang.zulipchat.com/#narrow/stream/213817-t-lang/topic/Generators.20that.20borrow.20on.20each.20resume.3F) here.) The interpreter loop is now a proper infinite loop, the only way to leave it is for some kind of interrupt to be triggered (represented as `InterpError`) -- unifying how we handle 'exit when calling `process::exit`' and 'exit when main thread quits'.

The last commit implements an alternative approach to rust-lang/miri#2660 using this new structure. Fixes rust-lang/miri#2629.
@saethlin saethlin deleted the yield-before-termination branch January 15, 2023 22:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-waiting-on-author Status: Waiting for the PR author to address review comments
Projects
None yet
Development

Successfully merging this pull request may close these issues.

scoped threads vs thread leaks
5 participants