Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Don't auto-advance time when a spawn_blocking task is running. #5115

Merged
merged 18 commits into from
Dec 17, 2022

Conversation

jorendorff
Copy link
Contributor

Time shouldn't auto-advance unless there's really nothing else going on. In general that is hard to fix, but this special case is relatively easy.

Fixes: #4614

Motivation

I'm writing tests for a complex distributed system. In a single test process, a "server" task runs CPU-bound units of work using spawn_blocking, while a "client" task sends requests and waits for responses using .await, with a generous timeout. Issue #4614 causes the timeout to elapse instantly; no responses ever succeed.

There seems to be consensus in #4614 that this behavior is a bug, and I agree.

Solution

I added a field in the clock that tells how many spawn_blocking tasks are running/scheduled.

The code increments this counter when spawning such a task and decrements it when the task is done. This means the increment and the decrement are not conveniently side by side in the code, but I think it might be important to bump the counter synchronously, to avoid a race condition if the new task doesn't actually start running right away.

I named the counter auto_advance_inhibit_count, instead of something obvious like num_blocking_tasks, because it might make sense in the future to provide an API whereby users can manually inhibit auto-advance (... if I'm right in thinking that blocking tasks are not the only kind of work Tokio "can't see").

@github-actions github-actions bot added the R-loom Run loom tests on this PR label Oct 21, 2022
@Darksonn Darksonn added A-tokio Area: The main tokio crate M-task Module: tokio/task M-time Module: tokio/time labels Oct 22, 2022
@jorendorff
Copy link
Contributor Author

jorendorff commented Oct 24, 2022

This could conceivably break someone's existing code.

If a spawn_blocking task behaves like a server, and runs forever (but using sync code) handling requests from async clients, this change would inhibit auto-advance as long as that task is running. That could make a test stall.

If that's too big a breaking change, my problem could alternatively be addressed with

  • a new API that's like spawn_blocking but inhibits auto-advance; or
  • a new API that allows spawn_blocking tasks to turn auto-advance on or off manually as needed.

Of course, there's the question of whether such APIs would actually work well. Intuitively, I think the second option is pretty good. I think async code generally "knows" locally there's a future that represents work that's happening outside of tokio's managed tasks, that will complete and unblock a tokio task. When such a future is polled, I think that's when auto-advance should be inhibited.

@Darksonn
Copy link
Contributor

Hmm, yes, the server example is annoying.

@jorendorff
Copy link
Contributor Author

Hmm, yes, the server example is annoying.

OK, how about a function that wraps a future, say

// (in tokio::time)

/// Annotate a future as doing work, from when it is first polled to when it becomes `Ready`, even if no
/// tokio thread is currently polling it.
///
/// This suppresses the auto-advance feature when time is paused. When time is not paused, or
/// the **test-util** feature is not enabled, this is a no-op.
async fn inhibit_auto_advance<F: Future>(fut: F) -> F::Output;

?

@carllerche
Copy link
Member

I haven't reviewed enough, but I am OK considering this a "bug fix" and not a "breaking change". The arguments are valid. Also, risk is lower due to it being a test-time feature.

@jorendorff
Copy link
Contributor Author

I haven't reviewed enough, but I am OK considering this a "bug fix" and not a "breaking change". The arguments are valid. Also, risk is lower due to it being a test-time feature.

In that case, this is ready for review.

@Darksonn
Copy link
Contributor

I had a quick look and it looks reasonable enough. There are some CI failures though.

@jorendorff
Copy link
Contributor Author

OK, I fixed it—CI is green now.

@jorendorff jorendorff force-pushed the jorendorff/pause-fix branch from 976004f to 1013452 Compare November 7, 2022 22:53
tokio/src/time/clock.rs Outdated Show resolved Hide resolved
@jorendorff
Copy link
Contributor Author

There's still a race condition, apparently. Loom runs the test about 600 times before detecting it.

@jorendorff
Copy link
Contributor Author

I wonder if this is the second problem I mentioned a couple days ago: if the main thread parks itself, allow_auto_advance in the blocking-task-runner thread won't wake it up.

The interleaving that fails is:

About to spawn task a
spawn_blocking_inner(task a)
BlockingSchedule::new()
inhibit_auto_advance: about to ask for clock lock
inhibit_auto_advance: acquired clock lock
inhibit_auto_advance: count is: 1
About to spawn task b
spawn_blocking_inner(task b)
BlockingSchedule::new()
inhibit_auto_advance: about to ask for clock lock
inhibit_auto_advance: acquired clock lock
inhibit_auto_advance: count is: 2
spawned both tasks
About to await task a
Running task a
allow_auto_advance: about to ask for clock lock
allow_auto_advance: acquired clock lock
allow_auto_advance: count is: 1
allow_auto_advance: about to ask for clock lock
allow_auto_advance: acquired clock lock
allow_auto_advance: count is: 0
thread 'main' panicked at 'deadlock; threads = [(Id(0), Blocked(Location(None))), (Id(1), Blocked(Location(None))), (Id(2), Blocked(Location(None)))]', /Users/jorendorff/.cargo/registry/src/git.luolix.top-1ecc6299db9ec823/loom-0.5.6/src/rt/execution.rs:215:13

I don't know why we don't see Running task b.

@Darksonn
Copy link
Contributor

Considering that loom reports a deadlock, it seems reasonable to conclude that this is due to the issue you mention.

As for why you don't see task B, I assume its because it was scheduled to run later?

Time shouldn't auto-advance unless there's really nothing else going on. In
general that is hard to fix, but this special case is relatively easy.

Fixes: tokio-rs#4614
This uses a destructor, so it will also work if tokio machinery panics while
trying to e.g. spawn a thread.
The auto-advance inhibit count must be decremented after the blocking task is
finished, and crucially after waking the JoinHandle, to prevent the tokio
thread from auto-advancing before tasks awaiting that JoinHandle are
rescheduled.
Specifically, when a blocking task is spawned, but no thread immediately awaits
the JoinHandle, if the main thread parks itself, we previously failed to unpark
it when the blocking task completed. Auto-advance would be re-enabled but could
not happen, a potential deadlock.
It does not always work; and is it happens, the caller has a handle already and
can just pass it in.
@jorendorff
Copy link
Contributor Author

The loom issue is fixed and has a test now. This is ready for review.

(I don't know why there are still jobs that are "Expected" after all this time; I think I've seen that before with gh actions when new jobs are added...?)

@Darksonn
Copy link
Contributor

You can fix CI by merging in master.

@jorendorff
Copy link
Contributor Author

OK, this is ready for a CI run.

@jorendorff
Copy link
Contributor Author

Thanks! Tests are green. I think this is ready for review.

@jorendorff
Copy link
Contributor Author

👋 hi, friendly review ping (or, what should I do next here)

Copy link
Contributor

@Darksonn Darksonn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about the silence. We discussed this on the discord back when you said CI was green, and I think we came to the conclusion that this looks fine. I took another look just now, and I have one comment regarding your tests.

Comment on lines 235 to 239
time::timeout(
Duration::from_secs(3),
task::spawn_blocking(|| thread::sleep(Duration::from_millis(250))),
)
.await
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please reduce the sleeps. Our test suite already takes long enough.

@Darksonn Darksonn merged commit e14ca72 into tokio-rs:master Dec 17, 2022
crapStone pushed a commit to Calciumdibromid/CaBr2 that referenced this pull request Jan 8, 2023
This PR contains the following updates:

| Package | Type | Update | Change |
|---|---|---|---|
| [tokio](https://tokio.rs) ([source](https://github.com/tokio-rs/tokio)) | dependencies | minor | `1.23.0` -> `1.24.1` |
| [tokio](https://tokio.rs) ([source](https://github.com/tokio-rs/tokio)) | dev-dependencies | minor | `1.23.0` -> `1.24.1` |

---

### Release Notes

<details>
<summary>tokio-rs/tokio</summary>

### [`v1.24.1`](https://github.com/tokio-rs/tokio/releases/tag/tokio-1.24.1): Tokio v1.24.1

[Compare Source](tokio-rs/tokio@tokio-1.24.0...tokio-1.24.1)

This release fixes a compilation failure on targets without `AtomicU64` when using rustc older than 1.63. ([#&#8203;5356])

[#&#8203;5356]: tokio-rs/tokio#5356

### [`v1.24.0`](https://github.com/tokio-rs/tokio/releases/tag/tokio-1.24.0): Tokio v1.24.0

[Compare Source](tokio-rs/tokio@tokio-1.23.1...tokio-1.24.0)

The highlight of this release is the reduction of lock contention for all I/O operations ([#&#8203;5300](tokio-rs/tokio#5300)). We have received reports of up to a 20% improvement in CPU utilization and increased throughput for real-world I/O heavy applications.

##### Fixed

-   rt: improve native `AtomicU64` support detection ([#&#8203;5284])

##### Added

-   rt: add configuration option for max number of I/O events polled from the OS
    per tick ([#&#8203;5186])
-   rt: add an environment variable for configuring the default number of worker
    threads per runtime instance ([#&#8203;4250])

##### Changed

-   sync: reduce MPSC channel stack usage ([#&#8203;5294])
-   io: reduce lock contention in I/O operations  ([#&#8203;5300])
-   fs: speed up `read_dir()` by chunking operations ([#&#8203;5309])
-   rt: use internal `ThreadId` implementation ([#&#8203;5329])
-   test: don't auto-advance time when a `spawn_blocking` task is running ([#&#8203;5115])

[#&#8203;5186]: tokio-rs/tokio#5186

[#&#8203;5294]: tokio-rs/tokio#5294

[#&#8203;5284]: tokio-rs/tokio#5284

[#&#8203;4250]: tokio-rs/tokio#4250

[#&#8203;5300]: tokio-rs/tokio#5300

[#&#8203;5329]: tokio-rs/tokio#5329

[#&#8203;5115]: tokio-rs/tokio#5115

[#&#8203;5309]: tokio-rs/tokio#5309

### [`v1.23.1`](https://github.com/tokio-rs/tokio/releases/tag/tokio-1.23.1): Tokio v1.23.1

[Compare Source](tokio-rs/tokio@tokio-1.23.0...tokio-1.23.1)

This release forward ports changes from 1.18.4.

##### Fixed

-   net: fix Windows named pipe server builder to maintain option when toggling
    pipe mode ([#&#8203;5336]).

[#&#8203;5336]: tokio-rs/tokio#5336

</details>

---

### Configuration

📅 **Schedule**: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).

🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied.

♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox.

🔕 **Ignore**: Close this PR and you won't be reminded about these updates again.

---

 - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box

---

This PR has been generated by [Renovate Bot](https://github.com/renovatebot/renovate).
<!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNC44NC4xIiwidXBkYXRlZEluVmVyIjoiMzQuODQuMSJ9-->

Co-authored-by: cabr2-bot <cabr2.help@gmail.com>
Reviewed-on: https://codeberg.org/Calciumdibromid/CaBr2/pulls/1703
Reviewed-by: crapStone <crapstone@noreply.codeberg.org>
Co-authored-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Co-committed-by: Calciumdibromid Bot <cabr2_bot@noreply.codeberg.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-tokio Area: The main tokio crate M-task Module: tokio/task M-time Module: tokio/time R-loom Run loom tests on this PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

tokio pause mode auto-advancing ignores spawn_blocking
3 participants