-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync: fix racy UnsafeCell
access on a closed oneshot
#4226
Conversation
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
This fixes the race. Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Signed-off-by: Eliza Weisman <eliza@buoyant.io>
loop { | ||
if State(state).is_closed() { | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I also considered writing this as
loop { | |
if State(state).is_closed() { | |
break; | |
} | |
while !State(state).is_closed() { |
but i had a vague memory of @carllerche saying that he doesn't like the use of while
loops for compare-and-swap loops... :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way it's written here with a loop
is perfectly readable, so let's just keep it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The oneshot code is honestly pretty bad now that I look at it - there are very few safety justifications in the code. What you are doing here sounds reasonable, but I'd need to read the entire implementation to be confident that it solves the problem.
loop { | ||
if State(state).is_closed() { | ||
break; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the way it's written here with a loop
is perfectly readable, so let's just keep it.
Yeah, I was also thinking about trying to clean up more of the existing implementation (maybe in a follow-up...). But, I'm pretty confident that this change fixes the specific crash in question. |
// `poll_recv` or `try_recv` call is occurring concurrently, both | ||
// threads may try to access the `UnsafeCell` if we were to set the | ||
// `VALUE_SENT` bit on a closed channel. | ||
let mut state = cell.load(Ordering::Relaxed); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems safer.
let mut state = cell.load(Ordering::Relaxed); | |
let mut state = cell.load(Ordering::Acquire); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it's necessary; the compare_exchange
has an Acquire
failure ordering, so when we actually go to set the value, we will always perform an Acquire
operation, and if the initial Relaxed
load is stale, the CAS will fail, and we'll loop again with a value that was accessed with an Acquire
ordering. So, making the initial load Relaxed
really just serves to avoid an unnecessary Acquire
in the case that we're the only thread writing to the state; if it's contended and the Relaxed
load is stale, we will always perform an Acquire
on subsequent iterations.
Other CAS loops elsewhere in Tokio follow a similar pattern:
tokio/tokio/src/time/driver/entry.rs
Lines 167 to 192 in 03969cd
// Quick initial debug check to see if the timer is already fired. Since | |
// firing the timer can only happen with the driver lock held, we know | |
// we shouldn't be able to "miss" a transition to a fired state, even | |
// with relaxed ordering. | |
let mut cur_state = self.state.load(Ordering::Relaxed); | |
loop { | |
debug_assert!(cur_state < STATE_MIN_VALUE); | |
if cur_state > not_after { | |
break Err(cur_state); | |
} | |
match self.state.compare_exchange( | |
cur_state, | |
STATE_PENDING_FIRE, | |
Ordering::AcqRel, | |
Ordering::Acquire, | |
) { | |
Ok(_) => { | |
break Ok(()); | |
} | |
Err(actual_state) => { | |
cur_state = actual_state; | |
} | |
} |
However, I'm happy to change this if you prefer.
As a note, I can run my repro from #4225 (comment) with this branch for over 40 minutes without seeing any segfaults or mangled strings. |
#4229 should help make some of the oneshot implementation's invariants a bit clearer, I hope! |
Depends on #4226 ## Motivation Currently, the safety invariants and synchronization strategy used in `tokio::sync::oneshot` are not particularly obvious, especially to a new reader. It would be nice to better document this code to make these invariants clearer. ## Solution This branch adds `SAFETY:` comments to the `oneshot` channel implementation. In particular, I've focused on documenting the invariants around when the inner `UnsafeCell` that stores the value can be accessed by the sender and receiver sides of the channel. I still want to take a closer look at when the waker cells can be set, and I'd like to add more documentation there in a follow-up branch. Signed-off-by: Eliza Weisman <eliza@buoyant.io>
# 1.14.0 (November 15, 2021) ### Fixed - macros: fix compiler errors when using `mut` patterns in `select!` ([#4211]) - sync: fix a data race between `oneshot::Sender::send` and awaiting a `oneshot::Receiver` when the oneshot has been closed ([#4226]) - sync: make `AtomicWaker` panic safe ([#3689]) - runtime: fix basic scheduler dropping tasks outside a runtime context ([#4213]) ### Added - stats: add `RuntimeStats::busy_duration_total` ([#4179], [#4223]) ### Changed - io: updated `copy` buffer size to match `std::io::copy` ([#4209]) ### Documented - io: rename buffer to file in doc-test ([#4230]) - sync: fix Notify example ([#4212]) [#4211]: #4211 [#4226]: #4226 [#3689]: #3689 [#4213]: #4213 [#4179]: #4179 [#4223]: #4223 [#4209]: #4209 [#4230]: #4230 [#4212]: #4212
Depends on #4226 ## Motivation Currently, the safety invariants and synchronization strategy used in `tokio::sync::oneshot` are not particularly obvious, especially to a new reader. It would be nice to better document this code to make these invariants clearer. ## Solution This branch adds `SAFETY:` comments to the `oneshot` channel implementation. In particular, I've focused on documenting the invariants around when the inner `UnsafeCell` that stores the value can be accessed by the sender and receiver sides of the channel. I still want to take a closer look at when the waker cells can be set, and I'd like to add more documentation there in a follow-up branch. Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Depends on #4226 ## Motivation Currently, the safety invariants and synchronization strategy used in `tokio::sync::oneshot` are not particularly obvious, especially to a new reader. It would be nice to better document this code to make these invariants clearer. ## Solution This branch adds `SAFETY:` comments to the `oneshot` channel implementation. In particular, I've focused on documenting the invariants around when the inner `UnsafeCell` that stores the value can be accessed by the sender and receiver sides of the channel. I still want to take a closer look at when the waker cells can be set, and I'd like to add more documentation there in a follow-up branch. Signed-off-by: Eliza Weisman <eliza@buoyant.io>
Depends on #4226 ## Motivation Currently, the safety invariants and synchronization strategy used in `tokio::sync::oneshot` are not particularly obvious, especially to a new reader. It would be nice to better document this code to make these invariants clearer. ## Solution This branch adds `SAFETY:` comments to the `oneshot` channel implementation. In particular, I've focused on documenting the invariants around when the inner `UnsafeCell` that stores the value can be accessed by the sender and receiver sides of the channel. I still want to take a closer look at when the waker cells can be set, and I'd like to add more documentation there in a follow-up branch. Signed-off-by: Eliza Weisman <eliza@buoyant.io>
# 1.14.0 (November 15, 2021) ### Fixed - macros: fix compiler errors when using `mut` patterns in `select!` ([#4211]) - sync: fix a data race between `oneshot::Sender::send` and awaiting a `oneshot::Receiver` when the oneshot has been closed ([#4226]) - sync: make `AtomicWaker` panic safe ([#3689]) - runtime: fix basic scheduler dropping tasks outside a runtime context ([#4213]) ### Added - stats: add `RuntimeStats::busy_duration_total` ([#4179], [#4223]) ### Changed - io: updated `copy` buffer size to match `std::io::copy` ([#4209]) ### Documented - io: rename buffer to file in doc-test ([#4230]) - sync: fix Notify example ([#4212]) [#4211]: #4211 [#4226]: #4226 [#3689]: #3689 [#4213]: #4213 [#4179]: #4179 [#4223]: #4223 [#4209]: #4209 [#4230]: #4230 [#4212]: #4212
# 1.14.0 (November 15, 2021) ### Fixed - macros: fix compiler errors when using `mut` patterns in `select!` ([#4211]) - sync: fix a data race between `oneshot::Sender::send` and awaiting a `oneshot::Receiver` when the oneshot has been closed ([#4226]) - sync: make `AtomicWaker` panic safe ([#3689]) - runtime: fix basic scheduler dropping tasks outside a runtime context ([#4213]) ### Added - stats: add `RuntimeStats::busy_duration_total` ([#4179], [#4223]) ### Changed - io: updated `copy` buffer size to match `std::io::copy` ([#4209]) ### Documented - io: rename buffer to file in doc-test ([#4230]) - sync: fix Notify example ([#4212]) [#4211]: #4211 [#4226]: #4226 [#3689]: #3689 [#4213]: #4213 [#4179]: #4179 [#4223]: #4223 [#4209]: #4209 [#4230]: #4230 [#4212]: #4212
Depends on #4226 ## Motivation Currently, the safety invariants and synchronization strategy used in `tokio::sync::oneshot` are not particularly obvious, especially to a new reader. It would be nice to better document this code to make these invariants clearer. ## Solution This branch adds `SAFETY:` comments to the `oneshot` channel implementation. In particular, I've focused on documenting the invariants around when the inner `UnsafeCell` that stores the value can be accessed by the sender and receiver sides of the channel. I still want to take a closer look at when the waker cells can be set, and I'd like to add more documentation there in a follow-up branch. Signed-off-by: Eliza Weisman <eliza@buoyant.io>
A data race condition was discovered in tokio, which can lead to memory corruption. This vulnerability affects our fork. See: - https://rustsec.org/advisories/RUSTSEC-2021-0124 - tokio-rs/tokio#4225 - tokio-rs/tokio#4226 - fengalin/tokio#1 Fixes: https://gitlab.freedesktop.org/gstreamer/gst-plugins-rs/-/issues/174
} | ||
// TODO: This could be `Release`, followed by an `Acquire` fence *if* | ||
// the `RX_TASK_SET` flag is set. However, `loom` does not support | ||
// fences yet. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think loom has partial support for fences iiuc (tokio-rs/loom#220), so technically this comment isn't correct anymore :D.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I think there might be a couple other places in Tokio where we also can implement some things nicer now that loom
supports fences. I didn't want to actually change that aspect of the implementation in the bugfix branch, though. Might be worth doing in a subsequent one!
# 1.8.4 (November 15, 2021) This release backports a bug fix from 1.13.1. ### Fixed - sync: fix a data race between `oneshot::Sender::send` and awaiting a `oneshot::Receiver` when the oneshot has been closed (tokio-rs#4226)
Is there any plan to backport this to 0.1.x ? |
We don't have any plans, no. |
Okay, thanks. |
Motivation
Issue #4225 describes a potential race condition in the
oneshot
channel in
tokio-sync
. The race occurs when the oneshot is closed, andthen a thread calls
oneshot::Sender::send
while another thread iscalling
oneshot::Receiver::try_recv
or awaiting the reciever. The callto
send
puts the sent value in the cell. Then, it sets theVALUE_SENT
(completed) bit, returning a snapshot of the state. If thestate is closed, the sender takes the sent value back out of the cell.
Receiving from the channel also loads the state. The reciever first
checks if the channel is completed, and takes the value out of the cell
if the completed bit is set. This is racy, because if a call to
recv
loads the state after the completed bit has been set, it will no longer
treat the channel as closed, and will try to take the value out of the
cell. This results in both threads accessing the cell concurrently.
Solution
This branch adds two
loom
tests reproducing the race, fortry_recv
and for awaiting a
Receiver
. Both of these tests fail with a causalityviolation when accessing the
UnsafeCell
onmaster
.I fixed the race condition by changing
State::set_completed
to not setthe
VALUE_SENT
bit if theCLOSED
bit has already been set. This way,if the channel is closed before
send
is called, we won't set the bitthat tells the receiver that it's okay to access the
UnsafeCell
, andwe the sender can take the value back out without the risk of the
receiver also concurrently accessing the
UnsafeCell
.This required changing
State::set_completed
from a simple fetch-or toa compare-and-swap loop, which is slightly a bummer, as it's less
efficient than a single fetch-or. However, as far as I can tell,
changing this to a CAS loop is the only solution to the race that
doesn't also change the oneshot's behavior. We currently have tests
asserting that if the channel is closed after a value is sent, the
value is still received. A simpler solution to this problem would be
changing
poll_recv
andtry_recv
to check if the channel is closedfirst, and only check if the channel is completed and take the value
out if it is not closed. This fixes the race, since the receiver will no
longer try to access the
UnsafeCell
if the channel has been closed.But, this would break the existing tests asserting that calling
close
after a value has been sent still results in the value being received.
Changing that behavior seems undesirable, so I thought this was worth
the CAS loop.
Fixes: #4225