-
-
Notifications
You must be signed in to change notification settings - Fork 2.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce contention in broadcast channel #6284
Conversation
Implement atomic linked list that allows pushing waiters concurrently, which reduces contention. Fixes: tokio-rs#5465
What implementation are you using for making your linked list atomic? Is it a well-known implementation strategy, or something you came up with on your own? |
It seems like the loom tests are not running on this PR. I guess github must have changed something that broke our CI config: tokio/.github/workflows/loom.yml Lines 26 to 40 in 12ce924
Successfully passing the |
Hmm, looks like it started running after I merged master into your branch. |
I came up with it on my own, although the idea behind it is quite trivial, so I'm sure it has been implemented by someone already. On the other hand, it is quite a special case of an atomic list - it allows adding elements atomically, but removing them still has to be done with some outside synchronization (in case of this PR - a write lock). I can research and see if there are some popular implementations of this kind of atomic list, if it would help. |
tokio/src/util/linked_list.rs
Outdated
/// Atomically adds an element first in the list. | ||
/// This method can be called concurrently from multiple threads. | ||
/// | ||
/// # Safety | ||
/// | ||
/// The caller must ensure that: | ||
/// - `val` is not pushed concurrently by muptiple threads, | ||
/// - `val` is not already part of some list. | ||
pub(crate) unsafe fn push_front(&self, val: L::Handle) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an interesting implementation. It looks like it could be correct. I'll have to think more about it.
Here are two basic MPSC atomic linked list implementations:
They are different because intrusive lists have additional logic to handle ABA issues as nodes can be removed and re-inserted concurrently to a separate insert. The provided atomic LL implementation is generic over the pointer type, which allows it to be used in an intrusive way. I haven't gotten to the broadcast channel changes yet, so it is possible that the broadcast channel uses it correctly, but I see the atomic LL implementation with generic handles as a hazard. |
@@ -310,7 +311,7 @@ struct Shared<T> { | |||
mask: usize, | |||
|
|||
/// Tail of the queue. Includes the rx wait list. | |||
tail: Mutex<Tail>, | |||
tail: RwLock<Tail>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
RwLock
implementations tend to be heavier than Mutex
. In this case, it looks like all reads are of numbers. Another option is to make these cells AtomicUsize
(or AtomicU64
) and require writers to these cells to hold the tail
mutex. Reads can do the atomic read directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you saying that we could make Tail
fields atomic? That would spare us the need to take a lock in some situations, but the main contention source is adding waiters to the list, which would still have to be done with a lock.
So, it seems like the main thing here is to take a "regular" DLL and make the push operation concurrent with other push operations but not with a pop operation. This is done by guarding the entire list with a single |
tokio/src/util/linked_list/atomic.rs
Outdated
|
||
/// An atomic intrusive linked list. It allows pushing new nodes concurrently. | ||
/// Removing nodes still requires an exclusive reference. | ||
pub(crate) struct AtomicLinkedList<L, T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you stick with this strategy, I would rename this ConcurrentPushLinkedList
or something explicit. Also, update the docs to be very clear that synchronization is required for all operations except concurrent push.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm inclined to agree with Carl --- the name "AtomicLinkedList" suggests that all links are atomic...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, it might be worth investigating whether there are performance benefits to using this list type in other places in tokio::sync
. i think there are other synchronization primitives that currently force all tasks pushing to their wait lists pushes to serialize themselves with a Mutex
, which could potentially benefit from this type. but, i would save that for a separate branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, it might be worth investigating whether there are performance benefits to using this list type in other places in
tokio::sync
.
I was eyeing sync::Notify
: it has a mutex that guards a list of waiters. #5464 reduced contention on the watch channel by utilizing 8 instances of Notify
instead of one. Looks like the perfect candidate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, there are definitely other places that could benefit from something similar. I'm also wondering if the timer could have a similar optimization. Some people experience significant contention in the timer.
However, timers have more contention on cancellation due to timeouts usually being cancelled...
I think I am following. I think the LL implementation works, but the docs need to be updated to very clearly state the requirements for using it. Also, if you want to explore more tweaks to the code, another option might be a two-lock strategy (similar to the Michael & Scott queue). In this strategy, the head and tail of the linked list are guarded by two separate locks. For this to work, you need a sentinel node. This might be interesting because RwLocks need to worry about fairness. If you decouple the head & tail locks, then push & pop no longer contend. To remove a node, you would then have to acquire a write lock on both locks. If we assume cancellation is a lower-priority operation, this might work out to be better. |
tokio/src/util/linked_list/atomic.rs
Outdated
|
||
/// An atomic intrusive linked list. It allows pushing new nodes concurrently. | ||
/// Removing nodes still requires an exclusive reference. | ||
pub(crate) struct AtomicLinkedList<L, T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I'm inclined to agree with Carl --- the name "AtomicLinkedList" suggests that all links are atomic...
tokio/src/util/linked_list/atomic.rs
Outdated
} | ||
|
||
#[cfg(test)] | ||
#[cfg(not(loom))] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it would be nice to have loom tests for this, potentially...
tokio/src/util/linked_list/atomic.rs
Outdated
|
||
/// An atomic intrusive linked list. It allows pushing new nodes concurrently. | ||
/// Removing nodes still requires an exclusive reference. | ||
pub(crate) struct AtomicLinkedList<L, T> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, it might be worth investigating whether there are performance benefits to using this list type in other places in tokio::sync
. i think there are other synchronization primitives that currently force all tasks pushing to their wait lists pushes to serialize themselves with a Mutex
, which could potentially benefit from this type. but, i would save that for a separate branch.
That's an interesting strategy, I will read on that. However, according to my experiments, the main contention source is two places: pushing into the waiters list in The tradeoff between the current approach and the two locks approach, as I see it, is that the current implementation has writers contending with each other and readers, but readers don't contend with each other. In the two locks approach, writers contend between themselves and readers between themselves. It looks like if we came into situation when a lot if readers have to insert themselves into the waiters list, then writers are apparently slower then readers, so having readers not contend with each other is more beneficial then having them not contend with writers. What do you think? |
If we're comparing to alternate approaches, then the existing sharding solution used by the watch channel shouldn't be forgotten. Just because the CAS operation is atomic doesn't mean that it doesn't cause contention. |
This is true, sharding is a very valid option. Initially I wanted to implement it here, but I gave it up because of the complexity of the broadcast channel's interactions with the tail lock. On the other hand, even with atomic pushes we can always fallback to sharding. If one list with atomic pushes performs better than one mutex-protected list, then probably sharded comparison will also be in favor of lists with atomic pushes (under high enough contention, that is - a mutex is probably faster when contention is low enough). |
Thanks for clarifying. I'm going to dig into the details a bit because the pattern of using linked list for waiters is used everywhere in Tokio, so a change here could be generally applicable. For example, I went back to the original implementation to refresh my memory and How much of the gain you observe is due to making Also, thinking more about the two-lock queue, I don't think it would support a doubly linked list. |
That's a good question, I made a separate branch to test it. The benchmark shows the following on my machine:
The smallest benchmark was probably subject to noise, but all in all, it looks like this change alone explains somewhere between a third and a half of performance improvement. |
At the very least, I think |
I discussed the linked list you proposed with Paul E. McKenney. He agrees that the concurrent insertion is correct, but he recommended that we go for sharding instead: "In my experience, sharding is almost always way better than tweaking locking and atomic operations. Shard first, tweak later, and even then only if necessary." Using multiple |
Ok, I will look into sharding, probably better as a separate PR. Do you think it is worth creating a PR to make |
Yes, that sounds simple. I think that would be a good PR. |
Motivation
Broadcast channel performance is suboptimal in cases when there are a lot of threads that frequently subscribe for new values, as they all have to contend for the same mutex. See #5465
Solution
The proposed change is to use an atomic linked list to store waiters. This way the tail mutex can be replaced with RwLock, and adding new subscribers only requires a read lock. Justifying safety becomes trickier though, I did my best in the comments.
The PR also includes a benchmark that on my machine shows a dramatic improvement in high contention scenarios: