Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Runtime diagnostics for leaked messages in unbounded channels #12971

Merged
merged 15 commits into from
Dec 23, 2022

Conversation

dmitry-markin
Copy link
Contributor

@dmitry-markin dmitry-markin commented Dec 19, 2022

Resolves #12853.

Report error if the unbounded channels out_events::channel() and mpsc::tracing_unbounded() grow above the predifined configurable threshold. The channel name and the backtrace of where it was created (if RUST_BACKTRACE=1 is set) is reported.

@dmitry-markin dmitry-markin added A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit labels Dec 19, 2022
@dmitry-markin dmitry-markin marked this pull request as draft December 19, 2022 11:19
@dmitry-markin
Copy link
Contributor Author

Converted the PR into draft to address all unbounded channels of substrate in one PR.

@dmitry-markin dmitry-markin marked this pull request as ready for review December 20, 2022 14:24
Copy link
Contributor

@altonen altonen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Did you test if the error is reported when the threshold is crossed?

@dmitry-markin
Copy link
Contributor Author

dmitry-markin commented Dec 20, 2022

LGTM. Did you test if the error is reported when the threshold is crossed?

Yes, I hardcoded the threshold to 1 and it worked)

@dmitry-markin dmitry-markin requested a review from a team December 21, 2022 15:47
{
// `warning_fired` and `queue_size` are not synchronized, so it's possible
// that the warning is fired few times before the `warning_fired` is seen
// by all threads. This seems better than introducing a mutex guarding them.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is it better? Would proper synchronization (e.g. mutex) be a serious performance penalty here? Or is there any other reason?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd rather prefer an extra warning then a lock contention, though I don't know how many threads there are usually.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error should never be emitted under normal circumstances, and the probability of emitting more than one error is also quite small imo, so I would avoid locking/holding a mutex every time message/event is sent over the channel. First of all, channels are meant to avoid synchronization, not to introduce it.

Copy link
Contributor

@melekes melekes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pub fn buffered_link<B: BlockT>() -> (BufferedLinkSender<B>, BufferedLinkReceiver<B>) {
let (tx, rx) = tracing_unbounded("mpsc_buffered_link");
pub fn buffered_link<B: BlockT>(
queue_size_warning: i64,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need signed integer here?

Copy link
Contributor Author

@dmitry-markin dmitry-markin Dec 22, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's explained in the comment for a struct field: to avoid underflow if, due to the lack of ordering, the counter happens to go < 0.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Internally: yes. But public API doesn't need to be signed integer. This should have been u32 instead that should still be plenty big for all intents and purposes. Same for tracing_unbounded function.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also wondering how much of a performance difference it actually makes using Relaxed ordering here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not that relaxed ordering makes sense in terms of performance, it's about not having to bother about synchronization of increments/decrements, why signed integer is used. Relaxed ordering is just a consequence of this decision, because more strong guarantees are not needed if we use the unsigned integer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've looked into the issue, and as far as I understand it's impossible to guarantee that the counter is never decremented before it's incremented not relying on internals of mpsc::unbounded(). Basically, we have the following events:

Thread A Thread B
increment pull
push decrement

In order for decrement to never happen before increment, push in thread A must synchronize with pull in thread B. Note that this is not a synchronization between operations with our atomic counter, but a synchronization of mpsc::unbounded() operations we are not in control of. We can try setting the strongest sequentially consistent ordering guarantee for increment and decrement, but for this to work push and pull must also be sequentially consistent operations, what is unlikely and cannot be relied on.

Please correct me if I'm missing something.

CC @bkchr

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you use Acquire/Release it should work: https://en.cppreference.com/w/cpp/atomic/memory_order

The compiler should add some barrier that ensures that reads/writes are not reordered.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW @nazar-pc why aren't you just use a channel with a size of 0 and using try_send?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because I didn't see try_send in there, also it is usually for different access patterns. I'd expect it to still produce a warning regardless.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here is a PR implementing exact queue size warning (#13117), but I'd like it to be reviewed by somebody with good understanding of concurrency, atomics, and memory order of operations. If you know who to invite for review, please invite them.

client/consensus/common/src/import_queue/buffered_link.rs Outdated Show resolved Hide resolved
pub fn tracing_unbounded<T>(
key: &'static str,
name: &'static str,
queue_size_warning: i64,
) -> (TracingUnboundedSender<T>, TracingUnboundedReceiver<T>) {
let (s, r) = mpsc::unbounded();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you (or anyone else) know the reason for using unbounded channels here? afaik they are never a good idea in production.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a long going story about implementing back-pressure mechanisms in substrate (using bounded channels), but nobody knows how to implement it correctly so far. So at least we added the warning to detect if some of the channels are not being polled and leak messages (and memory).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but nobody knows how to implement it correctly so far

This is clearly not the problem :P The code base has grown and organically. Not all was build from the beginning with async being considered. Unbounded channels give you the opportunity to combine async/sync code in some easy way, but yeah there is no back pressure.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. Thanks 🙏

client/utils/src/mpsc.rs Show resolved Hide resolved
@dmitry-markin dmitry-markin merged commit 40cb431 into master Dec 23, 2022
@dmitry-markin dmitry-markin deleted the dm-unbounded-channel-clogging-diagnostics branch December 23, 2022 13:03
pub fn tracing_unbounded<T>(
key: &'static str,
name: &'static str,
queue_size_warning: i64,
) -> (TracingUnboundedSender<T>, TracingUnboundedReceiver<T>) {
let (s, r) = mpsc::unbounded();
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but nobody knows how to implement it correctly so far

This is clearly not the problem :P The code base has grown and organically. Not all was build from the beginning with async being considered. Unbounded channels give you the opportunity to combine async/sync code in some easy way, but yeah there is no back pressure.

queue_size: queue_size.clone(),
queue_size_warning,
warning_fired: Arc::new(AtomicBool::new(false)),
creation_backtrace: Arc::new(Backtrace::capture()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need to resolve and we should not require RUST_BACKTRACE to be set. If we then want to print the backtrace, we should call resolve().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in #13020.


let queue_size = self.queue_size.fetch_add(1, Ordering::Relaxed);
if queue_size == self.queue_size_warning &&
!self.warning_fired.load(Ordering::Relaxed)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also addressed in #13020.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A0-please_review Pull request needs code review. B0-silent Changes should not be mentioned in any release notes C1-low PR touches the given topic and has a low impact on builders. D3-trivial 🧸 PR contains trivial changes in a runtime directory that do not require an audit
Projects
Status: done
Development

Successfully merging this pull request may close these issues.

Runtime diagnostics for leaked messages in unbounded channels
6 participants