Runtime diagnostics for leaked messages in unbounded channels #12971

dmitry-markin · 2022-12-19T11:05:51Z

Resolves #12853.

Report error if the unbounded channels out_events::channel() and mpsc::tracing_unbounded() grow above the predifined configurable threshold. The channel name and the backtrace of where it was created (if RUST_BACKTRACE=1 is set) is reported.

… always `1`

dmitry-markin · 2022-12-19T11:20:29Z

Converted the PR into draft to address all unbounded channels of substrate in one PR.

…l-clogging-diagnostics

altonen

LGTM. Did you test if the error is reported when the threshold is crossed?

dmitry-markin · 2022-12-20T15:48:12Z

LGTM. Did you test if the error is reported when the threshold is crossed?

Yes, I hardcoded the threshold to 1 and it worked)

michalkucharczyk · 2022-12-21T16:26:02Z

client/utils/src/mpsc.rs

+				{
+					// `warning_fired` and `queue_size` are not synchronized, so it's possible
+					// that the warning is fired few times before the `warning_fired` is seen
+					// by all threads. This seems better than introducing a mutex guarding them.


Why is it better? Would proper synchronization (e.g. mutex) be a serious performance penalty here? Or is there any other reason?

I'd rather prefer an extra warning then a lock contention, though I don't know how many threads there are usually.

This error should never be emitted under normal circumstances, and the probability of emitting more than one error is also quite small imo, so I would avoid locking/holding a mutex every time message/event is sent over the channel. First of all, channels are meant to avoid synchronization, not to introduce it.

melekes

melekes · 2022-12-22T05:00:08Z

client/consensus/common/src/import_queue/buffered_link.rs

-pub fn buffered_link<B: BlockT>() -> (BufferedLinkSender<B>, BufferedLinkReceiver<B>) {
-	let (tx, rx) = tracing_unbounded("mpsc_buffered_link");
+pub fn buffered_link<B: BlockT>(
+	queue_size_warning: i64,


why do we need signed integer here?

It's explained in the comment for a struct field: to avoid underflow if, due to the lack of ordering, the counter happens to go < 0.

Internally: yes. But public API doesn't need to be signed integer. This should have been u32 instead that should still be plenty big for all intents and purposes. Same for tracing_unbounded function.

I'm also wondering how much of a performance difference it actually makes using Relaxed ordering here.

It's not that relaxed ordering makes sense in terms of performance, it's about not having to bother about synchronization of increments/decrements, why signed integer is used. Relaxed ordering is just a consequence of this decision, because more strong guarantees are not needed if we use the unsigned integer.

I've looked into the issue, and as far as I understand it's impossible to guarantee that the counter is never decremented before it's incremented not relying on internals of mpsc::unbounded(). Basically, we have the following events:

Thread A Thread B

increment pull

push decrement

In order for decrement to never happen before increment, push in thread A must synchronize with pull in thread B. Note that this is not a synchronization between operations with our atomic counter, but a synchronization of mpsc::unbounded() operations we are not in control of. We can try setting the strongest sequentially consistent ordering guarantee for increment and decrement, but for this to work push and pull must also be sequentially consistent operations, what is unlikely and cannot be relied on.

Please correct me if I'm missing something.

CC @bkchr

If you use Acquire/Release it should work: https://en.cppreference.com/w/cpp/atomic/memory_order

The compiler should add some barrier that ensures that reads/writes are not reordered.

BTW @nazar-pc why aren't you just use a channel with a size of 0 and using try_send?

Because I didn't see try_send in there, also it is usually for different access patterns. I'd expect it to still produce a warning regardless.

Here is a PR implementing exact queue size warning (#13117), but I'd like it to be reviewed by somebody with good understanding of concurrency, atomics, and memory order of operations. If you know who to invite for review, please invite them.

client/consensus/common/src/import_queue/buffered_link.rs

melekes · 2022-12-22T05:10:49Z

client/utils/src/mpsc.rs

 	pub fn tracing_unbounded<T>(
-		key: &'static str,
+		name: &'static str,
+		queue_size_warning: i64,
 	) -> (TracingUnboundedSender<T>, TracingUnboundedReceiver<T>) {
 		let (s, r) = mpsc::unbounded();


do you (or anyone else) know the reason for using unbounded channels here? afaik they are never a good idea in production.

This is a long going story about implementing back-pressure mechanisms in substrate (using bounded channels), but nobody knows how to implement it correctly so far. So at least we added the warning to detect if some of the channels are not being polled and leak messages (and memory).

but nobody knows how to implement it correctly so far

This is clearly not the problem :P The code base has grown and organically. Not all was build from the beginning with async being considered. Unbounded channels give you the opportunity to combine async/sync code in some easy way, but yeah there is no back pressure.

Makes sense. Thanks 🙏

client/utils/src/mpsc.rs

bkchr · 2022-12-23T22:08:17Z

client/utils/src/mpsc.rs

 	pub fn tracing_unbounded<T>(
-		key: &'static str,
+		name: &'static str,
+		queue_size_warning: i64,
 	) -> (TracingUnboundedSender<T>, TracingUnboundedReceiver<T>) {
 		let (s, r) = mpsc::unbounded();


but nobody knows how to implement it correctly so far

This is clearly not the problem :P The code base has grown and organically. Not all was build from the beginning with async being considered. Unbounded channels give you the opportunity to combine async/sync code in some easy way, but yeah there is no back pressure.

bkchr · 2022-12-23T22:18:47Z

client/utils/src/mpsc.rs

+			queue_size: queue_size.clone(),
+			queue_size_warning,
+			warning_fired: Arc::new(AtomicBool::new(false)),
+			creation_backtrace: Arc::new(Backtrace::capture()),


This should use: https://docs.rs/backtrace/latest/backtrace/struct.Backtrace.html#method.new_unresolved

We don't need to resolve and we should not require RUST_BACKTRACE to be set. If we then want to print the backtrace, we should call resolve().

Addressed in #13020.

bkchr · 2022-12-23T22:22:34Z

client/utils/src/mpsc.rs

+
+				let queue_size = self.queue_size.fetch_add(1, Ordering::Relaxed);
+				if queue_size == self.queue_size_warning &&
+					!self.warning_fired.load(Ordering::Relaxed)


Why are you not using https://doc.rust-lang.org/std/sync/atomic/struct.AtomicBool.html#method.compare_exchange ?

Also addressed in #13020.

…tech#12971)

dmitry-markin added 3 commits December 16, 2022 18:14

minor: remove event count argument from Metrics::event_in() that is…

793a5fd

… always `1`

OutChannel queue size warning

2b661e2

Attach creation backtrace to channal for queue size warning reporting

3144bd5

dmitry-markin requested review from koute and altonen December 19, 2022 11:05

dmitry-markin marked this pull request as draft December 19, 2022 11:19

dmitry-markin added 9 commits December 19, 2022 17:25

rustfmt

49720f2

minor: convert tuple into struct

c640cbf

minor: renaming

972b452

Simplify queue size couting logic by using signed integer

1bb9602

Warn if there is clogging in utils::mpsc::tracing_unbounded()

69f2727

Set mpsc::tracing_unbounded() queue size warning thresholds

cacf5b9

Make unbounded channel in network/transactions metered

82e1b1e

Make unbounded channel in telemetry metered

75a3389

Merge remote-tracking branch 'origin/master' into dm-unbounded-channe…

94b7254

…l-clogging-diagnostics

dmitry-markin marked this pull request as ready for review December 20, 2022 14:24

dmitry-markin requested review from acatangiu and andresilva as code owners December 20, 2022 14:24

dmitry-markin added 2 commits December 20, 2022 17:55

Fix warning

0d527bc

Make rustdoc happy

709efba

altonen approved these changes Dec 20, 2022

View reviewed changes

dmitry-markin requested a review from a team December 21, 2022 15:47

michalkucharczyk reviewed Dec 21, 2022

View reviewed changes

melekes reviewed Dec 22, 2022

View reviewed changes

Expand doc

2706277

melekes approved these changes Dec 22, 2022

View reviewed changes

dmitry-markin merged commit 40cb431 into master Dec 23, 2022

dmitry-markin deleted the dm-unbounded-channel-clogging-diagnostics branch December 23, 2022 13:03

bkchr reviewed Dec 23, 2022

View reviewed changes

dmitry-markin mentioned this pull request Dec 26, 2022

Runtime diagnostics for leaked messages in unbounded channels (part 2) #13020

Merged

nazar-pc mentioned this pull request Jan 6, 2023

Upgrade Substrate fork autonomys/subspace#1053

Merged

1 task

dmitry-markin mentioned this pull request Jan 10, 2023

Make unbounded channels size warning exact (part 1) #13117

Closed

AurevoirXavier mentioned this pull request Feb 8, 2023

To polkadot-v0.9.37 darwinia-network/darwinia-2.0#267

Closed

kacperzuk-neti mentioned this pull request Feb 14, 2023

Polkadot v0.9.37 (pallet-nfts) liberland/liberland_substrate#244

Merged

ltfschoen pushed a commit to ltfschoen/substrate that referenced this pull request Feb 22, 2023

Runtime diagnostics for leaked messages in unbounded channels (parity…

8aa13e0

…tech#12971)

ark0f pushed a commit to gear-tech/substrate that referenced this pull request Feb 27, 2023

Runtime diagnostics for leaked messages in unbounded channels (parity…

b7309ab

…tech#12971)

dmitry-markin mentioned this pull request May 15, 2023

Make unbounded channels size warning exact (part 2) #13504

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Runtime diagnostics for leaked messages in unbounded channels #12971

Runtime diagnostics for leaked messages in unbounded channels #12971

dmitry-markin commented Dec 19, 2022 •

edited

Loading

dmitry-markin commented Dec 19, 2022

altonen left a comment

dmitry-markin commented Dec 20, 2022 •

edited

Loading

michalkucharczyk Dec 21, 2022

melekes Dec 22, 2022

dmitry-markin Dec 22, 2022

melekes left a comment

melekes Dec 22, 2022

dmitry-markin Dec 22, 2022 •

edited

Loading

nazar-pc Jan 6, 2023

nazar-pc Jan 6, 2023

dmitry-markin Jan 6, 2023

dmitry-markin Jan 9, 2023

bkchr Jan 9, 2023

bkchr Jan 9, 2023

nazar-pc Jan 9, 2023

dmitry-markin Jan 10, 2023

melekes Dec 22, 2022

dmitry-markin Dec 22, 2022

bkchr Dec 23, 2022

melekes Dec 26, 2022

bkchr Dec 23, 2022

bkchr Dec 23, 2022

bkchr Dec 23, 2022

dmitry-markin Dec 26, 2022

bkchr Dec 23, 2022

dmitry-markin Dec 26, 2022

Runtime diagnostics for leaked messages in unbounded channels #12971

Runtime diagnostics for leaked messages in unbounded channels #12971

Conversation

dmitry-markin commented Dec 19, 2022 • edited Loading

dmitry-markin commented Dec 19, 2022

altonen left a comment

Choose a reason for hiding this comment

dmitry-markin commented Dec 20, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

melekes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitry-markin Dec 22, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dmitry-markin commented Dec 19, 2022 •

edited

Loading

dmitry-markin commented Dec 20, 2022 •

edited

Loading

dmitry-markin Dec 22, 2022 •

edited

Loading