Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(console): fix the issue of continuous reconnection caused by decode failure #500

Closed
wants to merge 1 commit into from

Conversation

ethercflow
Copy link

@ethercflow ethercflow commented Dec 12, 2023

2023-12-12T08:46:56.332653Z WARN tokio_console::conn: error from stream status=status: OutOfRange, message: "Error, message length too large: found 5913832 bytes, the limit is: 4194304 bytes", details: [], metadata: MetadataMap { headers: {} }

By default, both of the tonic's max_decoding_message_size is 4M, when there are too many async tasks, the decoding will over 4M, causing tokio-console decoding to fail.

When it fails, the tokio-console interface will continuously display Connected/Reconnect, as it may be a bit strange to make the decoding maximum value a command line parameter, so make it 16M as a const value.

…de failure

Signed-off-by: Wenbo Zhang <wenbo.zhang@iomesh.com>
@ethercflow ethercflow requested a review from a team as a code owner December 12, 2023 08:46
@grahamking
Copy link
Contributor

grahamking commented Dec 12, 2023

I hit the same error and your patch fixes it. I can connect to my server shortly after it starts, but if I wait a few minutes before connecting I start seeing that log message and tokio-console won't work.

Thanks!

UPDATE: The fix only buys some time but doesn't solve the problem. I eventually got to:

2023-12-12T22:32:31.452815Z WARN tokio_console::conn: error from stream status=status: OutOfRange, message: "Error, message length too large: found 18007258 bytes, the limit is: 16777216 bytes", details: [], metadata: MetadataMap { headers: {} }

Not sure if it's my app or tokio-console. My guess is it's console-subscriber.

@ethercflow
Copy link
Author

I hit the same error and your patch fixes it. I can connect to my server shortly after it starts, but if I wait a few minutes before connecting I start seeing that log message and tokio-console won't work.

Thanks!

UPDATE: The fix only buys some time but doesn't solve the problem. I eventually got to:

2023-12-12T22:32:31.452815Z WARN tokio_console::conn: error from stream status=status: OutOfRange, message: "Error, message length too large: found 18007258 bytes, the limit is: 16777216 bytes", details: [], metadata: MetadataMap { headers: {} }

Not sure if it's my app or tokio-console. My guess is it's console-subscriber.

Thank you for the information. Maybe we also need to limit the size of encoding. I will do further research to confirm.

@ethercflow
Copy link
Author

@grahamking From https://github.com/hyperium/tonic/blob/v0.10.2/tonic/src/lib.rs#L62 it says:

//! # Code generated client/server configuration
//!
//! ## Max Message Size
//!
//! Currently, both servers and clients can be configured to set the max message encoding and
//! decoding size. This will ensure that an incoming gRPC message will not exahust the systems
//! memory. By default, the decoding message limit is 4MB and the encoding limit is usize::MAX.

So I think we need to call max_encoding__message_size to limit memory usage in server side. I'm curious, how many async tasks do you have there that generate so much data?

@grahamking
Copy link
Contributor

grahamking commented Dec 13, 2023

I don't have very many tasks, maybe 20 per second at most, they are short lived.

I think the issue is that when a client first connects it gets an Update object which contains all data (tasks, resources, async_ops) for the past retention, which defaults to 1 hour. The encoded length of that object can quite easily get above 4MB.
This object: https://github.com/tokio-rs/console/blob/main/console-subscriber/src/aggregator/mod.rs#L283-L291

Reducing retention to 1 minute prevents the problem, but then I get a different issue where nothing updates in the tokio-console display. Still trying to figure that one out.

@ethercflow
Copy link
Author

I don't have very many tasks, maybe 20 per second at most, they are short lived.

I think the issue is that when a client first connects it gets an Update object which contains all data (tasks, resources, async_ops) for the past retention, which defaults to 1 hour. The encoded length of that object can quite easily get above 4MB. This object: https://github.com/tokio-rs/console/blob/main/console-subscriber/src/aggregator/mod.rs#L283-L291

Reducing retention to 1 minute prevents the problem, but then I get a different issue where nothing updates in the tokio-console display. Still trying to figure that one out.

Oh I see, indeed, keeping 1h by default consumes a lot of memory.

grahamking pushed a commit to grahamking/console that referenced this pull request Dec 14, 2023
Do not record poll_ops if there are no current connected clients
(watchers). Without this `Aggregator::poll_ops` would grow forever.

Follow up to tokio-rs#311 and
fix for these two:
- tokio-rs#184
- tokio-rs#500
@grahamking
Copy link
Contributor

I think I hit two overlapping problems:

  • If I don't set retention to a short enough duration the tasks/resources/async_ops data will grow beyond the message size.
  • Even with retention, if I don't connect often enough the poll_ops (which aren't affected by retention) will grow too large for a message. This fixes it for me: fix(subscriber): Don't save poll_ops if no-one is receiving them #501

hawkw pushed a commit that referenced this pull request Jan 22, 2024
Do not record poll_ops if there are no current connected clients
(watchers). Without this `Aggregator::poll_ops` would grow forever.

Follow up to #311 and
fix for these two:
- #184
- #500

Fixes #184 

Co-authored-by: Graham King <grahamk@nvidia.com>
Co-authored-by: Hayden Stainsby <hds@caffeineconcepts.com>
@ethercflow ethercflow closed this Feb 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants