Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Warnings in GRPC ClientSubscriber Pubsub #12964

Closed
kamisoft-fr opened this issue May 15, 2024 · 34 comments
Closed

Warnings in GRPC ClientSubscriber Pubsub #12964

kamisoft-fr opened this issue May 15, 2024 · 34 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API. type: question Request for information or clarification. Not an issue.

Comments

@kamisoft-fr
Copy link

kamisoft-fr commented May 15, 2024

Environment details

OS: doesn't seem to matter (tested it in Windows 11 as well as in a Ubuntu Docker container)
.NET version: net8.0
Package name and version: Google.Cloud.PubSub.V1, Version 3.12.0

Steps to reproduce

Download sample from https://github.com/kamisoft-fr/GooglePubsubSampleApp
shorter version: #12965

Replace with your GCP settings : projectid & topicName
subcription name can be left as is

Run it

Open the "Output" window
Now wait for about 70 seconds. After that, Visual Studio begins to log occurring exceptions:

15:39:19 dbug: Microsoft.Extensions.Hosting.Internal.Host[1] Hosting starting
15:39:19 trce: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Starting duplex streaming API call to StreamingPull.
15:39:19 trce: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Starting duplex streaming API call to StreamingPull.
15:39:19 trce: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Starting duplex streaming API call to StreamingPull.
15:39:19 trce: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Starting duplex streaming API call to StreamingPull.
15:39:19 trce: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Starting duplex streaming API call to StreamingPull.
15:39:19 info: Microsoft.Hosting.Lifetime[0] Application started. Press Ctrl+C to shut down.
15:39:19 info: Microsoft.Hosting.Lifetime[0] Hosting environment: development
15:39:19 dbug: Microsoft.Extensions.Hosting.Internal.Host[2] Hosting started
15:40:22 warn: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Recoverable error in streaming pull; will retry. System.AggregateException: One or more errors occurred. (Status(StatusCode="Unavailable", Detail="The service was unable to fulfill your request. Please try again. [code=8a75]")) ---> Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="The service was unable to fulfill your request. Please try again. [code=8a75]") at Grpc.Net.Client.Internal.HttpContentClientStreamReader2.MoveNextCore(CancellationToken cancellationToken) --- End of inner exception stack trace --- 15:40:22 warn: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Recoverable error in streaming pull; will retry. System.AggregateException: One or more errors occurred. (Status(StatusCode="Unavailable", Detail="The service was unable to fulfill your request. Please try again. [code=8a75]")) ---> Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="The service was unable to fulfill your request. Please try again. [code=8a75]") at Grpc.Net.Client.Internal.HttpContentClientStreamReader2.MoveNextCore(CancellationToken cancellationToken) --- End of inner exception stack trace ---
15:40:22 dbug: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Delaying for 0s before streaming pull call.
15:40:22 dbug: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Delaying for 0s before streaming pull call.
15:40:22 trce: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Starting duplex streaming API call to StreamingPull.
15:40:22 trce: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Starting duplex streaming API call to StreamingPull.
15:40:27 warn: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Recoverable error in streaming pull; will retry. System.AggregateException: One or more errors occurred. (Status(StatusCode="Unavailable", Detail="The service was unable to fulfill your request. Please try again. [code=8a75]")) ---> Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="The service was unable to fulfill your request. Please try again. [code=8a75]") at Grpc.Net.Client.Internal.HttpContentClientStreamReader`2.MoveNextCore(CancellationToken cancellationToken) --- End of inner exception stack trace ---
15:40:27 dbug: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Delaying for 0s before streaming pull call.
15:40:27 trce: Common.Infrastructure.Pubsub.Google.GoogleChannelSubscriber[0] Starting duplex streaming API call to StreamingPull.

Thanks!

@jskeet
Copy link
Collaborator

jskeet commented May 15, 2024

There's a lot of code in that repro repository. I'm going to try to create a much shorter version.

I strongly suspect this is just a matter of warnings being logged by https://github.com/googleapis/google-cloud-dotnet/blob/main/apis/Google.Cloud.PubSub.V1/Google.Cloud.PubSub.V1/SubscriberClientImpl.SingleChannel.cs#L356, and that should quite possibly be a debug level log instead (given that it's being retried automatically).

@jskeet jskeet added type: question Request for information or clarification. Not an issue. api: pubsub Issues related to the Pub/Sub API. labels May 15, 2024
@jskeet
Copy link
Collaborator

jskeet commented May 15, 2024

Right, here's the rather shorter version: #12965

There are two potential aspects here:

  • Sometimes I see more of an exception/retry loop than I'd expect (although it does always recover)
  • The fact that it's being logged as a warning when I suspect that restarting the pull once every minute or so is "working as intended" is non-ideal

The latter is really easy to fix by just changing it to "debug" warning level. The first will need more investigation.

@kamisoft-fr
Copy link
Author

Hi Jon,
indeed that's a much shorter version than mine ! Thank you for taking the time to do it.
and now we can both reproduce the issue.
This confirms that it is not a misuse of the library on my part, which could have been the case :)

BTW after reproducing with your sample, the root cause seems to be the "moveNextTask" faulted with "System.Net.Quic.QuicException: The connection timed out from inactivity"

So I searched & found this recently merged fix on .NET QUIC : dotnet/runtime#102147

Or may be this opened on HTTP3 issue : dotnet/runtime#87478

All QUIC "timeout" issues are reported here https://github.com/dotnet/runtime/issues?q=is%3Aopen+label%3Aarea-System.Net.Quic+timeout

My local logs

2024-05-15T19:37:42.566Z info: Timing[0] Starting subscriber
2024-05-15T19:38:52.193Z warn: Subscriber[0] Recoverable error in streaming pull; will retry. System.AggregateException: One or more errors occurred. (Status(StatusCode="Unavailable", Detail="The service was unable to fulfill your request. Please try again. [code=8a75]")) ---> Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="The service was unable to fulfill your request. Please try again. [code=8a75]") at Grpc.Net.Client.Internal.HttpContentClientStreamReader2.MoveNextCore(CancellationToken cancellationToken) --- End of inner exception stack trace --- 2024-05-15T19:38:52.193Z warn: Subscriber[0] Recoverable error in streaming pull; will retry. System.AggregateException: One or more errors occurred. (Status(StatusCode="Unavailable", Detail="The service was unable to fulfill your request. Please try again. [code=8a75]")) ---> Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="The service was unable to fulfill your request. Please try again. [code=8a75]") at Grpc.Net.Client.Internal.HttpContentClientStreamReader2.MoveNextCore(CancellationToken cancellationToken) --- End of inner exception stack trace ---
2024-05-15T19:38:53.539Z warn: Subscriber[0] Recoverable error in streaming pull; will retry. System.AggregateException: One or more errors occurred. (Status(StatusCode="Unavailable", Detail="The service was unable to fulfill your request. Please try again. [code=8a75]")) ---> Grpc.Core.RpcException: Status(StatusCode="Unavailable", Detail="The service was unable to fulfill your request. Please try again. [code=8a75]") at Grpc.Net.Client.Internal.HttpContentClientStreamReader`2.MoveNextCore(CancellationToken cancellationToken) --- End of inner exception stack trace ---

@jskeet
Copy link
Collaborator

jskeet commented May 16, 2024

I don't think this is a QUIC issue, as otherwise pinning to HTTP/2.0 would have fixed it. I'll validate that though, and also try on .NET 6 (as we've seen some differences in behavior there)

@kamisoft-fr
Copy link
Author

I don't think this is a QUIC issue, as otherwise pinning to HTTP/2.0 would have fixed it. I'll validate that though, and also try on .NET 6 (as we've seen some differences in behavior there)

right, finally failure happens here
https://github.com/grpc/grpc-dotnet/blob/ef1f0d209c3e40c923a6f5830871402eecded030/src/Grpc.Net.Client/Internal/HttpContentClientStreamReader.cs#L177

as per screenshot

image

@jskeet
Copy link
Collaborator

jskeet commented May 16, 2024

Well, that's where the exception is being thrown. But the underlying question is why the status code is Unavailable at that point. That could have a number of actual causes.

I'm hoping to find time to investigate more today.

@jskeet
Copy link
Collaborator

jskeet commented May 16, 2024

Okay, a few more pieces of information:

  • As I mentioned before, the fact that these are shown as warnings is probably a mistake, given that it doesn't actually indicate that anything's wrong. I'll reduce the level to debug.
  • The reason we're seeing lots of failures is because the library starts multiple clients. This can be configured via SubscriberClientBuilder.ClientCount, but normally it's fine to leave it alone. I'm going to suggest that we add a "client index" to the log entries from SubscriberClient so we can easily tell the clients apart (while still using a single ILogger instance).
  • I've tried with .NET 6, .NET 8 and Grpc.Core - in all cases, I see each StreamingPull call fail after 60-120 seconds. I'm going to ask the Pub/Sub team if that's expected. The QUIC stream timing out may mean there are two different causes for the failure, which isn't idea.
  • There's a further complication that we can't currently tell the difference between "this is a real failure, please back off exponentially" and "yup, the stream just timed out, it's fine", so between calls to StreamingPull we're backing off exponentially if the pull fails without seeing any messages. Again, I'm going to ask the Pub/Sub team whether there's some better way of determining the cause of the stream being broken.

@kamisoft-fr: Other than the backoff aspect (which maxes out at 30s, so you could get a delay in seeing messages due to that if there's been a previous long spell of inactivity) I suspect the library is actually behaving fine in terms of processing subscriptions, and you shouldn't see any application-level issues. Have you noticed any semantic problems, or was it just the warning logs that were causing concern?

@jskeet
Copy link
Collaborator

jskeet commented May 16, 2024

A note on QUIC: it looks like the QUIC stream usually times out after 60 seconds; disabling QUIC increases the pull time to about 90 seconds usually. If we get more control over QUIC timeouts in the future, we'll probably increase that automatically (and use keep-alive) - but I don't think it makes a particularly material difference here.

@kamisoft-fr
Copy link
Author

@kamisoft-fr: Other than the backoff aspect (which maxes out at 30s, so you could get a delay in seeing messages due to that if there's been a previous long spell of inactivity) I suspect the library is actually behaving fine in terms of processing subscriptions, and you shouldn't see any application-level issues. Have you noticed any semantic problems, or was it just the warning logs that were causing concern?

I would say the main issue is that it gets worse over time. After a few tens of minutes, we have exceptions logged with stack traces every 2-3 seconds, which slows down the system.
Regarding the application side, we are indeed receiving the Pub/Sub messages as expected, but this is still under verification

@jskeet
Copy link
Collaborator

jskeet commented May 16, 2024

I would say the main issue is that it gets worse over time. After a few tens of minutes, we have exceptions logged with stack traces every 2-3 seconds, which slows down the system.

Ooh, that's interesting an unexpected. I wouldn't expect it to get worse over time. I'll try running my code for longer to see if I can reproduce this.

@kamisoft-fr
Copy link
Author

Here is an example on my machine

image

@jskeet
Copy link
Collaborator

jskeet commented May 16, 2024

Interesting- I definitely haven't seen that. Is that a single SubscriberClient? I do wonder whether there's something else going on. (Often we end up finding multiple causes when looking into a single thing. So what I've reproduced may not be what you're actually seeing.) It might also be worth you specifying the ClientCount (e.g. 8 so it's not going to throttle you, but it won't be too huge).

If you have just got a single SubscriberClient, then when I've improved the logging that should help diagnose this further (although you'll probably end up needing to turn the log level to Debug to see it, as I would like to make that the level for the current warning).

@kamisoft-fr
Copy link
Author

Yes we have 1 subcriber client per subcription, and in that example I have only one with default (unset) ClientCount on SubscriberClientBuilder
However, in production, we have 17 subscriptions and 2 to 3 times more SubscriberClients (1 per subcription x containers)

@jskeet
Copy link
Collaborator

jskeet commented May 16, 2024

Yes we have 1 subcriber client per subcription, and in that example I have only one with default (unset) ClientCount on SubscriberClientBuilder

Right. If you could test with a set ClientCount, that would be helpful. Don't feel any pressure though - I hope I'll be able to provide a build with the index either tomorrow early next week.

However, in production, we have 17 subscriptions and 2 to 3 times more SubscriberClients (1 per subcription x containers)

Right, and then a lot of underlying clients per subscriber - that would certainly lead to a lot of warnings ("1 warning per underlying client per minute" would translate to 17 x 3 x (say) 16 = 816 warnings per minute in that case, if there are 16 underlying clients per SubscriberClient)

@kamisoft-fr
Copy link
Author

Here is a timeline, when calls work, it last more than 60 seconds, but when it fails, all in few ms or 60ish seconds

OK
image

KO
image

@jskeet
Copy link
Collaborator

jskeet commented May 16, 2024

Sorry, I don't really follow what those graphs are showing. The logs were clearer to me :)

@kamisoft-fr
Copy link
Author

haha :)
each row reprensent an HTTP call to POST /google.pubsub.v1.Subscriber/StreamingPull
so I was only saying that when a request fails, it's always in 60 seconds, for all the others (OK) it's generally more than 60s

image

@jskeet
Copy link
Collaborator

jskeet commented May 16, 2024

I'm not sure what you'd consider a "success" though - because my understanding is that StreamingPull basically stays alive until the server chops it. It's a streaming call I wouldn't expect to see it "complete" unless the server disconnects.

@kamisoft-fr
Copy link
Author

Yes that's a behavior that seems to come from the server, otherwise the client would cut the connection systematically at 60s if it was a default configured client timeout

@jskeet
Copy link
Collaborator

jskeet commented May 16, 2024

Right. My point is that I think it's basically okay for the server to chop the stream - I have no idea why you're getting some requests "completing" after only a few milliseconds; my guess is that whatever client you're using is stopping when it gets the first response. Trying to use a "normal" HTTP client to diagnose a bidirectional streaming call is error-prone, and I think that's what you're seeing.

@jskeet
Copy link
Collaborator

jskeet commented May 17, 2024

I've released Google.Cloud.PubSub.V1 version 3.13.0 which has the changes mentioned earlier:

  • It includes a client index in logs for SubscriberClient
  • The "recoverable error" log is now at debug level

If you're easily able to upgrade your test code to use this (and lower the logging level to debug) it might help us to identify why you're getting so many log entries - I haven't managed to reproduce that yet.

@kamisoft-fr
Copy link
Author

Okay lets try it, I come back to you ASAP on this

PS : take a look at this issue -> grpc/grpc-dotnet#2361 (comment)

@jskeet
Copy link
Collaborator

jskeet commented May 17, 2024

PS : take a look at this issue -> grpc/grpc-dotnet#2361 (comment)

I'm aware of that, or related ones, but are you suggesting that's playing a role here? If so, have you tried setting the AppContext switch in your PubSub code?

@kamisoft-fr
Copy link
Author

Hi Jon,
I'm circling back on the topic, I activated the DEBUG mode and ran my application with the logs showing.

logs :
12964_logs.txt

@jskeet
Copy link
Collaborator

jskeet commented May 21, 2024

@kamisoft-fr: Thanks for that. Can I check whether this is a single SubscriberClient, or multiple ones?

@kamisoft-fr
Copy link
Author

That's for 3 subcriptions clients

@jskeet
Copy link
Collaborator

jskeet commented May 21, 2024

That's for 3 subcriptions clients

Aha. And it looks like there are 20 clients within each SubscriberClient, meaning 60 clients in all.

The log accounts for just under 6 minutes, with ~567 entries. Each restart accounts for 2 log entries (one for "there's been an error, will retry" and one for "delaying for N seconds before retrying), which means we've got ~285 restarts - so just under 5 per client.

That fits in with the general theory of "each client's streaming pull request lasts about a minute, then it fails, backs off (which is a bug being fixed separately) and retries"... so I don't think we've actually got anything left to explain.

Disabling QUIC might increase the time of each streaming pull from 60 seconds to 90-120s, but it's not going to be an order-of-magnitude thing.

Unless you know you'll have a lot of messages to process, you may well want to reduce the client count per SubscriberClient (using the ClientCount property in the builder); that's something for you to work out.

With all of this, is there anything else you're concerned about? If not, I think it's best to close this issue, but with follow-ups of:

  • We have work to do on QUIC support (or disablement) in general
  • We shouldn't back off after the stream has dropped, if it's dropped in a "normal" way
  • We need to see whether each stream only lasting 1-2 minutes is expected

Does that sound okay?

@kamisoft-fr
Copy link
Author

Yes that sounds good to me, looking forward the next release :)
Thanks for the investigation, and anyway, I must say that the devops team is seeing fewer warnings, so they are already a bit reassured !

@jskeet
Copy link
Collaborator

jskeet commented May 21, 2024

I must say that the devops team is seeing fewer warnings, so they are already a bit reassured !

Well that's mostly just having changed warnings to debug, but I can see how it would be calming :) (It really is expected, after all.) I'll close this issue now, but post updates about the bullets listed above if we know more.

@jskeet jskeet closed this as completed May 21, 2024
@jskeet
Copy link
Collaborator

jskeet commented May 22, 2024

We've heard back from the Pub/Sub team about the bottom two bullets. They're going to think about how to indicate "yes, the stream has been set up okay" to avoid unnecessary backoff (and we'll probably put a time-based check in the client for now), and there's already work in progress to try to increase the length that each stream survives for - although at the moment QUIC may be the immediate cause of stream shutdown. No action required on your part at the moment :)

@kamisoft-fr
Copy link
Author

ok that's cool !

if it can help, here are the errors reported in PROD by datadog, it seems that HTTP 2 is selected as default in those linux envs:

image

AND

image

@jskeet
Copy link
Collaborator

jskeet commented Jun 4, 2024

As of version 3.14.0, the "back off at stream cutoff after a minute" is fixed, so you may want to update that to improve responsiveness.

@kamisoft-fr
Copy link
Author

Hi jon, thanks for the update, here is the new behavior with debug logs activated

image

@jskeet
Copy link
Collaborator

jskeet commented Jun 6, 2024

Right - those all seem expected. (I might update the logging to include the client index in the "Retrying with no backoff" part at some point, but that's slightly tricky.) Sounds like all is okay :) (I'm hoping at some point that pull streams will last longer, but that's a long term issue with various moving parts, and is more of a nice-to-have than a real problem.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API. type: question Request for information or clarification. Not an issue.
Projects
None yet
Development

No branches or pull requests

2 participants