-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Warnings in GRPC ClientSubscriber Pubsub #12964
Comments
There's a lot of code in that repro repository. I'm going to try to create a much shorter version. I strongly suspect this is just a matter of warnings being logged by https://github.com/googleapis/google-cloud-dotnet/blob/main/apis/Google.Cloud.PubSub.V1/Google.Cloud.PubSub.V1/SubscriberClientImpl.SingleChannel.cs#L356, and that should quite possibly be a debug level log instead (given that it's being retried automatically). |
Right, here's the rather shorter version: #12965 There are two potential aspects here:
The latter is really easy to fix by just changing it to "debug" warning level. The first will need more investigation. |
Hi Jon, BTW after reproducing with your sample, the root cause seems to be the "moveNextTask" faulted with "System.Net.Quic.QuicException: The connection timed out from inactivity" So I searched & found this recently merged fix on .NET QUIC : dotnet/runtime#102147 Or may be this opened on HTTP3 issue : dotnet/runtime#87478 All QUIC "timeout" issues are reported here https://github.com/dotnet/runtime/issues?q=is%3Aopen+label%3Aarea-System.Net.Quic+timeout My local logs
|
I don't think this is a QUIC issue, as otherwise pinning to HTTP/2.0 would have fixed it. I'll validate that though, and also try on .NET 6 (as we've seen some differences in behavior there) |
right, finally failure happens here as per screenshot |
Well, that's where the exception is being thrown. But the underlying question is why the status code is Unavailable at that point. That could have a number of actual causes. I'm hoping to find time to investigate more today. |
Okay, a few more pieces of information:
@kamisoft-fr: Other than the backoff aspect (which maxes out at 30s, so you could get a delay in seeing messages due to that if there's been a previous long spell of inactivity) I suspect the library is actually behaving fine in terms of processing subscriptions, and you shouldn't see any application-level issues. Have you noticed any semantic problems, or was it just the warning logs that were causing concern? |
A note on QUIC: it looks like the QUIC stream usually times out after 60 seconds; disabling QUIC increases the pull time to about 90 seconds usually. If we get more control over QUIC timeouts in the future, we'll probably increase that automatically (and use keep-alive) - but I don't think it makes a particularly material difference here. |
I would say the main issue is that it gets worse over time. After a few tens of minutes, we have exceptions logged with stack traces every 2-3 seconds, which slows down the system. |
Ooh, that's interesting an unexpected. I wouldn't expect it to get worse over time. I'll try running my code for longer to see if I can reproduce this. |
Interesting- I definitely haven't seen that. Is that a single SubscriberClient? I do wonder whether there's something else going on. (Often we end up finding multiple causes when looking into a single thing. So what I've reproduced may not be what you're actually seeing.) It might also be worth you specifying the If you have just got a single |
Yes we have 1 subcriber client per subcription, and in that example I have only one with default (unset) ClientCount on SubscriberClientBuilder |
Right. If you could test with a set ClientCount, that would be helpful. Don't feel any pressure though - I hope I'll be able to provide a build with the index either tomorrow early next week.
Right, and then a lot of underlying clients per subscriber - that would certainly lead to a lot of warnings ("1 warning per underlying client per minute" would translate to 17 x 3 x (say) 16 = 816 warnings per minute in that case, if there are 16 underlying clients per SubscriberClient) |
Sorry, I don't really follow what those graphs are showing. The logs were clearer to me :) |
I'm not sure what you'd consider a "success" though - because my understanding is that StreamingPull basically stays alive until the server chops it. It's a streaming call I wouldn't expect to see it "complete" unless the server disconnects. |
Yes that's a behavior that seems to come from the server, otherwise the client would cut the connection systematically at 60s if it was a default configured client timeout |
Right. My point is that I think it's basically okay for the server to chop the stream - I have no idea why you're getting some requests "completing" after only a few milliseconds; my guess is that whatever client you're using is stopping when it gets the first response. Trying to use a "normal" HTTP client to diagnose a bidirectional streaming call is error-prone, and I think that's what you're seeing. |
I've released Google.Cloud.PubSub.V1 version 3.13.0 which has the changes mentioned earlier:
If you're easily able to upgrade your test code to use this (and lower the logging level to debug) it might help us to identify why you're getting so many log entries - I haven't managed to reproduce that yet. |
Okay lets try it, I come back to you ASAP on this PS : take a look at this issue -> grpc/grpc-dotnet#2361 (comment) |
I'm aware of that, or related ones, but are you suggesting that's playing a role here? If so, have you tried setting the AppContext switch in your PubSub code? |
Hi Jon, logs : |
@kamisoft-fr: Thanks for that. Can I check whether this is a single SubscriberClient, or multiple ones? |
That's for 3 subcriptions clients |
Aha. And it looks like there are 20 clients within each SubscriberClient, meaning 60 clients in all. The log accounts for just under 6 minutes, with ~567 entries. Each restart accounts for 2 log entries (one for "there's been an error, will retry" and one for "delaying for N seconds before retrying), which means we've got ~285 restarts - so just under 5 per client. That fits in with the general theory of "each client's streaming pull request lasts about a minute, then it fails, backs off (which is a bug being fixed separately) and retries"... so I don't think we've actually got anything left to explain. Disabling QUIC might increase the time of each streaming pull from 60 seconds to 90-120s, but it's not going to be an order-of-magnitude thing. Unless you know you'll have a lot of messages to process, you may well want to reduce the client count per SubscriberClient (using the ClientCount property in the builder); that's something for you to work out. With all of this, is there anything else you're concerned about? If not, I think it's best to close this issue, but with follow-ups of:
Does that sound okay? |
Yes that sounds good to me, looking forward the next release :) |
Well that's mostly just having changed warnings to debug, but I can see how it would be calming :) (It really is expected, after all.) I'll close this issue now, but post updates about the bullets listed above if we know more. |
We've heard back from the Pub/Sub team about the bottom two bullets. They're going to think about how to indicate "yes, the stream has been set up okay" to avoid unnecessary backoff (and we'll probably put a time-based check in the client for now), and there's already work in progress to try to increase the length that each stream survives for - although at the moment QUIC may be the immediate cause of stream shutdown. No action required on your part at the moment :) |
As of version 3.14.0, the "back off at stream cutoff after a minute" is fixed, so you may want to update that to improve responsiveness. |
Right - those all seem expected. (I might update the logging to include the client index in the "Retrying with no backoff" part at some point, but that's slightly tricky.) Sounds like all is okay :) (I'm hoping at some point that pull streams will last longer, but that's a long term issue with various moving parts, and is more of a nice-to-have than a real problem.) |
Environment details
OS: doesn't seem to matter (tested it in Windows 11 as well as in a Ubuntu Docker container)
.NET version: net8.0
Package name and version: Google.Cloud.PubSub.V1, Version 3.12.0
Steps to reproduce
Download sample from
https://github.com/kamisoft-fr/GooglePubsubSampleAppshorter version: #12965
Replace with your GCP settings : projectid & topicName
subcription name can be left as is
Run it
Open the "Output" window
Now wait for about 70 seconds. After that, Visual Studio begins to log occurring exceptions:
Thanks!
The text was updated successfully, but these errors were encountered: