-
Notifications
You must be signed in to change notification settings - Fork 230
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Subscription errors and does not reconnect when connection is cancelled by server #979
Comments
@hongalex Hey Alex, these default / retry updates were something you worked on. Did you have any thoughts on the change that led to this? |
Thanks for filing this issue! Indeed, the timeout for a streaming pull stream is 15 minutes, but it shouldn't be returning a CANCELLED status code. We removed CANCELLED as a retryable error per https://aip.dev/194, so we're currently investigating where the status is being surfaced from. |
I haven't been able to get the repro code to work on a local emulator (it doesn't time out) but I'll try again tomorrow with the real server. There's some thought that there may be a race condition between the client library timing out and the server timing out. There's also this possibility: |
Hello, we experienced the same problem in our application. Any news on that? |
It seems that instead of the |
@feywind Can we get this issue bumped up to a p1 and prioritized? It seems a subscription that is idle for 15 minutes just completely loses connectivity and the client doesn't properly reconnect. Otherwise, if we are supposed to handle reconnecting ourselves then can we get some guidance on how to do that? Thank you! |
Having save error in the same situation after 15 mins of retries. Do you guys have any updates on issue? |
Hi everyone, We've been looking at this today, and it seems like it's likely to be something specific to the Node library. Unfortunately I've also been unable to reproduce it locally, across 3 different Node versions, which leads me to wonder if there's something going on between Node and GKE. Re: prioritzation, that is indeed happening right now. If this is actively blocking you, you might try the workaround suggested in the original post (adding CANCELLED back to the retry list), but I can't really recommend that as a long term fix. I think we were just papering over an issue before. |
I am on Node 14.8, macOS Catalina and using the PubSub emulator in Docker ( Re: |
Those are good data points, thank you! |
I've duplicated that setup here, and it seems I still can't reproduce it. I'm using an almost unmodified version of the original repro above. So it seems like there has got to be some kind of difference in environment, if it's consistent for you all. Here's what I have:
This log pops up in the emulator window, but I'm unsure if it's related. Does it look familiar?
Hopefully there's something useful in there and we can compare notes to see what's potentially different. |
We have this issue in GKE. Kubernetes and the nodes are all on 1.16.13-gke.401. Our container is FROM node:10-alpine. We are using v2.1.0 of @google-cloud/pubsub. This is our production workload. We have an active support contract and I am happy to file a ticket if it would be helpful. |
I was able to reproduce this with the nodejs and .NET pubsub client actually with the same setup as you @feywind. It doesn't happen when I use the pubsub emulator directly in the CLI. Only difference is I am on 10.15.3. |
Same, it appears it works when using the Emulator on the host machine directly but not through Docker. |
@mrothroc It might, actually. Let me see if I can get you in touch with the TSE that's looking at this. |
Also @giuliano-barberi-tf do you have a .NET snippet that you're using for reproducing it? That might be a helpful data point too. |
I do not have something easy to share. I'm just seeing this in my application running locally but it's fairly vanilla just using SubscriberClient. Something like the sample code at https://googleapis.github.io/google-cloud-dotnet/docs/Google.Cloud.PubSub.V1/ will work to repro it and you can see it stops listening after the StartAsync call:
After 15 minutes I see the output Thanks! |
@feywind here are the full logs as discussed. Looking through them myself around the end, I noticed
Which AFAIK is |
@jeffijoe Do you mind sending me the package.json and the log-patched test code you're using? I think I see something interesting here, but I still want to try reproducing it here before guessing. |
Thanks! I'll take a look Monday. |
@jeffijoe Hey Jeff, Somehow, even using your repro files with the lockfile and all, I haven't been able to get it to reproduce here. It ran for almost an hour with no errors, and at the end of that, I could still publish to it using the pubber.js. Considering how similar our test environments are, now, I'm starting to wonder if there's some sort of firewall issue that's causing grpc issues with an idle connection. Either way, I'll keep digging on it tomorrow. Thanks for the patience. |
We are running in GKE and so far as I know there is no firewall. Not sure what sits inside the Google cloud between GKE and pubsub, but we don't have anything. |
Ah yeah, that actually is pretty interesting. Thanks! Also, within a Mac running Docker with the emulator, there wouldn't be any external firewalls. I know sometimes rules are set on individual machines by corporate policies or whatnot. Those would naturally be different between us, so I don't know if that would cause the variance that makes it hard for me to reproduce. GKE might do something similar? Also I think that whether we can reproduce it or not, ultimately the error handling in the client library could probably help with this. So I might start thinking in that direction. |
Hey, good(?) news!
I only saw this with grpc-js, which is also an interesting data point. I'll be bugging a grpc person about this probably. |
Quick update, that conversation is happening now. |
Thank you for working so hard on this issue! |
Hey, sorry it's taken so long! I've decided for now that I'm going to re-add CANCELLED to the retry list to get everyone back and going. I'll make a separate issue for fixing whatever is causing the CANCELLED. |
Waiting for release and new issue creation here. |
@jeffijoe 2.6.0 should now be available, adding CANCELLED back to the retry codes. This is just an interim thing to get you going again. If the cancels are happening, it might be causing a secondary issue where the emulator becomes unhappy about too many connections (after a long time). I'm still investigating why it's happening in the first place in #1135. |
Will give it a try! 🙏 |
I'm going to go ahead and close this for now, please feel free to re-open if 2.6.0 didn't get you going again. I'm going to continue my investigation of why the disconnects are happening over in #1135. |
Its definitely more stable now but I do have cases where I still need to restart. As an FYI this gets logged occasionally, usually after waking up my computer:
|
We are currently experiencing this same issue using gcloud-sdk docker image. What's the possible fix ? @feywind |
Sorry to be that guy but we're still having this issue. Disregard my earlier comment on being more stable, it's been the same, with the exception of now seeing a |
Hello, this is the code where we get the error |
* chore(deps): update dependency gts to v3 * 🦉 Updates from OwlBot See https://github.com/googleapis/repo-automation-bots/blob/main/packages/owl-bot/README.md Co-authored-by: Owl Bot <gcf-owl-bot[bot]@users.noreply.github.com> Co-authored-by: Benjamin E. Coe <bencoe@google.com> Co-authored-by: danieljbruce <danieljbruce@users.noreply.github.com>
After upgrading
@google-cloud/pubsub
to escape a cpu issue in@grpc/grpc-js
, we found that services using a long-running pubsub subscription began to crash periodically during local development and in GKE.Issue
When the pubsub service closes the underlying http2 connecton (this happens after 15 minutes), the connected subscription instance emits an
error
event and does not reconnect.In a previous version of the library, the subscription instance would reconnect (and wouldn't emit an
error
event). However, it looks like theCANCELLED
status was removed from theRETRY_CODES
list found insrc/pull-retry.ts
in this commit, which means we skip the the retry block and move on to destroy the stream with aStatusError
insrc/message-stream.ts
hereSince the pubsub service reserves the right to cancel a connection at any time, I would expect this library to handle reconnecting when the pubsub service cancels said connection and not emit an
error
message on the subscription instance.The simplest workaround I've found so far is to manually add the retry code for
CANCELLED
back into theRETRY_CODES
list before requiring@google-cloud/pubsub
proper:Environment details
@google-cloud/pubsub
:1.7.2
google-gax
:1.15.2
@grpc/grpc-js
:0.7.9
Steps to reproduce
The simplest way to reproduce this issue is to create a subscription to a topic using the pubsub-emulator and wait 15 minutes.
Example reproduction code:
Running the above script without sending any messages to the topic yielded this error after 15 minutes:
The text was updated successfully, but these errors were encountered: