-
Notifications
You must be signed in to change notification settings - Fork 654
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Intermittently the client enters a state where it doesnt receive response sent by server #2502
Comments
Did you have all tracers enabled here, or only some of them? You said that this persists on retries; does the server log that it receives multiple requests when those retries occur? Do you have keepalives enabled on the client? Can you trace this interaction with Wireshark or tcpdump and share the dump log? |
All tracers should be enabled, I have GRPC_TRACE="all" and GRPC_VERBOSITY="DEBUG". Yes, retrying results in another set of approximately the same logs on both the client and server I am not passing keepAlive. I will add it. Will look into if it is possible for me to get a Wireshark trace. |
OK, that probably means that the connection is still valid, which means that keepalives probably won't change anything. |
Environment was accidentally removed from this bad state before I could capture tcpdump. Will get tcpdump asap when I repro again, might be a couple days. |
I may be seeing this same issue, but it might be different. Let me know and I can open a separate issue if needed. All these reproductions were using We run a workload in GKE that serves REST requests and also publishes to PubSub. Very occasionally, about once a month, a single pod will go off the rails and continually log THIS DEADLINE_EXCEEDED error from gax-nodejs. PubSub messages are never published, the pod never recovers, and we need to kill the pod. I wrote a small script that, in a loop, awaits a publish to PubSub and waits 500ms. I added error logging to gax-nodejs around HERE. I was able to reproduce. Every error was this stack trace from grpc-node v1.8.17 Stack trace
I found this example of adding channelz support (thank you!). Added that, reproduced again. channelzShortly after startup grpcdebug works fine, but when I come back hours later even healthy pods still publishing to PubSub show this "failed to fetch subchannel" error. 🤷
But json output
I can't get grpcdebug to show any info on the subchannels listed. Converting the trace event timestamps to same timezone as this metric screenshot... Trace events
Everything goes wrong around that 2023-07-21T13:43:34 timestamp. The DEADLINE_EXCEEDED errors start to be logged at 2023-07-21T13:43:38.906 and I am using a 5 second I tried capturing a tcpdump, but there's nothing hitting the wire. I have nothing to show there. I'm not sure what would trigger what I think is the I'm running this on a zonal GKE cluster in us-central1-a and it will reproduce on at least one of ten pods within 12 hours. I've not been able to reproduce on any of: Local PC running on my consumer Internet line, GKE Autopilot in us-central1, regional GKE cluster in us-west1. Please let me know what I can collect that would be most useful for you to debug. I don't have tracing turned on because I didn't want to give myself a big GCP logging bill, but I might be able to write the tracing to a file and get it that way. |
The The channelz error you are seeing may indicate a channelz bug in the client. That component is not well-tested, so I wouldn't be surprised. Can you double check that you cannot get info on any of the listed subchannels? I'll look into this more on Monday. |
I'm seeing both. Sorry about being unclear. With no changes to gax-nodejs, I would only ever see the "Total timeout of API ${apiName} exceeded" error from HERE in gax. Gax does not log or provide the caller for a way to log an error from grpc-js that gax is going to retry. That would be done in gax around this line HERE. There are seven retry codes configured for google.pubsub.v1.Publisher Publish set HERE. That means I don't know which of seven error codes are actually happening, no real idea of what's going on. I added logging into gax (at that line) to log the error message gax is receiving from grpc-js and to also log some of the variables in scope at that point. Here's a screenshot of the logging with my patch I had saved from earlier. These are the retry settings I'm using. retry settings
initialRpcTimeoutMillis=5000 and totalTimeoutMillis=60000 means gax is retrying every roughly 5000ms. The log lines with the green details are the DEADLINE_EXEEDED errors from grpc-js HERE. After 12 tries at 5000ms, over 60,000 ms has passed, which exceeds totalTimeoutMillis so gax throws this error HERE. Once a pod gets into this state it never recovers. Very similar to P0rth0s reporting "The error persists on retries until the server or client is restarted" which is why I added to this issue instead of creating a new one. Thanks for the eyes on it! |
Please try updating your dependencies so that you pick up const pubsubClient = new PubSub({ 'grpc.keepalive_timeout_ms': 10000, 'grpc.keepalive_time_ms': 30000 } as any); The If that doesn't help, we can look into investigating further with trace logs. |
@murgatroid99 I sent you an email containing packet captures. Note I was originally getting DEADLINE_EXCEEDED errors but then started getting CANCELLED errors instead. However all other behavior appears consistent. |
FWIW I was encountering this using the This was on version |
@oldmantaiter From your description of the error, I think you would benefit from enabling keepalives, but unfortunately it looks like there isn't a way to inject options to the relevant gRPC client constructor call here. You may want to open an issue with that repository to get them to either set those options or allow you to set options. |
@murgatroid99 - Yep, that's where I ended up and switched libraries to resolve our issue, I'll open an issue to ask them to allow setting GRPC options so others running into it might be able to enable keepalives in the future. |
After a few goof-ups, I was able to reproduce and grab trace logs. This was with:
I have a tarball of the trace logs available here: https://github.com/wwilfinger/grpc-node-deadline-exceeded/tree/main/trace-logs (Screenshot is in UTC-5). The first error message is at 2023-08-01T02:42:31.319Z. I'm exec'ing into the pod and running some commands at around 2023-08-01T14:45Z channelz
That very last subchannel 887 output has a target of tcp info
Nothing is hitting eth0 so no packet captures. Action starts here which is line 4,431,621 in the trace log file
There's other examples earlier in this same trace log of a closed connection GOAWAY where gprc-js recovers just fine. Some lines I found interesting
subchannel (887) is what was reported in channelz. Those (918) lines have the same remote ip and are repeated in the rest of the trace log. I see a |
Thank you for all of this detailed information. It looks like subchannel 887 is definitely the culprit. Every single request initiated with it ends with Fortunately, keepalives should fix this for you. If there really is no connection, the client should see that on the first ping it sends and go back to connecting again. You should use the options I suggested in #2502 (comment).
That is correct. 918 is the ID for a transport, not a subchannel. That corresponds to a "socket" in channelz, and you can see that ID in the A few notes about the logging output, mostly for things I need to change:
|
Thanks so much for the detailed response. I'll turn on keepalives and see what happens.
Ah okay, that makes sense! |
Here's something interesting: the command |
(I think the keepalives will help our production deployment. I haven't seen it again in production but we were only seeing it about once a month or so anyway. I'm still interested in what's going on. I also know there's a lot of existing code not configuring keepalives.) My test setup (running in GKE standard [not autopilot] cluster, us-central1, e2-medium node) reproduced twice overnight. I was able to get heapdumps, grpc traces, and packet captures for both reproductions. I can email you the packet captures with the tls keylog and the heapdumps if requested. I believe the bearer token for the service account would be expired by now, but it's all over both of those I don't really want to post that all on the internet forever. Trace logs, grpcdebug output, and some shell cmd output are available here. "26q77" and "rcmzn" were suffixes on the k8s pods that reproduced. This all looks similar to my previous reproduction, but I have packet captures. 26q77trace.log line 2,361,090
This is the packet capture for the connection to 142.251.120.95:443 Similar story for the other pod rcmzntrace.log line 4,378,056
This is the packet capture for the connection to 173.194.195.95:443 I'm going to type a few words of analysis but only worry about the "26q77" logs. The "rcmzn" case is extremely similar.
I don't have proof but I have a feeling that node isn't being given enough cpu cycles to complete the handshake in time. There's noticeable gaps in the trace logs around T08:29:19.808 - T08:29:21.705. The resets from the server-sides happen to a lot (all?) of the inflight TCP connections. I'm running my testing code with guaranteed QoS 20 mcpu (request=limit) but on a e2-medium which is a shared cpu instance. We run 2000 mcpu with guaranteed QoS in production on much larger vms, but the app is doing much more than pubsub publishes. I found rumors of node's tls handshake happening in the main thread. Where I've not reproduced this behavior with my test script: running on a local pc, running in GKE autopilot, running in GKE standard with cpu request=20m and no limit. All of those would have fed more cpu cycles to node. Node continuing to think I hardly know what I'm doing with heapdumps. It seems like most of the memory is in buffers for outgoing requests that are never going to happen. |
https://github.com/wwilfinger/grpc-pubsub-fake-public/ I can reproduce locally consistently with this code. I had most of this already written so it's not a very minimal reproduction. The Golang gRPC server in this has a 1ms connection timeout configured. The NodeJS client can never (that I've seen) complete TLS within that time. After a handful of tries the client gets into the same state that I've seen while running in GKE and communicating with actual PubSub. In the client code, uncomment the gRPC keepalive config and it will get itself out of the "stuck" state (good!). Increase the server-side connection timeout to 1 second, everything works okay. |
This is great data. I think these packet captures are the smoking gun we were looking for: the client misses the FIN,ACK packet and then misses or ignores the RST, and from then on treats the connection as still open and usable, even though it is not. The heap dump seems to show where the memory leak is coming from: the HTTP/2 implementation will buffer arbitrarily many outgoing packets if it thinks it has a writable socket. I think this clearly shows that this is a Node bug, so I think you should file an issue in the Node repository. Do your packet captures show any further socket activity after the RST packet? The answer to that would indicate how the client is interacting with the socket in the long term. Keepalives should always get a client stuck like this unstuck. The client will send the ping, and it won't get a response, and then gRPC will discard the connection and create a new one. |
I managed to strace the client side of the reproduction with all gRPC tracers active, and the output is interesting, and might help with a Node issue. Here are the most relevant lines (not all consecutive):
So, FD 23 corresponds to the socket that appears to connect. Then writing to it results in |
Based on the information in nodejs/node#49147 (comment), I published a change in grpc-js version 1.9.1 that defers all actions in the write callback using |
v1.9.1 has made this worse for me. With this version I am seeing this bug at much greater frequency than previously (I believe oldest iv seen it on is 1.8.2, but could be wrong). I may be able to use this increased frequency to get around some of the packet capture issues I was running into. |
@P0rth0s have you tried enabling keepalives, as suggested in #2502 (comment)? The argument to the |
@murgatroid99 Yes passing I am seeing this under very similar conditions to the ones linked in that other thread. Since I can get this within a day consistently on 1.9.1 it shouldn't be a problem for me to get a tcpdump without ssl enabled |
The comment I linked is from earlier in this thread. Keepalives really should be fixing this. Remember that with those settings, a connection can stay in the bad state for up to 30 seconds before it is detected by keepalives. Can you confirm that you are seeing the problem persist for longer than that? Can you also confirm that keepalives are actually enabled by running your code with the environment variables |
Hi @murgatroid99, I've picked up playing with this from @P0rth0s. We've upgraded to 1.9.4 and have been doing some debugging and testing. I'm not 100% sure we are seeing the same issue or a newer behavior, possibly due to changes we've tried to make. Specifically, it appears now that we spin many connections very rapidly which in turn spins out CPU load even when we aren't making this amount of calls. I've uploaded |
It looks like you sorted that log. If so, that's more confusing than helpful, because it puts all of the events with the same timestamp in alphabetical order. Can you please share the unsorted output? |
Sorry about that standard practice for most of our internal logs. https://gist.github.com/krrose27/b9e31023bccfbcda02fb828c5f6317d7 |
This is definitely a different problem than the one we were seeing earlier. It looks like for some reason the server is repeatedly accepting the connection and then closing it. The client doesn't back off when connection attempts succeed, so this is a bit of a pathological case where the client will just keep trying to connect. If the server instead refused the connections entirely, the client would start connecting with a default 1 second delay and back off exponentially to a default maximum of 2 minutes. |
If you are running a Node server, and you call Update: As of grpc-js 1.10.x, |
Problem description
Intermittently our grpc client is entering a state where the server is sending a response to the client, but the client doesnt receive it and throws a DEADLINE_EXCEEDED error. The error persists on retries until the server or client is restarted.
Reproduction steps
Unknown - appears to eventually enter this state in longer lived environments.
Environment
Additional context
Client Logs
Server Logs
As you can see the server responds well within the deadline but the client never gets the response. I know transient failures can happen but since it persists on retries it appears there is something deeper going on here.
The text was updated successfully, but these errors were encountered: