-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[HTTP/2] RTT pings shutting down connection in gRPC CI - regression in .NET 6 #62216
Comments
Tagging subscribers to this area: @dotnet/ncl Issue DetailsDescriptiongRPC has a set of interop tests that test various scenarios. These are run as part of a CI process. Upgrading the .NET client from .NET 5 to .NET 6 resulted in some of the tests intermittently failing. Most tests continue to pass. The four that fail are:
Reproduction StepsI attempted to repo this from my local machine but I couldn't get them to fail. These tests are failing in most CI builds here - https://source.cloud.google.com/results/invocations/f72f83db-c93b-467e-94cb-6b4ff7aeb286/targets;collapsed=/github%2Fgrpc%2Finterop_test/tests;group=tests;test=cloud_to_prod:aspnetcore:default:server_streaming:tls;row=2 Expected behaviorExpect tests to continue passing after upgrading to .NET 6 Actual behaviorTests often fail. Logging from test failure:
Regression?Yes, this worked in .NET 5. The only change has been updating the client app to .NET 6. Known WorkaroundsNo response ConfigurationRuntime: 6.0.0-rtm.21522.10+4822e3c3aa77eb82b2fb33c9321f923cf11ddde6 Other informationNo response
|
I wonder why the server went with |
@antonfirsov Any insight into the ping behavior here? 3 pings in 1.5 seconds does seem like a lot. |
Good spotting on the pings. Some servers are very strict about the number of pings allowed before they kill the connection. The endpoint is a sandbox Google production environment. A hardened edge server is validating traffic. (Kestrel processes pings so quickly we don't bother being strict about them) |
I think there are two issues here:
|
Agreed. When the server gracefully (i.e. not in the middle of a frame) closes the connection after a GOAWAY with an error code is received, we should fail any outstanding requests with an exception that indicates the server error from GOAWAY. |
Separate issue for improving error message: #62228 |
Workaround: I think this is a candidate for servicing. |
We send a "burst" of ~4 pings after opening the connection to get an accurate RTT estimation fast, but the client should only sends these pings after receiving DATA or HEADERS from the server, in order to conform to the grpc recommendation on server-side ping flood protection. This allows maximum of 2 "ping strikes" on the server between sending HEADERS or DATA to the client. Looks like there is a sneaky bug in this logic, or some corner scenario I missed (or less likely a bug in the server). I'm afraid the only way to troubleshoot this is to grab a repro environment and produce logs, packet captures etc., @JamesNK can you help with this? If you can't get a local repro can we do something to grab a machine identical to the CI ones or use the CI machines for sandboxing?
It's called |
Triage: We should look into it, but given that it seems to be CI-only problem for now, it is ok to live with workaround for a while. |
@halter73 in my previous comment I missed the fact that it's a statement from you (I thought it's an observation from logs). Where is this 1.5sec restriction defined? I didn't find such a strict rule when investigating existing ping flood protection mechanisms. |
The initial comment has complete logs, including logs from HTTP event source. Capturing packets isn't possible because it is a CI environment that we don't own. Fortunately, the server is publically accessible so we can call it from our local machines. I have updated a PowerShell script to pass the server address and arguments to the interop test client. https://github.com/JamesNK/grpc-dotnet/tree/jamesnk/interoptests-failing Call the interop tests with There are many interop tests but only four are failing. large_unary and server_streaming return relatively large payloads, and jwt_token_creds and compute_engine_creds use authentication. I think what is consistent between these calls is they don't complete immediately. Either there is a delay to download the payload or a delay in validating creds. |
I think it is a matter of time. My hunch is this is caused by a combination of factors:
|
If the remote gRPC server that the CI tests are hitting is based on gRPC Core (the C/C++ implementation), the keepalive documentation for that HTTP2 server may be useful. It talks about when a ping is considered solicited vs. unsolicited and its "strikes" rules. By default, the server side of gRPC Core does not allow pings unless there is an active stream, and pings received more frequently than once every 5 minutes are considered unsolicited. After two unsolicited pings, the server says GOAWAY with ENHANCE_YOUR_CALM. |
@chwarr if my understanding of those guidelines is correct, this is not fully accurate. Note that the Keepalive User Guide for gRPC Core repeatedly talks about stuff like:
or
The details are not clarified in that document, but it references the Client-side Keepalive guide, where the server enforcement section clarifies that the unsolicited ping counter is being reset to Let me know if you think I'm misunderstanding those docs. |
You've read it much more closely than I. :-) I just wanted to point out 1) that the doc exists and 2) what the default values for the gRPC Core HTTP2 server are if these helped explain why the gRPC CI tests were hanging up on the gRPC.NET client. |
No need to decipher docs. The server is publicly available to test against. |
I am getting the same exception after upgrading to .NET 6, not quite sure if it's the same problem.
This exception is showing up on Ocelot API Gateway logs randomly only while calling HTTP/2 enabled container's REST API endpoint. |
@zhuoyang If your problem is the same, disabling dynamic window sizing might help: This should be done on the client side, I guess that would be Ocelot in your case. If it helps, it's the same issue. |
I disabled it just awhile ago, so far so good it seems, will monitor if the issue shows up again. |
@zhuoyang we do not have plans yet. You are the first customer hitting it outside of CI in the real world. When we see more people hitting it, we will raise priority of the issue. |
@karelz We are currently also experiencing the issue in a real world implementation. Disabling dynamic window sizing also helped in our case. As dynamic window sizing is activated by default, more people will run into this issue. |
@karelz @JamesNK note that the "gRPC CI" backends mentioned in this issue are actually backends that are configured to behave in the same way as Google Cloud API backends (the interop tests are basically hitting FooBar API backend). So if requests are failing for the "gRPC CI" backends, it also means there will most likely be failures against google cloud API backends. So this issue is quite serious. Can we raise the priority? It really looks like the DynamicWindowSizing .NET feature sends unsolicited pings (which leads to server terminating the connection). |
@karelz @fhaubner @JamesNK @antonfirsov I am now able to reproduce consistently (see details here: grpc/grpc-dotnet#1511 (comment)) and the problem seems to happen when the latency of the connection is sufficiently low (it happens when the client and server and geographically close to each other). With the description I provided in (grpc/grpc-dotnet#1511 (comment)), the issue should be easily reproducible by anyone who has access to a google cloud project (you need to create a VM in the right compute zone so it's "close enough" to our test backends). @antonfirsov any chance you could take a look? FTR, in one rare case, I managed to catch both a success and a failure on the same machine, in subsequent runs of the test (which I think is valuable since it makes all the parameters equal and the logs might reveal what what the difference that actually triggered the disconnection). The tests were run on a machine that otherwise has a failure rate >95% Log of successful run: Log of failing run: |
@jtattermusch thanks a lot for taking the time to create and document your repro, especially for tracking down that it's related to the location! I can also repro this in an Azure VM located in East US. I will look into this, though it will take some non-trivial amount of time to figure out the exact conditions leading to |
https://grpc.github.io/grpc/core/md_doc_keepalive.html The docs here say the ping limit is 2 without the server receiving headers or data frames. I can see 3 pings in Jan's logs. Is SocketsHttpHandler exceeding that limit? |
@JamesNK @jtattermusch correct me if I'm misreading the docs, but in my understanding |
My understanding is every ping is a bad ping, i.e. a strike. The server then resets the number of strikes when either enough time has passed (2 hours) or it sends a DATA or HEADERS. Because GRPC_ARG_KEEPALIVE_PERMIT_WITHOUT_CALLS defaults to false, SocketsHttpHandler must wait until after it sends a request before any pings. Then, after that first request, it can send up to 2 pings. Then it must wait for HEADERS/DATA from the server. When I talked to Jan about this he said it only reproduces when the client and server were located in the same data center, which means low latency. I wonder if there is a race between what the server sends and when the client sends its pings. The safest way to fix this could be for SocketsHttpHandler to send two pings instead of 3 to calculate latency. 2 pings at any time is always safe with the default server configuration. Alternatively, wait until the next request is started and we have received HEADERS from it before resetting allowed pings to zero. That should remove any chance of a race. |
This is what we do. We never send a PING frame without receiving a DATA/HEADERS on an active stream first, so the Receiving DATA or HEADERS indicates that on the other side the
However, if SocketsHttpHandler receives 3 HEADERS/DATA in a row replying with a PING for each of them, and then the server has a short hiatus sending DATA/HEADERS (and thus not resetting |
We did some research with @jtattermusch, and this turned out that the There are proxies in the wild which do different ping accounting than what is specified in the "server enforcement" document, resetting their ping counter upon receiving frames. We should adjust our implementation to make sure we don't send more than 2 RTT |
Triage:
As a result we recommend to close the issue as there is no current customer hitting it. @JamesNK just to double-check: You re-enabled the tests and you don't see the problem anymore either, right? |
Yes. Interop tests that call Google backend are using .NET 6 again - grpc/grpc-dotnet#1765 |
Description
gRPC has a set of interop tests that test various scenarios. These are run as part of a CI process. Upgrading the .NET client from .NET 5 to .NET 6 resulted in some of the tests failing 50% of the time.
Most tests continue to pass. The four that fail are:
Reproduction Steps
I attempted to reproduce this from my local machine by calling the remote server using the failing tests, but I couldn't get them to fail.
These tests are failing in most CI builds here - https://source.cloud.google.com/results/invocations/f72f83db-c93b-467e-94cb-6b4ff7aeb286/targets;collapsed=/github%2Fgrpc%2Finterop_test/tests;group=tests;test=cloud_to_prod:aspnetcore:default:server_streaming:tls;row=2
Expected behavior
Expect tests to continue passing after upgrading to .NET 6
Actual behavior
Tests often fail.
Error detail (consistent across the 4 failing tests):
Complete logging from test failure (including from HTTP event source):
Regression?
Yes, this worked in .NET 5. The only change has been updating the client app to .NET 6.
Known Workarounds
No response
Configuration
Runtime: 6.0.0-rtm.21522.10+4822e3c3aa77eb82b2fb33c9321f923cf11ddde6
Other information
No response
The text was updated successfully, but these errors were encountered: