-
Notifications
You must be signed in to change notification settings - Fork 4.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
h2 connection pool is limited by SETTINGS_MAX_CONCURRENT_STREAMS #2941
Comments
Unoptimized EDS is the root cause. With ADS, we have a single EDS stream globally, which is nice for the obvious reasons. With non-ADS, currently Envoy opens a new stream for each cluster, regardless of whether they point at the same management server and could be converged. I think it would be best to go in and fix EDS to reuse streams. It's not that hard, I'd SWAG this as 2-3 days effort including tests. |
That would only address envoy->pilot but not envoy->unmodified grpc app we can't assume all applications will change their configuration to deal with envoy limitations ? |
I believe same problem exists for HTTP Async Client as well, if the cluster is configured with H2. The gRPC Async Client is simply depending on that. |
@htuch EDS optimizing is good and solves the exact issue for istio, but I feel fixing gRPC async client (and underlying HTTP async client) is more important. We cannot rely on gRPC stream optimization for all cases. |
I don't really understand what EDS stream optimization means. Do you mean changing the API to put requests/responses over a single stream? The proper fix here is actually at the HTTP/2 connection pool level. The pool should be able to create multiple connections per backend host if needed. There are other reasons we actually want this also (allowing for more connection/stream fan out in the case of middle proxies). |
@mattklein123 today we have 1 EDS stream per cluster. EDS can have multiple cluster subscriptions per stream, but this isn't done today outside of ADS. |
There are no API changes, only Envoy implementation tuning. |
@htuch OK I see. Can we open a different issue on that? I would like to keep this issue open to track h2 connection pool which we should also fix at some point. |
@mattklein123 Sure, #2943. |
One possible option (or hack?) would be to override the server-sent 'max streams' until multiple connections are implemented. If the app happens to use a h2 stack that doesn't enforce max streams Or document how to set max stream for common languages and stacks (along with docs on how to The biggest problem was the lack of clear information about this limitation. |
That's going to break more stuff than fix. |
Extra info:
So the real fix - or lots of docs and changes in upstream servers are required, no quick hack. |
For folks looking for management server workarounds in the interim, here is an example of a Go management server change which is pretty trivial: projectcontour/contour#308 |
I am not able to reproduce this issue. I wonder if the problem with istio pilot was that the proxy is/was setup as TCP ? @costinm the only settings I see when using envoy as a grpc (h2) proxy is: while I go directly against my same go server with a max streams of 16: 22:21:38 http2: Framer 0xc4204e8000: read SETTINGS len=6, settings: MAX_CONCURRENT_STREAMS=16 and either way I can do > 100 simultaneous streams on 1 connection |
if I remove maxstream on the backend side I really push it (doing 1000 simultaneous streams * 2 connections) I get about 15% errors:
I get in the envoy ingress logs (no errors on the envoy side car) :
What is UO ? |
@ldemailly I'm pretty sure that gRPC xDS client talks directly to selected cluster over HTTP/2, so listener settings (HTTP or TCP proxy) for Pilot wouldn't matter. As for the difference in From documentation: |
I'm dealing with this kind of issues using Envoy (Istio) since last week where I'm trying to DDOS my application. istio/istio#4658 Go client (GO 1.10 / gRPC) -TLS- > Envoy (Istio 0.7.0) --> server (Go 1.10 / gRPC) Trying to DDOS my app, I start some I see Envoy opening 4 TCP cnx to my server application, proxying every client connections into it. Which means that, if my gRPC server is setup to allow 250 streams per client, I will be able to handle 1000 clients only (250 streams * 4 TCP connections). While I'm not sure for the numbers, I can confirm the behaviour. My conclusion is that :
Which means that if Envoy only opens 1 TCP cnx to the server, and if the MaxConcurrentStreams is 250, Envoy will not be able to handle more than 250 clients
I hope I'm wrong, but this is the behaviour I see. I can't find anything about the real behaviour Also, I f my server is set to |
I also noted something special with the pool of TCP cnx used by Envoy when using gRPC Streams. setup
Note that the load testI'm using our gRPC client to load test. Each In the above setup, everything is fine until a reach the 4000 active requests :
Discussionthis can be reproduced anytime. At the same time I can see hundreds of active |
I can now add some more info. one Envoy
Everything working as expected, can go up to 1024 Streams as per Envoy's default. If I try more, I do get some 503's, which is the expected behaviour two Envoy
In fact, what is happening is :
Which means if N > 100, I lose connexions. I can't find anything why the limit is 100, but I can reproduce this situation every time. I'm still digging... |
I really think this issue is not linked to istio/istio#4940 Maybe I should open a new issue here as it really sound to live inside Envoy ? |
@prune998 your issue is a different issue. I know what the issue is. Please delete all your comments from this issue as they are unrelated and open them in a new issue. Thanks. |
Doesn't seem like anyone is actively working on this. Mind if I look into implementing this? |
@mjpitz I have invited you to envoyproxy org, if you accept I can assign you the issue. |
@htuch : Joined |
Alright. I spent a fair bit of time diving into the code base. General Notes:
When primary is swapped to draining:
Implementation proposal:
Thoughts on this approach? Anything major I'm missing or things I should consider? |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
Sorry I haven't gotten to this yet. I got sidetracked with a presentation. Hopefully I should be able to get to it in the next couple of weeks. |
This issue has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in the next 7 days unless it is tagged "help wanted" or other activity occurs. Thank you for your contributions. |
keepalive |
I've been talking myself in circles so it would be nice to talk some of this through some more before implementing. Since being assigned this ticket, I've been reading up and understanding more about how the Envoy threading model works. This was well captured by the following blog post and tech talk:
A quick TLDR: envoy maintains a pool of workers (configured by One concern that came to mind was how this impacts the purpose / intent of MAX_CONCURRENT_STREAMS. From what I can gather as far as resources go, it's purpose isn't well documented. As an engineer, I think about this setting as a way for service owners to throttle workloads being performed by their clients. By implementing an "auto-scaling" feature like this, we effectively bypass this setting. I've read a few articles around this and many people are being suggested to work with the service owner to better understand the workload they are trying to perform. Here's one issue under the http2-spec detailing a similar response. Open Questions:
Ultimately, I've found myself in a personal philosophical debate of "Sure, I could add this feature, but should I?" Getting some clarity around my questions above will help resolve my internal debate. |
Afaik, it is the maximum multiplexing of logical grpc streams over a single tcp socket/connection; it's not meant to mean no additional streams should exist between the client and server but just a measure of limitation of throughput and congestion over a single socket (imo) |
I know gRPC defaults to one connection per backend returned via name resolver (at least in Java). The google-apis/gax project has a ChannelPool that allows you to increase the number of connections per backend. (just to add another reference to where we're already bypassing this constraint) |
@mjpitz it's probably a good idea to distinguish between data and control plane here. For data plane and backends, we will effectively have an upper bound of I agree with the idea of making the connection pool support multiple connections. @oschaaf has also been thinking about this in the context of https://github.com/envoyproxy/envoy-perf/tree/master/nighthawk |
Yes, we should definitely do this. There are additional reasons including mitigating head of line blocking in certain cases. I think we might have an issue open specifically on this but I can't remember and I quick search doesn't return anything. |
Got an initial patch based on my proposal from back in January. Working on running tests. Here's the initial diff. |
@mattklein123 : if we're also looking to mitigate hol-blocking, we probably want to provision the connections ahead of time yeah? edit: actually.. preemtively establishing connections doesn't seem like it would help much... looks like there are a couple other good solutions though |
@mjpitz yeah, pre-creation is another optimization, though IMO I would recommend tracking that under a new/different issue. I've discussed wanting to do this many times with @alyssawilk (who also has thoughts on allowing multiple h2 connections). |
Prefetch is also something that @oschaaf and I have discussed in the context of Nighthawk. |
So I'm not sure how I thought I was originally going to get the settings info from the h2 conn_pool. Getting deeper into the code, it seems like that's pretty well encapsulated by the client_codec (probably as it should be). I'm curious what other ideas are floating around so I'll check in on this again tomorrow. |
Fixed with recent connection pool changes. |
It turns out that gRPC async client won't open more than
SETTINGS_MAX_CONCURRENT_STREAMS
concurrent streams to xDS gRPC server (good!), but EDS monitors are long-lived streams, waiting forever, and it doesn't look that gRPC async client opens more than a single HTTP/2 connection to backend (bad!), which means that total number of working EDS endpoints is limited by xDS's settings, and only the firstSETTINGS_MAX_CONCURRENT_STREAMS
EDS will be able to establish HTTP/2 stream and receive responses.The solution is for gRPC async client to open another HTTP/2 connection once it reaches xDS's
SETTINGS_MAX_CONCURRENT_STREAMS
.Temporary workaround is to increase
SETTINGS_MAX_CONCURRENT_STREAMS
on xDS server, but that breaks once there are middle proxies involved.See istio/istio#4593 for background.
cc @htuch @mattklein123 @costinm @ldemailly @lizan
The text was updated successfully, but these errors were encountered: