Glob collections behavior on partial responses #99

PapaCharlie · 2024-08-05T20:44:24Z

I have a question on how clients are supposed to interpret glob collection responses from an xDS control plane. gRPC has a default message limit of 4MB, which can cause clients to reject a response from the control plane if it is too large. In practice, most glob collections will be small enough to fit in a single response, however, at LinkedIn, some clusters teeter over the edge of this limit during high load, which was causing some clients to simply reject the response. This is especially likely during startup since the clients may request multiple collections at once which can easily cross this size threshold. Because the limit is not trivial to raise (and there is no guarantee a single value will fit all usecases), our control plane implementation instead splits the response into multiple "chunks", each representing a subset of the collection, such that each response is smaller than 4MB. However, this raises the question of how the client should behave under such circumstances.

The spec does not dictate that the collection be sent as a whole every time (nor should it, for the reason listed above), but it also provides no way to mark the "end" of a collection or a means to provide the collection's size. This means in some extreme cases the client may receive only a very small subset of the collection on the initial response from the control plane. In this scenario, should the client:

Wait an arbitrary amount of time for the control plane to send the rest of the collection? In the case where the client already received everything, it could introduce unwanted latency.
Simply treat the contents of the response as the full collection, even if it is partial? This is equally bad since it could cause the client to send too much traffic to a subset of hosts if the collection is being used for LEDS.

There is no room in the protocol today to really communicate the size of the collection, and arguably it's something that would provide little to no purpose other than for this specific edge case. My suggestion would be to mimic the glob collection deletion notification, but in reverse. Here is what it would look like (following the example in TP1):

Client requests xdstp://some-authority/envoy.config.listener.v3.Listener/foo/*.
Server responds with resources [xdstp://some-authority/envoy.config.listener.v3.Listener/foo/bar, xdstp://some-authority/envoy.config.listener.v3.Listener/foo/baz, xdstp://some-authority/envoy.config.listener.v3.Listener/foo/*].

By adding the glob collection's name in the response, the control plane can signal to the client that it has sent everything. This serves to effectively bookend the response from the control plane. The client can subsequently wait for this "end-of-glob-collection" notification to unambiguously determine whether it has received every resource in the collection. The resource named after the collection would have to be null or some special value to prevent it from being interpreted as an actual member of the collection. This proposition could require some changes on clients, but this problem seems important to address as more systems leverage the xDS protocol.

The text was updated successfully, but these errors were encountered:

adisuissa · 2024-08-06T15:58:52Z

cc @htuch @markdroth

Generally speaking, the xDS-protocol is an eventually consistent protocol, which implies that even when sending responses in chunks, eventually the clients will have the same view as the server.
Even if an "EOF" type of indication will be sent by the server, the xDS-protocol will likely still use a "warming"-timeout to ensure liveness of the system.
That said, I do think that scalability is an imperative goal, and it should be addressed by the protocol.

Can we take a step back and try to understand what causes the bottleneck?
Specifically, is the underlying issue the number of entries in a response, or the size of each individual resource (and by having N number of resources goes over the threshold)?

PapaCharlie · 2024-08-06T17:37:21Z

Generally speaking, the xDS-protocol is an eventually consistent protocol, which implies that even when sending responses in chunks, eventually the clients will have the same view as the server.
Even if an "EOF" type of indication will be sent by the server, the xDS-protocol will likely still use a "warming"-timeout to ensure liveness of the system.

Yeah, and that will eventually happen even in this scenario. But if the client has no such "warming"-timeout, it could cause some trouble.

Regarding the timeout itself, is it expected that the client always use a "warming"-timeout? That does seem reasonable but it's technically something that could be short-circuited right? Especially now that the Delta/Incremental protocol actually can let the client know that the resource doesn't exist, removing the need for an explicit timeout for the SotW clients. With the Delta protocol there's no need for such a timeout.

Can we take a step back and try to understand what causes the bottleneck?
Specifically, is the underlying issue the number of entries in a response, or the size of each individual resource (and by having N number of resources goes over the threshold)?

It's a little bit of both :) We use the xDS protocol to ship custom types outside of the standard LDS/RDS/CDS/EDS flow. Basically, rest.li (LinkedIn's RPC stack) uses ZooKeeper as the data store for service discovery. We're trying to get off of ZK aggressively, and since we were migrating everything to gRPC/Envoy and developed an xDS control plane to do that, we effectively replaced the direct ZK connection with a connection to the xDS control plane. The control plane returns the same data that the ZK connection was returning, allowing us to plug in all of the existing code directly, without refactoring everything. In this case, the actual resources returned by the control plane for the equivalent of LEDS are larger than the LbEndpoint because it captures a bunch of additional metadata. Some of the clusters returned by the backend are also very large (4-5k hosts) so the overall cluster size combined with the larger individual resource size makes us cross the threshold frequently, even for an individual cluster. This is exacerbated when the client requests multiple clusters in one request.

Similarly, even though it's unlikely that a normal LEDS response for a single locality will cross that threshold, it's very possible to cross it if a client asks for many localities at once (e.g. during startup where it's asking for all the localities it was subscribed to previously). So while it's definitely not something that's usually encountered in average Envoy workflows, it's 100% conceivable that this could happen in very large meshes.

PapaCharlie · 2024-08-06T17:42:17Z

Regarding the proposed solution, this is something that we could introduce as an extension, where clients specify that they support this pseudo-EOF notification, and the control plane conditionally sends it back

htuch · 2024-08-07T17:53:21Z

Would a list collection work better in this case? I.e. if you want to know the exact expected resources.

PapaCharlie · 2024-08-07T19:25:21Z

if you want to know the exact expected resources

We don't really care about the specific resources. It's functionally equivalent how LEDS works today, where the number of hosts in a specific locality doesn't matter, Envoy just needs to know what they are.

Would a list collection work better in this case?

I think it wouldn't really work for this for the same reasons it doesn't work for LEDS. The resources are way too dynamic and would require a lot of round trips to materialize the actual full collection. We can't inline the entries in the collection response since it would also trigger the same problem where it crosses the threshold.

Ultimately, there's 2 separate issues I want to discuss:

An LEDS response can be larger than 4MB, causing it to be dropped by the client. The way around this is to split the response into multiple chunks, which works out of the box since every element of the collection is represented as an individual resource. Is it expected that glob collection responses should always fit in one response, or is this a valid strategy?
If a control plane can decide to arbitrarily chunk glob collection responses, the client now does not know whether the response was chunked, and therefore does not know whether it has the full collection. It seems to me that there are a few ways to address this:
1. Communicate to the client how many resources it should expect for the glob collection it wants. The client could then wait until it has that many resources. This one seems a little tricky to me since the resource count could change in between responses, making it a little racy, but doable.
2. Communicate to the client that a glob collection was chunked, signaling it to wait for a subsequent response. Pretty reasonable but I don't know if there's room for in the protocol?
3. Have a marker for the end of the collection. The client would then wait for responses until it receives said marker. This could be done by adding a special resource named after the collection at the end of the response (similar to deletion), or some other signal? Either way, this seems to be more flexible and probably the easiest to implement on the control plane.
4. Have a "warmup"-timeout, as @adisuissa mentioned. This is obviously the simplest thing to implement as it requires no changes to the protocol, but has the obvious pitfall in that it could be too short and cause the client to not wait for the full collection, or too long and waste time while the client already has the full collection.

Maybe we can set up a meeting to discuss this?

htuch · 2024-08-08T22:53:56Z

I think some kind of delimiter could make sense. @markdroth thoughts? I think we're generally open to meeting and have a broader interest in collaborating with folks doing work in this space with xdstp and pushing the limits of scalability.

markdroth · 2024-08-13T16:59:53Z

(Sorry for the delayed response; I was out of town for a few weeks.)

In the general case, I don't think it's reasonable to expect the protocol to have a notion of "I have sent you the complete set", because the set of resources in a given collection can change very dynamically (e.g., auto-scaling adding or removing endpoints). So what happens if some resources are added or removed before the control plane finishes sending the initial set? This seems like a somewhat arbitrary decision for the control plane to make in the general case -- and if there's enough churn in the set, then the control plane might never tell the client it has the whole set.

Furthermore, it's not clear to me that the client should really care whether it has the whole set. The client already needs to be able to handle resources being added and removed at any time. As @adisuissa mentioned above, xDS is an enventually consistent protocol, and it should not matter to the client whether it gets all of the endpoints at once or whether they are split into two responses.

Note that at least in gRPC, there is always a bit of endpoint imbalance when the client first starts up, because even if it gets all of the endpoint addresses at once, the connection attempts for those addresses will finish at different times, and the client will start sending traffic to the ones it is connected to while waiting for the others. This imbalance smooths itself out fairly quickly, assuming all endpoints are reachable. But a small delay in getting all of the endpoint addresses should not in principle make much difference.

(I realize that Envoy works differently than gRPC in that regard: in Envoy, the LB policy picks a host without regard to whether that host currently has a working connection, so this initial imbalance might not happen if all of the endpoints are known. But I will point out that the flip side of that is that Envoy may choose a host that turns out to be unreachable, thus causing the request to fail, whereas gRPC will not do that.)

So I think the main question I have here is, why isn't eventual consistency good enough here? Shouldn't the short-term imbalance be very quickly resolved? Are there other things you can do to alleviate even that short-term problem, such as having the control plane randomize the order of the endpoints it hands out?

If a given delta response contains too many resources, the server will break it up into multiple responses. However, this means the client does not know whether it received all the resources for its subscription. This is especially relevant for wildcard subscriptions, for which the client does not know the resources ahead of time and therefore cannot wait for them explicitly. By returning additional metadata in the nonce (there is no field for this in the delta discovery response, though I'm hoping that will change cncf/xds#99), the client can know if the server chunked the response, and react accordingly.

PapaCharlie changed the title ~~Clarification on glob collections behavior~~ Glob collections behavior on partial responses Aug 5, 2024

PapaCharlie mentioned this issue Sep 27, 2024

Include response chunking info in nonce linkedin/diderot#3

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Glob collections behavior on partial responses #99

Glob collections behavior on partial responses #99

PapaCharlie commented Aug 5, 2024

adisuissa commented Aug 6, 2024

PapaCharlie commented Aug 6, 2024

PapaCharlie commented Aug 6, 2024

htuch commented Aug 7, 2024

PapaCharlie commented Aug 7, 2024

htuch commented Aug 8, 2024

markdroth commented Aug 13, 2024

Glob collections behavior on partial responses #99

Glob collections behavior on partial responses #99

Comments

PapaCharlie commented Aug 5, 2024

adisuissa commented Aug 6, 2024

PapaCharlie commented Aug 6, 2024

PapaCharlie commented Aug 6, 2024

htuch commented Aug 7, 2024

PapaCharlie commented Aug 7, 2024

htuch commented Aug 8, 2024

markdroth commented Aug 13, 2024