-
Notifications
You must be signed in to change notification settings - Fork 4.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FilterChain Discovery Service (FDS) #4540
Comments
Related add-on feature: lazy-loading of FilterChains, triggered by a listener-filter that is watching for unknown SNI values. |
@htuch does the drain happen when a particular FilterChain is updated, and also when any FilterChain is added or deleted in the listener? |
@andraxylia the intention here is to only effect new connections, right? Obviously previous connections will use the old filter chain? I just want to make sure we are on the same page. If so this sounds like a useful and straightforward feature to me. |
@mattklein123 All is good if a change in the FilterChain does not affect existing TCP and GRPC connections - for instance the addition of a host in server_names in the FilterMatch. @rshriram @costinm told me that because of issues with full drain we cannot move Istio from multiple listeners to 1 listener with multiple FilterChains. |
I brainstormed with @htuch and here's what we propose for now:
Any comments or concerns with this plan? |
I don't understand how filter chains interact with draining? Can't old connections use the old filter chain and new ones use the new filter chain? If we want to add draining on filter chain change I think that's totally orthogonal? |
@mattklein123 Yes, that's basically the conclusion @htuch and I came to. There are two orthogonal feature requests/requirements here:
The implementation of these two features is mostly/entirely non-intersecting. |
OK, thanks for the clarification. IMO we don't ever need to implement draining on a per filter chain basis. I would track as a totally separate issue? Fixing (1) should be simple compared to having to do draining also (not simple). |
Actually, thinking through this, I think (2) would depend on (1) to work as expected. In order to modify one of the filter chains via FCDS, it would need to modify the listener and cause draining of connections on that filterchain, but not drain connections on unchanged filterchains. |
But even with lazy load, why is draining required? It seems like we can write the code so that a filter chain is latched onto a connection at initial connect time? Why is it required to drain? (I can see why someone might want that, but it seems not required). |
That's an interesting point. I suppose it's not required. But it would be inconsistent with current behavior for listeners: there's not currently a way to modify/remove a listener without draining. But the answer to (1) may be as simple as adding a new DrainType of "hot-restart only". That's not quite as fine-grained, but would be much less effort for most of the potential gains. It would entirely solve the case of filterchain add/remove, but would not allow forcing draining for a filterchain modify. |
FWIW, my view of filter chain is that it's something that is only consulted when a connection is created. As such, draining is not required. The reason I wrote draining into listeners is that when a listener goes away, by definition is destroys all owned connections (this wouldn't be the case for filter chains). I agree that we could add a configuration option to listeners to not drain and just hard swap, though it seems like few would want this behavior? |
I don't understand why filter chains would be any different in this way. We still need to ensure that all of the filter configs, etc are not destroyed until all connections using them have closed. This could be done with |
The reason that I added draining for the listener implementation was honestly mainly for simplicity due to connection ownership. I agree that connections could live beyond the life of a listener but it would be a very large change. Obviously not doing anything at all (draining vs. shared_ptr ownership) would be not good for many users. For filter chains, I think the keeping the filter chain alive via shared pointer if necessary is substantially simpler and if it were me is how I would start. I don't object to adding draining in any way, I just don't think it adds much and would be pretty complicated. |
Ok, that makes sense. I just wanted to make sure I understood your reasoning. What do you think about changing listeners so that they do not drain when their config is modified (but not deleted) if the only part of the config that changed is the filter chains? |
Yup makes sense and I don't think would be that hard. |
SGTM to me as well, FWIW I hadn't grokked the full tradeoff between shared_ptr and drain here in Listener implementation, so thanks @mattklein123 for explaining this for posterity. |
@mattklein123 I've been thinking this through, and simply using a shared_ptr for the filter-chain will run into other issues, because right now ThreadLocal slots must be destructed on the main thread. But I can probably make it post deletions to the main thread when the refcount goes to zero somehow. |
Hmm yeah that's definitely a complexity with the stored lambdas. Yeah I think could have some type of shared_ptr wrapper that on destruction posts back to the main listener for deletion. It's definitely not trivial b but doable. I suppose if TLS is the only issue you could fix that to allow deletion from any thread, but not sure if that is worth it or not. |
my two cents. the draining is totally optional. If we can have the old filter chain with the old connection live for ever in the same listener, then as Matt says, I see no reason to drain. Unless its a security issue and I want to terminate old connections in a timely fashion, and not have any connections with expired tls certs lingering around for ages. |
Wouldn't not draining lead to a situation, where we have long-running connections (i.e. HTTP/2, gRPC) that are latched to a filter chain that no longer exist, along with its outdated configuration, with no way to force those clients to re-connect and re-select a new filter chain? |
it does but the point is that unless these configurations pertain to some security stuff, this "forcing" thingy should be optional. In a way, we put the onus on the end user - if they want to take advantage of the newer config, they should reconnect. Gives them the option of staying with old config or reconnecting |
@mattklein123 @lambdai do you think this is needed for 1.12.0 or can we do this as a v3 add-on once shipped? |
Add on is fine with me. |
Description: Introduce filter chain context. Goal: Support the future work of adding filter chain without drain the whole listener, and deleting one filter chain by draining only the connection associated with the deleted filter chain. The ListenerFactoryContext should cover FilterChainFactoryContext, and filter chain context should cover the life of all the associated connections referring to the filter chain. In this PR the filter chain contexts are not yet destructed independently. I have follow up PRs to release the power of filter chain contexts. Risk Level: LOW Testing: unit test Addressing #4540 3/N Signed-off-by: Yuchen Dai <silentdai@gmail.com>
Description: Introduce filter chain context. Goal: Support the future work of adding filter chain without drain the whole listener, and deleting one filter chain by draining only the connection associated with the deleted filter chain. The ListenerFactoryContext should cover FilterChainFactoryContext, and filter chain context should cover the life of all the associated connections referring to the filter chain. In this PR the filter chain contexts are not yet destructed independently. I have follow up PRs to release the power of filter chain contexts. Risk Level: LOW Testing: unit test Addressing envoyproxy#4540 3/N Signed-off-by: Yuchen Dai <silentdai@gmail.com> Signed-off-by: Prakhar Gautam <prakhag@gmail.com>
I was curious about time-line on this feature (FDS). Looks like lot of work is already committed. |
@htuch will this support on-demand filter-chain, based on SNI or similar param |
We've pivoted towards not doing filter-chain discovery, instead relying on improvements to LDS update (i.e. drainless updates) and ECDS. I'm going to close this out, as I think the original goals we were shooting towards are satisfied by the aforementioned. Feel free to reopen if this is not the case. |
Perhaps a goal that I should have mentioned sooner and the reason I've been tracking this issue is to reduce the blast radius of a LDS NACK. My current NACK handling breaks up Resources into individual DeltaDiscoveryResponses to see which of the Resources produced the NACK and not resend it. This works well for granular resources like SDS, EDS, and CDS which may only affect a single service. A LDS NACK affects every service with traffic flowing through its port. For me that's thousands of services on a single compute cluster |
@phylake what typically causes these NACKs? I'm wondering if you move to ECDS whether enough of the config is out of the |
We will never have on-demand update of filter chain. |
@htuch nothing typically but I think I've been lucky in that regard. Diligent input validation and integration tests have kept them at bay so far but it's something of a ticking time bomb from my perspective. I have Kubernetes CRDs (created by service developers) that I translate into Envoy protobufs. This is a lossy transformation so it's difficult to know what input created an output that was NACKed which is why I do NACK handling as described above. The more of Envoy's config surface area I expose via these CRDs the more opportunity there is for me to miss something that Envoy might NACK. It's not a theoretical problem it's a very practical problem I've just been able to avoid so far in production ECDS looks to skip past quite a bit of configuration on a Listener down to a FilterChain's Filters where you'd find the TypedConfig. Certainly it could help but it's not what I'm looking for |
Have you considered doing more extensive validation offline or via canary? I think what you describe is a valid reason to have finer grained resources in the model, but other mechanisms to ensure that NACK is unlikely to happen will pay off more broadly. |
Not sure what you mean by offline. Yes we canary and have non prod environments developer progress through. That in addition to extensive unit, integration, and functional testing all contribute to finding issues before they hit prod. That's not really the point though: FDS plus the NACK handling I described was going to be that last measure in reducing the blast radius if all else fails. I have the granularity of resources I need everywhere else |
I'm going to reopen as this is a valid consideration, plus if we did want to lazy load the filter chains, we would need this. |
again, can the on-demand loading of filter chain be a part of this? If everything is on-demand, can u please make this also. |
I'd like to revisit this after unified filter chain matcher is adopted in istio. That API enforces the filter chain name. This change greatly reduced the complexity of FDS |
I see a few benefits of developing FCDS.
Is this feature actively worked on @htuch ? |
Folks, this is the implementation PR for FCDS (SoTW). Could you plz review.
|
In order to support filter chain updates without forcing a full listener drain, it's been suggested we add a FilterChain Discovery Service. This would allow individual FilterChains belonging to a Listener to be independently discovered and updated.
This would also lend support to finer grained Listener resource naming for incremental and on-demand xDS (generalizing #2500).
Opening this issue to track any design work and ownership.
@ggreenway @andraxylia @lizan @PiotrSikora @mattklein123
The text was updated successfully, but these errors were encountered: