FilterChain Discovery Service (FDS) #4540

htuch · 2018-09-26T20:27:01Z

In order to support filter chain updates without forcing a full listener drain, it's been suggested we add a FilterChain Discovery Service. This would allow individual FilterChains belonging to a Listener to be independently discovered and updated.

This would also lend support to finer grained Listener resource naming for incremental and on-demand xDS (generalizing #2500).

Opening this issue to track any design work and ownership.

@ggreenway @andraxylia @lizan @PiotrSikora @mattklein123

andraxylia · 2018-09-26T20:31:21Z

We need this fix for Istio for long-lived connections with SNI @rshriram @costinm .

ggreenway · 2018-09-26T20:36:36Z

Related add-on feature: lazy-loading of FilterChains, triggered by a listener-filter that is watching for unknown SNI values.

andraxylia · 2018-09-26T20:51:34Z

@htuch does the drain happen when a particular FilterChain is updated, and also when any FilterChain is added or deleted in the listener?

mattklein123 · 2018-09-26T21:54:27Z

We need this fix for Istio for long-lived connections with SNI

@andraxylia the intention here is to only effect new connections, right? Obviously previous connections will use the old filter chain? I just want to make sure we are on the same page. If so this sounds like a useful and straightforward feature to me.

andraxylia · 2018-09-26T23:42:40Z

@mattklein123 All is good if a change in the FilterChain does not affect existing TCP and GRPC connections - for instance the addition of a host in server_names in the FilterMatch.

@rshriram @costinm told me that because of issues with full drain we cannot move Istio from multiple listeners to 1 listener with multiple FilterChains.

ggreenway · 2019-01-07T20:02:35Z

I brainstormed with @htuch and here's what we propose for now:

Listeners will specify a list of FilterChains by name to discover. That means that adding a new static FilterChain will involve modifying the listener. This model is the best way to support lazy-loading of FilterChains.
We will modify the current listener implementation so that if a listener has filter chains added, modified, or removed, draining only happens to connections on the affected filter chain. Other connections do not need to drain. This will be done as a separate PR, as it does not require filterchain discovery.
FilterChain Discovery Service is geared towards the lazy-load use case. Statically defined/loaded filter chains can be added/modified/removed with the existing LDS.

Any comments or concerns with this plan?

mattklein123 · 2019-01-07T20:14:52Z

I don't understand how filter chains interact with draining? Can't old connections use the old filter chain and new ones use the new filter chain? If we want to add draining on filter chain change I think that's totally orthogonal?

ggreenway · 2019-01-07T20:18:32Z

@mattklein123 Yes, that's basically the conclusion @htuch and I came to. There are two orthogonal feature requests/requirements here:

Be able to add/remove filterchains without draining. Right now this is not possible only due to the implementation of listener config changes. The API does not need to change to fix this.
lazy-load filterchains.

The implementation of these two features is mostly/entirely non-intersecting.

mattklein123 · 2019-01-07T20:20:18Z

OK, thanks for the clarification. IMO we don't ever need to implement draining on a per filter chain basis. I would track as a totally separate issue? Fixing (1) should be simple compared to having to do draining also (not simple).

ggreenway · 2019-01-07T20:28:35Z

Actually, thinking through this, I think (2) would depend on (1) to work as expected. In order to modify one of the filter chains via FCDS, it would need to modify the listener and cause draining of connections on that filterchain, but not drain connections on unchanged filterchains.

mattklein123 · 2019-01-07T20:34:50Z

But even with lazy load, why is draining required? It seems like we can write the code so that a filter chain is latched onto a connection at initial connect time? Why is it required to drain? (I can see why someone might want that, but it seems not required).

ggreenway · 2019-01-07T20:43:25Z

That's an interesting point. I suppose it's not required. But it would be inconsistent with current behavior for listeners: there's not currently a way to modify/remove a listener without draining. But the answer to (1) may be as simple as adding a new DrainType of "hot-restart only". That's not quite as fine-grained, but would be much less effort for most of the potential gains. It would entirely solve the case of filterchain add/remove, but would not allow forcing draining for a filterchain modify.

mattklein123 · 2019-01-07T20:48:08Z

FWIW, my view of filter chain is that it's something that is only consulted when a connection is created. As such, draining is not required. The reason I wrote draining into listeners is that when a listener goes away, by definition is destroys all owned connections (this wouldn't be the case for filter chains). I agree that we could add a configuration option to listeners to not drain and just hard swap, though it seems like few would want this behavior?

ggreenway · 2019-01-07T21:25:52Z

The reason I wrote draining into listeners is that when a listener goes away, by definition is destroys all owned connections (this wouldn't be the case for filter chains).

I don't understand why filter chains would be any different in this way. We still need to ensure that all of the filter configs, etc are not destroyed until all connections using them have closed. This could be done with shared_ptr or similar, but the same could be said for listeners (I'm not arguing that we want to do that for listeners, just that we could). I don't think I understand what you have in mind for the different lifetimes of listeners vs filter chains and why they're fundamentally different.

mattklein123 · 2019-01-08T00:06:55Z

I don't think I understand what you have in mind for the different lifetimes of listeners vs filter chains and why they're fundamentally different.

The reason that I added draining for the listener implementation was honestly mainly for simplicity due to connection ownership. I agree that connections could live beyond the life of a listener but it would be a very large change. Obviously not doing anything at all (draining vs. shared_ptr ownership) would be not good for many users.

For filter chains, I think the keeping the filter chain alive via shared pointer if necessary is substantially simpler and if it were me is how I would start. I don't object to adding draining in any way, I just don't think it adds much and would be pretty complicated.

ggreenway · 2019-01-08T00:14:21Z

Ok, that makes sense. I just wanted to make sure I understood your reasoning.

What do you think about changing listeners so that they do not drain when their config is modified (but not deleted) if the only part of the config that changed is the filter chains?

mattklein123 · 2019-01-08T00:17:58Z

What do you think about changing listeners so that they do not drain when their config is modified (but not deleted) if the only part of the config that changed is the filter chains?

Yup makes sense and I don't think would be that hard.

htuch · 2019-01-09T04:14:14Z

SGTM to me as well, FWIW I hadn't grokked the full tradeoff between shared_ptr and drain here in Listener implementation, so thanks @mattklein123 for explaining this for posterity.

ggreenway · 2019-01-09T21:05:23Z

@mattklein123 I've been thinking this through, and simply using a shared_ptr for the filter-chain will run into other issues, because right now ThreadLocal slots must be destructed on the main thread. But I can probably make it post deletions to the main thread when the refcount goes to zero somehow.

mattklein123 · 2019-01-09T21:50:55Z

Hmm yeah that's definitely a complexity with the stored lambdas. Yeah I think could have some type of shared_ptr wrapper that on destruction posts back to the main listener for deletion. It's definitely not trivial b but doable. I suppose if TLS is the only issue you could fix that to allow deletion from any thread, but not sure if that is worth it or not.

rshriram · 2019-01-15T01:39:06Z

my two cents. the draining is totally optional. If we can have the old filter chain with the old connection live for ever in the same listener, then as Matt says, I see no reason to drain. Unless its a security issue and I want to terminate old connections in a timely fashion, and not have any connections with expired tls certs lingering around for ages.

PiotrSikora · 2019-01-15T02:16:48Z

Wouldn't not draining lead to a situation, where we have long-running connections (i.e. HTTP/2, gRPC) that are latched to a filter chain that no longer exist, along with its outdated configuration, with no way to force those clients to re-connect and re-select a new filter chain?

rshriram · 2019-01-15T02:20:42Z

it does but the point is that unless these configurations pertain to some security stuff, this "forcing" thingy should be optional. In a way, we put the onus on the end user - if they want to take advantage of the newer config, they should reconnect. Gives them the option of staying with old config or reconnecting

htuch · 2019-09-23T19:18:29Z

@mattklein123 @lambdai do you think this is needed for 1.12.0 or can we do this as a v3 add-on once shipped?

mattklein123 · 2019-09-23T19:58:28Z

Add on is fine with me.

Description: Introduce filter chain context. Goal: Support the future work of adding filter chain without drain the whole listener, and deleting one filter chain by draining only the connection associated with the deleted filter chain. The ListenerFactoryContext should cover FilterChainFactoryContext, and filter chain context should cover the life of all the associated connections referring to the filter chain. In this PR the filter chain contexts are not yet destructed independently. I have follow up PRs to release the power of filter chain contexts. Risk Level: LOW Testing: unit test Addressing #4540 3/N Signed-off-by: Yuchen Dai <silentdai@gmail.com>

Description: Introduce filter chain context. Goal: Support the future work of adding filter chain without drain the whole listener, and deleting one filter chain by draining only the connection associated with the deleted filter chain. The ListenerFactoryContext should cover FilterChainFactoryContext, and filter chain context should cover the life of all the associated connections referring to the filter chain. In this PR the filter chain contexts are not yet destructed independently. I have follow up PRs to release the power of filter chain contexts. Risk Level: LOW Testing: unit test Addressing envoyproxy#4540 3/N Signed-off-by: Yuchen Dai <silentdai@gmail.com> Signed-off-by: Prakhar Gautam <prakhag@gmail.com>

hirenshah05 · 2020-03-09T14:51:58Z

I was curious about time-line on this feature (FDS). Looks like lot of work is already committed.

vorishirne · 2022-04-05T01:05:20Z

@htuch will this support on-demand filter-chain, based on SNI or similar param

htuch · 2022-04-05T03:39:22Z

We've pivoted towards not doing filter-chain discovery, instead relying on improvements to LDS update (i.e. drainless updates) and ECDS. I'm going to close this out, as I think the original goals we were shooting towards are satisfied by the aforementioned. Feel free to reopen if this is not the case.

phylake · 2022-04-05T17:38:47Z

Perhaps a goal that I should have mentioned sooner and the reason I've been tracking this issue is to reduce the blast radius of a LDS NACK. My current NACK handling breaks up Resources into individual DeltaDiscoveryResponses to see which of the Resources produced the NACK and not resend it. This works well for granular resources like SDS, EDS, and CDS which may only affect a single service.

A LDS NACK affects every service with traffic flowing through its port. For me that's thousands of services on a single compute cluster

htuch · 2022-04-06T04:30:20Z

@phylake what typically causes these NACKs? I'm wondering if you move to ECDS whether enough of the config is out of the Listener resource that this will make a material difference. If not, maybe we should reopen.

vorishirne · 2022-04-06T06:19:28Z

I think the original goals we were shooting towards are satisfied by the aforementioned. Feel free to reopen if this is not the case.

We will never have on-demand update of filter chain.

phylake · 2022-04-06T15:05:35Z

@htuch nothing typically but I think I've been lucky in that regard. Diligent input validation and integration tests have kept them at bay so far but it's something of a ticking time bomb from my perspective. I have Kubernetes CRDs (created by service developers) that I translate into Envoy protobufs. This is a lossy transformation so it's difficult to know what input created an output that was NACKed which is why I do NACK handling as described above. The more of Envoy's config surface area I expose via these CRDs the more opportunity there is for me to miss something that Envoy might NACK. It's not a theoretical problem it's a very practical problem I've just been able to avoid so far in production

ECDS looks to skip past quite a bit of configuration on a Listener down to a FilterChain's Filters where you'd find the TypedConfig. Certainly it could help but it's not what I'm looking for

htuch · 2022-04-07T02:27:02Z

Have you considered doing more extensive validation offline or via canary? I think what you describe is a valid reason to have finer grained resources in the model, but other mechanisms to ensure that NACK is unlikely to happen will pay off more broadly.

phylake · 2022-04-07T17:11:33Z

Not sure what you mean by offline. Yes we canary and have non prod environments developer progress through. That in addition to extensive unit, integration, and functional testing all contribute to finding issues before they hit prod. That's not really the point though: FDS plus the NACK handling I described was going to be that last measure in reducing the blast radius if all else fails. I have the granularity of resources I need everywhere else

htuch · 2022-04-07T21:24:42Z

I'm going to reopen as this is a valid consideration, plus if we did want to lazy load the filter chains, we would need this.

vorishirne · 2022-04-08T06:40:32Z

again, can the on-demand loading of filter chain be a part of this? If everything is on-demand, can u please make this also.

lambdai · 2022-04-22T18:31:48Z

I'd like to revisit this after unified filter chain matcher is adopted in istio.

That API enforces the filter chain name. This change greatly reduced the complexity of FDS

ssripadham · 2022-08-24T05:12:08Z

I see a few benefits of developing FCDS.

It Allows us modeling customizations built into a single listener.
Example:
Listener
FilterChain1: Some unique rules in fcd1.conf
FilterChain2: Some unique rules in fcds2.conf
...
Support breaking a single LDS file into more manageable multiple FCDS files. Imagine these FCDS files holding several lines (1K+) of envoy yaml.
Filter Chain discovery & FilterChain Only updates (This feature exists as I understand in the lds itself)
Filter chains can be loaded/unloaded on demand. This way all of the Filter Chains need not be held in process memory.
Don't need to send full lds config to do protobuf diff between xds server and xds client and instead send smaller fcds configs
Any other benefit?

Is this feature actively worked on @htuch ?

rakeshdatta · 2022-09-13T23:20:15Z

Folks, this is the implementation PR for FCDS (SoTW). Could you plz review.
Next steps:

To test and possibly hook (logic from LDS filter chain only update) some more changes for FC level connection draining
Delta updates for FCDS

cc @htuch @mattklein123 @ssripadham

htuch added design proposal Needs design doc/proposal before implementation help wanted Needs help! api/v2 labels Sep 26, 2018

This was referenced Dec 19, 2018

Migrate from using bind_to_port=false to filter chains. istio/istio#10533

Open

Un-deprecate bind_to_port and use_original_dst. #5355

Closed

ggreenway mentioned this issue Dec 20, 2018

WIP: config: Add FilterChain Discovery Service. #5385

Closed

mattklein123 added this to the 1.10.0 milestone Jan 11, 2019

htuch removed this from the 1.12.0 milestone Sep 24, 2019

lambdai mentioned this issue Dec 3, 2019

Listener: introduce filter chain context #9205

Merged

shashankram mentioned this issue Apr 22, 2020

Does Envoy support per filter chain draining? #10899

Closed

htuch mentioned this issue Apr 29, 2021

Simpler listener configuration for multiple domains with TLS support #16174

Open

mattklein123 added help wanted Needs help! and removed no stalebot Disables stalebot from closing an issue labels Jan 28, 2022

mattklein123 closed this as completed Apr 5, 2022

htuch reopened this Apr 7, 2022

rakeshdatta mentioned this issue Sep 13, 2022

Feature : Listener Filter Chain Discovery Service #23096

Closed

rakeshdatta mentioned this issue Jan 27, 2023

Feature : Listener Filter Chain Discovery Service rakeshdatta/envoy#1

Draft

rakeshdatta mentioned this issue Feb 9, 2023

Feature : Listener Filter Chain Discovery Service #25473

Closed

FilterChain Discovery Service (FDS) #4540

FilterChain Discovery Service (FDS) #4540

Comments

htuch commented Sep 26, 2018 • edited Loading

andraxylia commented Sep 26, 2018

ggreenway commented Sep 26, 2018

andraxylia commented Sep 26, 2018

mattklein123 commented Sep 26, 2018

andraxylia commented Sep 26, 2018

ggreenway commented Jan 7, 2019

mattklein123 commented Jan 7, 2019

ggreenway commented Jan 7, 2019

mattklein123 commented Jan 7, 2019

ggreenway commented Jan 7, 2019

mattklein123 commented Jan 7, 2019

ggreenway commented Jan 7, 2019

mattklein123 commented Jan 7, 2019

ggreenway commented Jan 7, 2019

mattklein123 commented Jan 8, 2019

ggreenway commented Jan 8, 2019

mattklein123 commented Jan 8, 2019

htuch commented Jan 9, 2019

ggreenway commented Jan 9, 2019

mattklein123 commented Jan 9, 2019

rshriram commented Jan 15, 2019

PiotrSikora commented Jan 15, 2019

rshriram commented Jan 15, 2019

htuch commented Sep 23, 2019

mattklein123 commented Sep 23, 2019

hirenshah05 commented Mar 9, 2020

vorishirne commented Apr 5, 2022

htuch commented Apr 5, 2022

phylake commented Apr 5, 2022

htuch commented Apr 6, 2022

vorishirne commented Apr 6, 2022

phylake commented Apr 6, 2022

htuch commented Apr 7, 2022

phylake commented Apr 7, 2022

htuch commented Apr 7, 2022

vorishirne commented Apr 8, 2022

lambdai commented Apr 22, 2022

ssripadham commented Aug 24, 2022 • edited Loading

rakeshdatta commented Sep 13, 2022 • edited Loading

htuch commented Sep 26, 2018 •

edited

Loading

ssripadham commented Aug 24, 2022 •

edited

Loading

rakeshdatta commented Sep 13, 2022 •

edited

Loading