Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilterChain Discovery Service (FDS) #4540

Open
Tracked by #1039
htuch opened this issue Sep 26, 2018 · 47 comments
Open
Tracked by #1039

FilterChain Discovery Service (FDS) #4540

htuch opened this issue Sep 26, 2018 · 47 comments
Assignees
Labels
api/v3 Major version release @ end of Q3 2019 design proposal Needs design doc/proposal before implementation help wanted Needs help!

Comments

@htuch
Copy link
Member

htuch commented Sep 26, 2018

In order to support filter chain updates without forcing a full listener drain, it's been suggested we add a FilterChain Discovery Service. This would allow individual FilterChains belonging to a Listener to be independently discovered and updated.

This would also lend support to finer grained Listener resource naming for incremental and on-demand xDS (generalizing #2500).

Opening this issue to track any design work and ownership.

@ggreenway @andraxylia @lizan @PiotrSikora @mattklein123

@htuch htuch added design proposal Needs design doc/proposal before implementation help wanted Needs help! api/v2 labels Sep 26, 2018
@andraxylia
Copy link

We need this fix for Istio for long-lived connections with SNI @rshriram @costinm .

@ggreenway
Copy link
Contributor

Related add-on feature: lazy-loading of FilterChains, triggered by a listener-filter that is watching for unknown SNI values.

@andraxylia
Copy link

@htuch does the drain happen when a particular FilterChain is updated, and also when any FilterChain is added or deleted in the listener?

@mattklein123
Copy link
Member

We need this fix for Istio for long-lived connections with SNI

@andraxylia the intention here is to only effect new connections, right? Obviously previous connections will use the old filter chain? I just want to make sure we are on the same page. If so this sounds like a useful and straightforward feature to me.

@andraxylia
Copy link

@mattklein123 All is good if a change in the FilterChain does not affect existing TCP and GRPC connections - for instance the addition of a host in server_names in the FilterMatch.

@rshriram @costinm told me that because of issues with full drain we cannot move Istio from multiple listeners to 1 listener with multiple FilterChains.

@ggreenway
Copy link
Contributor

I brainstormed with @htuch and here's what we propose for now:

  • Listeners will specify a list of FilterChains by name to discover. That means that adding a new static FilterChain will involve modifying the listener. This model is the best way to support lazy-loading of FilterChains.
  • We will modify the current listener implementation so that if a listener has filter chains added, modified, or removed, draining only happens to connections on the affected filter chain. Other connections do not need to drain. This will be done as a separate PR, as it does not require filterchain discovery.
  • FilterChain Discovery Service is geared towards the lazy-load use case. Statically defined/loaded filter chains can be added/modified/removed with the existing LDS.

Any comments or concerns with this plan?

@mattklein123
Copy link
Member

I don't understand how filter chains interact with draining? Can't old connections use the old filter chain and new ones use the new filter chain? If we want to add draining on filter chain change I think that's totally orthogonal?

@ggreenway
Copy link
Contributor

@mattklein123 Yes, that's basically the conclusion @htuch and I came to. There are two orthogonal feature requests/requirements here:

  1. Be able to add/remove filterchains without draining. Right now this is not possible only due to the implementation of listener config changes. The API does not need to change to fix this.
  2. lazy-load filterchains.

The implementation of these two features is mostly/entirely non-intersecting.

@mattklein123
Copy link
Member

OK, thanks for the clarification. IMO we don't ever need to implement draining on a per filter chain basis. I would track as a totally separate issue? Fixing (1) should be simple compared to having to do draining also (not simple).

@ggreenway
Copy link
Contributor

Actually, thinking through this, I think (2) would depend on (1) to work as expected. In order to modify one of the filter chains via FCDS, it would need to modify the listener and cause draining of connections on that filterchain, but not drain connections on unchanged filterchains.

@mattklein123
Copy link
Member

But even with lazy load, why is draining required? It seems like we can write the code so that a filter chain is latched onto a connection at initial connect time? Why is it required to drain? (I can see why someone might want that, but it seems not required).

@ggreenway
Copy link
Contributor

That's an interesting point. I suppose it's not required. But it would be inconsistent with current behavior for listeners: there's not currently a way to modify/remove a listener without draining. But the answer to (1) may be as simple as adding a new DrainType of "hot-restart only". That's not quite as fine-grained, but would be much less effort for most of the potential gains. It would entirely solve the case of filterchain add/remove, but would not allow forcing draining for a filterchain modify.

@mattklein123
Copy link
Member

FWIW, my view of filter chain is that it's something that is only consulted when a connection is created. As such, draining is not required. The reason I wrote draining into listeners is that when a listener goes away, by definition is destroys all owned connections (this wouldn't be the case for filter chains). I agree that we could add a configuration option to listeners to not drain and just hard swap, though it seems like few would want this behavior?

@ggreenway
Copy link
Contributor

The reason I wrote draining into listeners is that when a listener goes away, by definition is destroys all owned connections (this wouldn't be the case for filter chains).

I don't understand why filter chains would be any different in this way. We still need to ensure that all of the filter configs, etc are not destroyed until all connections using them have closed. This could be done with shared_ptr or similar, but the same could be said for listeners (I'm not arguing that we want to do that for listeners, just that we could). I don't think I understand what you have in mind for the different lifetimes of listeners vs filter chains and why they're fundamentally different.

@mattklein123
Copy link
Member

I don't think I understand what you have in mind for the different lifetimes of listeners vs filter chains and why they're fundamentally different.

The reason that I added draining for the listener implementation was honestly mainly for simplicity due to connection ownership. I agree that connections could live beyond the life of a listener but it would be a very large change. Obviously not doing anything at all (draining vs. shared_ptr ownership) would be not good for many users.

For filter chains, I think the keeping the filter chain alive via shared pointer if necessary is substantially simpler and if it were me is how I would start. I don't object to adding draining in any way, I just don't think it adds much and would be pretty complicated.

@ggreenway
Copy link
Contributor

Ok, that makes sense. I just wanted to make sure I understood your reasoning.

What do you think about changing listeners so that they do not drain when their config is modified (but not deleted) if the only part of the config that changed is the filter chains?

@mattklein123
Copy link
Member

What do you think about changing listeners so that they do not drain when their config is modified (but not deleted) if the only part of the config that changed is the filter chains?

Yup makes sense and I don't think would be that hard.

@htuch
Copy link
Member Author

htuch commented Jan 9, 2019

SGTM to me as well, FWIW I hadn't grokked the full tradeoff between shared_ptr and drain here in Listener implementation, so thanks @mattklein123 for explaining this for posterity.

@ggreenway
Copy link
Contributor

@mattklein123 I've been thinking this through, and simply using a shared_ptr for the filter-chain will run into other issues, because right now ThreadLocal slots must be destructed on the main thread. But I can probably make it post deletions to the main thread when the refcount goes to zero somehow.

@mattklein123
Copy link
Member

Hmm yeah that's definitely a complexity with the stored lambdas. Yeah I think could have some type of shared_ptr wrapper that on destruction posts back to the main listener for deletion. It's definitely not trivial b but doable. I suppose if TLS is the only issue you could fix that to allow deletion from any thread, but not sure if that is worth it or not.

@mattklein123 mattklein123 added this to the 1.10.0 milestone Jan 11, 2019
@rshriram
Copy link
Member

my two cents. the draining is totally optional. If we can have the old filter chain with the old connection live for ever in the same listener, then as Matt says, I see no reason to drain. Unless its a security issue and I want to terminate old connections in a timely fashion, and not have any connections with expired tls certs lingering around for ages.

@PiotrSikora
Copy link
Contributor

Wouldn't not draining lead to a situation, where we have long-running connections (i.e. HTTP/2, gRPC) that are latched to a filter chain that no longer exist, along with its outdated configuration, with no way to force those clients to re-connect and re-select a new filter chain?

@rshriram
Copy link
Member

it does but the point is that unless these configurations pertain to some security stuff, this "forcing" thingy should be optional. In a way, we put the onus on the end user - if they want to take advantage of the newer config, they should reconnect. Gives them the option of staying with old config or reconnecting

@htuch
Copy link
Member Author

htuch commented Sep 23, 2019

@mattklein123 @lambdai do you think this is needed for 1.12.0 or can we do this as a v3 add-on once shipped?

@mattklein123
Copy link
Member

Add on is fine with me.

@htuch htuch removed this from the 1.12.0 milestone Sep 24, 2019
junr03 pushed a commit that referenced this issue Jan 8, 2020
Description: Introduce filter chain context.
Goal: Support the future work of adding filter chain without drain the whole listener, and deleting one filter chain by draining only the connection associated with the deleted filter chain.

The ListenerFactoryContext should cover FilterChainFactoryContext, and filter chain context should cover the life of all the associated connections referring to the filter chain.

In this PR the filter chain contexts are not yet destructed independently. I have follow up PRs to release the power of filter chain contexts.

Risk Level: LOW
Testing: unit test

Addressing #4540 3/N

Signed-off-by: Yuchen Dai <silentdai@gmail.com>
prakhag-1 pushed a commit to prakhag-1/envoy that referenced this issue Jan 8, 2020
Description: Introduce filter chain context.
Goal: Support the future work of adding filter chain without drain the whole listener, and deleting one filter chain by draining only the connection associated with the deleted filter chain.

The ListenerFactoryContext should cover FilterChainFactoryContext, and filter chain context should cover the life of all the associated connections referring to the filter chain.

In this PR the filter chain contexts are not yet destructed independently. I have follow up PRs to release the power of filter chain contexts.

Risk Level: LOW
Testing: unit test

Addressing envoyproxy#4540 3/N

Signed-off-by: Yuchen Dai <silentdai@gmail.com>

Signed-off-by: Prakhar Gautam <prakhag@gmail.com>
@hirenshah05
Copy link

I was curious about time-line on this feature (FDS). Looks like lot of work is already committed.

@vorishirne
Copy link

@htuch will this support on-demand filter-chain, based on SNI or similar param

@htuch
Copy link
Member Author

htuch commented Apr 5, 2022

We've pivoted towards not doing filter-chain discovery, instead relying on improvements to LDS update (i.e. drainless updates) and ECDS. I'm going to close this out, as I think the original goals we were shooting towards are satisfied by the aforementioned. Feel free to reopen if this is not the case.

@phylake
Copy link

phylake commented Apr 5, 2022

Perhaps a goal that I should have mentioned sooner and the reason I've been tracking this issue is to reduce the blast radius of a LDS NACK. My current NACK handling breaks up Resources into individual DeltaDiscoveryResponses to see which of the Resources produced the NACK and not resend it. This works well for granular resources like SDS, EDS, and CDS which may only affect a single service.

A LDS NACK affects every service with traffic flowing through its port. For me that's thousands of services on a single compute cluster

@htuch
Copy link
Member Author

htuch commented Apr 6, 2022

@phylake what typically causes these NACKs? I'm wondering if you move to ECDS whether enough of the config is out of the Listener resource that this will make a material difference. If not, maybe we should reopen.

@vorishirne
Copy link

I think the original goals we were shooting towards are satisfied by the aforementioned. Feel free to reopen if this is not the case.

We will never have on-demand update of filter chain.

@phylake
Copy link

phylake commented Apr 6, 2022

@htuch nothing typically but I think I've been lucky in that regard. Diligent input validation and integration tests have kept them at bay so far but it's something of a ticking time bomb from my perspective. I have Kubernetes CRDs (created by service developers) that I translate into Envoy protobufs. This is a lossy transformation so it's difficult to know what input created an output that was NACKed which is why I do NACK handling as described above. The more of Envoy's config surface area I expose via these CRDs the more opportunity there is for me to miss something that Envoy might NACK. It's not a theoretical problem it's a very practical problem I've just been able to avoid so far in production

ECDS looks to skip past quite a bit of configuration on a Listener down to a FilterChain's Filters where you'd find the TypedConfig. Certainly it could help but it's not what I'm looking for

@htuch
Copy link
Member Author

htuch commented Apr 7, 2022

Have you considered doing more extensive validation offline or via canary? I think what you describe is a valid reason to have finer grained resources in the model, but other mechanisms to ensure that NACK is unlikely to happen will pay off more broadly.

@phylake
Copy link

phylake commented Apr 7, 2022

Not sure what you mean by offline. Yes we canary and have non prod environments developer progress through. That in addition to extensive unit, integration, and functional testing all contribute to finding issues before they hit prod. That's not really the point though: FDS plus the NACK handling I described was going to be that last measure in reducing the blast radius if all else fails. I have the granularity of resources I need everywhere else

@htuch
Copy link
Member Author

htuch commented Apr 7, 2022

I'm going to reopen as this is a valid consideration, plus if we did want to lazy load the filter chains, we would need this.

@htuch htuch reopened this Apr 7, 2022
@vorishirne
Copy link

again, can the on-demand loading of filter chain be a part of this? If everything is on-demand, can u please make this also.

@lambdai
Copy link
Contributor

lambdai commented Apr 22, 2022

I'd like to revisit this after unified filter chain matcher is adopted in istio.

That API enforces the filter chain name. This change greatly reduced the complexity of FDS

@ssripadham
Copy link

ssripadham commented Aug 24, 2022

I see a few benefits of developing FCDS.

  1. It Allows us modeling customizations built into a single listener.
    Example:
    Listener
    FilterChain1: Some unique rules in fcd1.conf
    FilterChain2: Some unique rules in fcds2.conf
    ...
  2. Support breaking a single LDS file into more manageable multiple FCDS files. Imagine these FCDS files holding several lines (1K+) of envoy yaml.
  3. Filter Chain discovery & FilterChain Only updates (This feature exists as I understand in the lds itself)
  4. Filter chains can be loaded/unloaded on demand. This way all of the Filter Chains need not be held in process memory.
  5. Don't need to send full lds config to do protobuf diff between xds server and xds client and instead send smaller fcds configs
  6. Any other benefit?

Is this feature actively worked on @htuch ?

@rakeshdatta
Copy link

rakeshdatta commented Sep 13, 2022

Folks, this is the implementation PR for FCDS (SoTW). Could you plz review.
Next steps:

  1. To test and possibly hook (logic from LDS filter chain only update) some more changes for FC level connection draining
  2. Delta updates for FCDS

cc @htuch @mattklein123 @ssripadham

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api/v3 Major version release @ end of Q3 2019 design proposal Needs design doc/proposal before implementation help wanted Needs help!
Projects
None yet
Development

No branches or pull requests