Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: xDS mux unification #14496

Closed
wants to merge 34 commits into from
Closed

WIP: xDS mux unification #14496

wants to merge 34 commits into from

Conversation

dmitri-d
Copy link
Contributor

Fixes #11477.

  • reintroduces config: reinstate #8478 (unification of delta and SotW xDS), reverted by #8939 #8974
  • Ported "legacy" grpc mux and grpc_subscription_impl, to provide wire-level compatibility for existing server implementations
  • Legacy mux and subscription are used by default and controlled by the "envoy.reloadable_features.legacy_sotw_xds" runtime config feature.
  • Integration-level tests (ads, cds, etc) updated to test both legacy and unified muxes

TODO's

  • update runtime config docs
  • port unit-level tests for legacy mux and subscription

fredlas and others added 25 commits November 25, 2019 12:09
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
…unification

Signed-off-by: Fred Douglas <fredlas@google.com>
This reverts commit 30ce222.
This restoration was to test my theory that 8350 was the cause
of the hang. Now that we have established it wasn't, it will be
good to have this cleaner, more sensible behavior back in.

Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
…unification

Signed-off-by: Fred Douglas <fredlas@google.com>
…ing behavior, which treats any news about the fetch as precluding a timeout, as opposed to only wanting a successful one

Signed-off-by: Fred Douglas <fredlas@google.com>
…unification

Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
…unification

Signed-off-by: Fred Douglas <fredlas@google.com>
…unification

Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Fred Douglas <fredlas@google.com>
…unification

Signed-off-by: Fred Douglas <fredlas@google.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
…ementation that is wire-compatible with the legacy grpc mux used in non-delta xDS

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@repokitteh-read-only
Copy link

CC @envoyproxy/api-shepherds: Your approval is needed for changes made to (api/envoy[\w/]*/(v1alpha\d?|v1|v2alpha\d?|v2))|(api/envoy/type/(matcher/)?\w+.proto).
CC @envoyproxy/api-shepherds: Your approval is needed for changes made to api/envoy/.
API shepherd assignee is @lizan
CC @envoyproxy/api-watchers: FYI only for changes made to api/envoy/.

🐱

Caused by: #14496 was opened by dmitri-d.

see: more, trace.

@dmitri-d
Copy link
Contributor Author

ping @htuch, @snowp

Dmitri Dolguikh added 5 commits January 4, 2021 12:11
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Copy link
Member

@lizan lizan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just API review, can you check CI and then I will look into code.

@@ -51,7 +51,7 @@ message ApiConfigSource {
// the v2 protos is used.
REST = 1;

// gRPC v2 API.
// "State of the world" gRPC v2 API, using Discovery{Request,Response} protos.
Copy link
Member

@lizan lizan Jan 5, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove v2, as we're moving to v3. or actually revert this.

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @dmitri-d, this is epic. A couple of high-level questions:

  1. Did you end up root causing the original rollback? Do we know why things would be different this time around?
  2. Is there any way to break this PR up into smaller PRs or recommend a review strategy? This is a lot of churn and code which appears new (but maybe comes from elsewhere?). The GitHub UX makes it hard to track this and reason about correctness.

[Delta]DiscoveryRequest protos.

What does GrpcMux even do in this diagram? Just own things and pass through function calls?
Answer: 1) it sequences the requests and ACKs that the various type_urls send, 2) it handles the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you use proper markdown enumerated lists?

@dmitri-d
Copy link
Contributor Author

dmitri-d commented Jan 7, 2021

Did you end up root causing the original rollback? Do we know why things would be different this time around?

I did not; my understanding is that @rgs1 and @fredlas couldn't find anything in the original work. @rgs offered help testing changes (here: #11477 (comment)), hopefully it's still on. I ported legacy (non-delta xDS) mux and subscription implementations into this branch, which should make the transition easier. The legacy mux is enabled by default.

Is there any way to break this PR up into smaller PRs or recommend a review strategy? This is a lot of churn and code which appears new (but maybe comes from elsewhere?).

The core of the changes is the original PR; however it had to be updated to accommodate things like subascription resumption (ScopedResume), ttl, etc. Some of the issues are due to naming: diff on source/common/config/grpc_mux_impl.cc is not going to be very useful, for example, as the new mux implementation lives there now.

I'd suggest looking at changes in source/common/config/legacy_grpc_mux_impl.* and source/common/config/legacy_grpc_subscription_impl.* and their use first: this are the original mux and subscription implementations that should guarantee compatibility with existing xDS server implementations.

A next good step, IMO, would be to look at delta xDS mux and subscription implementations and their use (here: source/common/config/delta_subscription_state.cc
and here

class GrpcMuxDelta
). These are, to a large extent, refactorings of current delta mux and state implementations and should have a familiar feel.

I'd leave SotW mux, subscription, and subscription state for last. I think TTL logic can be reviewed separately too, after the main functionality has been dealt with.

Hope this helps, let me know if there's anything else I can do to help with reviewing of this PR.

Base automatically changed from master to main January 15, 2021 23:02
@lizan lizan added the waiting label Jan 22, 2021
@htuch
Copy link
Member

htuch commented Feb 2, 2021

@adisuissa will help with a first pass, thanks!

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
@dmitri-d
Copy link
Contributor Author

dmitri-d commented Feb 5, 2021

@adisuissa: please let me know if there's anything I can do to make this review easier...

Signed-off-by: Dmitri Dolguikh <ddolguik@redhat.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, thanks for taking this on @dmitri-d. Following up on IM conversations, I can provide some high-level guidance here and @adisuissa will dive deeper. I think though we might want to consider splitting some of this into distinct PRs.

Some high-level suggestions to be able to land this somewhat incrementally and with better reviewability:

  1. An initial PR that just introduces the new runtime option in the subscription factory and cluster manager. It should switch between new/old variants but keep the same name for existing SotW and for the new GrpxMuxSotw just have some dummy implementation.
  2. Don't change names until a final PR for anything that doesn't need it. E.g. can we keep GrpxMuxImpl instead of GrpcMuxSotw for now? Can we keep NewGrpcMuxImpl instead of GrpcMuxDelta? These newer Sotw/Delta names should ideally be the names in the end state, but hopefully that is a trivial rename pass once we have the functional stuff sorted.

One thing that concerns me is that the amount of change in something like GrpcMuxImpl that isn't protected by the runtime flags is very high.

If we had as a starting point the guarantee that zero code changes in existing grpc_mux_impl.cc and new_grpc_mux_impl.cc, that these are the runtime default from the factories, and then we build up the replacements over a series of PRs, I think this would have higher safety guarantee and be easier to review.

The downside of this is that we will end up with three (or maybe four?) different gRPC muxes effectively during the transition, which is pretty horrendous but might be safest and the most incremental approach. The reason I'm making this case is that even something like grp_mux_impl.cc, which should be the legacy path, has significant delta in this PR.

This is just a suggestion, curious to hear from you and @adisuissa on this topic.

/**
* Adapter from typed Subscription to untyped GrpcMux. Also handles per-xDS API stats/logging.
*/
class LegacyGrpcSubscriptionImpl : public Subscription,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How come we need a legacy subscription? The subscription is largely independent of the gRPC mux itself, it would seem not a lot of logic should change here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Largely due to init_fetch_timeout timer moving to watches. I think having a dedicated implementation is easier/safer/more maintainable than having conditionals in one common implementation.

@dmitri-d
Copy link
Contributor Author

dmitri-d commented Feb 9, 2021

I agree that preserving the names of the classes would make reviewing much easier, will check how much work it is.

@adisuissa
Copy link
Contributor

+1 for incremental changes, and for creating a new unified mux without renaming the old ones.
Also, suggest to have a runtime flag as an opt-in for the next-gen (unified) mux.
Another possibility to consider is to introduce the unified mux as another GrpcMux extended class, and try to add its functionality and introduce incremental changes.

@@ -160,7 +164,7 @@ TEST_P(CdsIntegrationTest, CdsClusterUpDownUp) {
ASSERT_TRUE(codec_client_->waitForDisconnect());

// Tell Envoy that cluster_1 is back.
EXPECT_TRUE(compareDiscoveryRequest(Config::TypeUrl::get().Cluster, "42", {}, {}, {}));
EXPECT_TRUE(compareDiscoveryRequest(Config::TypeUrl::get().Cluster, "42", {}, {}, {}, false));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In many of the compareDiscoveryRequest calls the PR adds the 6th arg as false (which IIUC should be false by default). Would it be better to remove it from these calls? (less changes across all integration tests)

@github-actions
Copy link

This pull request has been automatically marked as stale because it has not had activity in the last 30 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 11, 2021
@@ -74,6 +74,7 @@ constexpr const char* runtime_features[] = {
"envoy.reloadable_features.http_transport_failure_reason_in_body",
"envoy.reloadable_features.http_upstream_wait_connect_response",
"envoy.reloadable_features.http2_skip_encoding_empty_trailers",
"envoy.reloadable_features.legacy_sotw_xds",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you give some details on the control plane facing changes that enabling/disabling this flag causes?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

closed in favour of #15473

@github-actions
Copy link

This pull request has been automatically closed because it has not had activity in the last 37 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@github-actions github-actions bot closed this Mar 20, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api stale stalebot believes this issue/PR has not been touched recently v2-freeze
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Converge [New]GrpcMuxImpl
6 participants