Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

init: order dynamic resource initialization to make RTDS always be first #10362

Merged
merged 13 commits into from
Apr 20, 2020

Conversation

yanavlasov
Copy link
Contributor

The new order of initialization:

  1. Initialize all primary clusters
  2. Initialize RTDS
  3. Initialize secondary clusters
  4. Initialize the rest of dynamic resources

Risk Level: High (changes to initialization order)
Testing: Unit Tests, Integration Tests, (internal Google e2e tests)
Docs Changes: N/A
Release Notes: N/A
Fixes #9709

Signed-off-by: Yan Avlasov yavlasov@google.com

The new order of initialization:
1. Initialize all primary clusters
2. Initialize RTDS
3. Initialize secondary clusters
4. Initialize the rest of dynamic resources

Signed-off-by: Yan Avlasov <yavlasov@google.com>
@yanavlasov
Copy link
Contributor Author

This is to start the discussion about correctness of this approach and see if I have missed some edge cases.
I still need to check if any doc need to be changed.
Does this need release notes?

Copy link
Contributor

@snowp snowp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this approach looks OK but this part of the code is pretty complicated so I'd love to hear what others have to say as well.

@@ -73,6 +73,15 @@ class ClusterManagerFactory;
/**
* Manages connection pools and load balancing for upstream clusters. The cluster manager is
* persistent and shared among multiple ongoing requests/connections.
* Cluster manager is initialed in two phases. In the first phase which begins at the construction
* all primary (i.e. not provisioned through xDS) clusters are initialized.
* After the first phase the RTDS (if configured) initialization begins. This allows runtime
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this sounds like its own phase, maybe we should say that there are 3 phases?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Copy link
Contributor Author

@yanavlasov yanavlasov Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wanted to avoid leaking overall initialization order into the cluster manager. So that is why I put 2 phases there.

  1. In the first phase primary clusters are brought up.
  2. The server does something else, which cluster manager does not need to care about.
  3. Then the second phase begins where secondary clusters are initialized.

From the cluster manager perspective there are two phase only. I've updated comment and moved most of it into the InstanceImpl where the order is (mostly) established.

* The second phase of cluster manager initialized begins after RTDS has initialized. In the second
* phase all secondary clusters are initialized and then the rest of the configuration provisioned
* through xDS.
* Please note: this order requires that RTDS is provisioned using a primary cluster. If RTDS is
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what happens its using a secondary cluster? or is this invariant enforced?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No, it is not enforced right now. What would be the best way to do it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's actually multiple restrictions here:

  1. RTDS must be available via a primary cluster.
  2. If RTDS happens to be configured with ADS, then ADS must also be available via a primary cluster.
  3. Various others, e.g. if a secondary cluster is configured with ADS for its EDS, then ADS must also be available via a primary cluster.

We can enforce these by throwing a config rejection exception on violation of these criteria at construction/config ingest.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, the invariants for RTDS config are enforced. The ApiConfigSource must already specified using primary clusters only (checked by the Utility::checkApiConfigSourceSubscriptionBackingCluster). And RTDS provisioned through ADS will fail initialize if ADS is using secondary cluster, since secondary clusters are not present in cluster manager when RTDS is initialized.
I have added server_test tests to check this.

Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks like a well structured fix to the problem and the right approach. I have a few documentation and convention nits, otherwise implementations looks good.
/wait

include/envoy/upstream/cluster_manager.h Outdated Show resolved Hide resolved
include/envoy/upstream/cluster_manager.h Outdated Show resolved Hide resolved
@@ -73,6 +73,15 @@ class ClusterManagerFactory;
/**
* Manages connection pools and load balancing for upstream clusters. The cluster manager is
* persistent and shared among multiple ongoing requests/connections.
* Cluster manager is initialed in two phases. In the first phase which begins at the construction
* all primary (i.e. not provisioned through xDS) clusters are initialized.
* After the first phase the RTDS (if configured) initialization begins. This allows runtime
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

* The second phase of cluster manager initialized begins after RTDS has initialized. In the second
* phase all secondary clusters are initialized and then the rest of the configuration provisioned
* through xDS.
* Please note: this order requires that RTDS is provisioned using a primary cluster. If RTDS is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's actually multiple restrictions here:

  1. RTDS must be available via a primary cluster.
  2. If RTDS happens to be configured with ADS, then ADS must also be available via a primary cluster.
  3. Various others, e.g. if a secondary cluster is configured with ADS for its EDS, then ADS must also be available via a primary cluster.

We can enforce these by throwing a config rejection exception on violation of these criteria at construction/config ingest.

@@ -178,7 +178,14 @@ void ClusterManagerInitHelper::maybeFinishInitialize() {

void ClusterManagerInitHelper::onStaticLoadComplete() {
ASSERT(state_ == State::Loading);
state_ = State::WaitingForStaticInitialize;
// After initialization of primary clusters has completed, transition to
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was state_ WaitingForStaticInitialize before but now for secondary?

Copy link
Contributor Author

@yanavlasov yanavlasov Apr 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed states to better reflect cluster manager's initialization sequence.

// During this state we wait to start initializing secondary clusters. In this state all
// phase 1 clusters have completed initialization. Initialization of the secondary clusters
// is started by the `initializeSecondaryClusters` method.
WaitingForSecondaryInitialize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be a fan of adding Rtds as a specific state.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my neophyte perspective this would break abstraction, i.e. why should cluster manager be concerned with RTDS and reflect it in its internal state? The way I wanted to code this is:

  1. Initialize primary clusters.
  2. Let the server do something else. (cluster manager is in the WaitingForSecondaryInitialize state).
  3. Initialize secondary clusters when told so by the server.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the way you have it now is clean, without any mention of RTDS inside ClusterManager, resolved.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on changing this to WaitingToStartSecondaryInitialization? I found this confusing on read through (not that what was there before was not confusing). Feel free to update others to make them more clear if that can be done. Perhaps WaitingToStartCdsInitialization, etc.?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed states. Cleaned-up comments a bit as well.

source/server/server.cc Show resolved Hide resolved
@stale
Copy link

stale bot commented Mar 23, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Mar 23, 2020
@stale
Copy link

stale bot commented Mar 30, 2020

This pull request has been automatically closed because it has not had activity in the last 14 days. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot closed this Mar 30, 2020
Signed-off-by: Yan Avlasov <yavlasov@google.com>
@yanavlasov yanavlasov reopened this Apr 1, 2020
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Apr 1, 2020
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
@mattklein123 mattklein123 self-assigned this Apr 2, 2020
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, a few small comments and we can ship.
/wait

test/integration/ads_integration_test.cc Outdated Show resolved Hide resolved
test/integration/ads_integration_test.cc Outdated Show resolved Hide resolved
test/integration/ads_integration_test.cc Outdated Show resolved Hide resolved
@stale
Copy link

stale bot commented Apr 10, 2020

This pull request has been automatically marked as stale because it has not had activity in the last 7 days. It will be closed in 7 days if no further activity occurs. Please feel free to give a status update now, ping for review, or re-open when it's ready. Thank you for your contributions!

@stale stale bot added the stale stalebot believes this issue/PR has not been touched recently label Apr 10, 2020
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
@stale stale bot removed the stale stalebot believes this issue/PR has not been touched recently label Apr 11, 2020
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just one last Q.

}
};

INSTANTIATE_TEST_SUITE_P(IpVersionsClientTypeDelta, AdsIntegrationTestWithRtdsAndSecondaryClusters,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do the secondary clusters come from?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added a comment on line 947

Signed-off-by: Yan Avlasov <yavlasov@google.com>
htuch
htuch previously approved these changes Apr 14, 2020
Copy link
Member

@htuch htuch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@azure-pipelines
Copy link

Command 'retest' is not supported by Azure Pipelines.

Supported commands
  • help:
    • Get descriptions, examples and documentation about supported commands
    • Example: help "command_name"
  • list:
    • List all pipelines for this repository using a comment.
    • Example: "list"
  • run:
    • Run all pipelines or specific pipelines for this repository using a comment. Use this command by itself to trigger all related pipelines, or specify specific pipelines to run.
    • Example: "run" or "run pipeline_name, pipeline_name, pipeline_name"
  • where:
    • Report back the Azure DevOps orgs that are related to this repository and org
    • Example: "where"

See additional documentation.

@yanavlasov
Copy link
Contributor Author

/azp run

@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s), but failed to run 2 pipeline(s).

Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks this is great. Just a few small comments.

/wait

// During this state we wait to start initializing secondary clusters. In this state all
// phase 1 clusters have completed initialization. Initialization of the secondary clusters
// is started by the `initializeSecondaryClusters` method.
WaitingForSecondaryInitialize,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thoughts on changing this to WaitingToStartSecondaryInitialization? I found this confusing on read through (not that what was there before was not confusing). Feel free to update others to make them more clear if that can be done. Perhaps WaitingToStartCdsInitialization, etc.?

source/server/server.cc Outdated Show resolved Hide resolved
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: Yan Avlasov <yavlasov@google.com>
Copy link
Member

@mattklein123 mattklein123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! Can you merge master which should hopefully fix CI?

/wait

Signed-off-by: Yan Avlasov <yavlasov@google.com>
@mattklein123 mattklein123 merged commit aaba081 into envoyproxy:master Apr 20, 2020
penguingao pushed a commit to penguingao/envoy that referenced this pull request Apr 22, 2020
…rst (envoyproxy#10362)

The new order of initialization:
1. Initialize all primary clusters
2. Initialize RTDS
3. Initialize secondary clusters
4. Initialize the rest of dynamic resources

Signed-off-by: Yan Avlasov <yavlasov@google.com>
Signed-off-by: pengg <pengg@google.com>
rgs1 pushed a commit to rgs1/envoy that referenced this pull request Apr 23, 2020
…ys be first (envoyproxy#10362)"

This reverts commit aaba081.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
mattklein123 pushed a commit that referenced this pull request Apr 23, 2020
…ys be first (#10362)" (#10919)

This reverts commit aaba081.

Signed-off-by: Raul Gutierrez Segales <rgs@pinterest.com>
spenceral added a commit to spenceral/envoy that referenced this pull request Apr 27, 2020
Signed-off-by: Spencer Lewis <slewis@squareup.com>

* master:
  fault injection: add support for setting gRPC status (envoyproxy#10841)
  tests: tag tests that fail on Windows with fails_on_windows (envoyproxy#10940)
  Fix typo on Postgres Proxy documentation. (envoyproxy#10930)
  fuzz: improve header/data stop/continue modeling in HCM fuzzer. (envoyproxy#10931)
  gzip filter: allow setting zlib compressor's chunk size (envoyproxy#10508)
  http: replace vector/reserve with InlinedVector in codec helper (envoyproxy#10941)
  stats: add utilities to create stats from a vector of tokens, mixing dynamic and symbolic elements. (envoyproxy#10735)
  hcm: avoid invoking 100-continue handling on decode filter. (envoyproxy#10929)
  prometheus stats: Correctly group lines of the same metric name. (envoyproxy#10833)
  status: Fix ASAN error in Status payload handling (envoyproxy#10906)
  path: Fix merge slash for paths ending with slash and present query args (envoyproxy#10922)
  compressor filter: add benchmark (envoyproxy#10464)
  xray: expected_span_name is not captured by the lambda with MSVC (envoyproxy#10934)
  ci: update before purge in cleanup (envoyproxy#10938)
  tracer: Improve test coverage for x-ray (envoyproxy#10890)
  Revert "init: order dynamic resource initialization to make RTDS always be first (envoyproxy#10362)" (envoyproxy#10919)
@yanavlasov yanavlasov deleted the xds-order branch February 1, 2021 19:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

RTDS should be fully warmed before ClusterManager initialization
4 participants