Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use fewer connections in the multi-version client (release-6.3) #4677

Conversation

sfc-gh-abeamon
Copy link
Collaborator

@sfc-gh-abeamon sfc-gh-abeamon commented Apr 16, 2021

This is a backport of #4667 and #4733, with various parts that aren't applicable removed. This is being backported to support upgrade scenarios with 6.3 that require lowering the cost of the multi-version client with two or more external clients.

This updates the multi-version client to monitor the protocol version of the cluster and use only the primary and one appropriately-versioned external client to connect.

As a result, this should eliminate the impact of incompatible clients on the cluster.

  • Completed 100K correctness tests with 3 pre-existing failures (2 timeouts in BackupCorrectnessClean and 1 failure in a SnapTestAttrition upgrade).
  • Passed 10K binding tester runs
  • Passed multi-threaded client tests

The following real-world tests were performed to verify that the client functioned and that we had the expected number of connections:

  • 5.0 external; cluster at 5.0
  • 5.1 and 6.3 external; cluster at 5.1
  • 5.2 external; cluster at 5.2
  • 6.0 external; cluster at 6.0
  • 6.1 external; cluster at 6.1
  • 6.2 and 6.3 external; cluster at 6.2
  • 6.2 and 6.3 external; cluster at 6.2 (local client disabled)
  • 6.2 and 6.3 external; cluster at 6.3
  • 6.2 and 6.3 external; cluster at 6.3 (local client disabled)
  • no externals; cluster at 6.3
  • 7.0 external; cluster at 7.0
  • 5.0 and 5.1 external; cluster upgraded from 5.0 to 5.1
  • 5.1, 5.2 and 6.3 external; cluster upgraded from 5.1 to 5.2
  • 5.1 and 6.3 external; cluster upgraded from 5.1 to 6.3
  • 6.2 and 6.3 external; cluster upgraded from 6.2 to 6.3
  • 6.2 and 7.0 external; cluster upgraded from 6.2 to 7.0
  • 6.2 and 7.0 external; cluster upgraded from 6.3 to 7.0
  • 6.2 and 6.3 external; cluster upgraded from 6.2 to 6.3 (local client disabled)
  • 6.3 external; cluster upgraded from 6.2 to 6.3 (no connection before upgrade)
  • 6.2 external; cluster upgraded from 6.1 to 6.2 (no connection after upgrade)
  • 6.1 external; cluster upgraded from 6.1 to 6.2 to 6.3 (no connection at 6.2)
  • 6.1 and 6.3 external; cluster upgraded from 6.1 to 6.2 to 6.3 (no connection at 6.2, local client disabled)
  • 5.0 and 6.2 external; cluster "downgraded" from 6.2 to 5.0
  • 5.1 and 6.2 external; cluster "downgraded" from 6.2 to 5.1
  • 5.0 external; cluster "downgraded" from 6.3 to 5.0
  • 5.1 external; cluster "downgraded" from 6.3 to 5.1
  • 6.2 external; cluster "downgraded" from 6.3 to 6.2
  • 7.0 external; cluster "downgraded" from 7.0 to 6.3 (does not work for unrelated reasons)

Code-Reviewer Section

The general guidelines can be found here.

Please check each of the following things and check all boxes before accepting a PR.

  • The PR has a description, explaining both the problem and the solution.
  • The description mentions which forms of testing were done and the testing seems reasonable.
  • Every function/class/actor that was touched is reasonably well documented.

For Release-Branches

If this PR is made against a release-branch, please also check the following:

  • This change/bugfix is a cherry-pick from the next younger branch (younger release-branch or master if this is the youngest branch)
  • There is a good reason why this PR needs to go into a release branch and this reason is documented (either in the description above or in a linked GitHub issue)

@sfc-gh-abeamon sfc-gh-abeamon changed the base branch from master to release-6.3 April 16, 2021 23:38
…I versions < 610). Fix reference counting of DLDatabase objects to avoid leaking the underlying database handle. Update release notes to note that clients older than 6.2 still create extra connections.
…ld versions. Update the C bindings implementation of get_server_protocol to convert the ProtocolVersion object into a uint64_t. Rename a misleading protocol version alias.
Use nullptr instead of NULL
Use const& for a parameter
Add some comments
…termine the protocol version in order to trigger leader monitoring communication.
…ed. Avoid using any state when cancelled. Fix race between setting up the protocol version monitor and destroying the database.
… in lambdas, instead capturing a full Reference.
@sfc-gh-abeamon sfc-gh-abeamon marked this pull request as ready for review May 1, 2021 01:45
@sfc-gh-abeamon
Copy link
Collaborator Author

sfc-gh-abeamon commented May 4, 2021

Note that the primary differences with the master branch are:

  1. Everything related to use of the new stable interface for getting the protocol version has been removed
  2. The MVC always uses a database created with the primary client to monitor the protocol version rather than trying to use the active one (Partially 000102d, but some other parts of this were in the original merge).
  3. The MVC resets failed connections to un-failed if it is monitoring them for the connect packet. This is to prevent a protection that can cause the connection to go unmonitored for long stretches. 54d6176

Copy link
Contributor

@vishesh vishesh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@@ -737,6 +737,13 @@ ACTOR Future<Void> connectionKeeper(Reference<Peer> self,

conn->close();
conn = Reference<IConnection>();

// Old versions will throw this error, and we don't want to forget their protocol versions.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I understood correctly it is supposed to be old version will not throw this error, and will hang up? https://github.com/apple/foundationdb/blob/master/fdbrpc/FlowTransport.actor.cpp#L1224

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What happens depends on what side of the connection you are on. If you are sending the connect packet, the connection will drop and you won't learn anything about the protocol version.

If you are receiving the connect packet, then it will hang up on the old connection (i.e. not send anything), but locally it will throw an error that cleans up the connection state. We do, however, learn what the protocol version is in this case, and we're choosing not to discard it by ignoring this error.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This differs from the newer versions, by the way, in the fact that new versions leave the connection open but don't exchange any real messages over it. The idea is that by leaving the connection open, we can cheaply keep track of the fact that the two sides are not compatible.

@vishesh vishesh merged commit b1c8bbf into apple:release-6.3 May 5, 2021
@sfc-gh-abeamon sfc-gh-abeamon deleted the feature-mvc-monitor-protocol-version-release-6.3 branch June 3, 2021 21:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants