Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GODRIVER-2828 Use topology version from Server instead of Connection in ProcessError. #1252

Merged

Conversation

matthewdale
Copy link
Collaborator

@matthewdale matthewdale commented May 9, 2023

GODRIVER-2828

Summary

  • Use the latest topology version from either the server or connection when processing errors in ProcessError.
  • Refactor the ProcessError test to express the expected initial and resulting state changes more clearly.
  • Add test cases for when the connection topology version is newer than the server topology version and when the topology process ID has changed.

Background & Motivation

While processing errors, the SDAM error handling function ProcessError attempts to compare the topology version on the error to the current topology version known by the Go driver, ignoring errors where the topology version is not newer than the known topology version. However, the Go driver currently uses the connection's topology version as the "known" topology version, which is never updated after the connection is created. As a result, ProcessError will act on errors from older connections when it should ignore them.

@matthewdale matthewdale force-pushed the godriver2828-sdam-top-version branch 4 times, most recently from 3e86d65 to 0c97b45 Compare May 16, 2023 17:59
@matthewdale matthewdale marked this pull request as ready for review May 16, 2023 18:10
Copy link
Collaborator

@qingyang-hu qingyang-hu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

// ignore stale error
if desc.TopologyVersion.CompareToIncoming(cerr.TopologyVersion) >= 0 {
// Ignore errors that came from when the database was on a previous topology version.
if topologyVersion.CompareToIncoming(cerr.TopologyVersion) >= 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where the max{ServerTopology, ConnectionTopology} = ErrorTopology , why do we consider this a "previous topology version"? If this is correct behavior, can we add a comment explaining it?

// In the case where the max{serverDesc.TopologyVersion, connDesc.TopologyVersion} == cerr.TopologyVersion , 
// we consider cerr.Topology version a "previous topology version" because ...

Copy link
Collaborator Author

@matthewdale matthewdale May 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, the current comment actually describes slightly different behavior because the logic also ignores errors that came from the current topology version.

I believe the reason we only act on errors where the error topology version is strictly greater than the local topology version is that we only want to process the first instance of that error for a given topology version, and ignore the rest.

For example consider the following sequence of events:

Step Action Driver TopologyVersion DB TopologyVersion
1 Initial state 1 1
2 DB elects new primary 1 2
3 Driver starts 4 concurrent writes to secondary, secondary returns "not writable primary" 1 2
4 Driver processes first "not writable primary" error with TopologyVersion=2, marks server as Kind=Unknown, updates its TopologyVersion 2 2
5 Driver processes 3 more "not writable primary" errors with TopologyVersion=2, ignores them 2 2

Notice that the driver only processes the first "not writable primary" error, marking the server role as "unknown" until it is updated by the monitoring routine. Subsequent errors are ignored based on topology version, which prevents the driver from re-processing errors for the same topology change event.

Keep in mind that we only check topology version for "node is recovering" or "not primary" error codes, which we expect to increment topology version on the database. The first time the driver sees one of those errors, it knows the database topology has changed and needs to re-discover the topology. The driver ignores topology version for all other kinds of errors.

For additional reference, the topology version comparison logic comes directly from the error handling pseudocode in the SDAM spec.

I'll update the comment with a summary of the above information to increase clarity.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. This great explanation! Thank you

// ignore stale error
if desc.TopologyVersion.CompareToIncoming(wcerr.TopologyVersion) >= 0 {
// Ignore errors that came from when the database was on a previous topology version.
if topologyVersion.CompareToIncoming(wcerr.TopologyVersion) >= 0 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case where the max{ServerTopology, ConnectionTopology} = ErrorTopology , why do we consider this a "previous topology version"? If this is correct behavior, can we add a comment explaining it?

// In the case where the max{serverDesc.TopologyVersion, connDesc.TopologyVersion} == wcerr.TopologyVersion , 
// we consider cerr.Topology version a "previous topology version" because ...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See this comment for a detailed answer.

@@ -434,9 +467,9 @@ func (s *Server) ProcessError(err error, conn driver.Connection) driver.ProcessE

res := driver.ServerMarkedUnknown
// If the node is shutting down or is older than 4.2, we synchronously clear the pool
if wcerr.NodeIsShuttingDown() || desc.WireVersion == nil || desc.WireVersion.Max < 8 {
if wcerr.NodeIsShuttingDown() || wireVersion == nil || wireVersion.Max < 8 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on creating a constant for "8"? Even though comparing wireVersions to their literal number is commonplace in the driver, it is a magic number which makes it sort of hard to interpret.

const wireVersionGEQ42 = 8

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, using a named constant would help with readability. Will update.

x/mongo/driver/topology/server_test.go Outdated Show resolved Hide resolved
}

func newProcessErrorTestConn(tv *description.TopologyVersion) *processErrorTestConn {
func newProcessErrorTestConn(wireVersion *description.VersionRange, stale bool) *processErrorTestConn {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we put this and "newServerDescription" in the namespace? i.e. either both as package test functions, or both as closure functions in the test function?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The processErrorTestConn can't be defined in the ProcessError test function because it needs to define methods to satisfy the driver.Connection interface. However, I can move newServerDescription to the package scope.

@@ -415,16 +448,16 @@ func (s *Server) ProcessError(err error, conn driver.Connection) driver.ProcessE

res := driver.ServerMarkedUnknown
// If the node is shutting down or is older than 4.2, we synchronously clear the pool
if cerr.NodeIsShuttingDown() || desc.WireVersion == nil || desc.WireVersion.Max < 8 {
if cerr.NodeIsShuttingDown() || wireVersion == nil || wireVersion.Max < 8 {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What are your thoughts on creating a constant for "8"? Even though comparing wireVersions to their literal number is commonplace in the driver, it is a magic number which makes it sort of hard to interpret.

const wireVersionGEQ42 = 8

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, using a named constant would help with readability. Will update.

x/mongo/driver/topology/server_test.go Outdated Show resolved Hide resolved
name: "stale not primary error",
startDescription: newServerDescription(description.RSPrimary, processID, 1, nil),
inputErr: driver.Error{
Code: 10107, // NotPrimary
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional: 10107 is "NotWritablePrimary", see here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. That error is documented as "NotPrimary" in many places in the Go driver code. I'll correct it in these tests and then open a subsequent PR to correct it elsewhere in the Go driver.

matthewdale and others added 3 commits May 22, 2023 18:09
@matthewdale matthewdale force-pushed the godriver2828-sdam-top-version branch from dda23d3 to af7f854 Compare May 24, 2023 19:07
@matthewdale matthewdale merged commit 71edba3 into mongodb:master Jun 5, 2023
matthewdale added a commit that referenced this pull request Jun 5, 2023
…in ProcessError. (#1252)

Co-authored-by: Preston Vasquez <prestonvasquez@icloud.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants