-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only release the lock when the cluster is reconciled #2117
Only release the lock when the cluster is reconciled #2117
Conversation
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though I was thinking it could be nice to output the cluster.LockDuration in the TakeLock failure output for users to see how long they should expect to wait in the case of a problem (though I don't know how many people use multi operator setups)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, though I was thinking it could be nice to output the cluster.LockDuration in the TakeLock failure output for users to see how long they should expect to wait in the case of a problem (though I don't know how many people use multi operator setups)
That makes sense, I'll add it 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The logger already contains the information, so there is no need to update the code: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/fdbclient/lock_client.go#L118-L147
e841c46
to
6d18fea
Compare
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Result of fdb-kubernetes-operator-pr on Linux CentOS 7
|
Description
In the past we added code to release the lock when the operation was performed. Doing that increases the risk of race conditions when multiple operator instances are managing a multi-region (or three data hall) FDB cluster. In order to reduce the risk of those race conditions the operator is releasing the lock only when the cluster is reconciled (or when the lock is timed out).
Type of change
Please select one of the options below.
Discussion
The general idea of the locking mechanism is to reduce the operations executed in parallel on the FDB cluster. When the lock is directly released after the operation we have the risk that another operator instance is directly doing another operation that could be disruptive, e.g. excluding processes.
Testing
e2e test will be running by CI.
Documentation
Will be updated.
Follow-up