Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only release the lock when the cluster is reconciled #2117

Conversation

johscheuer
Copy link
Member

Description

In the past we added code to release the lock when the operation was performed. Doing that increases the risk of race conditions when multiple operator instances are managing a multi-region (or three data hall) FDB cluster. In order to reduce the risk of those race conditions the operator is releasing the lock only when the cluster is reconciled (or when the lock is timed out).

Type of change

Please select one of the options below.

  • Bug fix (non-breaking change which fixes an issue)

Discussion

The general idea of the locking mechanism is to reduce the operations executed in parallel on the FDB cluster. When the lock is directly released after the operation we have the risk that another operator instance is directly doing another operation that could be disruptive, e.g. excluding processes.

Testing

e2e test will be running by CI.

Documentation

Will be updated.

Follow-up

@johscheuer johscheuer added the bug Something isn't working label Aug 27, 2024
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 5a6038c
  • Duration 2:58:37
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 15b4edc
  • Duration 2:56:52
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

Copy link
Contributor

@nicmorales9 nicmorales9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I was thinking it could be nice to output the cluster.LockDuration in the TakeLock failure output for users to see how long they should expect to wait in the case of a problem (though I don't know how many people use multi operator setups)

Copy link
Member Author

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, though I was thinking it could be nice to output the cluster.LockDuration in the TakeLock failure output for users to see how long they should expect to wait in the case of a problem (though I don't know how many people use multi operator setups)

That makes sense, I'll add it 👍

Copy link
Member Author

@johscheuer johscheuer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The logger already contains the information, so there is no need to update the code: https://github.com/FoundationDB/fdb-kubernetes-operator/blob/main/fdbclient/lock_client.go#L118-L147

@johscheuer johscheuer force-pushed the only-release-lock-when-cluster-reconciled branch from e841c46 to 6d18fea Compare August 28, 2024 13:11
@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: da4fb90
  • Duration 3:35:16
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: e841c46
  • Duration 3:22:57
  • Result: ❌ FAILED
  • Error: Error while executing command: if $fail_test; then exit 1; fi. Reason: exit status 1
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@foundationdb-ci
Copy link

Result of fdb-kubernetes-operator-pr on Linux CentOS 7

  • Commit ID: 6d18fea
  • Duration 3:23:10
  • Result: ✅ SUCCEEDED
  • Error: N/A
  • Build Log terminal output (available for 30 days)
  • Build Workspace zip file of the working directory (available for 30 days)

@johscheuer johscheuer merged commit 2c4a55a into FoundationDB:main Aug 28, 2024
8 checks passed
@johscheuer johscheuer deleted the only-release-lock-when-cluster-reconciled branch August 28, 2024 17:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants