Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

khepri_cluster: Fix race condition in the reset code #294

Merged
merged 1 commit into from
Sep 11, 2024

Conversation

dumbbell
Copy link
Member

Why

The khepri_cluster:reset() function is mainly used to uncluster a node. To make sure that both parties (the leaving node and the rest of the cluster) see the same cluster membership in the end, we perform two resets:

  1. We remove the leaving member from a remote member (if the node is clustered).
  2. We remove the leaving member from its own view of the cluster.

This way, if the leaving node was out-of-sync about the cluster membership because it lost its state for instance, we are sure that at the end, everyone agrees.

However, when the leaving node is removed using a remote member, that member will stop the leaving Ra server. Therefore, when we try the second remove on the leaving node, we might get a {error, noproc} error because the Ra process already exited.

How

We adopt the same solution as the error handling done with wait_for_leader(): if ra:remove_member() returns the noproc error, we consider that's ok and proceed with the reset.

This should fix a rare transient failure that I saw in CI but was never able to reproduce locally.

@dumbbell dumbbell added the bug Something isn't working label Sep 11, 2024
@dumbbell dumbbell added this to the v0.16.0 milestone Sep 11, 2024
@dumbbell dumbbell self-assigned this Sep 11, 2024
@dumbbell dumbbell force-pushed the fix-reset-race-condition branch from b5d503b to f495420 Compare September 11, 2024 16:15
[Why]
The `khepri_cluster:reset()` function is mainly used to uncluster a
node. To make sure that both parties (the leaving node and the rest of
the cluster) see the same cluster membership in the end, we perform two
resets:

1. We remove the leaving member from a remote member (if the node is
   clustered).
2. We remove the leaving member from its own view of the cluster.

This way, if the leaving node was out-of-sync about the cluster
membership because it lost its state for instance, we are sure that at
the end, everyone agrees.

However, when the leaving node is removed using a remote member, that
member will stop the leaving Ra server. Therefore, when we try the
second remove on the leaving node, we might get a `{error, noproc}`
error because the Ra process already exited.

[How]
We adopt the same solution as the error handling done with
`wait_for_leader()`: if `ra:remove_member()` returns the `noproc` error,
we consider that's ok and proceed with the reset.

This should fix a rare transient failure that I saw in CI but was never
able to reproduce locally.
@dumbbell dumbbell force-pushed the fix-reset-race-condition branch from f495420 to 8491582 Compare September 11, 2024 16:31
Copy link

codecov bot commented Sep 11, 2024

Codecov Report

Attention: Patch coverage is 0% with 5 lines in your changes missing coverage. Please review.

Project coverage is 89.53%. Comparing base (c5722bf) to head (8491582).
Report is 2 commits behind head on main.

Files with missing lines Patch % Lines
src/khepri_cluster.erl 0.00% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #294      +/-   ##
==========================================
- Coverage   89.68%   89.53%   -0.15%     
==========================================
  Files          21       21              
  Lines        3188     3192       +4     
==========================================
- Hits         2859     2858       -1     
- Misses        329      334       +5     
Flag Coverage Δ
erlang-25 88.78% <0.00%> (+0.04%) ⬆️
erlang-26 89.44% <0.00%> (+0.04%) ⬆️
erlang-27 89.47% <0.00%> (-0.12%) ⬇️
os-ubuntu-latest 89.53% <0.00%> (-0.05%) ⬇️
os-windows-latest 89.53% <0.00%> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@dumbbell dumbbell marked this pull request as ready for review September 11, 2024 16:45
@dumbbell
Copy link
Member Author

The patch coverage is 0% because a follow-up changes to khepri_cluster will greatly increase the chance to hit this error (100% for me locally). This is how the problem was discovered. However, I wanted to commit this one first because CI will likely fail for that other patch.

@dumbbell dumbbell merged commit 80ef2a3 into main Sep 11, 2024
12 checks passed
@dumbbell dumbbell deleted the fix-reset-race-condition branch September 11, 2024 16:48
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants