-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix cluster not able to spin up issue when disk usage exceeds threshold #15258
Merged
dblock
merged 9 commits into
opensearch-project:main
from
zane-neo:fix-cluster-unable-spin-up
Oct 16, 2024
Merged
Changes from all commits
Commits
Show all changes
9 commits
Select commit
Hold shift + click to select a range
f38dba8
fix cluster not able to spin up issue when disk usage exceeds threshold
zane-neo 5dbdab9
Add comment to changes
zane-neo 4fcd310
Add UT to ensure the keepAliveThread starts before node starts
zane-neo 2ad0126
remove unused imports
zane-neo acebaf3
Fix forbidden API calls check failed issue
zane-neo c29a02b
format code
zane-neo 2b3cd47
format code
zane-neo af9d70d
change setInstance method to static
zane-neo 7f24452
Add countdownlatch in test to coordinate the thread to avoid concuree…
zane-neo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naively speaking, if
node.start()
doesn't succeeded, then there is nothing to keep alive and the JVM shutting down seems like the right thing to do. It seems like this change could lead to a partially-started or not-at-all-started node running indefinitely because the keepAliveThread will just keep it alive in some sort of zombie state afternode.start()
failed. What am I missing?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Naively responding, I've no objection to the JVM shutting down, but having spent way too many hours trying to figure out why the JVM shut down in various situations, I'd be really happy to have at least one thread faithfully logging whatever issues caused it to shut down, before shutting itself down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not all
node.start()
failing means the JVM should quit, like the case in the corresponding issue, if we're able to keep JVM running, user can fix this issue simply by changing the cluster settings. So the fix is to provide user the capability to interact with the running cluster(even partially-started).And for those cases
node.start()
failing and the cluster is not available at all case, this fix doesn't have real impact because in the end user needs to check error/fatal logs to fix the root cause. This change doesn't add any blocking to user but w/o this change, the first case users are blocked.So in all I think this is a positive enhancement to user.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like a really blunt change for an extremely specific case. In general, quitting the JVM and letting it restart is preferable to continuing to run in a partially-started state. Prior to this change, a transient failure in
node.start()
might automatically recover because the JVM would quit and whatever is monitoring the process would restart it. Now there is a risk (in my opinion) that the process will continue running in a non-functional partially-started state that will require human intervention to resolve. How do we know that is not a risk here?The assumption that anything useful can be done with a partially started node was true in the specific case mentioned but is not true in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dbwiddis Not logging why the JVM shutdown is clearly a problem. However, I'd argue this change might make things considerably worse. Instead of knowing there's a problem (because the process restarted) we'll instead have a partially-started node running in a crippled state with no indication that something went wrong during startup.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can create the revert PR now, and I'll look into the solution 2 for a long term solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @zane-neo . Please also consider solution 1 and/or the bug I reported. One of the two of those issues should be fixed ASAP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrross @dblock @dbwiddis Do you think we can conclude that all plugins shouldn't block the OpenSearch from starting up even plugin itself encountering issue? I believe if this is true, we can simple do a try catch check for the onNodeStarted method to avoid this. The pros of this fix is it's simple and effective, and it can prevent this won't happen in other plugins.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@zane-neo I don't think this is true generally. For example, if the security plugin can't properly start up it would be safer to fail versus coming up with security disabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@andrross thanks, I created this PR: opensearch-project/observability#1873 for short term fix, and I will continue the long term fix exploration.