HDDS-12080: delete irrelevant RATIS/THREE pipeline's raft logs in case of DEAD datanode state #7697
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
What changes were proposed in this pull request?
Clear non-relevant RATIS/THREE pipelines on datanode if the previous datanode state on the SCM side is equal to 'DEAD'
There are a number of cases when a datanode can't send heartbeat requests to SCM and the latter starts to handle the datanode as a DEAD one: close pipelines, and clear the command queue for the DEAD node. And the datanode will never get the commands to clear/close its pipelines, but can get a new command queue to create a bunch of new pipelines. The pipelines count increases and each node restart triggers reading of the pipelines (aka raft group) and can consume a lot of time and memory.
We know that in the case of the DEAD state of the node, the related pipelines are already closed and irrelevant, and it doesn't make sense to initiate the raft logs on starting/restarting the datanode, and it seems we could delete the directories of the pipelines/raft_logs in case of the previously saved state of the datanode is equal to 'DEAD'
What is the link to the Apache JIRA
https://issues.apache.org/jira/browse/HDDS-12080
How was this patch tested?
manually (test cases are in progress of development and discussable)