Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Description
This PR fixes the issue where the csm object occasionally stays in the failed state for close to an hour after the pods finish successfully deploying. The issue is that, sometimes, especially when the deployment takes longer, the final CSM update that occurs once the pods are running fails. The RC is that, during the CSM update, cluster can be simultaneously updated by something outside of the operator while we are processing the update, causing a synchronization error that kills the update.
The fix for this is to add retry logic to UpdateStatus function using the
RetryOnConflict
function to retry when we get an error saying the csm object has been updated during our update. With this, the csm object goes into a succeeded state every time.GitHub Issues
List the GitHub issues impacted by this PR:
Checklist:
How Has This Been Tested?
Installed PowerFlex 1,000 times on two different systems with ImagePullPolicy=Always (to lengthen deployment), the csm object never got stuck in a failed state. Without the fix, it was getting stuck in the failed state 30-40% of the time
Installed PowerScale 1,000 times on two different systems with ImagePullPolicy=Always (to lengthen deployment), the csm object never got stuck in a failed state.
Successfully ran pflex + pscale e2e tests multiple times.