Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

csm sync issue fixed #259

Merged
merged 3 commits into from
Jun 5, 2023
Merged

csm sync issue fixed #259

merged 3 commits into from
Jun 5, 2023

Conversation

jooseppi-luna
Copy link
Contributor

@jooseppi-luna jooseppi-luna commented May 31, 2023

Description

This PR fixes the issue where the csm object occasionally stays in the failed state for close to an hour after the pods finish successfully deploying. The issue is that, sometimes, especially when the deployment takes longer, the final CSM update that occurs once the pods are running fails. The RC is that, during the CSM update, cluster can be simultaneously updated by something outside of the operator while we are processing the update, causing a synchronization error that kills the update.

The fix for this is to add retry logic to UpdateStatus function using the RetryOnConflict function to retry when we get an error saying the csm object has been updated during our update. With this, the csm object goes into a succeeded state every time.

GitHub Issues

List the GitHub issues impacted by this PR:

GitHub Issue #
dell/csm#816

Checklist:

  • I have performed a self-review of my own code to ensure there are no formatting, vetting, linting, or security issues
  • I have verified that new and existing unit tests pass locally with my changes
  • I have not allowed coverage numbers to degenerate
  • I have maintained at least 90% code coverage
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • I have added tests that prove my fix is effective or that my feature works
  • I have maintained backward compatibility

How Has This Been Tested?

Installed PowerFlex 1,000 times on two different systems with ImagePullPolicy=Always (to lengthen deployment), the csm object never got stuck in a failed state. Without the fix, it was getting stuck in the failed state 30-40% of the time
Installed PowerScale 1,000 times on two different systems with ImagePullPolicy=Always (to lengthen deployment), the csm object never got stuck in a failed state.
Successfully ran pflex + pscale e2e tests multiple times.

Copy link
Contributor

@JacobGros JacobGros left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@kumarkgosa kumarkgosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@jooseppi-luna jooseppi-luna merged commit 9ceaa98 into main Jun 5, 2023
@JacobGros JacobGros deleted the fix-failed-state branch July 18, 2023 18:07
ChristianAtDell added a commit that referenced this pull request Oct 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants