-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove upgrade extension loop for the same goal state #1686
Conversation
@@ -684,7 +684,6 @@ def __init__(self, ext_handler, protocol): | |||
self.operation = None | |||
self.pkg = None | |||
self.pkg_file = None | |||
self.is_upgrade = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm very curious to know as to why this property was added in the first place?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR brought it in. It seems the flag was only used to intentionally retry the same goal state if it's an upgrade scenario. I believe the intention was to add a retry in case of transient failures during upgrade, but we've since seen how messy and costly it is when it's a non-transient upgrade failure (hint: OMS), so I don't it's a good trade-off to keep this logic.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I cant think of a scenario where this might hit, but I'm afraid of customers who might've taken dependency on this "behavior" where we keep trying to retry goalstate if there's an extension upgrade available.
Anyways I feel even if someone did take a dependency then they can modify the behavior to the correct behavior
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah basically. Like what I meant was since we even had a test case that was pretty much trying to check if the disable failed scenario would recover itself without a new incarnation change, maybe the previous owners told the extension publishers to not worry about transient issues as they would be "auto-resolved" eventually due to which they dont have enough retry logic in their code.
@vrdmr I agree with your point about double checking the documentation and making it very explicit that we would only try it once per goalstate.
Codecov Report
@@ Coverage Diff @@
## develop #1686 +/- ##
==========================================
Coverage ? 67.34%
==========================================
Files ? 80
Lines ? 11432
Branches ? 1604
==========================================
Hits ? 7699
Misses ? 3393
Partials ? 340
Continue to review full report at Codecov.
|
self._assert_handler_status(protocol.report_vm_status, "NotReady", expected_ext_count=0, version="1.0.1") | ||
|
||
@patch('azurelinuxagent.ga.exthandlers.HandlerManifest.get_disable_command') | ||
def test__extension_upgrade_failure_when_prev_version_disable_fails_and_recovers(self, patch_get_disable_command, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I dont know how this test was recovering before (maybe by using the is_upgrade flag) - test__extension_upgrade_failure_when_prev_version_disable_fails_and_recovers
, but maybe you could keep it just to ensure that we have this case tested too.
Everything else LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, this test was "recovering" by patching launch_command to ensure it's a no-op and simply running the exthandler sequence for the same goal state, which is not valid anymore. However, there is value in having a test that exercises the recover scenario of the upgrade sequence (this time on a new incarnation). I'll add that test. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This fix was much needed. Thanks. :)
LGTM.
Description
Currently, if we're upgrading an extension and the upgrade fails, we will retry the entire upgrade sequence every three seconds. This is triggered in the main loop of the extension handler thread.
Unfortunately, when one of the commands in the upgrade sequence keeps failing (e.g. the old extension's disable command) we are stuck in a never-ending loop of trying to upgrade the extension and failing.
This PR removes the retry logic for the upgrade scenario, meaning we will treat it as any other operation -- if the operation was already processed (either successfully or unsuccessfully) for the same goal state (same incarnation number), we will not process it again.
PR information
Quality of Code and Contribution Guidelines
This change is