Remove upgrade extension loop for the same goal state #1686

pgombar · 2019-10-28T22:50:30Z

Description

Currently, if we're upgrading an extension and the upgrade fails, we will retry the entire upgrade sequence every three seconds. This is triggered in the main loop of the extension handler thread.

Unfortunately, when one of the commands in the upgrade sequence keeps failing (e.g. the old extension's disable command) we are stuck in a never-ending loop of trying to upgrade the extension and failing.

This PR removes the retry logic for the upgrade scenario, meaning we will treat it as any other operation -- if the operation was already processed (either successfully or unsuccessfully) for the same goal state (same incarnation number), we will not process it again.

PR information

The title of the PR is clear and informative.
There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
Except for special cases involving multiple contributors, the PR is started from a fork of the main repository, not a branch.
If applicable, the PR references the bug/issue that it fixes in the description.
New Unit tests were added for the changes made and Travis.CI is passing.

Quality of Code and Contribution Guidelines

I have read the contribution guidelines.

This change is

larohra · 2019-10-28T22:53:48Z

azurelinuxagent/ga/exthandlers.py

@@ -684,7 +684,6 @@ def __init__(self, ext_handler, protocol):
        self.operation = None
        self.pkg = None
        self.pkg_file = None
-        self.is_upgrade = False


I'm very curious to know as to why this property was added in the first place?

This PR brought it in. It seems the flag was only used to intentionally retry the same goal state if it's an upgrade scenario. I believe the intention was to add a retry in case of transient failures during upgrade, but we've since seen how messy and costly it is when it's a non-transient upgrade failure (hint: OMS), so I don't it's a good trade-off to keep this logic.

I cant think of a scenario where this might hit, but I'm afraid of customers who might've taken dependency on this "behavior" where we keep trying to retry goalstate if there's an extension upgrade available.
Anyways I feel even if someone did take a dependency then they can modify the behavior to the correct behavior

@larohra: By taking dependency on this "behavior", do you mean writing extensions assuming that GA would be retrying till the end of time?

@pgombar, we have to make it clear in our documentation as well to make it clear and concrete and we would only try once, if not already in our documentation.

yeah basically. Like what I meant was since we even had a test case that was pretty much trying to check if the disable failed scenario would recover itself without a new incarnation change, maybe the previous owners told the extension publishers to not worry about transient issues as they would be "auto-resolved" eventually due to which they dont have enough retry logic in their code.

@vrdmr I agree with your point about double checking the documentation and making it very explicit that we would only try it once per goalstate.

codecov · 2019-10-28T22:59:18Z

Codecov Report

❗ No coverage uploaded for pull request base (develop@8c108ec). Click here to learn what that means.
The diff coverage is 100%.

@@            Coverage Diff             @@
##             develop    #1686   +/-   ##
==========================================
  Coverage           ?   67.34%           
==========================================
  Files              ?       80           
  Lines              ?    11432           
  Branches           ?     1604           
==========================================
  Hits               ?     7699           
  Misses             ?     3393           
  Partials           ?      340

Impacted Files	Coverage Δ
azurelinuxagent/ga/exthandlers.py	`84.01% <100%> (ø)`

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c108ec...6e23f55. Read the comment docs.

larohra · 2019-10-29T21:26:25Z

tests/ga/test_extension.py

-        self._assert_handler_status(protocol.report_vm_status, "NotReady", expected_ext_count=0, version="1.0.1")
-
-    @patch('azurelinuxagent.ga.exthandlers.HandlerManifest.get_disable_command')
-    def test__extension_upgrade_failure_when_prev_version_disable_fails_and_recovers(self, patch_get_disable_command,


I dont know how this test was recovering before (maybe by using the is_upgrade flag) - test__extension_upgrade_failure_when_prev_version_disable_fails_and_recovers, but maybe you could keep it just to ensure that we have this case tested too.

Everything else LGTM!

Good point, this test was "recovering" by patching launch_command to ensure it's a no-op and simply running the exthandler sequence for the same goal state, which is not valid anymore. However, there is value in having a test that exercises the recover scenario of the upgrade sequence (this time on a new incarnation). I'll add that test. Thanks!

larohra

LGTM

vrdmr

This fix was much needed. Thanks. :)

LGTM.

pgombar and others added 2 commits October 24, 2019 15:30

remove retry logic for upgrade scenario for the same goal state

32b21e0

update tests

6dceed3

pgombar requested review from larohra, narrieta and vrdmr as code owners October 28, 2019 22:50

larohra reviewed Oct 28, 2019

View reviewed changes

larohra reviewed Oct 29, 2019

View reviewed changes

larohra approved these changes Oct 29, 2019

View reviewed changes

add test case for disable recovering on next incarnation

6e23f55

vrdmr approved these changes Oct 30, 2019

View reviewed changes

narrieta approved these changes Nov 4, 2019

View reviewed changes

pgombar merged commit b82a31b into Azure:develop Nov 4, 2019

pgombar deleted the remove_upgrade_loop branch November 4, 2019 18:08

mbsnl mentioned this pull request Dec 20, 2019

Partition "/" full caused by Windows Azure Linux Agent(waagent) cloudfoundry/bosh-linux-stemcell-builder#107

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove upgrade extension loop for the same goal state #1686

Remove upgrade extension loop for the same goal state #1686

pgombar commented Oct 28, 2019 •

edited by vrdmr

Loading

larohra Oct 28, 2019

pgombar Oct 28, 2019

larohra Oct 29, 2019

vrdmr Oct 30, 2019

larohra Oct 30, 2019

codecov bot commented Oct 28, 2019 •

edited

Loading

larohra Oct 29, 2019

pgombar Oct 29, 2019 •

edited

Loading

larohra left a comment

vrdmr left a comment

Remove upgrade extension loop for the same goal state #1686

Remove upgrade extension loop for the same goal state #1686

Conversation

pgombar commented Oct 28, 2019 • edited by vrdmr Loading

Description

PR information

Quality of Code and Contribution Guidelines

larohra Oct 28, 2019

Choose a reason for hiding this comment

pgombar Oct 28, 2019

Choose a reason for hiding this comment

larohra Oct 29, 2019

Choose a reason for hiding this comment

vrdmr Oct 30, 2019

Choose a reason for hiding this comment

larohra Oct 30, 2019

Choose a reason for hiding this comment

codecov bot commented Oct 28, 2019 • edited Loading

Codecov Report

larohra Oct 29, 2019

Choose a reason for hiding this comment

pgombar Oct 29, 2019 • edited Loading

Choose a reason for hiding this comment

larohra left a comment

Choose a reason for hiding this comment

vrdmr left a comment

Choose a reason for hiding this comment

pgombar commented Oct 28, 2019 •

edited by vrdmr

Loading

codecov bot commented Oct 28, 2019 •

edited

Loading

pgombar Oct 29, 2019 •

edited

Loading