Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove upgrade extension loop for the same goal state #1686

Merged
merged 3 commits into from
Nov 4, 2019

Conversation

pgombar
Copy link
Contributor

@pgombar pgombar commented Oct 28, 2019

Description

Currently, if we're upgrading an extension and the upgrade fails, we will retry the entire upgrade sequence every three seconds. This is triggered in the main loop of the extension handler thread.

Unfortunately, when one of the commands in the upgrade sequence keeps failing (e.g. the old extension's disable command) we are stuck in a never-ending loop of trying to upgrade the extension and failing.

This PR removes the retry logic for the upgrade scenario, meaning we will treat it as any other operation -- if the operation was already processed (either successfully or unsuccessfully) for the same goal state (same incarnation number), we will not process it again.


PR information

  • The title of the PR is clear and informative.
  • There are a small number of commits, each of which has an informative message. This means that previously merged commits do not appear in the history of the PR. For information on cleaning up the commits in your pull request, see this page.
  • Except for special cases involving multiple contributors, the PR is started from a fork of the main repository, not a branch.
  • If applicable, the PR references the bug/issue that it fixes in the description.
  • New Unit tests were added for the changes made and Travis.CI is passing.

Quality of Code and Contribution Guidelines


This change is Reviewable

@@ -684,7 +684,6 @@ def __init__(self, ext_handler, protocol):
self.operation = None
self.pkg = None
self.pkg_file = None
self.is_upgrade = False
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm very curious to know as to why this property was added in the first place?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR brought it in. It seems the flag was only used to intentionally retry the same goal state if it's an upgrade scenario. I believe the intention was to add a retry in case of transient failures during upgrade, but we've since seen how messy and costly it is when it's a non-transient upgrade failure (hint: OMS), so I don't it's a good trade-off to keep this logic.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I cant think of a scenario where this might hit, but I'm afraid of customers who might've taken dependency on this "behavior" where we keep trying to retry goalstate if there's an extension upgrade available.
Anyways I feel even if someone did take a dependency then they can modify the behavior to the correct behavior

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@larohra: By taking dependency on this "behavior", do you mean writing extensions assuming that GA would be retrying till the end of time?

@pgombar, we have to make it clear in our documentation as well to make it clear and concrete and we would only try once, if not already in our documentation.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah basically. Like what I meant was since we even had a test case that was pretty much trying to check if the disable failed scenario would recover itself without a new incarnation change, maybe the previous owners told the extension publishers to not worry about transient issues as they would be "auto-resolved" eventually due to which they dont have enough retry logic in their code.

@vrdmr I agree with your point about double checking the documentation and making it very explicit that we would only try it once per goalstate.

@codecov
Copy link

codecov bot commented Oct 28, 2019

Codecov Report

❗ No coverage uploaded for pull request base (develop@8c108ec). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff             @@
##             develop    #1686   +/-   ##
==========================================
  Coverage           ?   67.34%           
==========================================
  Files              ?       80           
  Lines              ?    11432           
  Branches           ?     1604           
==========================================
  Hits               ?     7699           
  Misses             ?     3393           
  Partials           ?      340
Impacted Files Coverage Δ
azurelinuxagent/ga/exthandlers.py 84.01% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 8c108ec...6e23f55. Read the comment docs.

self._assert_handler_status(protocol.report_vm_status, "NotReady", expected_ext_count=0, version="1.0.1")

@patch('azurelinuxagent.ga.exthandlers.HandlerManifest.get_disable_command')
def test__extension_upgrade_failure_when_prev_version_disable_fails_and_recovers(self, patch_get_disable_command,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dont know how this test was recovering before (maybe by using the is_upgrade flag) - test__extension_upgrade_failure_when_prev_version_disable_fails_and_recovers, but maybe you could keep it just to ensure that we have this case tested too.

Everything else LGTM!

Copy link
Contributor Author

@pgombar pgombar Oct 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point, this test was "recovering" by patching launch_command to ensure it's a no-op and simply running the exthandler sequence for the same goal state, which is not valid anymore. However, there is value in having a test that exercises the recover scenario of the upgrade sequence (this time on a new incarnation). I'll add that test. Thanks!

Copy link
Contributor

@larohra larohra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Member

@vrdmr vrdmr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This fix was much needed. Thanks. :)

LGTM.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants