Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

VMSS: Enhance Upgrade Policy with Automatic OS upgrades and Rolling Upgrade Policy support #922

Merged
merged 22 commits into from
Oct 26, 2018

Conversation

agolomoodysaada
Copy link
Contributor

@agolomoodysaada agolomoodysaada commented Mar 2, 2018

closes #714

Remaining Work

  • Support Rolling upgrade policy.
  • Support health probe referencing in network_profile.
  • Support automatic OS upgrades
  • Write passing tests

This is a placeholder PR to track progress of work. The code is ready to be reviewed.
I tried to stay backwards compatible as much as I could.

Docs can be found here: https://github.com/Azure/vm-scale-sets/tree/master/preview/upgrade#how-to-configure-auto-updates

@agolomoodysaada agolomoodysaada force-pushed the upgrade_policy_rolling branch 4 times, most recently from f2424b5 to 5c2895e Compare March 2, 2018 01:11
@agolomoodysaada agolomoodysaada force-pushed the upgrade_policy_rolling branch from 5c2895e to 83b20f2 Compare March 2, 2018 16:02
@agolomoodysaada agolomoodysaada changed the title Enhance Upgrade Policy with Automatic OS upgrades and Rolling Upgrade Policy support VMSS: Enhance Upgrade Policy with Automatic OS upgrades and Rolling Upgrade Policy support Mar 2, 2018
@tombuildsstuff tombuildsstuff added enhancement service/vmss Virtual Machine Scale Sets labels Mar 2, 2018
@tombuildsstuff tombuildsstuff added this to the 1.3.0 milestone Mar 2, 2018
Copy link
Contributor

@tombuildsstuff tombuildsstuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hey @agolomoodysaada

Thanks for this PR :)

I've taken a look through and this is off to a good start - there's two things we need to add to be able to merge this:

  1. Switch from a Set -> a List (since all values are being returned)
  2. Adding an acceptance test to cover this functionality (which you've already got a checkbox for, I'm just stating it for posterity 😄)

Thanks!

},

"rolling_upgrade_policy": {
Type: schema.TypeSet,
Copy link
Contributor

@tombuildsstuff tombuildsstuff Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

given all 4 of these are returned, and there's only one item available - can we swap this to a TypeList instead? It'll make it easier for people to reference #Resolved

"pause_time_between_batches": {
Type: schema.TypeString,
Optional: true,
Default: "PT0S",
Copy link
Contributor

@tombuildsstuff tombuildsstuff Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

are these 4 defaults the default set by Azure? #Closed

Copy link
Contributor Author

@agolomoodysaada agolomoodysaada Mar 6, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe they are, yes. #Closed

Copy link
Contributor

@julienstroheker julienstroheker Jun 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it is. #Closed

},
},
},
Set: resourceArmVirtualMachineScaleSetRollingUpgradePolicyHash,
Copy link
Contributor

@tombuildsstuff tombuildsstuff Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(if we make this a List, we shouldn't need this Set function) #Resolved

* `max_batch_instance_percent` - (Optional) The maximum percent of total virtual machine instances that will be upgraded simultaneously by the rolling upgrade in one batch. As this is a maximum, unhealthy instances in previous or future batches can cause the percentage of instances in a batch to decrease to ensure higher reliability. Defaults to `20`.
* `max_unhealthy_instance_percent` - (Optional) The maximum percentage of the total virtual machine instances in the scale set that can be simultaneously unhealthy, either as a result of being upgraded, or by being found in an unhealthy state by the virtual machine health checks before the rolling upgrade aborts. This constraint will be checked prior to starting any batch. Defaults to `20`.
* `max_unhealthy_upgraded_instance_percent` - (Optional) The maximum percentage of upgraded virtual machine instances that can be found to be in an unhealthy state. This check will happen after each batch is upgraded. If this percentage is ever exceeded, the rolling update aborts. Defaults to `5`.
* `pause_time_between_batches` - (Optional) The wait time between completing the update for all virtual machines in one batch and starting the next batch. The time duration should be specified in ISO 8601 format for duration (https://en.wikipedia.org/wiki/ISO_8601#Durations). Defaults to 0 seconds represented as `PT0S`.
Copy link
Contributor

@tombuildsstuff tombuildsstuff Mar 5, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor can we quote the 0? #Resolved

@tombuildsstuff tombuildsstuff modified the milestones: 1.3.0, 1.3.1 Mar 13, 2018
@tombuildsstuff tombuildsstuff modified the milestones: 1.3.1, 1.3.2 Mar 28, 2018
@tombuildsstuff tombuildsstuff modified the milestones: 1.3.2, 1.3.3 Apr 4, 2018
@tombuildsstuff tombuildsstuff modified the milestones: 1.3.3, 1.4.0 Apr 17, 2018
@tombuildsstuff tombuildsstuff modified the milestones: 1.4.0, 1.5.0 Apr 25, 2018
@tombuildsstuff tombuildsstuff modified the milestones: 1.5.0, 1.6.0 May 8, 2018
@tombuildsstuff tombuildsstuff modified the milestones: 1.6.0, 1.7.0 May 21, 2018
@katbyte katbyte modified the milestones: 1.7.0, Soon May 31, 2018
@katbyte
Copy link
Collaborator

katbyte commented May 31, 2018

Hi @agolomoodysaada,

Just wondering if you were still working on this 🙂

@agolomoodysaada
Copy link
Contributor Author

@katbyte , I honestly haven't had much time to finish up the tests for this work.
If you're interested in picking up this work, feel free to drop a comment here and take over.
Thank you!

katbyte
katbyte previously requested changes Aug 7, 2018
Copy link
Collaborator

@katbyte katbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeffreyCline, @JunyiYi

This PR still has failing tests that need to be resolved:

* azurerm_lb_rule.test: Error Creating/Updating LoadBalancer "acctestlb-8066726686886945008" (Resource Group "acctestrg-8066726686886945008"): network.LoadBalancersClient#CreateOrUpdate: Failure sending request: StatusCode=0 -- Original Error: autorest/azure: Service returned an error. Status=400 Code="CannotModifyRuleUsedByProbeUsedByVMSS" Message="Load balancer rule /subscriptions/*******/resourceGroups/acctestrg-8066726686886945008/providers/Microsoft.Network/loadBalancers/acctestlb-8066726686886945008/loadBalancingRules/LBRule cannot be modified because it references the load balancer probe /subscriptions/*******/resourceGroups/acctestrg-8066726686886945008/providers/Microsoft.Network/loadBalancers/acctestlb-8066726686886945008/probes/ssh-running-probe that used as health probe by VM scale set /subscriptions/*******/resourceGroups/acctestrg-8066726686886945008/providers/Microsoft.Compute/virtualMachineScaleSets/acctvmss-8066726686886945008. To make changes to this rule, please update VM scale set to remove the reference to the probe." Details=[]

@JunyiYi JunyiYi dismissed katbyte’s stale review August 7, 2018 23:53

Test case has been updated accordingly.

@katbyte katbyte removed the request for review from JunyiYi August 8, 2018 07:30
Copy link
Collaborator

@katbyte katbyte left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JunyiYi,

All tests are failing due to a panic:

Test ended in panic.

------- Stdout: -------
=== RUN   TestAccAzureRMVirtualMachineScaleSet_NonStandardCasing

------- Stderr: -------
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x0 pc=0x19fed31]

goroutine 629 [running]:
github.com/terraform-providers/terraform-provider-azurerm/azurerm.flattenAzureRmVirtualMachineScaleSetRollingUpgradePolicy(...)
	/opt/teamcity-agent/work/6811c3a0aae0f499/src/github.com/terraform-providers/terraform-provider-azurerm/azurerm/resource_arm_virtual_machine_scale_set.go:1139
github.com/terraform-providers/terraform-provider-azurerm/azurerm.resourceArmVirtualMachineScaleSetRead(0xc42038d500, 0x20507a0, 0xc4204e4800, 0xc4204722e0, 0x0)
	/opt/teamcity-agent/work/6811c3a0aae0f499/src/github.com/terraform-providers/terraform-provider-azurerm/azurerm/resource_arm_virtual_machine_scale_set.go:873 +0x23a1
github.com/terraform-providers/terraform-provider-azurerm/azurerm.resourceArmVirtualMachineScaleSetCreate(0xc42038d500, 0x20507a0, 0xc4204e4800, 0xc42038d500, 0x0)
	/opt/teamcity-agent/work/6811c3a0aae0f499/src/github.com/terraform-providers/terraform-provider-azurerm/azurerm/resource_arm_virtual_machine_scale_set.go:828 +0xd4e
github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/helper/schema.(*Resource).Apply(0xc420354620, 0xc4202f9310, 0xc42062b480, 0x20507a0, 0xc4204e4800, 0xc4202cde01, 0x46, 0x0)
	/opt/teamcity-agent/work/6811c3a0aae0f499/src/github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/helper/schema/resource.go:227 +0x364
github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/helper/schema.(*Provider).Apply(0xc420354c40, 0xc4204ae870, 0xc4202f9310, 0xc42062b480, 0x1, 0x54, 0x13)
	/opt/teamcity-agent/work/6811c3a0aae0f499/src/github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/helper/schema/provider.go:283 +0xa4
github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/terraform.(*EvalApply).Eval(0xc42087a900, 0x32225c0, 0xc4206b72b0, 0x2, 0x2, 0x208ae8e, 0x4)
	/opt/teamcity-agent/work/6811c3a0aae0f499/src/github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/terraform/eval_apply.go:57 +0x239
github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/terraform.EvalRaw(0x3210c80, 0xc42087a900, 0x32225c0, 0xc4206b72b0, 0x0, 0x0, 0x0, 0x0)
	/opt/teamcity-agent/work/6811c3a0aae0f499/src/github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/terraform/eval.go:53 +0x173
github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/terraform.(*EvalSequence).Eval(0xc420791000, 0x32225c0, 0xc4206b72b0, 0x2, 0x2, 0x208ae8e, 0x4)
	/opt/teamcity-agent/work/6811c3a0aae0f499/src/github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/terraform/eval_sequence.go:14 +0x7e
github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/terraform.EvalRaw(0x3211740, 0xc420791000, 0x32225c0, 0xc4206b72b0, 0x1e18b80, 0x3223702, 0x1dad7e0, 0xc4203ff2a0)
	/opt/teamcity-agent/work/6811c3a0aae0f499/src/github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/terraform/eval.go:53 +0x173
github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/terraform.Eval(0x3211740, 0xc420791000, 0x32225c0, 0xc4206b72b0, 0xc420791000, 0x3211740, 0xc420791000, 0x48f514)
	/opt/teamcity-agent/work/6811c3a0aae0f499/src/github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/terraform/eval.go:34 +0x4d
github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/terraform.(*Graph).walk.func1(0x20135a0, 0xc42045a080, 0x0, 0x0)
	/opt/teamcity-agent/work/6811c3a0aae0f499/src/github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/terraform/graph.go:126 +0xd0d
github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/dag.(*Walker).walkVertex(0xc42039d880, 0x20135a0, 0xc42045a080, 0xc420624840)
	/opt/teamcity-agent/work/6811c3a0aae0f499/src/github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/dag/walk.go:387 +0x3c1
created by github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/dag.(*Walker).Update
	/opt/teamcity-agent/work/6811c3a0aae0f499/src/github.com/terraform-providers/terraform-provider-azurerm/vendor/github.com/hashicorp/terraform/dag/walk.go:310 +0x1364

I have commented the relevant parts of code that required a nil-check.

@@ -1059,6 +1134,17 @@ func flattenAzureRmVirtualMachineScaleSetBootDiagnostics(bootDiagnostic *compute
return []interface{}{b}
}

func flattenAzureRmVirtualMachineScaleSetRollingUpgradePolicy(rollingUpgradePolicy *compute.RollingUpgradePolicy) []interface{} {
b := map[string]interface{}{
"max_batch_instance_percent": *rollingUpgradePolicy.MaxBatchInstancePercent,
Copy link
Collaborator

@katbyte katbyte Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These all need to be nil-checked. #Resolved

MaxUnhealthyUpgradedInstancePercent: utils.Int32(int32(config["max_unhealthy_upgraded_instance_percent"].(int))),
PauseTimeBetweenBatches: utils.String(config["pause_time_between_batches"].(string)),
}
} else {
Copy link
Collaborator

@katbyte katbyte Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this else is redundant, above if statement returns a value. #Resolved

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kt@katbyte.me is right. Please take at this link:
https://golang.org/ref/spec#Return_statements

The return values will be initialized to 0 while entering into function, so return nil here can be saved.


In reply to: 208484494 [](ancestors = 208484494)

@katbyte
Copy link
Collaborator

katbyte commented Aug 8, 2018

screen shot 2018-08-08 at 00 35 02

Copy link
Contributor

@tombuildsstuff tombuildsstuff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hi @JunyiYi

Taking a quick look through this I noticed a couple of minor things in addition to @katbyte's comments - the main issue I spotted was that we should add an upgrade test to ensure the rolling_upgrade_policy block is removed (and fix up the nil checks)

Thanks!

}

resource "azurerm_public_ip" "test" {
name = "PublicIPForLB"
Copy link
Contributor

@tombuildsstuff tombuildsstuff Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this should be acctestpip%d to get caught by the sweepers #Resolved

}

sku {
name = "Standard_A0"
Copy link
Contributor

@tombuildsstuff tombuildsstuff Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Standard_A0 -> Standard_F2 to reduce the time for this test #Resolved

new resource to be created. See [documentation](http://docs.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-placement-groups) for more information.
* `upgrade_policy_mode` - (Required) Specifies the mode of an upgrade to virtual machines in the scale set. Possible values, `Rolling`, `Manual`, or `Automatic`. When choosing `Rolling`, you will need to set a health probe.
* `automatic_os_upgrade` - (Optional) Automatic OS patches can be applied by Azure to your scaleset. This is particularly useful when `upgrade_policy_mode` is set to `Rolling`. Defaults to `false`.
* `rolling_upgrade_policy` - (Optional) Ignored unless `upgrade_policy_mode` is set to `Rolling`.
Copy link
Contributor

@tombuildsstuff tombuildsstuff Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we make this:

* `rolling_upgrade_policy` - (Optional) A `rolling_upgrade_policy` block as defined below. This is only applicable when the `upgrade_policy_mode` is `Rolling`
``` #Resolved

@hunterchris
Copy link

Hello folks, is there any estimation to when this fix will be released?
Thanks!

@holderbaum
Copy link

Hey there,

we are still desperately looking for this feature since it actually hinders us in using scale sets at all.

What is you current estimate on when we can expect it's merge?

Cheers
Jakob

@JunyiYi
Copy link

JunyiYi commented Sep 6, 2018

Hi @hunterchris @holderbaum , the code here is waiting for a fix in API, you can track the issue here: Azure/azure-rest-api-specs#3751. AFAIK, without the fix by API side, we would not merge it.

@ghost ghost added the size/XL label Sep 28, 2018
@katbyte katbyte added the upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR label Sep 29, 2018
@JunyiYi JunyiYi removed the upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR label Oct 1, 2018
@JunyiYi
Copy link

JunyiYi commented Oct 1, 2018

Hi @katbyte @tombuildsstuff , according to internal email thread, API team said they will not fix it in a foreseeable future. So I just fix it in the provider by

  1. Making sure the user clears out the rolling_upgrade_policy when upgrade_policy_mode is not Rolling
  2. And suppress rolling_upgrade_policy diff when user removes it (while remotely it is the default value) under non-Rolling upgrade_policy_mode

This action is also what API team handles it (I mean reset rolling_upgrade_policy to default values when upgrade_policy_mode is not Rolling) at the service side according to their confirmation in the email thread. They also recommend us to do so in our code base.

Test cases are added to cover the behavior here as well.

@tombuildsstuff tombuildsstuff modified the milestones: Soon, 1.18.0 Oct 25, 2018
@katbyte katbyte assigned katbyte and unassigned JunyiYi Oct 25, 2018
@katbyte
Copy link
Collaborator

katbyte commented Oct 26, 2018

@JunyiYi,

Thanks for the updates to the tests, they now pass:
screen shot 2018-10-26 at 16 28 59

@katbyte katbyte merged commit c2080ad into hashicorp:master Oct 26, 2018
katbyte added a commit that referenced this pull request Oct 26, 2018
@tombuildsstuff
Copy link
Contributor

hi @agolomoodysaada @julienstroheker @holderbaum

Just to let you know that this has been released as a part of v1.18 of the AzureRM Provider (the full changelog is available here). You can upgrade to this by specifying the version in the provider block (as shown below) and then running terraform init -upgrade

provider "azurerm" {
  version = "=1.18.0"
}

Thanks!

@ghost
Copy link

ghost commented Mar 6, 2019

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.

If you feel this issue should be reopened, we encourage creating a new issue linking back to this one for added context. If you feel I made an error 🤖 🙉 , please reach out to my human friends 👉 hashibot-feedback@hashicorp.com. Thanks!

@ghost ghost locked and limited conversation to collaborators Mar 6, 2019
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
enhancement service/vmss Virtual Machine Scale Sets size/XL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Feature Request: Support VMSS upgrade_policy_mode "Rolling"
9 participants