Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Azure Linux Agent fails to install extensions in Linux VMSS without settings field in the extension block #23688

Closed
1 task done
Laffs2k5 opened this issue Oct 25, 2023 · 17 comments
Labels
service/virtual-machine upstream/microsoft/waiting-on-service-team This label is applicable when waiting on the Microsoft Service Team v/3.x

Comments

@Laffs2k5
Copy link

Laffs2k5 commented Oct 25, 2023

Is there an existing issue for this?

  • I have searched the existing issues

Community Note

  • Please vote on this issue by adding a 👍 reaction to the original issue to help the community and maintainers prioritize this request
  • Please do not leave "+1" or "me too" comments, they generate extra noise for issue followers and do not help prioritize the request
  • If you are interested in working on this issue or have submitted a pull request, please leave a comment and review the contribution guide to help.

Terraform Version

1.5.6

AzureRM Provider Version

3.77.0

Affected Resource(s)/Data Source(s)

azurerm_linux_virtual_machine_scale_set

Terraform Configuration Files

resource "azurerm_linux_virtual_machine_scale_set" "runner" {
  # required fields goes here

  # Health check on port 22
  # Required for automatic os upgrades
  extension {
    name                 = "port-22-health"
    publisher            = "Microsoft.ManagedServices"
    type                 = "ApplicationHealthLinux"
    type_handler_version = "1.0"
    settings = jsonencode({
      "protocol" = "tcp"
      "port"     = 22
    })
  }

  # Support logon with AAD
  extension {
    provision_after_extensions = ["port-22-health"]
    name                       = "azure-ad-ssh-login"
    publisher                  = "Microsoft.Azure.ActiveDirectory"
    type                       = "AADSSHLoginForLinux"
    type_handler_version       = "1.0"
  }

  # Install and configuration as Github actions runner
  extension {
    provision_after_extensions = ["port-22-health", "azure-ad-ssh-login"]
    name                       = "custom-script-setup-as-github-actions-runner"
    publisher                  = "Microsoft.Azure.Extensions"
    type                       = "CustomScript"
    type_handler_version       = "2.1"
    protected_settings         = jsonencode({ "script" = base64gzip(local.provisioning_script) })
  }
}

Debug Output/Panic Output

NOTE: I know this is not terrform/azurerm, but please keep reading
This is log output of the Azure Linux Agent (waagent) inside one of the unhealthy VMs

2023-10-25T11:56:08.127286Z WARNING ExtHandler ExtHandler dependsOn is an empty array for extension Microsoft.ManagedServices.ApplicationHealthLinux; setting the dependency level to 0
2023-10-25T11:56:08.131016Z ERROR ExtHandler ExtHandler Error fetching the goal state: [ProtocolError] Error fetching goal state
Inner error: [VmSettingsParseError] Error parsing vmSettings [HGAP: 1.0.8.143 Etag:10483117470961880950]: list index out of range
Traceback (most recent call last):
  File "bin/WALinuxAgent-2.9.1.1-py3.8.egg/azurelinuxagent/common/protocol/extensions_goal_state_from_vm_settings.py", line 55, in __init__
    self._parse_vm_settings(json_text)
  File "bin/WALinuxAgent-2.9.1.1-py3.8.egg/azurelinuxagent/common/protocol/extensions_goal_state_from_vm_settings.py", line 153, in _parse_vm_settings
    self._parse_extensions(vm_settings)
  File "bin/WALinuxAgent-2.9.1.1-py3.8.egg/azurelinuxagent/common/protocol/extensions_goal_state_from_vm_settings.py", line 412, in _parse_extensions
    self._parse_dependency_level(depends_on, extension)
  File "bin/WALinuxAgent-2.9.1.1-py3.8.egg/azurelinuxagent/common/protocol/extensions_goal_state_from_vm_settings.py", line 499, in _parse_dependency_level
    extension.settings[0].dependencyLevel = depends_on[0]['dependencyLevel']
IndexError: list index out of range

Expected Behaviour

The VMSS should enter running state, provision extensions and be healthy.

Actual Behaviour

The Azure Linux Agent (waagent) in each VM's in the VMSS fails to install the extensions as configured with error as shown.

We have been running several Linux VMSS (Ubuntu 20.04) with the given extension configuration for well over a year now.
The VMSS's are turned off during night and boots up each morning. Except today. There have been no changes in the Azure resource, no change in the terraform code or in provider version.

After a bit of digging around it turns out that the python code in waagent fails when an extension is declared without the settings field, as in this example for the extension named azure-ad-ssh-login.

The workaround turned out to be quite simple: add settings = jsonencode({}) to the extension declaration to help the python code.

Not sure what has change that caused this to happened, either in ARM API or waagent?
The current waagent relase `2.9.1.1' is from April 2023. Note sure about

Creating this issue to make someone look into if the settings field of the extension block of azurerm_linux_virtual_machine_scale_set (and possibly the windows variant) should be made required. Or maybe the provider should default to adding an empty JSON object when the settings field is not declared?

Steps to Reproduce

  1. Provision an Ubuntu 20.04 VMSS with the extensions declaration as shown.
  2. terraform apply.
  3. Observe that the VMSS remains unhealthy.
  4. Optionally connect to one of the VM's and read the waagent log: tail -f /var/log/waagent.log

Important Factoids

No response

References

The python stack trace from waagent points to code here: https://github.com/Azure/WALinuxAgent/blob/v2.9.1.1/azurelinuxagent/common/protocol/extensions_goal_state_from_vm_settings.py#L499

@github-actions github-actions bot added the v/3.x label Oct 25, 2023
@Laffs2k5
Copy link
Author

Just observed that a downside to the workaround is that terraform detects the extension blocks as changed on each apply, so not really a clean workaround 🤷‍♀️

@vijaytdh
Copy link

vijaytdh commented Oct 25, 2023

👍 We see the same issue with Windows VMSS, terraform fails with the error:

 Error: waiting for update of Windows Virtual Machine Scale Set "xxxx" (Resource Group "yyyyy"): Code="VMExtensionHandlerNonTransientError" Message="The handler for VM extension type 'Microsoft.Azure.ActiveDirectory.AADLoginForWindows' has reported terminal failure for VM extension 'AADLogin' with error message: 'Enable failed for plugin (name: Microsoft.Azure.ActiveDirectory.AADLoginForWindows, version 2.1.0.0) with exception Command C:\\Packages\\Plugins\\Microsoft.Azure.ActiveDirectory.AADLoginForWindows\\2.1.0.0\\AADLoginForWindowsHandler.exe of Microsoft.Azure.ActiveDirectory.AADLoginForWindows has exited with Exit code: 1'.\r\n    \r\n'Enable handler for the extension failed. More information on troubleshooting is available at https://aka.ms/vmextensionwindowstroubleshoot'" Target="6"

The logs on the VM show:

 "status":{
         "code":-2146233076,
         "formattedMessage":{
            "lang":"en-US",
            "message":"The data contract type 'AADLoginForWindowsHandler.HandlerSettings' cannot be deserialized because the required data member 'publicSettings' was not found."
         },
         "name":"Microsoft.Azure.ActiveDirectory.AADLoginForWindows",
         "operation":"Enable",
         "status":"error",
         "substatus":null
      },

We tried serveral different type handler versions and had the same issue.

If I deploy the extension from the Portal which creates an ARM template with the following it works:

"resources": [
        {
            "type": "Microsoft.Compute/virtualMachineScaleSets/extensions",
            "apiVersion": "2021-03-01",
            "name": "[concat(parameters('vmName'),'/AADLogin')]",
            "location": "[parameters('location')]",
            "properties": {
                "publisher": "Microsoft.Azure.ActiveDirectory",
                "type": "AADLoginForWindows",
                "typeHandlerVersion": "1.0",
                "autoUpgradeMinorVersion": true
            }
        }
    ]

Which doesn't makes sense because this is the same config as via Terraform - e.g. we don't specify setings (which under the same thing as publicSettings).

We also assumed that this might be some API level change that has caused this.

The workaround suggested by @Laffs2k5 doesn't work for Windows it fails with the error below because it expects a principal ID to enable Intune management:

 Error message: \"'mdmId' setting was not found. Please input the 'mdmId' setting. This setting is case sensitive\". More information on troubleshooting is available at https://aka.ms/vmextensionwindowstroubleshoot.

In our case they are Windows 2019 VMSS and the issue occurs with version 3.52 and 3.65.0 of the azurerm Terraform provider and using Terraform 1.3.2.

@Laffs2k5
Copy link
Author

Laffs2k5 commented Oct 25, 2023

Interesting that there's similar behavior with the Windows variant.

Here is a updated workaround for the Linux VMSS without false positives: just add a dummy value to the settings ex. settings = jsonencode({ dummy = "value" })

@vijaytdh
Copy link

I can reproduce the issue with an Ubuntu 20.04 VMSS with Terraform 1.3.2 and version 3.65.0 of the azurerm provider. Also the workaround does not work for me in this case. I am using the custom script, Azure Monitor Agent and Application Health extensions. I have also tried only using some extensions but that didn't help - even if I have one extension - the Azure Monitor Agent extension it still fails with the goal state error.

The only thing I can think of is to try to use the ordering of extensions through the explicit dependencies as @Laffs2k5 has done.

@vijaytdh
Copy link

So for Linux I think the fix is in this PR Azure/WALinuxAgent#2957

@rcskosir rcskosir added upstream upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR and removed upstream labels Oct 26, 2023
@narrieta
Copy link

narrieta commented Oct 30, 2023

@Laffs2k5 You have already pointed out to the correct fix on the Agent. We started rolling out a workaround last Friday, but it will take a few weeks to reach all regions that are affected.

As far as workarounds, settings = jsonencode({}) won't make any difference. jsonencode({ dummy = "value" }) will work as long as the extension in question ignores or handles the dummy setting.

A safer workaround is to remove the deployAfterExtensions property. You can re-add it once we have this scenario working correctly.

The issue occurred because the service side was ignoring deployAfterExtensions when the extension has no settings. This issue on the service was fixed recently and ended up exposing this bug in the Agent.

Removing deployAfterExtensions should not affect the functionality in your template, since anyways it had been ignored on extensions with no settings until very recently.

I can link to this issue when we have a full fix on the service and the agent.

Note that this is in specific for the Linux Agent. What is the issue with the Windows Agent? I can relay it to the Windows team.

@vijaytdh
Copy link

Thanks @narrieta - the issue with the Windows agent is very similar the VMSS fails to provision with the error:

Multiple VM extensions failed to be provisioned on the VM. Please see the VM extension instance view for other failures. The first extension failed due to the error: The handler for VM extension type 'Microsoft.Azure.ActiveDirectory.AADLoginForWindows' has reported terminal failure for VM extension 'aad-login' with error message: 'Enable failed for plugin (name: Microsoft.Azure.ActiveDirectory.AADLoginForWindows, version 1.2.0.0) with exception Command C:\Packages\Plugins\Microsoft.Azure.ActiveDirectory.AADLoginForWindows\1.2.0.0\AADLoginForWindowsHandler.exe of Microsoft.Azure.ActiveDirectory.AADLoginForWindows has exited with Exit code: 1'. 'Enable handler for the extension failed. More information on troubleshooting is available at [https://aka.ms/vmextensionwindowstroubleshoot'](https://aka.ms/vmextensionwindowstroubleshoot%27)

When we then look at the logs on the VM it says:

{"status":{"code":-2146233076,"formattedMessage":{"lang":"en-US","message":"The data contract type 'AADLoginForWindowsHandler.HandlerSettings' cannot be deserialized because the required data member 'publicSettings' was not found."},"name":"Microsoft.Azure.ActiveDirectory.AADLoginForWindows","operation":"Enable","status":"error","substatus":null},"timestampUTC":"\/Date(1698242853229)\/","version":"1"}]

However, unlike with Linux VMSS, we cannot supply a value for the settings field, because this is used on Windows when the VM is a Windows 10 VM that is intended to be onboarded to Intune (see here)

We raised a support ticket but are struggling to convince them that this is due to a change in the backend service and interaction with the agent.

@narrieta
Copy link

@vijaytdh thank you. I relayed this to the Windows team. Could you share the ID of the support ticket?

@vijaytdh
Copy link

@narrieta great, thank you. The ticket TrackingID is 2310250050004606

@narrieta
Copy link

@vijaytdh - I asked the Windows team to help with this ticket

@narrieta
Copy link

@vijaytdh I talked to the Windows team and, unfortunately, there is no workaround on the Agent side. The service side is rolling back the changes and that should alleviate the issue both on Linux and Windows. If you are looking for a temporary workaround, removing deployAfterExtensions should work. Note that this would need to be removed only on extensions with no settings.

@vijaytdh
Copy link

vijaytdh commented Nov 1, 2023

@narrieta thanks for following up on this and for the suggested workaround. It is good to hear that the service side change is being rolled back.

@vijaytdh
Copy link

vijaytdh commented Nov 1, 2023

@narrieta I checked and for Windows VMSS I am not even using deployAfterExtensions (I did have this earlier to see if forcing a strict ordering of the extensions somehow workaround the issue)......having said that I just retried a deployment and it worked! 😃 So I think the rollback may have aready happened (well at least in East US2 and North Europe - the two regions I just tested with).

@rcskosir rcskosir added upstream/microsoft/waiting-on-service-team This label is applicable when waiting on the Microsoft Service Team and removed upstream/microsoft Indicates that there's an upstream issue blocking this issue/PR labels Jan 30, 2024
@rcskosir
Copy link
Contributor

Thanks for taking the time to submit this issue. It looks like this has been resolved as of Azure/WALinuxAgent#2957 on the Azure side. As such, I am going to mark this issue as closed. If that is not the case, please provide additional information including the version in which you are still experiencing this issue, thanks!

@Laffs2k5
Copy link
Author

For anyone stumbling over this issue: it's correct that a fix has been merged as stated. But at the time of writing (3 months after merge of the fix) an updated version of WALinuxAgent has still to be deployed. My understanding is that the fix is part of the 2.10.0.7 release. Roll out to the various regions can be tracked on the WALinuxAgent releases page

@alexpilon666
Copy link

@Laffs2k5 even though the page says that there hasn't been a new version released, I can confirm that the issue has been fixed. I had been in contact with Microsoft on my end and did some testing after the technical support rep confirmed the fix had been rolled out, and it's been working great ever since.

No idea why the new version doesn't show in the releases though.

Copy link

I'm going to lock this issue because it has been closed for 30 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Apr 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
service/virtual-machine upstream/microsoft/waiting-on-service-team This label is applicable when waiting on the Microsoft Service Team v/3.x
Projects
None yet
Development

No branches or pull requests

5 participants