-
Notifications
You must be signed in to change notification settings - Fork 371
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
WALinuxAgent doesn't download all .crt and .prv files from KeyVault #2750
Comments
For the example in the log you posted, at what time did you push the certificate update? The agent log does not show any operation updating certificates around 4:30. Each operation that updates certificates produces a new sequential ID that we call "incarnation". The last certificate update closest to 4:30 was on incarnation 63, at 1:36:
After that, there are no further operations updating certificates. The reason the certificate is downloaded after you restart the agent is because on service restart the agent repeats the last operation. In this case, it repeated incarnation 63 and downloaded the certificate again:
|
Hello @narrieta , thanks for the explanation regarding incarnation, didn't know that. The time of pushing certificates is around 10 mins before the ansible task to copy the certificates. As you can see, the incarnation 63 points to file F6E6FAACxxxxx and not for the one is missing F66D9xxxx. That means that file was not downloaded. We can't understand why during the push the secrets are not being downloaded but only after we restart the service. It seems that the PUSH is not forcing the download of the secrets or something like that. Let me try to add more information here: |
@CMalaquias17 I posted the wrong cert in my previous reply. F66D9 is also being downloaded as part of incarnation 63:
I see that Custom Script ran at this time
This operation is coming via "FastTrack", which is a recent optimization to make extensions execute faster. FastTrack operations won't download the keyvault certificates. As a workaround you can consider copying the certificates originally downloaded, instead of moving them to a different location. |
@narrieta we are not moving the certificates, we are copying some of them and then delete only a few of them. Regarding this fast track thing, that I was not aware of it, when did this optimization started being utilized? I mean date and time. this could explain why the ARM template and script worked for very long time and started to fail a few months back. Is it possible to know when fast track started to being used? Fast track is something that we should activate or it is activated in background without any actions needed? Another thing, if I may ask. and sorry for very long questions.. when you say Fast Track operations"won'r download the keyvault certificates" it means that if we use that part in ARM template, it will not download the certificates that time? thank you. |
Ok, then you may be deleting the cert that the custom script needs. The F6xxx cert was downloaded at 2023-01-31T01:36:19.635993Z. Fast track was enabled over several months starting from late 2022. No action is needed from users. The keyvault certificates will be downloaded, although not on every single operation (if the operation is using Fast Track then they won't be downloaded). In your case, the certificates were downloaded on incarnation 63, which is not using FastTrack |
@narrieta I think you are right. The issue started in the final of November in only one region and since late December it spread for more the one region. This is something that we do for a very long time with same lines of code so, maybe fast track is the explanation for that. is there a away to avoid fast track? is there something we can do like, changing ARM templates or in the code to force the downloads? Or, thank you. |
There are some operations that never use FastTrack and force a re-download of the certificates. Adding a tag ( az vm update --set tags.Tag1=Value1), re-apply (az vm reapply), etc. If not deleting the certificates is an option, I would recommend that. We can change the agent to download certificates when using FastTrack too, but we don't have another release coming in the next few months. As far as rotating the certificates, what do you currently do? Any changes in the ARM template involving certificates won't use Fast Track. In you case, running Custom Script used Fast Track because the operation is not related to certificates. |
@narrieta by rotating I mean, Changing the key of one certificate or even add a new one to the KV, how the agent knows that it has more certificates to download? So, it means that here (image below), we are not downloading anything? |
@CMalaquias17 Yes, those are being downloaded. You can check incarnation 63 |
so sometimes when running ARM template it will download and sometimes it will not download? |
Fixed by #2761 |
Not sure if this is a bug but I will try to explain as much as I can.
Environment: Virtual Machine Scale Set - Azure (West EU, West US, Japan, Asia,.. all regions impacted)
We are doing deployments of VMSS in azure using ARM template and ansible playbook with some configurations.
Before running the ansible, we are using the following to push certficates from KV:
TEMPLATE
{
"type": "Microsoft.Compute/virtualMachines",
"name": "Region1VM",
...
"properties": {
...
"osProfile": {
"computerName": "Region1VM",
...
"secrets": [
{
"sourceVault": {
"id": "[resourceId('Microsoft.KeyVault/vaults', Region1KeyVault)]"
},
"vaultCertificates": [
{
"certificateUrl": "[reference(resourceId('Microsoft.KeyVault/vaults/secrets', 'Region1KeyVault', 'SampleCertificateAsSecret')).secretUriWithVersion]",
"certificateStore": "My"
}
]
}
],
},
...
}
}
CODE
LINK
https://devblogs.microsoft.com/premier-developer/centralized-vm-certificate-deployment-across-multiple-regions-with-arm-templates/#part-2-push-certificate-from-the-regional-key-vault-to-the-virtual-machine
after running this part of the ARM template we do some certificate copies from var/lib/waagent to another location but it fails with the below error:
"could not find or access '/var/lib/waagent/nameexample.prv"
The problem is, the file that is missing should be downloaded during the push of the certificates from Keyvault but this is not happening and ansible playbook crashes.
If we restart the waagent service, the file "nameexample.prv" will be downloaded and ansible will not crash anymore.
The final lines of the ansible code, will remove this file again from the VM. The next deployment will crash again.
We have two workarounds here:
FIRST- if we restart the waagent after the crash everything will run as expected
SECOND - if we don't delete the file after the ansible
MAIN PROBLEM - this has been working like this for last 8 months, but now we are getting this errors.
we don't understand why, the agent doesn't push all the file in KV and we always need to restart the service to do a "complete download" let's say.
Additional context
I can give you an agent log from 31st of January 4:30PM issue :
Log file attached
waagent.log
The text was updated successfully, but these errors were encountered: