Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

platforms: investigate support for Azure Stack Hub (azurestack) #476

Closed
cfBrianMiller opened this issue May 11, 2020 · 20 comments · Fixed by coreos/ignition#1007
Closed
Labels
area/platforms jira for syncing to jira

Comments

@cfBrianMiller
Copy link

Hello,

When trying to boot this image, it fails with the following boot diagnostics.

[ 0.000000] Command line: BOOT_IMAGE=(hd0,gpt1)/ostree/rhcos-acecfdfafb8976a8675311239f88ec5442a47472a131f1d0c9113e11a8d2ac13/vmlinuz-4.18.0-147.5.1.el8_1.x86_64 ignition.firstboot rd.neednet=1 ip=dhcp,dhcp6 rhcos.root=crypt_rootfs console=tty0 console=ttyS0,115200n8 rd.luks.options=discard ostree=/ostree/boot.1/rhcos/acecfdfafb8976a8675311239f88ec5442a47472a131f1d0c9113e11a8d2ac13/0 ignition.platform.id=azure
...
[ 23.398199] ignition-setup[627]: File /usr/lib/ignition/platform/azure/base.ign does not exist.. Skipping copy
...
[ 29.878870] systemd[1]: ignition-fetch.service: Main process exited, code=exited, status=1/FAILURE
[ 29.878901] ignition[670]: drive status: OK
[ 29.878918] systemd[1]: ignition-fetch.service: Failed with result 'exit-code'.
[ 29.878957] systemd[1]: Failed to start Ignition (fetch).
[ 29.878986] ignition[670]: op(1): [started] mounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure771149153"
[ 29.879035] systemd[1]: Dependency failed for Ignition Complete.
[ 29.879166] ignition[670]: op(1): [failed] mounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure771149153": invalid argument
[ 29.879190] systemd[1]: Dependency failed for Initrd Default Target.
[ 29.879226] ignition[670]: failed to fetch config: failed to mount device "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure771149153": invalid argument
[ 29.879246] systemd[1]: initrd.target: Job initrd.target/start failed with result 'dependency'.
[ 29.879269] ignition[670]: failed to acquire config: failed to mount device "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure771149153": invalid argument
[ 29.879285] systemd[1]: initrd.target: Triggering OnFailure= dependencies.
[ 29.879321] ignition[670]: Ignition failed: failed to mount device "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure771149153": invalid argument

I am deploying to azure stack with this custom data string, (with a valid URL)

eyJpZ25pdGlvbiI6eyJ2ZXJzaW9uIjoiMi4yLjAiLCJjb25maWciOnsicmVwbGFjZSI6eyJzb3VyY2UiOiI8dmFsaWRfdXJsPiJ9fX19Cg==

I am unable to exec into the box to determine the exact error however this is what Microsoft support believes is the issue after troubleshooting CoreOS issues.

It seems the rhcos deployment vhd has Provisioning.DecodeCustomData set to n for Azure Stack. This property needs to be set to y during image preparation.

I am capable of testing fixes to this problem against an up to date azure stack.

Thank you.

@lucab
Copy link
Contributor

lucab commented May 11, 2020

Thanks for the report.

A bunch of things to unpack here:

  • you seem to be experiencing an issue with RHCOS. Contrarily to Fedora CoreOS, RHCOS only exists (as a component) in the context of OpenShift (as a product), so bugs and support are handled via the traditional Bugzilla way. Feel free to report this there.
  • the image you are using is for Azure proper. I don't personally know the details of Azure Stack, but if it uses a different userdata channel then the boot failure is legit, and Ignition/OS changes may be required in order to support that.
  • the underlying issue seems to be that the usual Azure "Virtual CD" is not available on the node. Does Provisioning.DecodeCustomData control that, or how is it related here?

@cgwalters
Copy link
Member

As far as I know we haven't done any investigation of Azure Stack; we have #148 which is for the "main" Azure but we should probably break out a separate tracker for Azure Stack.

@darkmuggle darkmuggle self-assigned this Jun 9, 2020
@darkmuggle
Copy link
Contributor

It seems the rhcos deployment vhd has Provisioning.DecodeCustomData set to n for Azure Stack. This property needs to be set to y during image preparation.

@cfBrianMiller per https://docs.microsoft.com/en-us/azure-stack/operator/azure-stack-linux?view=azs-2002#step-2-reference-cloud-inittxt-during-the-linux-vm-deployment it would appear that this is a question of how you are deploying the image I tested the 4.3 and 4.5 image Azure proper and it worked.

To @lucab:

the underlying issue seems to be that the usual Azure "Virtual CD" is not available on the node. Does Provisioning.DecodeCustomData control that, or how is it related here?

Per https://docs.microsoft.com/en-us/azure/virtual-machines/custom-data:

On Linux OS's, custom data is passed to the VM via the ovf-env.xml file, which is copied to the /var/lib/waagent directory during provisioning. Newer versions of the Microsoft Azure Linux Agent will also copy the base64-encoded data to /var/lib/waagent/CustomData as well for convenience.

Provisioning.DecodeCustomData is an instruction to WALinuxAgent (https://github.com/Azure/WALinuxAgent#provisioningdecodecustomdata ala https://github.com/Azure/WALinuxAgent/blob/11d0881cd01e1bc5ff4f918c33701b60274c6e40/bin/waagent2.0#L4579-L4587) and is not relevant here. Ignition will parse any CustomData for Ignition data only and has no understanding the Provisioning values.

I concur with @lucab
that the VirtualCD is not being found and hence provisioning fails. See https://github.com/coreos/ignition/blob/master/internal/providers/azure/azure.go#L70 where its looking for /dev/disk/by-id/ata-Virtual-CD https://github.com/coreos/ignition/blob/master/internal/providers/azure/azure.go#L37.

@cfBrianMiller if you have indeed deployed the VM with custom data properly, we would need at the very least:

  • console logs (journalctl --system)
  • /run/ignition*
  • find /dev/disk

On Azure proper I see the Virtual CD-ROM come up in the console logs:

Jun 09 17:16:01 localhost kernel: sd 3:0:1:0: [sdb] Attached SCSI disk
Jun 09 17:16:02 localhost kernel: ata2.00: ATAPI: Virtual CD, , max MWDMA2
Jun 09 17:16:02 localhost kernel: scsi 1:0:0:0: CD-ROM            Msft     Virtual CD/ROM   1.0  PQ: 0 ANSI: 5
Jun 09 17:16:02 localhost kernel: scsi 1:0:0:0: Attached scsi generic sg2 type 5
Jun 09 17:16:02 localhost kernel:  sda: sda1 sda2 sda3 sda4
Jun 09 17:16:02 localhost kernel: sd 2:0:0:0: [sda] Attached SCSI disk
Jun 09 17:16:02 localhost kernel: sr 1:0:0:0: [sr0] scsi3-mmc drive: 0x/0x tray
Jun 09 17:16:02 localhost kernel: cdrom: Uniform CD-ROM driver Revision: 3.20
Jun 09 17:16:02 localhost kernel: sr 1:0:0:0: Attached scsi CD-ROM sr

If there is no VirtualCD showing up then question for Azure Stack is how do we access it and, more importantly where is it documented? I did a deep dive into the documentation and the WALinuxAgent code and from what I was able to glean the device should be there.

@darkmuggle darkmuggle added the jira for syncing to jira label Jun 9, 2020
@darkmuggle
Copy link
Contributor

darkmuggle commented Jun 9, 2020

From the Azure Documentation, an image prepared for Azure proper should work. Microsoft, in an email thread, had indicated that Azure proper images should work on Azure Stack:
https://docs.microsoft.com/en-us/azure/virtual-machines/linux/create-upload-generic

@darkmuggle
Copy link
Contributor

darkmuggle commented Jun 10, 2020

Over at https://bugzilla.redhat.com/attachment.cgi?id=1696551 a console log was provided that gave a whole lot more information:

[   14.795315] UDF-fs: warning (device sr0): udf_load_vrs: No VRS found
[   14.821361] UDF-fs: Scanning with blocksize 2048 failed
[   14.845200] UDF-fs: warning (device sr0): udf_load_vrs: No VRS found
[   14.870806] UDF-fs: Scanning with blocksize 4096 failed
[   14.893620] ignition[813]: op(1): [failed]   mounting "/dev/disk/by-id/ata-Virtual_CD" at "/tmp/ignition-azure549490584": invalid argument

These error messages are NOT found on Azure proper.

Based on the UDF source [1] the kernel is NOT locating the UDF VRS (Volume Recognition Sequence) and so the mount is returning EINVAL. In other words, the kernel is saying that Ignition asked for a UDF mount but whatever is on /dev/sr0 is not a UDF volume.

There are three potential cases:

  • the UDF volume for the custom data is corrupt (that would be a bug with AzureStack)
  • there is a bug in the UDF source (kernel bug)
  • the CDROM is not UDF and is something else (per [2] it should be UDF)

Can you attach a copy of the UDF?

Looking at how WALinuxAgent does the mount [3], it does a blind mount without specifying the filesystem; Ignition is more precise [4].

[1] https://git.kernel.org/pub/scm/linux/kernel/git/jwboyer/fedora.git/tree/fs/udf/super.c#n1970
[2] https://docs.microsoft.com/en-us/azure-stack/operator/azure-stack-redhat-create-upload-vhd?view=azs-2002
[3] https://github.com/Azure/WALinuxAgent/blob/develop/bin/waagent2.0#L581-L582
[4] https://github.com/coreos/ignition/blob/master/internal/providers/azure/azure.go#L69-L74

@darkmuggle
Copy link
Contributor

We have confirmation from Microsoft that the UDF volume is, in fact, not a UDF volume: it's a generic iso9660. I have draft fix proposed in Ignition that should allow Ignition to work on either Azure or Azure Stack.

@dustymabe
Copy link
Member

Any chance they'll fix the documentation now?

@darkmuggle
Copy link
Contributor

Any chance they'll fix the documentation now?

We can ask.

@darkmuggle
Copy link
Contributor

A complete different issue in Afterburn has come up:

s)...[   32.768570] NetworkManager[568]: <info>  [1593115470.6525] dhcp4 (eth0): option private_245          => 'a8:3f:81:10'

And then:

[   64.908820] afterburn[658]: Jun 25 19:57:52.985 WARN Failed to get fabric address from DHCP: maximum number of retries (60) reached
[   64.988395] afterburn[658]: Jun 25 19:57:52.986 INFO Using fallback address
[   65.033307] afterburn[658]: Jun 25 19:57:52.986 INFO Fetching http://168.63.129.16/?comp=versions: Attempt #1
^M[     *] A start job is running for Afterburn Hostname (52s / no limit)
[   65.566088] afterburn[658]: Jun 25 19:57:53.643 INFO Fetch successful
[   65.621959] afterburn[658]: Jun 25 19:57:53.643 INFO Fetching http://168.63.129.16/machine/?comp=goalstate: Attempt #1
[   65.698749] afterburn[658]: Jun 25 19:57:53.651 INFO Fetch successful
[   65.747770] afterburn[658]: Jun 25 19:57:53.659 INFO Fetching http://169.254.169.254/metadata/instance/compute/name?api-version=2017-08-01&format=text: Attempt #1
[   65.942651] afterburn[658]: Jun 25 19:57:53.674 INFO Failed to fetch: 500 Internal Server Error

And ending with:

Displaying logs from failed units: afterburn-hostname.service
-- Logs begin at Thu 2020-06-25 20:04:16 UTC, end at Thu 2020-06-25 20:05:59 UTC. --
Jun 25 20:05:51 afterburn[655]: Jun 25 20:05:51.338 INFO Failed to fetch: 500 Internal Server Error
Jun 25 20:05:51 afterburn[655]: Error: failed to run
Jun 25 20:05:51 afterburn[655]: Caused by: writing hostname
Jun 25 20:05:51 afterburn[655]: Caused by: failed to get hostname
Jun 25 20:05:51 afterburn[655]: Caused by: maximum number of retries (10) reached
Jun 25 20:05:51 afterburn[655]: Caused by: failed to fetch: 500 Internal Server Error
Jun 25 20:05:51 systemd[1]: afterburn-hostname.service: Main process exited, code=exited, status=1/FAILURE
Jun 25 20:05:51 systemd[1]: afterburn-hostname.service: Failed with result 'exit-code'.
Jun 25 20:05:51 systemd[1]: Failed to start Afterburn Hostname.

Both the Ignition issue and now Afterburn raises two distinct differences that raise the questions of what other differences exist. In my opinion, we should consider whether Azure and AzureStack should be considered the same.

@darkmuggle darkmuggle changed the title Boot fails on Azure Stack Ignition fails to provision on Azure Stack Jun 25, 2020
@dustymabe dustymabe added the status/pending-upstream-release Fixed upstream. Waiting on an upstream component source code release. label Jun 26, 2020
@dustymabe
Copy link
Member

@darkmuggle - maybe a new issue for the afterburn bits? or if you want to go wide - an issue to discuss our approach to Azure vs AzureStack

@cfBrianMiller
Copy link
Author

The problem child is definitely this line:

[   65.747770] afterburn[658]: Jun 25 19:57:53.659 INFO Fetching http://169.254.169.254/metadata/instance/compute/name?api-version=2017-08-01&format=text: Attempt #1

Azure stack has different API versions, for compute it is 2017-12-01

@lucab lucab changed the title Ignition fails to provision on Azure Stack platforms: investigate support for Azure Stack Jun 29, 2020
@lucab
Copy link
Contributor

lucab commented Jun 29, 2020

Things we have discovered so far on Azure Stack:

Things still to discover:

  • where instances get their hostname from
  • whether SSH pubkeys works the same way as Azure
  • whether boot check-in works the same way as Azure

@lucab lucab reopened this Jun 29, 2020
@bgilbert bgilbert removed the status/pending-upstream-release Fixed upstream. Waiting on an upstream component source code release. label Jun 29, 2020
@jlebon
Copy link
Member

jlebon commented Jun 29, 2020

Should coreos/ignition#1007 be reverted for the time being?

@bgilbert
Copy link
Contributor

I'd say so, yes.

@darkmuggle
Copy link
Contributor

darkmuggle commented Jun 29, 2020

Should coreos/ignition#1007 be reverted for the time being?

Is there any harm in leaving the code? We know the code works. And having this code will make the enablement easier.

Conceivably, enabling AzureStack as a separate platform from the Ignition side would look something akin to:

diff --git a/internal/platform/platform.go b/internal/platform/platform.go
index a5a4844..a9674a4 100644
--- a/internal/platform/platform.go
+++ b/internal/platform/platform.go
@@ -91,6 +91,10 @@ func init() {
                name:  "azure",
                fetch: azure.FetchConfig,
        })
+       configs.Register(Config{
+               name:  "azurestack",
+               fetch: azure.FetchConfig,
+       })
        configs.Register(Config{
                name:  "brightbox",
                fetch: openstack.FetchConfig,

The difficulties with AzureStack in Afterburn may be handled differently.

Also, the code now checks to see if the volume is either a UDF or ISO9660 before blindly attempting to mount it as a UDF volume.

@bgilbert
Copy link
Contributor

bgilbert commented Jun 29, 2020

Is there any harm in leaving the code? We know the code works. And having this code will make the enablement easier.

If we release the code and later roll it back, we'll be making shipped code stricter, which in principle could break someone.

Why not just go ahead and add the separate platform ID to Ignition now? It should be straightforward to add a wrapper which enables ISO9660 only on Azure Stack.

@cgwalters
Copy link
Member

Also coreos/coreos-assembler#1566

darkmuggle pushed a commit to coreos/ignition that referenced this issue Jun 29, 2020
This moves the iso9660 support out of Azure's provider and introduces a
new provider for AzureStack. Per [1], we found that AzureStack should be
treated as its own platform. The Azure provider now is configurable for
the filesystem types.

[1] coreos/fedora-coreos-tracker#476

Signed-off-by: Ben Howard <ben.howard@redhat.com>
@darkmuggle
Copy link
Contributor

AzureStack is now a distinct platform for both Ignition and COSA. The word from back-channels is that might have what we need towards August/September on the Afterburn side.

@lucab
Copy link
Contributor

lucab commented Jul 10, 2020

"Azure Stack" is a whole product family, which spans a few verticals. My understanding is that here we are targeting "Azure Stack Hub" only for its computing on-demand capabilities. Re-titled accordingly.

@lucab lucab changed the title platforms: investigate support for Azure Stack platforms: investigate support for Azure Stack Hub (azurestack) Jul 10, 2020
@jlebon
Copy link
Member

jlebon commented Apr 9, 2021

This is done now in coreos/afterburn#561 which is part of the v5.0.0 release.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/platforms jira for syncing to jira
Projects
None yet
7 participants