-
Notifications
You must be signed in to change notification settings - Fork 60
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
aarch64: aws: some instance types fail in UEFI #920
Comments
FWIW, I was able to successfully launch an instance of Fedora Cloud 34 on |
For RHCOS instances we have been using m6g instances and they have been fine |
Just booted RHCOS on
What is useful to know from this datapoint is that the FCOS and RHCOS grub configuration files are the exact same so at least we should be able to rule out some |
ok rebasing from RHCOS to FCOS seems to work so the Fedora kernel itself should be fine? |
That likely indicates that fcos is not setting up /boot correctly |
After some investigation I think what I'm seeing is that if I add |
Without further investigation (pending me working on it today) here are some thoughts:
|
Summary:
|
Why are you using DTB? You should be using ACPI. I suspect this is the underlying problem. The serial console on an ACPI system is provided by the ACPI SPCR entry and on a SBSA compliant system is generally not ttyS0. |
FTR I see the same (or similar)
|
FYI This line is the console provided by the ACPI tables. |
Does that mean it is using ACPI even though I'm trying to understand where we're going wrong here. If I try to force acpi on FCOS with
|
TBH it's hard to tell from the information provided. Would likely need to login to a system to see |
On first glance, if I'm not mistaken, issue is only showing up on the |
Correct
on the instance types it doesn't boot on, it never boots. It's consistent.
Sounds right, though I've only tested on |
This is a workaround to get console=ttyS0,115200n8 into the aarch64 AWS image. It does so by applying the following patch to gf-platformid: ```diff diff --git a/usr/lib/coreos-assembler/gf-platformid b/usr/lib/coreos-assembler/gf-platformid index 2912b322c..36d089651 100755 --- a/usr/lib/coreos-assembler/gf-platformid +++ b/usr/lib/coreos-assembler/gf-platformid @@ -46,7 +46,11 @@ blscfg_path=$(coreos_gf glob-expand /boot/loader/entries/ostree-*.conf) coreos_gf download "${blscfg_path}" "${tmpd}"/bls.conf # Remove any platformid currently there sed -i -e 's, ignition.platform.id=[a-zA-Z0-9]*,,g' "${tmpd}"/bls.conf -sed -i -e 's,^\(options .*\),\1 ignition.platform.id='"${platformid}"',' "${tmpd}"/bls.conf +if [ "${platformid}" == 'aws' ]; then + sed -i -e 's|^\(options .*\)|\1 ignition.platform.id='"${platformid}"' console=ttyS0,115200n8|' "${tmpd}"/bls.conf +else + sed -i -e 's,^\(options .*\),\1 ignition.platform.id='"${platformid}"',' "${tmpd}"/bls.conf +fi coreos_gf upload "${tmpd}"/bls.conf "${blscfg_path}" if [ "$basearch" = "s390x" ] ; then ``` Once coreos/fedora-coreos-config#1181 and coreos/coreos-assembler#2400 land then we won't need this any longer. This implements a fix for coreos/fedora-coreos-tracker#920
Now that we've hacked in console=ttyS0 (see coreos/fedora-coreos-tracker#920 now we can start to upload/test there again.
Now that we've got console=ttyS0 in the aarch64 images they can boot on all aarch64 instance types (see [1]). The c6g.xlarge is not a bare metal instance type and thus will boot much faster so let's go with that. [1] coreos/fedora-coreos-tracker#920)
This is a workaround to get console=ttyS0,115200n8 into the aarch64 AWS image. It does so by applying the following patch to gf-platformid: ```diff diff --git a/usr/lib/coreos-assembler/gf-platformid b/usr/lib/coreos-assembler/gf-platformid index 2912b322c..36d089651 100755 --- a/usr/lib/coreos-assembler/gf-platformid +++ b/usr/lib/coreos-assembler/gf-platformid @@ -46,7 +46,11 @@ blscfg_path=$(coreos_gf glob-expand /boot/loader/entries/ostree-*.conf) coreos_gf download "${blscfg_path}" "${tmpd}"/bls.conf # Remove any platformid currently there sed -i -e 's, ignition.platform.id=[a-zA-Z0-9]*,,g' "${tmpd}"/bls.conf -sed -i -e 's,^\(options .*\),\1 ignition.platform.id='"${platformid}"',' "${tmpd}"/bls.conf +if [ "${platformid}" == 'aws' ]; then + sed -i -e 's|^\(options .*\)|\1 ignition.platform.id='"${platformid}"' console=ttyS0,115200n8|' "${tmpd}"/bls.conf +else + sed -i -e 's,^\(options .*\),\1 ignition.platform.id='"${platformid}"',' "${tmpd}"/bls.conf +fi coreos_gf upload "${tmpd}"/bls.conf "${blscfg_path}" if [ "$basearch" = "s390x" ] ; then ``` Once coreos/fedora-coreos-config#1181 and coreos/coreos-assembler#2400 land then we won't need this any longer. This implements a fix for coreos/fedora-coreos-tracker#920
Now that we've hacked in console=ttyS0 (see coreos/fedora-coreos-tracker#920 now we can start to upload/test there again.
Now that we've got console=ttyS0 in the aarch64 images they can boot on all aarch64 instance types (see [1]). The c6g.xlarge is not a bare metal instance type and thus will boot much faster so let's go with that. [1] coreos/fedora-coreos-tracker#920)
This is a workaround to get console=ttyS0,115200n8 into the aarch64 AWS image. It does so by applying the following patch to gf-platformid: ```diff diff --git a/usr/lib/coreos-assembler/gf-platformid b/usr/lib/coreos-assembler/gf-platformid index 2912b322c..36d089651 100755 --- a/usr/lib/coreos-assembler/gf-platformid +++ b/usr/lib/coreos-assembler/gf-platformid @@ -46,7 +46,11 @@ blscfg_path=$(coreos_gf glob-expand /boot/loader/entries/ostree-*.conf) coreos_gf download "${blscfg_path}" "${tmpd}"/bls.conf # Remove any platformid currently there sed -i -e 's, ignition.platform.id=[a-zA-Z0-9]*,,g' "${tmpd}"/bls.conf -sed -i -e 's,^\(options .*\),\1 ignition.platform.id='"${platformid}"',' "${tmpd}"/bls.conf +if [ "${platformid}" == 'aws' ]; then + sed -i -e 's|^\(options .*\)|\1 ignition.platform.id='"${platformid}"' console=ttyS0,115200n8|' "${tmpd}"/bls.conf +else + sed -i -e 's,^\(options .*\),\1 ignition.platform.id='"${platformid}"',' "${tmpd}"/bls.conf +fi coreos_gf upload "${tmpd}"/bls.conf "${blscfg_path}" if [ "$basearch" = "s390x" ] ; then ``` Once coreos/fedora-coreos-config#1181 and coreos/coreos-assembler#2400 land then we won't need this any longer. This implements a fix for coreos/fedora-coreos-tracker#920
Now that we've hacked in console=ttyS0 (see coreos/fedora-coreos-tracker#920 now we can start to upload/test there again.
Still don't know the real root cause, but I applied a hack/fix in coreos/fedora-coreos-pipeline@ddd9da9 to get The more appropriate long term fix will land in coreos/coreos-assembler#2400 and coreos/fedora-coreos-config#1181 |
The fix for this went into stable stream release |
It looks right in SPCR table. iasl the SPCR table, and get the following /*
[000h 0000 4] Signature : "SPCR" [Serial Port Console Redirection Table] [024h 0036 1] Interface Type : 00 [028h 0040 12] Serial Port Register : [Generic Address Structure] [034h 0052 1] Interrupt Type : 08 Raw Table Data: Length 80 (0x50)
|
/sys/firmware/fdt has no extra info for device-tree, except the bootargs which is built from command line. fdtdump /sys/firmware/fdt **** fdtdump is a low-level debugging tool, not meant for general use. /dts-v1/; / { |
So I think even without specifying the console=ttyS0, the linux kernel should be able to use the default platform console as the code path in linux tree So it is a bug that the kernel can not detect the default platform console without specifying one explicitly. |
thank @pfliu - Any idea how we can track down the commit that introduces that bug? Or maybe you can help us report it upstream? Can you confirm you still see the same problem with latest Fedora kernels? |
Find some extra things if removing console=ttyS0.
|
I will try with the latest upstream kernel, then I can report the result to the upstream. |
It turns out that in linux kernel, the function univ8250_console_match() can not match the driver with the device reported by ACPI SPCR table. The mismatch happens on the following statements: Where @addr is the one reported in SPCR (0x90a0000), while port->mapbase == 0x8011c000. Consequently, the pci serial driver failed to register the serial port as the default console. I think this should be an AWS platform bug |
First, note about the "EFI stub: ERROR: FIRMWARE BUG: kernel image not aligned on 64k boundary" error.. this is a bootloader bug (old grub2 ?) It's been fixed in Fedora/RH afaik About the "mapbase" mismatch, that could indicate that the kernel is incorrectly remapping the PCI device containing the UART, thus causing that mismatch. It should be prevented to do so by the "PCI Boot Configuration" _DSM Function" in ACPI. Can you tell me more about the instance type you used ? I'll see if I can track this down |
So.... For (bad) historical reasons, the arm64 kernel will re-assign all PCI(e) devices at boot, it will not try to honor the existing firmware assignments at all, unless an ACPI _DSM entry tells it to. This behaviour is different from x86. I remember some debates about this ages ago, and I think the excuse for not fixing that was the existence of a couple of platforms with broken UEFI firmwares... So the kernel reassigns everything, and thus the address no longer matches. It's fundamentally a kernel issue. I'll work with EC2 to see if we can get the _DSM firmware entry changed, but I have suspicions that this is too disruptive to do on existing instance types since it will completely change their PCI resources layout. We do have it on metal, that I know. So I think we need to look at a more involved kernel fix. This is fundamentally a Linux issue. It should probably keep track of the console address in a form more appropriate than an ASCII string ... and have some kind of hook into the PCI code to "match" it with a discovered device and then adjust the address as the device gets remapped. |
@ozbenh, appreciate for looking into this issue. I used c6g.xlarge. I have not seen any connection between the two address numbers 0x90a0000 and 0x8011c000. I though at least both of them should have 0xc000 as the offset inside a pci bar. According to 1, The base address of the Serial Port register set is |
Hi @ozbenh 👋
In Fedora CoreOS here we're just using the latest GRUB2 in Fedora. Do you think the bug resurfaced? |
From which Fedora version ? This patch in Fedora's grub2 should fix it |
They don't need to be related. One is what UEFI assigned at boot, the other is what Linux assigned. The BAR is only 4k in size so there isn't even a 0xc000 offset there, that's just where it happened to be allocated.
Yeah though Linux can't really deal with these I think in any saner way either :-) I've started a conversation in the linux-pci and linux-arm-kernel mailing lists about this "arm64 PCI resource allocation issue", let's see where that goes. |
Sorry for the delay in getting back to you on this.
I think you're referring to https://src.fedoraproject.org/rpms/grub2/blob/rawhide/f/0190-arm64-Fix-EFI-loader-kernel-image-allocation.patch Yeah I'm looking at some console output from tests that ran today and I don't see This is kind of an old issue so some of the logs are from systems before the patch you mentioned was implemented. |
Right, I wrote that patch for Amazon Linux initially and sent it to the various distros, it took a while for it to percolate. Sadly that code is still hacked to death by the original shim support path so the patch can't really be upstreamed (upstream loader just uses UEFI load image). |
For some reason some instance types fail to launch..
a1.metal
seems fine.c6g.xlarge
anda1.2xlarge
fail with:The text was updated successfully, but these errors were encountered: