Skip to content
This repository has been archived by the owner on May 30, 2023. It is now read-only.

sys-kernel: enable IOMMU on arm64 #2235

Merged
merged 1 commit into from
Oct 21, 2022
Merged

Conversation

Snaipe
Copy link
Contributor

@Snaipe Snaipe commented Oct 17, 2022

sys-kernel: enable IOMMU on arm64

On Gigabyte R152-P31 arm64 servers, the Flatcar PXE images hang during the boot process, making them unusable, while Fedora CoreOS images work.

The kernel seems to start correctly, however it invariably ends up printing this message and hanging:

ata1.00: qc timeout (cmd 0xec)
ahci 000c:01:00.0: AHCI controller unavailable!
pcieport 000c:00:01.0: AER: Uncorrected (Non-Fatal) error received: 000c:00:00.0
ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
pcieport 000c:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
pcieport 000c:00:01.0:   device [1def:e101] error status/mask=00004000/00400000
pcieport 000c:00:01.0:    [14] CmpltTO                (First)
ahci 000c:01:00.0: AHCI controller unavailable!
ahci 000c:01:00.0: AER: can't recover (no error_detected callback)
pcieport 000c:00:01.0: AER: device recovery failed
pcieport 000c:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 000c:00:00.0

Enabling IOMMU seems to make the problem disappear.

How to use

Build the kernel and boot it. It might be hard to test without real hardware, but it did fix it for our Gigabyte R152-P31 server.

Testing done

Ran the following commands to build the kernel and initramfs:

$ ./checkout "stable-3227.2.2"
$ ./run_sdk_container -t -C ghcr.io/flatcar/flatcar-sdk-all:3227.0.0
$ emerge-arm64-usr coreos-modules
$ emerge-arm64-usr coreos-kernel 
$ ./build_image --board=arm64-usr --replace
$ ./image_to_vm.sh --from=../build/images/arm64-usr/latest --board=arm64-usr --format pxe

Then, after uploading both vmlinuz and the initramfs to our pxe server, booted it with the following grub config:

linux flatcar3227.2.2/aarch64/vmlinuz ip=dhcp ipv6.disable=1 flatcar.first_boot=1 flatcar.autologin ignition.config.url=http://10.90.21.50/snaipe-debug.ign console=tty0 systemd.journald.forward_to_console=yes debug
initrd flatcar3227.2.2/aarch64/initrd.img

... and the following ignition config:

{
  "ignition": {
    "config": {},
    "security": {
      "tls": {}
    },
    "timeouts": {},
    "version": "2.3.0"
  },
  "networkd": {},
  "passwd": {
    "users": [
      {
        "name": "core",
        "sshAuthorizedKeys": [
          "ssh-ed25519 AAAAC3NzaC1lZDI1NTE5AAAAIAbcwED..."
        ]
      }
    ]
  },
  "storage": {},
  "systemd": {}
}

With the config change, the kernel boots properly and I get to the login screen. Without it, the kernel eventually hangs and dumps stacktraces every so often.

  • Changelog entries added in the respective changelog/ directory (user-facing change, bug fix, security fix, update)
  • Inspected CI output for image differences: /boot and /usr size, packages, list files for any missing binaries, kernel modules, config files, kernel modules, etc.

Fixes flatcar/Flatcar#871.

On Gigabyte R152-P31 arm64 servers, the Flatcar PXE images hang during the boot
process, making them unusable, while Fedora CoreOS images work.

The kernel seems to start correctly, however it invariably ends up printing
this message and hanging:

    ata1.00: qc timeout (cmd 0xec)
    ahci 000c:01:00.0: AHCI controller unavailable!
    pcieport 000c:00:01.0: AER: Uncorrected (Non-Fatal) error received: 000c:00:00.0
    ata1.00: failed to IDENTIFY (I/O error, err_mask=0x4)
    pcieport 000c:00:01.0: PCIe Bus Error: severity=Uncorrected (Non-Fatal), type=Transaction Layer, (Requester ID)
    pcieport 000c:00:01.0:   device [1def:e101] error status/mask=00004000/00400000
    pcieport 000c:00:01.0:    [14] CmpltTO                (First)
    ahci 000c:01:00.0: AHCI controller unavailable!
    ahci 000c:01:00.0: AER: can't recover (no error_detected callback)
    pcieport 000c:00:01.0: AER: device recovery failed
    pcieport 000c:00:01.0: AER: Multiple Uncorrected (Non-Fatal) error received: 000c:00:00.0

Enabling IOMMU seems to make the problem disappear.
@jepio
Copy link
Contributor

jepio commented Oct 18, 2022

CI running here http://jenkins.infra.kinvolk.io:8080/job/container/job/packages_all_arches/574/ (internal only link, sorry).

@jepio
Copy link
Contributor

jepio commented Oct 21, 2022

All tests pass, reran EM because first attempt timed-out due to parallel test runs on another PR.

In case someone ends up reading this: yes, this might cause performance to be lower. If you prefer to tradeoff security for performance, look at running iommu in lazy unmap mode or passthrough mode.

@jepio jepio merged commit 56662f5 into flatcar-archive:main Oct 21, 2022
@jepio jepio mentioned this pull request Oct 24, 2022
2 tasks
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

arm64: unable to PXE boot, hang after AHCI errors
3 participants