-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
arm64: unable to PXE boot, hang after AHCI errors #871
Comments
Hi @Snaipe, |
Just tried to boot the vanilla 3227.2.2 pxe images with I'm not sure why, but here or even in my own tests I'm not seeing any effect when setting coherent_pool=xyz; the boot logs still contain Anyway, here's FCOS' kconfig: fcos-kconfig.txt |
Would you be able to test build a Flatcar image with the following value set in the kernel config:
That would be the source of the 16M vs 4M atomic pool difference. But to be honest I have no idea why CMA would make a difference here, especially with assigning io space. |
I thought I had done something wrong and had to rebuild the kernel twice to confirm, but that config didn't seem to have any effect. I'm a bit puzzled. I am also noticing that when I pass cma=0 to the kernel cmdline, I'm back at the same point as above, so maybe it's something else. I'm also noticing that I am consistently getting a BTF mismatch on the scsi_common module so I think I might be doing something wrong. Is the guide up-to-date? I'm going to try enabling MODULE_ALLOW_BTF_MISMATCH for now to see if I can get this thing farther, but even better would be to actually reproduce the 3227.2.2 images. |
That's weird. If I find some time i'd check it myself.
Sorry, just noticed that the guide has some issues. Try patching the config in
Would you open a PR with the kconfig changes after you've checked which one works best? |
Thanks, doing this right now.
Yep, will do. |
Okay, this was very helpful -- the image I got reproduces the original problem, which mean I'll be able to properly troubleshoot things. I can confirm that CMA might have been a red herring, as with all of the above config set, I'm still seeing the problem. |
What about enabling |
Progress! Enabling IOMMU got me past the original failure. We're now blocking for about 300 seconds after the NICs get initialized and the disks get SCSI-attached, before it gives up and reboots. Boot logs. I'm going to try to bisect the kernel config differences with FCOS too in the meantime. |
Unfortunately I was not able to bisect the config diff to something that worked -- the initial set was nonfunctional from the get-go (can't even boot) and too large to efficiently trim to find something that worked. I did try investigating the DMA differences a bit further, but I wasn't really trusting the config at that point, so I applied this patch:
This didn't seem to have any effect on the default preallocation size, though it did confirm that coherent_pool was correctly being picked up. That said, I'm not sure this is related anymore, as I'm pretty sure the latest boot logs indicate that init is running pretty well. It looks like systemd is waiting on something and timing out, so I'm going to try and see if I can get it to dump some information. |
I don't know which console you grab the logs from but make sure you have this in the command line: |
From serial through IPMI -- I've added forward_to_console=1 and I think it might be a config issue pulling something over the network. Currently troubleshooting, but it seems that the IOMMU configs were what really unwedged me. I'll report back if/once I get the config issues working and make a PR with the kconfig changes if nothing else is breaking |
I think you may want to set |
I finally got it to work -- removing everything in the config except for a pretty bare bones ignition config, it managed to finally get all the way to login:
I guess we have some stuff to revise in our current configuration -- it was the same config used for older flatcar deployments so I had assumed it would continue working, but perhaps not. At the very least, being able to boot means I finally have something to go on. I had some extras kernel configs enabled, so I'll try just with IOMMU first to see if I can identify the minimum set of configs to turn on, then put that in a pull request. |
I've made a PR at flatcar-archive/coreos-overlay#2235 with the kconfig changes. |
Thank you for this contribution @Snaipe. This fix will now progress through alpha->beta->stable. |
Description
On Gigabyte arm64 servers, the Flatcar PXE images hang during the boot process, making them unusable, while Fedora CoreOS images work. We think we narrowed it down to CMA not being enabled in the kernel config, and enabling it generally and for DMA seems to get the boot process farther along.
Impact
Inability to use flatcar images altogether on arm-based servers
Environment and steps to reproduce
We are running the flatcar pxe vmlinuz and initrd images on a bare metal arm64 server. This server is a Gigabyte R152-P31.
This seems to be happening with multiple (likely all) published Flatcar images for arm64. I have tested 3033.3.5, 3227.2.2, and 3374.0.0. We have yet to find an image that works.
We use the following grub config to boot these images (adapting the versions appropriately):
The kernel seems to start correctly, however it invariably ends up printing this message and hanging:
... with stack dumps being periodically printed out because the whole thing is blocked. Full logs for a 3227.2.2 boot: flatcar-3227.2.2-boot.log
Booting the kernel with modules_blacklist=ahci of course works and gets us a shell, but means no disks are visible.
One thing we have noticed is that Fedora CoreOS 36.20220918.3.0 has no issue getting past this point. Here is a diff of the boot logs for both:
flatcar-vs-fcos-boot.diff
One obvious difference is the kernel version (Flatcar uses 5.15 while FCOS uses 5.19), but we think we have narrowed it down to the following differences:
In particular, the DMA coherent pools are rather small and it's been accepted that the coherent pool can only work correctly with CMA enabled. The inability to allocate space for the offending PCI device during boot seem to correlate this. On Flatcar, there is no space reserved for CMA while FCOS has 64MiB, and specifying
cma=64M
on the kernel cmdline prints that the argument is unrecognized, hinting that it's config-disabled.I have built the kernel of flatcar 3227.2.3 using the SDK, and enabled the following configs:
This seemed to have gotten the boot process farther along, except for dracut having issues using iscsi, but I think this may have something to do with my build more than anything. Boot logs for this are here:
flatcar-with-cma.log. Currently investigating.
Expected behavior
We would have expected the boot to complete.
The text was updated successfully, but these errors were encountered: