Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Boot issue on dedicated server with talos 1.7.0 since 1.7.0-alpha.1 (1.7.0-alpha.0 worked) #8743

Closed
WinterNis opened this issue May 15, 2024 · 13 comments · Fixed by siderolabs/pkgs#959
Assignees

Comments

@WinterNis
Copy link

Bug Report

Boot issue on dedicated server with talos 1.7.0 since 1.7.0-alpha.1 (1.7.0-alpha.0 worked)

Description

We are encountering boot issues with talos v1.7.0 on a OVH dedicated server. We were previously able to boot talos on this server using 1.7.0-alpha.0 version.

Logs at boot with v1.7.0:

EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
EFI stub: Measured initrd data into PCR 9

And nothing after that. It just hangs.

This is similar to #8657, but we dediced to open another issue since we are using different hardware/provider and were able to pin point versions (see below)

We tried to disable console kernel extra args. We had the same result (and yes we checked that the console argument was not here anymore using GRUB)

We had the issue with talos 1.7.0 and 1.7.1. Since we previously worked with alpha and beta versions, we tried to check the alpha/beta version in order to find more information that could help the issue.
We were able to find the version when the issue starts to happen:

talos 1.7.0-alpha.0 works
talos 1.7.0-alpha.1 and all versions after that does not work

We first supposed that it could be a kernel version issue and hardware incompatibility.
As far as we understand, v1.7.0-alpha.0 is using kernel v6.6.14 while v1.7.0-alpha.1 is using kernel v6.6.21.
I will be honest, I tried to read through through the kernel v6.6 patch versions changelog but this is too low level for me and I could not find anything useful

How can we debug this issue further ? Thanks !

Logs

EFI stub: Loaded initrd from LINUX_EFI_INITRD_MEDIA_GUID device path
EFI stub: Measured initrd data into PCR 9

Environment

  • Talos version:
    Working version: v1.7.0-alpha.0
    Non working version: v1.7.0-alpha.1 and onward

Dedicated server specs:
OVH scale a1 server: AMD EPYC GENOA 9124 - 16c/32t - 3 GHz/3.6 GHz 128Go Ram

The v1.7.0 version works on another (different) dedicated server that runs on a Intel Xeon-E 2388G - 8c/16t - 3.2 GHz/4.6 GHz

@smira
Copy link
Member

smira commented May 16, 2024

This is a tough issue to look into.

There might be two issues here:

  • still something with console, as EFI stub is printed via EFI console, and all other messages are printed by the kernel via its own console driver, so I would double-check that, the best is to enter GRUB menu and check that there are no console args at all
  • something very incompatible that prevents Linux to boot, but I haven't seen such reports so far. I looked through other changes to the kernel config, and I don't see anything. Does this machine work with other Linux distros which use Linux 6.6+ kernel?

@smira
Copy link
Member

smira commented May 16, 2024

P.S. If it's possible, you could try booting in BIOS (non-UEFI) mode to see if that works.

@WinterNis
Copy link
Author

Thanks for your answers. This is a tough one indeed.

  • We did enter grub and made sure there were no console arguments at all.  We still had the issue.
  • We were able to access the bios (non-UEFI). Not sure what we can do in here to help though. Exiting the bios then result in the same boot issue.
  • We are able to install ubuntu 24.04 on the server, which is running kernel v6.8.
  • We tried on another scale a1 server (same specs, different server). We had the same issue.

I haven't seen such reports so far

As far as I understand, #8657 does report compatibility issues no ?

@smira
Copy link
Member

smira commented May 16, 2024

As far as I understand, #8657 does report compatibility issues no ?

This issue seems to be two issues actually, one is console args (there's an issue with Linux kernel that if console=ttyS0 is specified, and there's no serial port, Linux kernel hangs on boot the same way as you report), and another one which is related to the kernel panicking early on boot (but that is on QEMU).

  • We were able to access the bios (non-UEFI). Not sure what we can do in here to help though. Exiting the bios then result in the same boot issue.

There should be an option to boot in "legacy mode" (or something like that), which disables UEFI completely. I'm just curious if this is related to UEFI or not. Talos should work both ways, but still.

@smira
Copy link
Member

smira commented May 16, 2024

I wonder if https://cateee.net/lkddb/web-lkddb/EFI_DISABLE_PCI_DMA.html might be the issue here, it was enabled in alpha.1 version.

So I guess the experiment is to add efi=no_disable_early_pci_dma to the kernel command line might fix it.

@frezbo
Copy link
Member

frezbo commented May 16, 2024

I wonder if https://cateee.net/lkddb/web-lkddb/EFI_DISABLE_PCI_DMA.html might be the issue here, it was enabled in alpha.1 version.

So I guess the experiment is to add efi=no_disable_early_pci_dma to the kernel command line might fix it.

that could be it, since it broke booting on arm64

@WinterNis
Copy link
Author

Well, that actually solved the problem.

Passing efi=no_disable_early_pci_dma AND removing console arguments did the trick.

With the efi argument, but without removing the console arguments, we had the same issue.

Thanks guys, you truly sniped this 🙏

What’s the best way of "fixing" this ? Should we add the arguments in our custom images and that’s it ? Or do you consider removing the disable because of compatibility issues ?

@smira
Copy link
Member

smira commented May 16, 2024

What’s the best way of "fixing" this ? Should we add the arguments in our custom images and that’s it ? Or do you consider removing the disable because of compatibility issues ?

yes, you can do a custom kernel arg for now, and I believe it would still be fine (ignored) if we disable it by default in the kernel config.

@smira
Copy link
Member

smira commented May 16, 2024

I will actually remove that kernel option, whoever wants that could do a kernel arg to enforce it, but e.g. Ubuntu doesn't enable it by default.

@smira
Copy link
Member

smira commented May 16, 2024

console fix is expected in 1.8

@smira smira self-assigned this May 16, 2024
@WinterNis
Copy link
Author

That’s good news. Thanks again for your help guys, truly appreciated 🙏

I will let you guys close this issue or keep it open until the fix are landed.

smira added a commit to smira/pkgs that referenced this issue May 16, 2024
This effectively reverts siderolabs#899 completely.

Fixes siderolabs/talos#8743

Signed-off-by: Andrey Smirnov <andrey.smirnov@siderolabs.com>
(cherry picked from commit f414bbd)
@smira
Copy link
Member

smira commented May 16, 2024

Talos 1.7.2 will have this fix included.

@buroa
Copy link

buroa commented May 17, 2024

Thanks for this @smira. This pretty much breaks a ton of EFI boot processes. I saw this on the Mac minis as well and patched it inside my builds.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jul 17, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants