Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange calls during network boot #649

Open
raddirad opened this issue Mar 28, 2024 · 40 comments
Open

Strange calls during network boot #649

raddirad opened this issue Mar 28, 2024 · 40 comments

Comments

@raddirad
Copy link

Hi

I have a problem with shim 15.8 and a Dell Latitude 5300 2-in-1 Notebook.
This Noteboot uses the latest Firmware 1.29
It is connected to Ethernet via a Thunderbolt USB-C Docking Station.

We do a lot of netbooting with a current shim 15.8. This shim is signed by Micrsooft, although the problem isn't Secureboot related.

When Netbooting on this specific machine we get a strange request via TFTP

RRQ from 192.168.16.97 filename loader/shimx64.efi.signed
tftp: client does not accept options
RRQ from 192.168.16.97 filename loader/shimx64.efi.signed
RRQ from 192.168.16.97 filename oader/revocations.efi
sending NAK (1, File not found) to 192.168.16.97
RRQ from 192.168.16.97 filename loader/?USB

Then the system fails and boots in a SupportAssist mode by Dell.

To verify it's not related to our shim i took the latest 15.8 shim from Canonical, with the same result.

Other systems, like Dell 5430 or vSphere or Proxmox VMs aren't affected. As for now this is the only system I know that has this issue

Other systems request the grub binary as expected after the revocations.efi is not found.

@olifre
Copy link

olifre commented Apr 12, 2024

I do observe something similar with Dell Latitude 3590, OptiPlex 3040 and others. Checking wiht tcpdump, I see:

1318    08:31:26,348827        TFTP    84    Read Request, File: grub2/revocations.efi, Transfer type: octet, blksize=512
1319    08:31:26,352904        TFTP    61    Error Code, Code: File not found, Message: File not found
1320    08:31:28,673738        TFTP    77    Read Request, File: grub2/�Onboard, Transfer type: octet, blksize=512
1321    08:31:28,680057        UDP    61    45932 → caci-lm(1554) Len=19

The last packet seems incorrectly parsed by wireshark, it also contains the message "File not found" in the raw part.

It feels like some kind of bad memory access — "Onboard" is one of the EFI boot options on my end, probably the same holds true for "USB" in @raddirad s case. The strange character is 0xc2 in my case.

@olifre
Copy link

olifre commented Apr 12, 2024

I've made some progress trying to understand the changes between 15.6. and 15.8.

Adding:

return EFI_SUCCESS;

right here:


(i.e. after the special case handling several devices), things work again with my affected systems. Of course, that's not a real solution, but it highlights how the bad loader name appears.

So it seems that the secondary_loader, which is learnt from the load_options, contains some garbage on Dell systems (likely just the human-readable name of the network boot option instead of the actual loader).

Since the garbage does not start with \0, it is not ignored. The reason why it worked in the past is since shim had hardcoded the default loader to be used, i.e. grubx64.efi, which was fixed here:
a23e2f0
This leads to the bad value to be used instead of it being ignored. It's not yet fully clear to me how this bad character enters the options (UEFI bug?) and what would be the best way to ignore it (ignore if non-printable characters are seen?).

@olifre
Copy link

olifre commented Apr 12, 2024

After enabling debug = 1 and rebuilding shim, I could grab this:
shim_UEFI_PXE
This appears to be the load_options one of our Dell systems provides, and it does not contain a file name, but the name of the option ("Onboard NIC (IPV4)") which is not really useful as secondary_loader. Furthermore, it is prefixed with a strange 0xc2 character.

Since I am not an expert in guessing which other things may break, I'm not sure about the best approach to fix this (ignore loaders starting with non-ASCII characters, for example?).

If a patch is developed (or there is consensus on how this should be handled), I can test it in my environment.

@raddirad
Copy link
Author

maybe @julian-klode or @vathpela could take a look at this?

Thanks in advance

@vathpela
Copy link
Contributor

vathpela commented May 3, 2024

That looks like a Boot#### variable that efibootmgr would display like this:
* Onboard NIC(IPV4) PciRoot(0x0)/Pci(0x1c,0x0)/Pci(0x0,0x0)/MAC(d09466f5ac05,0)/IPv4(0.0.0.0,0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)..BO

I have absolutely no idea why this is being passed in the load options, but "is this a fully formed boot variable" is a thing that certainly could be tested for and ignored.

@vathpela
Copy link
Contributor

vathpela commented May 3, 2024

Okay to be fair a little bit weird boot variable - the structure is like this psuedocode:

struct efi_load_option_s {
        uint32_t attributes;
        uint16_t file_path_list_length;
        uint16_t description[];  // NUL-terminated UCS-2 string
        uint8_t file_path_list[];
        uint8_t optional_data[];
};

So that's:
01 00 00 00 - attributes (EFI_VARIABLE_NON_VOLATILE)
c2 00 - File path length (0xc2)
4f 00 6e 00 62 00 6f 00 61 00 72 00 64 00 20 00 4e 00 49 00 43 00 28 00 49 00 50 00 56 00 34 00 29 00 00 00 - description "Onboard NIC(IPV4)"
Then 0x2a through 0xeb are the file path string, which is formed suspiciously. It starts with the device path in my previous comment, which goes from 0x3c to the 7f ff 04 00 at 0x82 which is the "end entire device path" marker. You would expect that would be the end except of course we've still got a lot of bytes left in our 0xc2 bytes of device path, and lo and behold there's just another device path there. It's a vendor specific message path, and it starts with some gibberish we can't decode, then another UCS-2 string that looks like a description of the ethernet port and the familiar 7f ff 04 00. No idea what the second device path is for at all. And then it ends in 00 00 42 4f, which is the (nonstandard) marker the boot services on this machine have crammed into the "optional data" to mark that it was created by the firmware.

@vathpela
Copy link
Contributor

vathpela commented May 3, 2024

So in summary: 1) I have no idea why there's a boot variable hanging out here, 2) I have no idea why the device path list in the boot variable has this weird vendor device path, but 3) it is basically a reasonably well formed boot variable, and we could probably test for that, but I'd rather know why Dell is doing this, because it doesn't really seem like they should be.

@pjwelsh
Copy link

pjwelsh commented May 23, 2024

Was there a path chosen to help with this issue on the shim side? I have all Dell systems with this issue. My only choice at this time seems to be to downgrade shim-x64 to a 15.6 version.

@pjwelsh
Copy link

pjwelsh commented May 23, 2024

Also, I know it affects at least the Dell Optiplex 5040, 7040, 3060 and 5060 and Latitude 5400. For me, it's any Dell desktop or laptop I've needed to PXE install to so far.

@nathan-omeara
Copy link

Was there a path chosen to help with this issue on the shim side? I have all Dell systems with this issue. My only choice at this time seems to be to downgrade shim-x64 to a 15.6 version.

Have you tried going into the UEFI settings and in the 'boot sequence' section, unchecking the 'onboard nic' and 'usb' choices? You can still use f12 to choose a single-boot target of usb or network boot, but if you are permanently netbooting systems that won't work, obviously.

@pjwelsh
Copy link

pjwelsh commented May 23, 2024 via email

@raddirad
Copy link
Author

Was there a path chosen to help with this issue on the shim side? I have all Dell systems with this issue. My only choice at this time seems to be to downgrade shim-x64 to a 15.6 version.

Have you tried going into the UEFI settings and in the 'boot sequence' section, unchecking the 'onboard nic' and 'usb' choices? You can still use f12 to choose a single-boot target of usb or network boot, but if you are permanently netbooting systems that won't work, obviously.

We are installing the OS via PXE/UEFI Netboot, and thus disabling boot choices is not an option. In addition the shim doesn't even try to load the grub via PXE/UEFI Netboot and hangs at the error described in OP.
Last working version was 15.7

@nathan-omeara
Copy link

Was there a path chosen to help with this issue on the shim side? I have all Dell systems with this issue. My only choice at this time seems to be to downgrade shim-x64 to a 15.6 version.

Have you tried going into the UEFI settings and in the 'boot sequence' section, unchecking the 'onboard nic' and 'usb' choices? You can still use f12 to choose a single-boot target of usb or network boot, but if you are permanently netbooting systems that won't work, obviously.

We are installing the OS via PXE/UEFI Netboot, and thus disabling boot choices is not an option. In addition the shim doesn't even try to load the grub via PXE/UEFI Netboot and hangs at the error described in OP. Last working version was 15.7

Yes, if you are only installing the OS, you can press f12 to do a one-time boot to PXE, even when PXE is not in the 'boot sequence' list. That is how I have been able to work around this bug to install the OS via network boot.

@raddirad
Copy link
Author

In our case we are loading the shim via PXE and this bug happens before the shim chainloads the grub via PXE.

@nathan-omeara
Copy link

In our case we are loading the shim via PXE and this bug happens before the shim chainloads the grub via PXE.

Yes, that is how this bug is occurring. I would still suggest you try the workaround. It isn't a great solution, but it seems to work, and still allows you to interactively network boot for OS install.

@raddirad
Copy link
Author

Ok, now I get it. Yeah for me personally this is doable, but I can't tell our customers to this things if they have a lot of affected devices.
This should be addressed by the shim team

@olifre
Copy link

olifre commented May 24, 2024

Indeed, thanks for the proposed workaround, in fact in our case we reinstall nodes without user interaction (i.e. by triggering a PXE boot remotely, by adding it to the boot order temporarily, then rebooting), so this does not help with the many distributed desktop machines we operate.

@nathan-omeara
Copy link

a23e2f0

This is the commit that introduces this issue. If I revert it, I can boot my dell (that I finally got hands-on with) with the Onboard devices still in the boot sequence.

So, I'm guessing this is getting confused by the weird Dell boot entries, and screwing up the load path for grubx64.efi

@nathan-omeara
Copy link

Possible fix:

shim/shim.c

Lines 1262 to 1263 in 0287c6b

if (!use_fb && (efi_status == EFI_INVALID_PARAMETER ||
efi_status == EFI_NOT_FOUND)) {

Add TFTP_ERROR here:

        if (!use_fb && (efi_status == EFI_INVALID_PARAMETER ||
                        efi_status == EFI_NOT_FOUND ||
                        efi_status == EFI_TFTP_ERROR)) {

In my testing, this gets it booting over network again.

@pjwelsh
Copy link

pjwelsh commented Jun 3, 2024

Any guess as to how long a change like may take to make it into a updated release package?

@raddirad
Copy link
Author

raddirad commented Jun 4, 2024

maybe @vathpela @jsetje or @julian-klode could say more on if this might get upstream

@jsetje
Copy link
Collaborator

jsetje commented Jun 5, 2024

Thank you for getting my attention. Just testing for the extra error is probably reasonable, but I'm also curious why we get a variable that looks like that. Since I exposed this, I'll certainly help get a fix in.

@pjwelsh
Copy link

pjwelsh commented Jun 5, 2024 via email

@jsetje
Copy link
Collaborator

jsetje commented Jun 5, 2024

I started asking around to see if I could find a system to test this with, which made me wonder about a 2-in-1 with a built in NIC. So I looked at the original report again. I bet that this all has something to do with how the docking station brings the NIC in.

@nathan-omeara
Copy link

It definitely isn't specific to docking stations.

I have one dell on-hand with the issue, a Latitude 5300 (built-in NIC, no external NIC). But I also have one other dell, one HP, and one MS surface that do not show this issue, all 3 of those using USB NICs only.

It's worth noting that USB boot technically has the same basic issue, but the fallback code kicks in on USB boot, because the error handling I pointed out handles the error when it's on a filesystem, it just doesn't handle it when it's on TFTP.

I also wonder if HTTP(s) boot would be another error code that would need to be added there, but my only on-hand device with HTTP(s) boot support is the dell that doesn't demonstrate this issue.

If you pay attention on USB boot, you can see the same error, followed by the message here:

shim/shim.c

Line 1265 in 0287c6b

L"start_image() returned %r, falling back to default loader\n",

This is what led me to try adding EFI_TFTP_ERROR to that statement.

@nathan-omeara
Copy link

nathan-omeara commented Jun 5, 2024

Hmm, yeah, forced a (similar?) error by renaming grubx64.efi on my http boot server and booting my other dell.
start_image() returned 00000023

I'm guessing because 0x23 (35?) is relatively new, and I'm using fedora's shipping version of shim 15.8 which was probably compiled with an earlier version of gnu_efi.

So I'd suggest adding EFI_TFTP_ERROR and EFI_HTTP_ERROR to that fallback logic.

I certainly wouldn't object to fixing the parsing of the weird values (if there is an actual issue, and it isn't just Dell and Lenovo (and maybe others) doing something that breaks the standard) but harmonizing the fallback behavior between local filesystems and network boot makes sense to me.

@jsetje
Copy link
Collaborator

jsetje commented Jun 5, 2024

FWIW, we'll have to fix this forward. In addition to the patch that exposed this, we'll need non-hardcoded paths and names for UKIs. Hopefully I can get my hands on a setup that exposes this, but I'm also not opposed to keep trying unless we get a very specific error.

@raddirad
Copy link
Author

raddirad commented Jun 5, 2024

I started asking around to see if I could find a system to test this with, which made me wonder about a 2-in-1 with a built in NIC. So I looked at the original report again. I bet that this all has something to do with how the docking station brings the NIC in.

This is not related to the Dock. I tested the 2-in-1 and a working Dell device. The 2-in-1 failed, the other one succeeded.

@olifre mentioned other devices that show the same behaviour ("Dell Latitude 3590, OptiPlex 3040 and others")
Maybe @olifre can post the other ones and you might get access to one of those

@raddirad
Copy link
Author

raddirad commented Jun 5, 2024

FWIW, we'll have to fix this forward. In addition to the patch that exposed this, we'll need non-hardcoded paths and names for UKIs. Hopefully I can get my hands on a setup that exposes this, but I'm also not opposed to keep trying unless we get a very specific error.

If you want anything tested, I have access to the 2-in-1 convertible I mention in OP. I can test new code

@olifre
Copy link

olifre commented Jun 5, 2024

Maybe @olifre can post the other ones and you might get access to one of those

I can immediately add to the list:

  • Dell OptiPlex 3020
  • Dell OptiPlex 3080

After that, I stopped doing systematic testing, as testing other models (we have an assortment of Dell OptiPlex systems, but no other Latitudes at hand) would mean temporarily stealing them from active users to test them in our test network.

I can certainly try to grab a specific model if you know you can get a hand on any OptiPlex, check it and report back here.

Combining my list with the information provided by @pjwelsh above, I think the full known Dell list is:

  • Dell OptiPlex 3020, 3040, 5040, 7040, 3060, 5060, 3080
  • Dell Latitude 3590, 5300 2-in-1, 5400

From those numbers, it seems quite likely all the OptiPlex _020 to _080 are affected (at least).

@nathan-omeara
Copy link

I am also able to test proposed patches. I even set up an additional signing key on the latitude 5300 so I can sign my builds and test with secure boot on.

@nathan-omeara
Copy link

I was going to submit a PR with the changes I recommended above, but it won't compile with EFI_HTTP_ERROR without updating the submodule branch for gnu-efi, and it looks like that's more complicated than I had assumed.

@pjwelsh
Copy link

pjwelsh commented Jun 20, 2024

Any progress on the PR submission?

@nathan-omeara
Copy link

I could submit it without EFI_HTTP_ERROR until gnu-efi is updated. I'm not sure what you guys need to do to pull in a newer version of gnu-efi.

@pjwelsh
Copy link

pjwelsh commented Jun 20, 2024 via email

@nathan-omeara
Copy link

It's a submodule, and it looks like each release of shim has a specific gnu-efi revision/branch tagged:

shim/.gitmodules

Lines 1 to 4 in 0287c6b

[submodule "gnu-efi"]
path = gnu-efi
url = https://github.com/rhboot/gnu-efi.git
branch = shim-15.8

That gnu-efi shim-15.8 branch does not contain EFI_HTTP_ERROR yet, but the main branch has that.

nathan-omeara added a commit to nathan-omeara/shim that referenced this issue Jun 20, 2024
…boot

Only certain errors trigger fallback to the default loader name.  This change allows fallback when encountering `EFI_TFTP_ERROR` as well.  And ideally would also handle `EFI_HTTP_ERROR` the same way, but that requires updating gnu-efi to a version newer than the shim-15.8 branch.

This fixes the issue reported in rhboot#649 that prevents boot on some models of PC.
nathan-omeara added a commit to nathan-omeara/shim that referenced this issue Jun 20, 2024
…boot

Only certain errors trigger fallback to the default loader name.  This change allows fallback when encountering `EFI_TFTP_ERROR` as well.  And ideally would also handle `EFI_HTTP_ERROR` the same way, but that requires updating gnu-efi to a version newer than the shim-15.8 branch.

This fixes the issue reported in rhboot#649 that prevents boot on some models of PC.

Signed-off-by: Nathan O'Meara <Nathan.OMeara@tanium.com>
nathan-omeara added a commit to nathan-omeara/shim that referenced this issue Jun 20, 2024
…boot

Only certain errors trigger fallback to the default loader name.  This change allows fallback when encountering `EFI_TFTP_ERROR` as well.  And ideally would also handle `EFI_HTTP_ERROR` the same way, but that requires updating gnu-efi to a version newer than the shim-15.8 branch.

This fixes the issue reported in rhboot#649 that prevents boot on some models of PC.

Signed-off-by: Nathan O'Meara <Nathan.OMeara@tanium.com>
@olifre
Copy link

olifre commented Jun 21, 2024

For reference, I can confirm that PR #666 indeed fixes the issue for our machines when I recompile shim with that patch (we are using TFTP boot). Many thanks!

@raddirad
Copy link
Author

I can also confirm, that adding efi_status == EFI_TFTP_ERROR fixed the error on the Dell 5300 2-in-1 i mentioned in OP

@nathan-omeara
Copy link

@jsetje @vathpela - Any thoughts on #666

The fix works in my test system, and the two commenters above.

@raddirad
Copy link
Author

raddirad commented Aug 1, 2024

ping @jsetje @vathpela

is there any update on #666 ?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants