Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Strange calls during network boot #649

Open
raddirad opened this issue Mar 28, 2024 · 68 comments · May be fixed by #694
Open

Strange calls during network boot #649

raddirad opened this issue Mar 28, 2024 · 68 comments · May be fixed by #694

Comments

@raddirad
Copy link

Hi

I have a problem with shim 15.8 and a Dell Latitude 5300 2-in-1 Notebook.
This Noteboot uses the latest Firmware 1.29
It is connected to Ethernet via a Thunderbolt USB-C Docking Station.

We do a lot of netbooting with a current shim 15.8. This shim is signed by Micrsooft, although the problem isn't Secureboot related.

When Netbooting on this specific machine we get a strange request via TFTP

RRQ from 192.168.16.97 filename loader/shimx64.efi.signed
tftp: client does not accept options
RRQ from 192.168.16.97 filename loader/shimx64.efi.signed
RRQ from 192.168.16.97 filename oader/revocations.efi
sending NAK (1, File not found) to 192.168.16.97
RRQ from 192.168.16.97 filename loader/?USB

Then the system fails and boots in a SupportAssist mode by Dell.

To verify it's not related to our shim i took the latest 15.8 shim from Canonical, with the same result.

Other systems, like Dell 5430 or vSphere or Proxmox VMs aren't affected. As for now this is the only system I know that has this issue

Other systems request the grub binary as expected after the revocations.efi is not found.

@olifre
Copy link

olifre commented Apr 12, 2024

I do observe something similar with Dell Latitude 3590, OptiPlex 3040 and others. Checking wiht tcpdump, I see:

1318    08:31:26,348827        TFTP    84    Read Request, File: grub2/revocations.efi, Transfer type: octet, blksize=512
1319    08:31:26,352904        TFTP    61    Error Code, Code: File not found, Message: File not found
1320    08:31:28,673738        TFTP    77    Read Request, File: grub2/�Onboard, Transfer type: octet, blksize=512
1321    08:31:28,680057        UDP    61    45932 → caci-lm(1554) Len=19

The last packet seems incorrectly parsed by wireshark, it also contains the message "File not found" in the raw part.

It feels like some kind of bad memory access — "Onboard" is one of the EFI boot options on my end, probably the same holds true for "USB" in @raddirad s case. The strange character is 0xc2 in my case.

@olifre
Copy link

olifre commented Apr 12, 2024

I've made some progress trying to understand the changes between 15.6. and 15.8.

Adding:

return EFI_SUCCESS;

right here:


(i.e. after the special case handling several devices), things work again with my affected systems. Of course, that's not a real solution, but it highlights how the bad loader name appears.

So it seems that the secondary_loader, which is learnt from the load_options, contains some garbage on Dell systems (likely just the human-readable name of the network boot option instead of the actual loader).

Since the garbage does not start with \0, it is not ignored. The reason why it worked in the past is since shim had hardcoded the default loader to be used, i.e. grubx64.efi, which was fixed here:
a23e2f0
This leads to the bad value to be used instead of it being ignored. It's not yet fully clear to me how this bad character enters the options (UEFI bug?) and what would be the best way to ignore it (ignore if non-printable characters are seen?).

@olifre
Copy link

olifre commented Apr 12, 2024

After enabling debug = 1 and rebuilding shim, I could grab this:
shim_UEFI_PXE
This appears to be the load_options one of our Dell systems provides, and it does not contain a file name, but the name of the option ("Onboard NIC (IPV4)") which is not really useful as secondary_loader. Furthermore, it is prefixed with a strange 0xc2 character.

Since I am not an expert in guessing which other things may break, I'm not sure about the best approach to fix this (ignore loaders starting with non-ASCII characters, for example?).

If a patch is developed (or there is consensus on how this should be handled), I can test it in my environment.

@raddirad
Copy link
Author

maybe @julian-klode or @vathpela could take a look at this?

Thanks in advance

@vathpela
Copy link
Contributor

vathpela commented May 3, 2024

That looks like a Boot#### variable that efibootmgr would display like this:
* Onboard NIC(IPV4) PciRoot(0x0)/Pci(0x1c,0x0)/Pci(0x0,0x0)/MAC(d09466f5ac05,0)/IPv4(0.0.0.0,0,DHCP,0.0.0.0,0.0.0.0,0.0.0.0)..BO

I have absolutely no idea why this is being passed in the load options, but "is this a fully formed boot variable" is a thing that certainly could be tested for and ignored.

@vathpela
Copy link
Contributor

vathpela commented May 3, 2024

Okay to be fair a little bit weird boot variable - the structure is like this psuedocode:

struct efi_load_option_s {
        uint32_t attributes;
        uint16_t file_path_list_length;
        uint16_t description[];  // NUL-terminated UCS-2 string
        uint8_t file_path_list[];
        uint8_t optional_data[];
};

So that's:
01 00 00 00 - attributes (EFI_VARIABLE_NON_VOLATILE)
c2 00 - File path length (0xc2)
4f 00 6e 00 62 00 6f 00 61 00 72 00 64 00 20 00 4e 00 49 00 43 00 28 00 49 00 50 00 56 00 34 00 29 00 00 00 - description "Onboard NIC(IPV4)"
Then 0x2a through 0xeb are the file path string, which is formed suspiciously. It starts with the device path in my previous comment, which goes from 0x3c to the 7f ff 04 00 at 0x82 which is the "end entire device path" marker. You would expect that would be the end except of course we've still got a lot of bytes left in our 0xc2 bytes of device path, and lo and behold there's just another device path there. It's a vendor specific message path, and it starts with some gibberish we can't decode, then another UCS-2 string that looks like a description of the ethernet port and the familiar 7f ff 04 00. No idea what the second device path is for at all. And then it ends in 00 00 42 4f, which is the (nonstandard) marker the boot services on this machine have crammed into the "optional data" to mark that it was created by the firmware.

@vathpela
Copy link
Contributor

vathpela commented May 3, 2024

So in summary: 1) I have no idea why there's a boot variable hanging out here, 2) I have no idea why the device path list in the boot variable has this weird vendor device path, but 3) it is basically a reasonably well formed boot variable, and we could probably test for that, but I'd rather know why Dell is doing this, because it doesn't really seem like they should be.

@pjwelsh
Copy link

pjwelsh commented May 23, 2024

Was there a path chosen to help with this issue on the shim side? I have all Dell systems with this issue. My only choice at this time seems to be to downgrade shim-x64 to a 15.6 version.

@pjwelsh
Copy link

pjwelsh commented May 23, 2024

Also, I know it affects at least the Dell Optiplex 5040, 7040, 3060 and 5060 and Latitude 5400. For me, it's any Dell desktop or laptop I've needed to PXE install to so far.

@nathan-omeara
Copy link

Was there a path chosen to help with this issue on the shim side? I have all Dell systems with this issue. My only choice at this time seems to be to downgrade shim-x64 to a 15.6 version.

Have you tried going into the UEFI settings and in the 'boot sequence' section, unchecking the 'onboard nic' and 'usb' choices? You can still use f12 to choose a single-boot target of usb or network boot, but if you are permanently netbooting systems that won't work, obviously.

@pjwelsh
Copy link

pjwelsh commented May 23, 2024 via email

@raddirad
Copy link
Author

Was there a path chosen to help with this issue on the shim side? I have all Dell systems with this issue. My only choice at this time seems to be to downgrade shim-x64 to a 15.6 version.

Have you tried going into the UEFI settings and in the 'boot sequence' section, unchecking the 'onboard nic' and 'usb' choices? You can still use f12 to choose a single-boot target of usb or network boot, but if you are permanently netbooting systems that won't work, obviously.

We are installing the OS via PXE/UEFI Netboot, and thus disabling boot choices is not an option. In addition the shim doesn't even try to load the grub via PXE/UEFI Netboot and hangs at the error described in OP.
Last working version was 15.7

@nathan-omeara
Copy link

Was there a path chosen to help with this issue on the shim side? I have all Dell systems with this issue. My only choice at this time seems to be to downgrade shim-x64 to a 15.6 version.

Have you tried going into the UEFI settings and in the 'boot sequence' section, unchecking the 'onboard nic' and 'usb' choices? You can still use f12 to choose a single-boot target of usb or network boot, but if you are permanently netbooting systems that won't work, obviously.

We are installing the OS via PXE/UEFI Netboot, and thus disabling boot choices is not an option. In addition the shim doesn't even try to load the grub via PXE/UEFI Netboot and hangs at the error described in OP. Last working version was 15.7

Yes, if you are only installing the OS, you can press f12 to do a one-time boot to PXE, even when PXE is not in the 'boot sequence' list. That is how I have been able to work around this bug to install the OS via network boot.

@raddirad
Copy link
Author

In our case we are loading the shim via PXE and this bug happens before the shim chainloads the grub via PXE.

@nathan-omeara
Copy link

In our case we are loading the shim via PXE and this bug happens before the shim chainloads the grub via PXE.

Yes, that is how this bug is occurring. I would still suggest you try the workaround. It isn't a great solution, but it seems to work, and still allows you to interactively network boot for OS install.

@raddirad
Copy link
Author

Ok, now I get it. Yeah for me personally this is doable, but I can't tell our customers to this things if they have a lot of affected devices.
This should be addressed by the shim team

@olifre
Copy link

olifre commented May 24, 2024

Indeed, thanks for the proposed workaround, in fact in our case we reinstall nodes without user interaction (i.e. by triggering a PXE boot remotely, by adding it to the boot order temporarily, then rebooting), so this does not help with the many distributed desktop machines we operate.

@nathan-omeara
Copy link

a23e2f0

This is the commit that introduces this issue. If I revert it, I can boot my dell (that I finally got hands-on with) with the Onboard devices still in the boot sequence.

So, I'm guessing this is getting confused by the weird Dell boot entries, and screwing up the load path for grubx64.efi

@nathan-omeara
Copy link

Possible fix:

shim/shim.c

Lines 1262 to 1263 in 0287c6b

if (!use_fb && (efi_status == EFI_INVALID_PARAMETER ||
efi_status == EFI_NOT_FOUND)) {

Add TFTP_ERROR here:

        if (!use_fb && (efi_status == EFI_INVALID_PARAMETER ||
                        efi_status == EFI_NOT_FOUND ||
                        efi_status == EFI_TFTP_ERROR)) {

In my testing, this gets it booting over network again.

@pjwelsh
Copy link

pjwelsh commented Jun 3, 2024

Any guess as to how long a change like may take to make it into a updated release package?

@raddirad
Copy link
Author

raddirad commented Jun 4, 2024

maybe @vathpela @jsetje or @julian-klode could say more on if this might get upstream

@jsetje
Copy link
Collaborator

jsetje commented Jun 5, 2024

Thank you for getting my attention. Just testing for the extra error is probably reasonable, but I'm also curious why we get a variable that looks like that. Since I exposed this, I'll certainly help get a fix in.

@pjwelsh
Copy link

pjwelsh commented Jun 5, 2024 via email

@jsetje
Copy link
Collaborator

jsetje commented Jun 5, 2024

I started asking around to see if I could find a system to test this with, which made me wonder about a 2-in-1 with a built in NIC. So I looked at the original report again. I bet that this all has something to do with how the docking station brings the NIC in.

@nathan-omeara
Copy link

It definitely isn't specific to docking stations.

I have one dell on-hand with the issue, a Latitude 5300 (built-in NIC, no external NIC). But I also have one other dell, one HP, and one MS surface that do not show this issue, all 3 of those using USB NICs only.

It's worth noting that USB boot technically has the same basic issue, but the fallback code kicks in on USB boot, because the error handling I pointed out handles the error when it's on a filesystem, it just doesn't handle it when it's on TFTP.

I also wonder if HTTP(s) boot would be another error code that would need to be added there, but my only on-hand device with HTTP(s) boot support is the dell that doesn't demonstrate this issue.

If you pay attention on USB boot, you can see the same error, followed by the message here:

shim/shim.c

Line 1265 in 0287c6b

L"start_image() returned %r, falling back to default loader\n",

This is what led me to try adding EFI_TFTP_ERROR to that statement.

@nathan-omeara
Copy link

nathan-omeara commented Jun 5, 2024

Hmm, yeah, forced a (similar?) error by renaming grubx64.efi on my http boot server and booting my other dell.
start_image() returned 00000023

I'm guessing because 0x23 (35?) is relatively new, and I'm using fedora's shipping version of shim 15.8 which was probably compiled with an earlier version of gnu_efi.

So I'd suggest adding EFI_TFTP_ERROR and EFI_HTTP_ERROR to that fallback logic.

I certainly wouldn't object to fixing the parsing of the weird values (if there is an actual issue, and it isn't just Dell and Lenovo (and maybe others) doing something that breaks the standard) but harmonizing the fallback behavior between local filesystems and network boot makes sense to me.

@jsetje
Copy link
Collaborator

jsetje commented Jun 5, 2024

FWIW, we'll have to fix this forward. In addition to the patch that exposed this, we'll need non-hardcoded paths and names for UKIs. Hopefully I can get my hands on a setup that exposes this, but I'm also not opposed to keep trying unless we get a very specific error.

@raddirad
Copy link
Author

raddirad commented Jun 5, 2024

I started asking around to see if I could find a system to test this with, which made me wonder about a 2-in-1 with a built in NIC. So I looked at the original report again. I bet that this all has something to do with how the docking station brings the NIC in.

This is not related to the Dock. I tested the 2-in-1 and a working Dell device. The 2-in-1 failed, the other one succeeded.

@olifre mentioned other devices that show the same behaviour ("Dell Latitude 3590, OptiPlex 3040 and others")
Maybe @olifre can post the other ones and you might get access to one of those

@raddirad
Copy link
Author

raddirad commented Jun 5, 2024

FWIW, we'll have to fix this forward. In addition to the patch that exposed this, we'll need non-hardcoded paths and names for UKIs. Hopefully I can get my hands on a setup that exposes this, but I'm also not opposed to keep trying unless we get a very specific error.

If you want anything tested, I have access to the 2-in-1 convertible I mention in OP. I can test new code

@olifre
Copy link

olifre commented Jun 5, 2024

Maybe @olifre can post the other ones and you might get access to one of those

I can immediately add to the list:

  • Dell OptiPlex 3020
  • Dell OptiPlex 3080

After that, I stopped doing systematic testing, as testing other models (we have an assortment of Dell OptiPlex systems, but no other Latitudes at hand) would mean temporarily stealing them from active users to test them in our test network.

I can certainly try to grab a specific model if you know you can get a hand on any OptiPlex, check it and report back here.

Combining my list with the information provided by @pjwelsh above, I think the full known Dell list is:

  • Dell OptiPlex 3020, 3040, 5040, 7040, 3060, 5060, 3080
  • Dell Latitude 3590, 5300 2-in-1, 5400

From those numbers, it seems quite likely all the OptiPlex _020 to _080 are affected (at least).

@raddirad
Copy link
Author

It's not stupid if it works. However I have seen devices requesting different names �USB for example.

@MarkusSpier
Copy link

For our two Dell-Test clients (both Optiplex Systems, one is a Touch all in one) it works.
Maybe you can also use the Filename êusb for the USB-Szenario.

@dbnicholson
Copy link

Could either of you dump out the raw boot option data and attach it here? I'd like to poke at it in code instead of trying to interpret the hexdump in my head. You can just copy the appropriate /sys/firmware/efi/efivars/BootXXXX-8be4df61-93ca-11d2-aa0d-00e098032b8c file corresponding to the right boot option. Look at the output of efibootmgr to see which on it is. You could also base64 encode it like base64 /sys/firmware/efi/efivars/BootXXXX-8be4df61-93ca-11d2-aa0d-00e098032b8c > bootopt.b64 and upload that.

@raddirad
Copy link
Author

raddirad commented Oct 2, 2024

So I did this on an OptiPlex 3050

efibootmgr -v
BootCurrent: 0012
Timeout: 2 seconds
BootOrder: 0013,0014,0015,0016,0017,0012,000A,000A,0012,0019
Boot0000* Windows Boot Manager	HD(1,GPT,cc9ed1a0-b28c-4713-9cbf-a3af67ae85d0,0x800,0xfa000)/File(\EFI\Microsoft\Boot\bootmgfw.efi)WINDOWS.........x...B.C.D.O.B.J.E.C.T.=.{.9.d.e.a.8.6.2.c.-.5.c.d.d.-.4.e.7.0.-.a.c.c.1.-.f.3.2.b.3.4.4.d.4.7.9.5.}....................
Boot000A  Windows Boot Manager	VenHw(99e275e7-75a0-4b37-a2e6-c5385e6c00cb)
Boot000B  Diskette Drive	BBS(Floppy,Diskette Drive,0x0)..BO
Boot000D  USB Storage Device	BBS(USB,USB Storage Device,0x0)..BO
Boot000E  CD/DVD/CD-RW Drive	BBS(CDROM,CD/DVD/CD-RW Drive,0x0)..BO
Boot000F* Onboard NIC	BBS(Network,Realtek PXE B01 D00,0x0)..BO
Boot0010* P0: Samsung SSD 850 PRO 128GB	BBS(HD,P0: Samsung SSD 850 PRO 128GB,0x0)..BO
Boot0012* Onboard NIC(IPV4)	PciRoot(0x0)/Pci(0x1c,0x0)/Pci(0x0,0x0)/MAC(d89ef37f465b,0)/IPv4(0.0.0.00.0.0.0,0,0)..BO
Boot0013* Diskette Drive	BBS(Floppy,Diskette Drive,0x0)..BO
Boot0014* Internal HDD	BBS(HD,Internal HDD,0x0)..BO
Boot0015* USB Storage Device	BBS(USB,USB Storage Device,0x0)..BO
Boot0016* CD/DVD/CD-RW Drive	BBS(CDROM,CD/DVD/CD-RW Drive,0x0)..BO
Boot0017* Onboard NIC	BBS(Network,Realtek PXE B01 D00,0x0)..BO
Boot0019* Onboard NIC(IPV6)	PciRoot(0x0)/Pci(0x1c,0x0)/Pci(0x0,0x0)/MAC(d89ef37f465b,0)/IPv6([::]:<->[::]:,0,0)..BO

There are different devices with an Onboard Prefix. So here are all of then

Boot000F* Onboard NIC

base64 /sys/firmware/efi/efivars/Boot000F-8be4df61-93ca-11d2-aa0d-00e098032b8c 
BwAAAAEAAAB+AE8AbgBiAG8AYQByAGQAIABOAEkAQwAAAAUBHAAGAAAAUmVhbHRlayBQWEUgQjAx
IEQwMAB//wQAAQQaAK6EsR31gXJOhUQrqwwsrFwBAAACAAB//wQAAQQ8AO9HZC3JO6BBrBlNUdAb
TOZSAGUAYQBsAHQAZQBrACAAUABYAEUAIABCADAAMQAgAEQAMAAwAAAAf/8EAAAAQk8=

Boot0012* Onboard NIC(IPV4)

base64 /sys/firmware/efi/efivars/Boot0012-8be4df61-93ca-11d2-aa0d-00e098032b8c 
BwAAAAEAAADCAE8AbgBiAG8AYQByAGQAIABOAEkAQwAoAEkAUABWADQAKQAAAAIBDADQQQMKAAAA
AAEBBgAAHAEBBgAAAAMLJQDYnvN/RlsAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADDBsAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAB//wQAAQRiAO9HZC3JO6BBrBlNUdAbTOZJAFAANAAgAFIAZQBh
AGwAdABlAGsAIABQAEMASQBlACAARwBCAEUAIABGAGEAbQBpAGwAeQAgAEMAbwBuAHQAcgBvAGwA
bABlAHIAAAB//wQAAABCTw==

Boot0017* Onboard NIC

base64 /sys/firmware/efi/efivars/Boot0017-8be4df61-93ca-11d2-aa0d-00e098032b8c 
BwAAAAEAAAB+AE8AbgBiAG8AYQByAGQAIABOAEkAQwAAAAUBHAAGAAAAUmVhbHRlayBQWEUgQjAx
IEQwMAB//wQAAQQaAK6EsR31gXJOhUQrqwwsrFwBAAACAAB//wQAAQQ8AO9HZC3JO6BBrBlNUdAb
TOZSAGUAYQBsAHQAZQBrACAAUABYAEUAIABCADAAMQAgAEQAMAAwAAAAf/8EAAAAQk8=

Boot0019* Onboard NIC(IPV6)

base64 /sys/firmware/efi/efivars/Boot0019-8be4df61-93ca-11d2-aa0d-00e098032b8c 
BwAAAAEAAADjAE8AbgBiAG8AYQByAGQAIABOAEkAQwAoAEkAUABWADYAKQAAAAIBDADQQQMKAAAA
AAEBBgAAHAEBBgAAAAMLJQDYnvN/RlsAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADDTwAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAB//wQA
AQRiAO9HZC3JO6BBrBlNUdAbTOZJAFAANgAgAFIAZQBhAGwAdABlAGsAIABQAEMASQBlACAARwBC
AEUAIABGAGEAbQBpAGwAeQAgAEMAbwBuAHQAcgBvAGwAbABlAHIAAAB//wQAAABCTw==

@nathan-omeara
Copy link

And to add some data points, the two entries matching "Onboard" on my Lattitude 5300:
Boot0003* Onboard NIC(IPV4):

BwAAAAEAAADQAE8AbgBiAG8AYQByAGQAIABOAEkAQwAoAEkAUABWADQAKQAAAAIBDADQQQMKAAAA
AAEBBgAGHwMLJQAs6n8Kn2kAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADDBsAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAB//wQAAQR2AO9HZC3JO6BBrBlNUdAbTOZQAFgARQAgAEkAUAA0ACAASQBu
AHQAZQBsACgAUgApACAARQB0AGgAZQByAG4AZQB0ACAAQwBvAG4AbgBlAGMAdABpAG8AbgAgACgA
NgApACAASQAyADEAOQAtAEwATQAAAH//BAAAAEJP

Boot0004* Onboard NIC(IPV6)

BwAAAAEAAADxAE8AbgBiAG8AYQByAGQAIABOAEkAQwAoAEkAUABWADYAKQAAAAIBDADQQQMKAAAA
AAEBBgAGHwMLJQAs6n8Kn2kAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAADDTwAAAAAAAAAAAAA
AAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAAQAAAAAAAAAAAAAAAAAAAAAB//wQAAQR2AO9H
ZC3JO6BBrBlNUdAbTOZQAFgARQAgAEkAUAA2ACAASQBuAHQAZQBsACgAUgApACAARQB0AGgAZQBy
AG4AZQB0ACAAQwBvAG4AbgBlAGMAdABpAG8AbgAgACgANgApACAASQAyADEAOQAtAEwATQAAAH//
BAAAAEJP

When this device is encountering this error, the boot filename sent in the TFTP request (that should be grubx64.efi) is ÐOnboard (where 0xd0 precedes "Onboard"), and I notice that in the hex representation of that var, on the 5300, 0xd000 precedes the utf-16-le encoding of "Onboard NIC(IPV4)".

This is readily visible in a packet capture of the TFTP download request.

That is slightly different from the Optiplex example above, so I wonder if the TFTP file request in @raddirad 's example would have 0xc2 in front of "Onboard" in the TFTP request?

@nathan-omeara
Copy link

Per this comment #649 (comment), it seems that the d0/c2/etc are the length of the boot option's file_path_list[] entry.

It's possible that the length in my example just happens to match what is sent in the filename.. so I would be curious to see if it's different with different lengths.

@raddirad
Copy link
Author

raddirad commented Oct 2, 2024

@nathan-omeara could you tell me how to get those hex values. I would like to provide infos.

@nathan-omeara
Copy link

nathan-omeara commented Oct 2, 2024

image Adding a small example of what I see in wireshark.

I just run wireshark on my TFTP server, and filter the displayed packets to tftp (I also include dhcp just to help with some troubleshooting):
image

Edit: And to be clear, all you have to do is click on the 'source file' in the packet dissector pane to highlight the exact bytes in the hex dump pane.

If you instead wanted to capture using TCPDump and provide the raw dump file I or someone else could load it into wireshark and look, but there's a risk of capturing other sensitive data from your network that way. (Though you could probably lower the chances of that by only capturing packets with a destination of udp port 69)

@raddirad
Copy link
Author

raddirad commented Oct 2, 2024

so in my case on the OptiPlex 3050 it is indeed 0xc2
image (1)

@nathan-omeara
Copy link

nathan-omeara commented Oct 2, 2024

So, it seems like something probably has an off-by-one error in the parser.

Maybe not.. "Onboard" shouldn't even be the path looked at, so it's probably more complicated than an off-by-one error.

@nathan-omeara
Copy link

I happened to get hands-on with an Optiplex 7070 today, and verified it looks the same as my Latitude 5300: Length is 0xd0, and the character preceding "Onboard" in the TFTP request is also 0xd0.

@nathan-omeara
Copy link

I had some time to do some poking around again - made a build with verbose enabled, and added some extra debug prints.

I made no real progress, I dumped some boot options from machines that work, and the basic structure seems the same, the option length value (0xd0 on my broken machines) does correctly point at the end of the end marker 0x7FFF0400 in every case, so I don't understand why specifically these Dells confuse the algorithm and end up with the length in the option name (or why it's trying to use the option name as a file path string at all, I feel like that should fail and not work in any of these cases).

I think I would need to enroll my custom signing cert in the UEFI of a working machine so I can run the verbose shim on them as well to make any more progress, and I won't have time to do that any time soon.

So I am still thinking my proposed fix over in #666 is still a good thing to do, even if this boot option parsing can be fixed.

@dbnicholson
Copy link

I made no real progress, I dumped some boot options from machines that work, and the basic structure seems the same, the option length value (0xd0 on my broken machines) does correctly point at the end of the end marker 0x7FFF0400 in every case, so I don't understand why specifically these Dells confuse the algorithm and end up with the length in the option name (or why it's trying to use the option name as a file path string at all, I feel like that should fail and not work in any of these cases).

It's not Dell in this case, but rather shim. What's happening (I'm pretty sure), is that shim tries to parse the load option. But it's a very strange load option. Here's what it looks like as a regular EFI variable:

00000000  07 00 00 00 01 00 00 00  d0 00 4f 00 6e 00 62 00  |..........O.n.b.|
00000010  6f 00 61 00 72 00 64 00  20 00 4e 00 49 00 43 00  |o.a.r.d. .N.I.C.|
00000020  28 00 49 00 50 00 56 00  34 00 29 00 00 00 02 01  |(.I.P.V.4.).....|
00000030  0c 00 d0 41 03 0a 00 00  00 00 01 01 06 00 06 1f  |...A............|
00000040  03 0b 25 00 2c ea 7f 0a  9f 69 00 00 00 00 00 00  |..%.,....i......|
00000050  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000060  00 00 00 00 00 03 0c 1b  00 00 00 00 00 00 00 00  |................|
00000070  00 00 00 00 00 00 00 00  00 00 00 00 00 00 00 00  |................|
00000080  7f ff 04 00 01 04 76 00  ef 47 64 2d c9 3b a0 41  |......v..Gd-.;.A|
00000090  ac 19 4d 51 d0 1b 4c e6  50 00 58 00 45 00 20 00  |..MQ..L.P.X.E. .|
000000a0  49 00 50 00 34 00 20 00  49 00 6e 00 74 00 65 00  |I.P.4. .I.n.t.e.|
000000b0  6c 00 28 00 52 00 29 00  20 00 45 00 74 00 68 00  |l.(.R.). .E.t.h.|
000000c0  65 00 72 00 6e 00 65 00  74 00 20 00 43 00 6f 00  |e.r.n.e.t. .C.o.|
000000d0  6e 00 6e 00 65 00 63 00  74 00 69 00 6f 00 6e 00  |n.n.e.c.t.i.o.n.|
000000e0  20 00 28 00 36 00 29 00  20 00 49 00 32 00 31 00  | .(.6.). .I.2.1.|
000000f0  39 00 2d 00 4c 00 4d 00  00 00 7f ff 04 00 00 00  |9.-.L.M.........|
00000100  42 4f                                             |BO|
00000102

Ignore the first 4 bytes (07 00 00 00) as those are the EFI variable attributes. There are 2 of the 0x7FFF0400 end of device path nodes. When shim sees the first one, it stops parsing. Then if checks if the spot it got to corresponds to what the load option says was the length of the path list (d0 00 or 208 LE). Since the device path length points after the second end of end of device path node rather than the first, it thinks it's not a valid load option.

When it thinks it's not an actual load option, it treats it like it's just a path and tries its best to strip it out after skipping an initial path as a workaround for some EFI shell misadventures. In your case, the first "path" is 01 00 00 00 where the 00 00 would be a NUL terminating the string in UCS-2. The second "path" that it ends up matching on is everything from d0 to the 00 00 NUL after the Onboard NIC description.

So, I think the fix is to ignore the case where the load option end of device path node was found before the device path length said it would. That's unusual but valid. What would end up happening in that case is shim would try to parse the actual load option optional data it was looking for. Those are the 00 00 42 4f bytes at the end. I don't know what those are supposed to represent, but they'd be treated by shim as an empty string since it would find the leading NUL and you would keep the default second stage of using grub. I have a patch to try that out.

I think I would need to enroll my custom signing cert in the UEFI of a working machine so I can run the verbose shim on them as well to make any more progress, and I won't have time to do that any time soon.

That or temporarily turn off secure boot. Shim does all the same things in that case except for validating the image it's going to start.

So I am still thinking my proposed fix over in #666 is still a good thing to do, even if this boot option parsing can be fixed.

I think it is a good idea, but I think the implementation can be better to convert actual not found cases to EFI_NOT_FOUND.

dbnicholson added a commit to dbnicholson/shim that referenced this issue Oct 2, 2024
When looking for load option optional data, the parser asserts that the
byte after the end of device path node is the same as what the file path
length says it should be. While unusual, it is valid if the end of
device path node comes before the end of the file path list.

That supports some unusual Dell load options where there are two device
paths in the list but the first is terminated by an End Entire Device
Path. Maybe they intended to use an End Device Path Instance node there?
Who knows. Either way, treating it as invalid ends up trying to read
paths from the beginning of the option with obviously poor results.

Fixes: rhboot#649

Signed-off-by: Dan Nicholson <dbn@endlessos.org>
@dbnicholson
Copy link

If anyone wants to give #694 a spin, that would be great.

@dbnicholson
Copy link

If anyone wants to give #694 a spin, that would be great.

If you're going to try that as just a patch, then you'll also need 0287c6b.

@nathan-omeara
Copy link

That fits what I was seeing, I was able to confirm that we exit get_load_option_optional_data() here:

if (i != fplistlen)

with i less than 208.

I feel like that function should be a loop until i >= fplistlen, instead of a single pass, but I was second guessing myself.

@nathan-omeara
Copy link

If anyone wants to give #694 a spin, that would be great.

I checked out your branch and built it, and can confirm it requests grubx64.efi instead of a junk filename.

I see your patch is not looping on get_load_option_optional_data(), but I assume since there's only two entries, first we fail with the initial pass here:

shim/load-options.c

Lines 421 to 443 in e064e7d

efi_status = get_load_option_optional_data(li->LoadOptions,
li->LoadOptionsSize,
&li->LoadOptions,
&li->LoadOptionsSize);
if (EFI_ERROR(efi_status)) {
/*
* it's not an EFI_LOAD_OPTION, so it's probably just a string
* or list of strings.
*
* UEFI shell copies the whole line of the command into
* LoadOptions. We ignore the first string, i.e. the name of this
* program in this case.
*/
loader_str = split_load_options(li->LoadOptions,
li->LoadOptionsSize,
&remaining,
&remaining_size);
if (loader_str && is_our_path(li, loader_str)) {
li->LoadOptions = remaining;
li->LoadOptionsSize = remaining_size;
}
}

and then when it runs split_load_options again here:

shim/load-options.c

Lines 445 to 446 in e064e7d

loader_str = split_load_options(li->LoadOptions, li->LoadOptionsSize,
&remaining, &remaining_size);

it's parsing the second option, and that gets us the right data?

I will do a deeper dive tomorrow, but that's my quick impression.

@dbnicholson
Copy link

What's happening is get_load_option_optional_data doesn't fail with my patch. Which is what it's supposed to do since it's a valid (albeit weird) load option. The double split_load_options is a special case for the EFI shell's weird entries. Shim is assuming that if get_load_option_optional_data succeeded that the load options data doesn't need that special treatment.

The whole point of parse_load_options is to find if there's optional data. If there is optional data, then shim tries to use it to determine the second stage loader instead of always using grub. In older shim releases it didn't bother with this, but that prevented legitimate use cases like fwupd's where it wants to create a boot entry specifying to execute it as the second stage.

parse_load_options is trying to parse the EFI_LOADED_IMAGE LoadOptions field to see if there's a different second stage to use. What the spec says is that LoadOptions should correspond to the OptionalData field of the loaded EFI_LOAD_OPTION. Unfortunately, some firmware like Dell's stuffs an entire EFI_LOAD_OPTION in there.

What get_load_option_optional_data does is to try to determine is whether LoadOptions corresponds to a full EFI_LOAD_OPTION. If it looks like one, then it just wants to get the OptionalData field at the end. If it doesn't look like a full EFI_LOAD_OPTION, then it tries to parse the LoadOptions data as is. In either case, it's trying to parse OptionalData. It just depends where in the LoadOptions data to start trying to decipher it. split_load_options is what does the deciphering of the optional data. It was starting at the beginning of the LoadOptions data since get_load_option_optional_data failed.

Not sure about this part of my earlier analysis:

When it thinks it's not an actual load option, it treats it like it's just a path and tries its best to strip it out after skipping an initial path as a workaround for some EFI shell misadventures. In your case, the first "path" is 01 00 00 00 where the 00 00 would be a NUL terminating the string in UCS-2. The second "path" that it ends up matching on is everything from d0 to the 00 00 NUL after the Onboard NIC description.

It should only skip the first path if it appears to match the path of the loaded image. I doubt the path the the loaded image looks like 01 00 00 00 in your case. Regardless, it's clearly parsing in the wrong spot.

@dbnicholson
Copy link

I feel like that function should be a loop until i >= fplistlen, instead of a single pass, but I was second guessing myself.

I considered that, but I think it's more correct the way I have it. All that get_load_option_optional_data is trying to do is determine if the data is a full EFI_LOAD_OPTION. If that's the case, the optional data comes after fplistlen bytes of device path data. Once the end device path node has been found within the fplistlen bytes, it's proven that it's a real EFI_LOAD_OPTION. It's weird, but there's nothing wrong with sticking more data between the end of device path node and where the optional data begins. Per the spec:

The FilePathList[0] is specific to the device type. Other device paths may optionally exist in the FilePathList, but their usage is OSV specific.

It would be nice to determine that all the paths in the device path list were valid, but shim isn't a load option linter. It's just trying to determine how to use the data based on its shape.

@dbnicholson
Copy link

I took a closer look at the file not found fallback, and I think #695 is a nicer way to handle it. It's completely untested, though.

@christoph-at-unicon
Copy link

It's been a few months since this was reported, and various ways to deal with this haven been proposed. However, there's still no official solution, a commit in https://github.com/rhboot/shim. Now I'd like to bring this to an end, can I help with that? I have various Dell devices available for testing. Our customers use them a lot, and of course I'd prefer things to be smooth for them.

@raddirad
Copy link
Author

raddirad commented Nov 6, 2024

maybe @vathpela @jsetje or @julian-klode could say more on this

@raddirad
Copy link
Author

raddirad commented Nov 6, 2024

or maybe @aronowski could help with bringing this to upstream

@ewimar
Copy link

ewimar commented Dec 19, 2024

We at Landesmedienzentrum Baden-Württemberg, a federal authority in Germany, support about 2000 schools with their school IT networks. We rely on opsi, a canny open-source solution, for OS and software deployment. Some schools have been affected by the issues mentioned here in #649 and they are still unable to (re-) install computers in their network.
We would greatly appreciate any progress on #649 and #666 to finally get a patch review on #428.
Thanks and cheers to all contributors!

@pjwelsh
Copy link

pjwelsh commented Dec 20, 2024 via email

@nathan-omeara
Copy link

Originally, I thought just removing network boot and usb boot from the automatic boot order fixed it... because it does on my test laptop. Then someone told me it didn't work on theirs of the same model. I believe the magic is: If you have a bios/firmware admin password set, it doesn't matter if you remove those from the automatic boot order, it still confuses shim and fails to netboot.

So in many secure environments, there is no 'acceptable' answer to this other than usb boot instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

10 participants