Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

config is valid at installation but invalid at runtime when disabling config persistence #3676

Closed
OGKevin opened this issue May 26, 2021 · 6 comments · Fixed by #3688
Closed

Comments

@OGKevin
Copy link
Contributor

OGKevin commented May 26, 2021

Bug Report

When using the following config:

cluster:
  ca:
    crt: <crt>
    key: ""
  controlPlane:
    endpoint: <endpoint>
  network:
    dnsDomain: cluster.local
    podSubnets:
    - 10.244.0.0/16
    serviceSubnets:
    - 10.96.0.0/12
  token: <token>
debug: false
machine:
  install:
    disk: /dev/mmcblk0
  kubelet:
    extraArgs:
      node-ip: 192.168.2.210
      node-labels: metal.sidero.dev/uuid=00c03114-0000-0000-0000-e45f011f4f30
  network:
    interfaces:
    - cidr: 192.168.2.210/24
      interface: wg0
      wireguard:
        peers:
        - allowedIPs:
          - 0.0.0.0/0
          endpoint: 83.84.110.84:51820
          publicKey: <key>
        privateKey: <key>
    - dhcp: true
      interface: eht0
  token: <token>
  type: worker
persist: false
version: v1alpha1

The installation of Talos succeeds, however, after rebooting into Talos and the config is re-fetched due to persist: false the config validation fails and the nodes ends up in a boot loop.

However, when I change persist to True and re-install Talos and reboot to Talos, it works as expected.

Logs

Looking for the error msg in the code, I've found https://github.com/talos-systems/go-smbios/blob/d3a32bea731a0c2a60ce7f5eae60253300ef27e1/smbios/smbios.go#L60 which is a perfect match.
failed to decode structures: unexpected EOF

Environment

  • booting via Sidero
  • Talos version: v0.10.3
  • Platform: metal rpi_4

Kernel args

kernel:
      args:
      - init_on_alloc=1
      - initrd=initramfs.xz
      - ip=dhcp
      - slab_nomerge
      - pti=on
      - talos.board=rpi_4
      - talos.shutdown=poweroff
      - talos.config=http://192.168.1.200:30005/configdata?uuid=
      - talos.platform=metal
@smira
Copy link
Member

smira commented May 26, 2021

Kevin, do you have screenshot/log of the boot?

SMBIOS shouldn't be touched on config validation, there must be something else here.

@OGKevin
Copy link
Contributor Author

OGKevin commented May 26, 2021

Unfortunately, I cant provide logs or a screenshot. What I can provide is a literal screenshot tho 😀
IMG_1847

@smira
Copy link
Member

smira commented May 26, 2021

I think there's something into the play here, and it's not related to the machine configuration at all.

What is happening is that Talos fails to load SMBIOS data while trying to fill in ?uuid= argument.

First of all, do you know where that UUID 00c03114-* is coming from? Sidero doesn't pass it filled in.

Do you have different versions of Talos in PXE boot environment in Sidero and in installer config? Feels like it should have never worked even on the first boot unless there's some recent change which breaks SMBIOS reading on Raspberry Pi.

@OGKevin
Copy link
Contributor Author

OGKevin commented May 26, 2021

Do you have different versions of Talos in PXE boot environment in Sidero and in installer config?

Nope, Sidero is configured to pull v0.10.3 kernel and initram. There is nothing specified in the install config which makes it default to also v0.10.3.

First of all, do you know where that UUID 00c03114-* is coming from?

This is a good question and an interesting question. Did you ask because it's printed before it's actually fetched 😅 ?

https://github.com/talos-systems/talos/blob/22f375300c1cc1d95db540afd510a21b66d7c8a3/internal/app/machined/pkg/runtime/v1alpha1/platform/metal/metal.go#L47-L68

I would need to do some digging to come up with the answer.

@smira
Copy link
Member

smira commented May 26, 2021

Did you ask because it's printed before it's actually fetched sweat_smile ?

yes, because Sidero passes ?uuid= without the value, and the code which fails is supposed to fill in the value.

So I wonder if we have a bug with SMBIOS library on RPi or arm64, because you could be the first one to test this :)

@OGKevin
Copy link
Contributor Author

OGKevin commented May 26, 2021

Yea but I wonder how, because

log.Printf("fetching machine config from: %q", *option)

seems to print the correct URL and the following code tries to fetch the UUID while it already has it 🤔 So in theory changing the code to check if the uuid value is not empty would actually bypass smbios. But then I'm also wondering where and how the UUID is populated in the first place.

I'll do some digging and see what I'll find.

OGKevin added a commit to OGKevin/talos that referenced this issue May 27, 2021
During boot sequence, if `talos.config`'s url has the uuid parameter, the uuid
value is retrieved via SMBIOS. However, at this part of the code it can happen
that the uuid is already set and valid. If this is the case, instead of
re-fetching the uuid, the one that is already set can be used.

closes siderolabs#3676
OGKevin added a commit to OGKevin/talos that referenced this issue May 27, 2021
During boot sequence, if `talos.config`'s url has the uuid parameter, the uuid
value is retrieved via SMBIOS. However, at this part of the code it can happen
that the uuid is already set and valid. If this is the case, instead of
re-fetching the uuid, the one that is already set can be used.

closes siderolabs#3676
OGKevin added a commit to OGKevin/talos that referenced this issue May 27, 2021
During boot sequence, if `talos.config`'s url has the uuid parameter, the uuid
value is retrieved via SMBIOS. However, at this part of the code it can happen
that the uuid is already set and valid. If this is the case, instead of
re-fetching the uuid, the one that is already set can be used.

closes siderolabs#3676

Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
OGKevin added a commit to OGKevin/talos that referenced this issue May 27, 2021
During boot sequence, if `talos.config`'s url has the uuid parameter, the uuid
value is retrieved via SMBIOS. However, at this part of the code it can happen
that the uuid is already set and valid. If this is the case, instead of
re-fetching the uuid, the one that is already set can be used.

closes siderolabs#3676

Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
OGKevin added a commit to OGKevin/talos that referenced this issue May 27, 2021
During boot sequence, if `talos.config`'s url has the uuid parameter, the uuid
value is retrieved via SMBIOS. However, at this part of the code it can happen
that the uuid is already set and valid. If this is the case, instead of
re-fetching the uuid, the one that is already set can be used.

closes siderolabs#3676

Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
OGKevin added a commit to OGKevin/talos that referenced this issue May 28, 2021
During boot sequence, if `talos.config`'s url has the uuid parameter, the uuid
value is retrieved via SMBIOS. However, at this part of the code it can happen
that the uuid is already set and valid. If this is the case, instead of
re-fetching the uuid, the one that is already set can be used.

closes siderolabs#3676

Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
OGKevin added a commit to OGKevin/talos that referenced this issue May 31, 2021
During boot sequence, if `talos.config`'s url has the uuid parameter, the uuid
value is retrieved via SMBIOS. However, at this part of the code it can happen
that the uuid is already set and valid. If this is the case, instead of
re-fetching the uuid, the one that is already set can be used.

closes siderolabs#3676

Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
OGKevin added a commit to OGKevin/talos that referenced this issue May 31, 2021
During boot sequence, if `talos.config`'s url has the uuid parameter, the uuid
value is retrieved via SMBIOS. However, at this part of the code it can happen
that the uuid is already set and valid. If this is the case, instead of
re-fetching the uuid, the one that is already set can be used.

closes siderolabs#3676

Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
talos-bot pushed a commit that referenced this issue May 31, 2021
During boot sequence, if `talos.config`'s url has the uuid parameter, the uuid
value is retrieved via SMBIOS. However, at this part of the code it can happen
that the uuid is already set and valid. If this is the case, instead of
re-fetching the uuid, the one that is already set can be used.

closes #3676

Signed-off-by: Kevin Hellemun <17928966+OGKevin@users.noreply.github.com>
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 25, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants