Skip to content

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Talos install error - couldn't get current server API group list: - tls: internal error #1401

Closed
DavidIlie opened this issue Apr 2, 2024 · 14 comments

Comments

@DavidIlie
Copy link

Following the pathway to install Talos, I have an issue where the cluster does seem to setup but the workers do not join, and the masters have these errors:

error refreshing pod status and the error is related to TLS (tls: internal error)
and also controller failed errors too
image

this is what I see in my terminal while setting it up
image

@onedr0p
Copy link
Owner

onedr0p commented Apr 2, 2024

I think this is expected due to etcd not being bootstrapped?

bootstrap-install:
desc: Install the Talos cluster
dir: "{{.TALOS_DIR}}"
cmds:
- echo "Installing Talos... ignore the errors and be patient"
- until talhelper gencommand bootstrap | bash; do sleep 10; done
- sleep 10
preconditions:
- { msg: "Missing talhelper config file", sh: "test -f {{.TALHELPER_CONFIG_FILE}}" }

It doesn't seem like the above was ran. At what point in the task commands did you get to and was there any errors on the client side?

@bojanraic
Copy link

In addition to @onedr0p's comments, when I tried to install Talos manually, it took some time for the master to be ready.
In both of the screenshots, uptime is only a few minutes so maybe it hasn't finished bootstrapping yet.
I went back to k3s in the meantime, but please update us on your Talos install progress via this template and I may take another stab at it when time permits.
Good luck!

@DavidIlie
Copy link
Author

I left it running the whole night yesterday and the same thing happened. I am also sure that I think all scripts are running and before the node first reboots/loads there are errors regarding something like a "admin" certificate

Any ideas?

@onedr0p
Copy link
Owner

onedr0p commented Apr 2, 2024

Maybe give it another shot when you have a moment? Not sure what happened here to be honest could be a ton of different issues from misconfig to network issues to anything else really :/

The important bits of the config that can really go wrong if not set right are the network and disk selectors.

@DavidIlie
Copy link
Author

Disk selectors work I believe, data is being written to the disk and network is working o I believe on all the nodes.

The error is just the "tls: internal error" every time the masters try to fetch something from their own localhost IP

@DavidIlie
Copy link
Author

The bootstrap first begins with these errors in the console

image

But I believe that's when the nodes get rebooted as then it boots and continues til kubelet is healthy but the error is back:

image

And then my terminal tries to connect to the VIP and nothing happens

image

@onedr0p
Copy link
Owner

onedr0p commented Apr 3, 2024

I saw this in your previous config (sorry this is all I have to go on from #1398 (comment))

    networkInterfaces:
      - deviceSelector:
          hardwareAddr: ""

That should be the nodes mac address, are you sure this is populated? It should be in xx:xx:xx:xx:xx:xx format and be unique per-node.

# talos_nic: "" # (Required: Talos) MAC address of the NIC for this node (talosctl get links -n <ip> --insecure)

@onedr0p
Copy link
Owner

onedr0p commented Apr 3, 2024

I added validation on talos_nic here to hopefully catch this for other people in the future.

@DavidIlie
Copy link
Author

I already populated those, I just redacted them when I sent it here. Every single value is present

@DavidIlie
Copy link
Author

2024-04-04.00-02-53.mp4

This is a recording of what happens

@onedr0p
Copy link
Owner

onedr0p commented Apr 3, 2024

I wonder if you need to use a different type of network selector in the Talos/talhelper config or change something in the NIC settings on the VM in Proxmox?

I just hand-held someone thru the whole repo who is using bare-metal nodes and we had success after figuring out they were not setting the correct value for talos_nic which lead me to commit validation on that.

@DavidIlie
Copy link
Author

Do you have an example of what I would need to do?

@onedr0p
Copy link
Owner

onedr0p commented Apr 3, 2024

I am probably not the best person to ask about that as I do not use any hypervisors in my life right now 😄

Maybe a good start is to review the talos proxmox docs and see if everything lines up there and with the rendered config here.

@onedr0p
Copy link
Owner

onedr0p commented Apr 3, 2024

Keep in mind there are a bunch of different network selectors you can use so maybe mac address is not the best with PVE? I dunno.

Repository owner locked and limited conversation to collaborators Apr 5, 2024
@onedr0p onedr0p converted this issue into discussion #1407 Apr 5, 2024

This issue was moved to a discussion.

You can continue the conversation there. Go to discussion →

Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants