Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unable to deploy EKS-A to vSphere cluster #7954

Open
galvesribeiro opened this issue Apr 10, 2024 · 8 comments
Open

Unable to deploy EKS-A to vSphere cluster #7954

galvesribeiro opened this issue Apr 10, 2024 · 8 comments

Comments

@galvesribeiro
Copy link

galvesribeiro commented Apr 10, 2024

What happened:
Unable to deploy EKS-A on ESXI 8 U1

What you expected to happen:
The initial cluster to be deployed

How to reproduce it (as minimally and precisely as possible):
Just follow the process from the documentation to deploy the initial cluster.

When it tries to deploy the first etcd VM from the templates, the VM is created, but then briefly after creation it is removed and I see the following error:

image

A specified parameter was not correct: spec.config.deviceChange[0].operation

I've tried with multiple BR versions starting from 1.26 to 1.29 and all of them fail. Also tried on two completely separated ESXI/vSphere clusters with the same results.

Environment:
Latest EKS-A CLI (from brew) on macOS Sonoma (fully updated) deploying to ESXi/vCenter/vSAN 8U1.

@Darth-Weider
Copy link

I would check the spec: datastore: setting first. Can you post your cluster manifest ?

@galvesribeiro
Copy link
Author

galvesribeiro commented Apr 12, 2024

@Darth-Weider thanks for the reply.

For those which are having similar issues here is a TL;DR;:

  1. Set the VSphereMachineConfig.spec.cloneMode to linkedClone AND remove the diskGiB which is added by default when you call generate command on the CLI.
  2. pods.cidrBlocks and services.cidrBlocks should NOT collide with the DHCP range as well. The DHCP range is ONLY used for the VM IPs.

I've just finally figured out what is going on here. There were a few things that weren't really clear when reading the docs:

  1. The VSphereMachineConfig.spec.cloneMode when not set, and the diskGiB set to anything, as it is when the generate command run, would throw that error. Then we set cloneMode to linkedClone, it then fail the validation saying that we shouldn't set the diskGiB. We then removed the diskGiB and it worked. The images were deployed just fine.
  2. The next failure was that it complains about the IP of the control plane not being unique when I'm pretty sure it was since I've created a single VLAN/subnet like 10.254.0.0/16 specifically for EKS-A. Then, besides .1 which is the gateway, I've set from .1 to .100 to be excluded from the DHCP range, and made the .10 the IP of the control plane VIP and it keep saying the .10 was in use. I then ran with the --skip-ip-verification as suggested by another issue here, and it passed thru but, the Etcd nodes never got ready for whatever reason and the process good looping waiting for it to be ready. It turns out that the documentation don't make clear that this pods/services.cidrBlocks must be any network that (1) doesn't collide with the host and other subnets on your physical network AND (2) is not part of that DHCP range (because of (1)). So as soon as I created a VLAN with 10.170.0.0/24, made the control plane VIP .10, the DHCP range to be .100-.154 and kept the pods.cidrBlocks to 10.254.0.0/16 (which is a non-routable address space on the physical network), it just worked. The reason of this confusion is that other k8s distros I've used use different CNIs where pods and/or services get IPs on the underlying network, but that is not the case with EKS-A.

@galvesribeiro
Copy link
Author

galvesribeiro commented Apr 13, 2024

The fun fact is that this is not consistent. I've created the same config multiple times on the same environment and sometimes the process fail in the end with "Creating EKS-A namespace" "The connection to the server localhost:8080 was refused" which makes no sense as I don't have anything open on localhost at 8080.

Here is the config:

apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: Cluster
metadata:
  name: awsemu
spec:
  clusterNetwork:
    cniConfig:
      cilium: {}
    pods:
      cidrBlocks:
      - 172.18.0.0/16
    services:
      cidrBlocks:
      - 10.96.0.0/12
  controlPlaneConfiguration:
    count: 3
    endpoint:
      host: "172.16.1.1"
    machineGroupRef:
      kind: VSphereMachineConfig
      name: awsemu-cp
  datacenterRef:
    kind: VSphereDatacenterConfig
    name: datacenter
  externalEtcdConfiguration:
    count: 3
    machineGroupRef:
      kind: VSphereMachineConfig
      name: awsemu-etcd
  kubernetesVersion: "1.29"
  managementCluster:
    name: awsemu
  workerNodeGroupConfigurations:
  - count: 1
    machineGroupRef:
      kind: VSphereMachineConfig
      name: awsemu
    name: md-0

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereDatacenterConfig
metadata:
  name: datacenter
spec:
  datacenter: datacenter
  insecure: false
  network: workload
  server: 192.168.8.12
  thumbprint: "27:44:A2:74:89:B4:D3:4E:97:30:D7:AF:3B:88:06:F4:08:0C:4F:D7"

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: awsemu-cp
spec:
  cloneMode: linkedClone
  datastore: vsandatastore
  folder: Kubernetes/Management/Control Plane
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: /datacenter/host/hwcluster/Resources
  storagePolicyName: ""
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDGidVzdPHSLPNq7i4+r1AD2bfAQmEC8NmZM1V0vN7jMIW2QZSflL2LrCpGk0969FHesOUTM1x61B5oYepsLjYgSKDC2mNxIg2jZONPYCg30fxE5vOxWUJObCGuc4trKfz9DLPx7+C3fGgXQaFmnugMgRbqYurdrr8HDeXsavwN361x/MesKpY4E26SBt/RG/sZEssVnzeIPbM8S9LDOX62znFYIXRlgmmx9un68TqQpMti6CnIWUlYwx90MJkV0avL5BeSg9ex3JxYH1THQw3tcj5gyh9GY9yWVxXA7bs3wh5vd8JAJEtPpeqaafRaqXfBFWzC3/L21GxVCwgvGAjovhdDGk3vn6PNRKf4b1MydHnVK7/lZnpNpenDYCszSEebkS5joqehpkaJ4eED1ACvJeh/0urupu47RMN6DcwLUR7j3o7sxcXZK31lecgogC7yvC5eZGK/B6rwHyV3xX7KaVcfabJJeiiJgrb2cKesiKDFgR8DlQ+sUrdwUIcsxsoOskYZJQuvH/h2Gi7lZv71uABnQLvcAeF6OSj7vnrsQ7oUKdcJhAfoRdJCOEt1PtgyDfe2WJ9gH3KRbuHxnNVyQKNZaI5OtEPCxlPIyXbGQnsTwZ1AiWj/RYbj3DP3aCM3Iu7Lg7z/dVGSnRfWJk0zdcZekGch0O43H0EX7611kQ==

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: awsemu
spec:
  cloneMode: linkedClone
  datastore: vsandatastore
  folder: Kubernetes/Management/Worker Nodes
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: /datacenter/host/hwcluster/Resources
  storagePolicyName: ""
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDGidVzdPHSLPNq7i4+r1AD2bfAQmEC8NmZM1V0vN7jMIW2QZSflL2LrCpGk0969FHesOUTM1x61B5oYepsLjYgSKDC2mNxIg2jZONPYCg30fxE5vOxWUJObCGuc4trKfz9DLPx7+C3fGgXQaFmnugMgRbqYurdrr8HDeXsavwN361x/MesKpY4E26SBt/RG/sZEssVnzeIPbM8S9LDOX62znFYIXRlgmmx9un68TqQpMti6CnIWUlYwx90MJkV0avL5BeSg9ex3JxYH1THQw3tcj5gyh9GY9yWVxXA7bs3wh5vd8JAJEtPpeqaafRaqXfBFWzC3/L21GxVCwgvGAjovhdDGk3vn6PNRKf4b1MydHnVK7/lZnpNpenDYCszSEebkS5joqehpkaJ4eED1ACvJeh/0urupu47RMN6DcwLUR7j3o7sxcXZK31lecgogC7yvC5eZGK/B6rwHyV3xX7KaVcfabJJeiiJgrb2cKesiKDFgR8DlQ+sUrdwUIcsxsoOskYZJQuvH/h2Gi7lZv71uABnQLvcAeF6OSj7vnrsQ7oUKdcJhAfoRdJCOEt1PtgyDfe2WJ9gH3KRbuHxnNVyQKNZaI5OtEPCxlPIyXbGQnsTwZ1AiWj/RYbj3DP3aCM3Iu7Lg7z/dVGSnRfWJk0zdcZekGch0O43H0EX7611kQ==

---
apiVersion: anywhere.eks.amazonaws.com/v1alpha1
kind: VSphereMachineConfig
metadata:
  name: awsemu-etcd
spec:
  cloneMode: linkedClone
  datastore: vsandatastore
  folder: Kubernetes/Management/ETCD
  memoryMiB: 8192
  numCPUs: 2
  osFamily: bottlerocket
  resourcePool: /datacenter/host/hwcluster/Resources
  storagePolicyName: ""
  users:
  - name: ec2-user
    sshAuthorizedKeys:
    - ssh-rsa AAAAB3NzaC1yc2EAAAADAQABAAACAQDGidVzdPHSLPNq7i4+r1AD2bfAQmEC8NmZM1V0vN7jMIW2QZSflL2LrCpGk0969FHesOUTM1x61B5oYepsLjYgSKDC2mNxIg2jZONPYCg30fxE5vOxWUJObCGuc4trKfz9DLPx7+C3fGgXQaFmnugMgRbqYurdrr8HDeXsavwN361x/MesKpY4E26SBt/RG/sZEssVnzeIPbM8S9LDOX62znFYIXRlgmmx9un68TqQpMti6CnIWUlYwx90MJkV0avL5BeSg9ex3JxYH1THQw3tcj5gyh9GY9yWVxXA7bs3wh5vd8JAJEtPpeqaafRaqXfBFWzC3/L21GxVCwgvGAjovhdDGk3vn6PNRKf4b1MydHnVK7/lZnpNpenDYCszSEebkS5joqehpkaJ4eED1ACvJeh/0urupu47RMN6DcwLUR7j3o7sxcXZK31lecgogC7yvC5eZGK/B6rwHyV3xX7KaVcfabJJeiiJgrb2cKesiKDFgR8DlQ+sUrdwUIcsxsoOskYZJQuvH/h2Gi7lZv71uABnQLvcAeF6OSj7vnrsQ7oUKdcJhAfoRdJCOEt1PtgyDfe2WJ9gH3KRbuHxnNVyQKNZaI5OtEPCxlPIyXbGQnsTwZ1AiWj/RYbj3DP3aCM3Iu7Lg7z/dVGSnRfWJk0zdcZekGch0O43H0EX7611kQ==

---

This also leaves all VMs created behind and the cluster in a state that it isn't ready nor can I delete with eksctl so all we can do is to manually stop and delete each VM...

@Darth-Weider
Copy link

galvesribeiro Can you try fullclone instead linkedclone ? Also the CP node ip address is set to "172.16.1.1" ? Is it your vlan gateway IP ? And does your EKS-A vlan have access to your vCenter API endpoint ? I

@galvesribeiro
Copy link
Author

@Darth-Weider

Can you try fullclone instead linkedclone

Full clone is what was causing vSphere to fail with that message as you see the picture (A specified parameter was not correct: spec.config.deviceChange[0].operation). I was only able to make it pass thru it and deploy the VMs with linkedClone. Otherwise, that error appears on vSphere and the EKS-A CLI keep looping "waiting" for the Etcd to get ready which clearly would never happen 😄.

Also the CP node ip address is set to "172.16.1.1" ? Is it your vlan gateway IP?

No. The network is:

  • Address space: 172.16.0.0/16
  • Gateway/DNS: 172.16.0.1
  • DHCP range: 172.16.2.1-254
  • CP: 172.16.1.1

And does your EKS-A vlan have access to your vCenter API endpoint ?

Yep. vCenter is 192.168.8.12 which is routable thru the 172.16.0.1 gateway.

@bsmithtm
Copy link

@galvesribeiro I believe the problem you had with "Creating EKS-A namespace" "The connection to the server localhost:8080 was refused" is unrelated, and I've had experience with it too.

I posted my results over in #8123 (comment) where the poster ran into the exact same error.

Pretty sure there's a race condition toward the end of the entire cluster creation process. It's supposed to write the kubeconfig for your new EKSA cluster to your local filesystem, then start using that kubeconfig to connect to the new cluster and lift all the EKSA controllers and CRDs into it. The very first action it takes to do that is to make the eksa-system namespace.

If the kubeconfig file doesn't exist yet, that command will fail--and because there is no kubeconfig set, it will fall back to the kubectl default which is to connect to a local K8s cluster API at localhost:8080.

Sometimes when I ran it the kubeconfig would be there first, and it would succeed. Other times it wouldn't, and it would bomb with this error. That's why it's inconsistent.

The way I solved it was to fork the EKSA plugin and add a 30s sleep before it tries to run commands against the new cluster 😏 . That has worked for me every time since. Another poster on that ticket suggested ensuring that the k8s versions match exactly between kubectl and the cluster, which I haven't yet tested (all my clusters are working). However you could try that as well.

Hope that helps!

@galvesribeiro
Copy link
Author

Interesting idea. The thing is that consistently changing from full cone to linked clone made it work.

If the delay as you suggest is the culprit, I'd guess it should be included among many other "await" that happens across the steps.

The overall feeling is that EKS-A is not been given much attention and/or maintained unfortunately if this issue is out for that long.

But thanks for the reply, I'll have a look on forking.

@galvesribeiro
Copy link
Author

Hey folks!

Just to remember the team, this still an issue.

Today we had a need to increase the /tmp directory from bottleneck VM disk. We aren't able to do it on running nodes, so we had thought "lets set the diskGiB on the manifest of the workload nodes". The problem is (and confirmed by the docs) the size of the disk can only be changed when using fullClone. If we are on linkedClone we can't set it.

So the logical action was "yes, it will be a full copy of the disk but at least we can use the bigger disk. Let's change to fullClone and make the diskGiB bigger". Well, I wish that worked but it fail with the same error ("A specified parameter was not correct: spec.config.deviceChange[0].operation") when performing the upgrade.

I wish EKS team would have a look on this. This issue is from April and we have not got even a reply on what is wrong or any proper workaround yet...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants