Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

EKS v0.19.5 Creating cluster in Docker fails at some point #8123

Closed
abregar opened this issue May 9, 2024 · 5 comments
Closed

EKS v0.19.5 Creating cluster in Docker fails at some point #8123

abregar opened this issue May 9, 2024 · 5 comments

Comments

@abregar
Copy link

abregar commented May 9, 2024

Considering this a problem classification. Tried to initialize dev cluster on Macos (Sonoma 14.4.1), Docker desktop v4.30.0 as documentation suggests with a higher verbosity level:

eksctl anywhere create cluster -f $CLUSTER_NAME.yaml -v 9

Using latest release as in:

Initializing long running container     {"name": "eksa_1715243480381280000", "image": "public.ecr.aws/eks-anywhere/cli-tools:v0.19.5-eks-a-65"}

Initialization goes well, containers for control-plane, lb, etcd, .. are successfully created. But creation process then stops at this point:

24-05-09T10:52:50.466+0200    V1      cleaning up temporary namespace  for diagnostic collectors      {"namespace": "eksa-diagnostics"}
2024-05-09T10:52:50.466+0200    V5      Retrier:        {"timeout": "2562047h47m16.854775807s", "backoffFactor": null}
2024-05-09T10:52:50.466+0200    V6      Executing command       {"cmd": "/usr/local/bin/docker exec -i eksa_1715244428714146000 kubectl delete namespace eksa-diagnostics --kubeconfig mgmt/mgmt-eks-a-cluster.kubeconfig"}
2024-05-09T10:52:55.641+0200    V5      Retry execution successful      {"retries": 1, "duration": "5.175007875s"}
2024-05-09T10:52:55.642+0200    V4      Task finished   {"task_name": "collect-cluster-diagnostics", "duration": "17.227805209s"}
2024-05-09T10:52:55.642+0200    V4      ----------------------------------
2024-05-09T10:52:55.642+0200    V4      Saving checkpoint       {"file": "mgmt-checkpoint.yaml"}
2024-05-09T10:52:55.643+0200    V4      Tasks completed {"duration": "5m38.393764542s"}
2024-05-09T10:52:55.643+0200    V3      Cleaning up long running container      {"name": "eksa_1715244428714146000"}
2024-05-09T10:52:55.643+0200    V6      Executing command       {"cmd": "/usr/local/bin/docker rm -f -v eksa_1715244428714146000"}
Error: creating namespace eksa-system: The connection to the server localhost:8080 was refused - did you specify the right host or port?

To me, it looks like that temporary container is rm too early and script does not handle the missing kubeconfig then.

So, questions - is this considered a bug, is it possible to workaround quickly somehow and is there a possibility to continue the cluster creation procedure from the failing point?

@sp1999
Copy link
Member

sp1999 commented May 14, 2024

Hey @abregar, did you have the KUBECONFIG env variable set when creating the cluster? If yes, can you unset it and try recreating the cluster again?

@abregar
Copy link
Author

abregar commented May 23, 2024

no, KUBECONFIG was not set. Also tried new release EKS v0.19.6 and is still failing for me at the same point.
Any other hints what should I try to check or modify in some script?

@bsmithtm
Copy link

I have the same setup and the same result:

  • Running eksctl anywhere create cluster from a Mac to create a standalone cluster in vSphere
  • Bootstrap Kind cluster works properly, applies my Cluster and VSphere* objects, controllers are healthy
  • Actual VMs are started properly in vSphere and are healthy

My run fails at the same step: using the new kubeconfig for the EKSA cluster named galactica to create the eksa-system namespace and lift the bootstrapped stuff into the actual K8s cluster from Kind. The Overview describes the step as

Moves the Cluster API and EKS-A core components from the bootstrap cluster to the EKS Anywhere cluster

        {"T":1718777530703686000,"M":"Creating EKS-A namespace"}
        {"T":1718777530704099000,"M":"Executing command","cmd":"/usr/local/bin/docker exec -i eksa_1718777136668786000 kubectl get namespace eksa-system --kubeconfig galactica/galactica-eks-a-cluster.kubeconfig"}
        {"T":1718777530809852000,"M":"docker","stderr":"The connection to the server localhost:8080 was refused - did you specify the right host or port?\n"}
        {"T":1718777530810000000,"M":"Executing command","cmd":"/usr/local/bin/docker exec -i eksa_1718777136668786000 kubectl create namespace eksa-system --kubeconfig galactica/galactica-eks-a-cluster.kubeconfig"}
        {"T":1718777530919670000,"M":"docker","stderr":"The connection to the server localhost:8080 was refused - did you specify the right host or port?\n"}
        {"T":1718777530919780000,"M":"Task finished","task_name":"workload-cluster-init","duration":"4m53.567563542s"}

Looking at timestamps, it seems like the new EKSA kubeconfig is created fractions of a second before this task attempts to use it:

2024-06-18 23:12:10.703604808 -0700 galactica-eks-a-cluster.kubeconfig created
2024-06-18 23:12:10.704099000 -0700 kubectl get namespace eksa-system
2024-06-18 23:12:10.810000000 -0700 kubectl create namespace eksa-system

So perhaps the kubectl command running inside the eksa_1718777136668786000 container doesn't have access to the new EKSA kubeconfig? The container was still running after these timestamps, because I can see it generating the support bundle where I got these logs. However it's not there now, so I can't see what local dirs might have been mounted to it.

@amitmavgupta
Copy link
Contributor

amitmavgupta commented Jul 10, 2024

@bsmithtm I ran into this issue and triaged it the entire day. I am on Ubuntu 22.04 with Docker.

Installing kubectl with the stable version and not the actual Kubernetes version for your EKS-A cluster creates a mismatch, which messes things up.

To isolate this, I am running 1.29 on both client and server, which has worked well.

@abregar
Copy link
Author

abregar commented Aug 2, 2024

Tried today on Macos (Sonoma 14.6), Docker desktop v4.33.0 and latest EKS release 0.20.2 and
2024-08-02T14:47:44.497+0200 V0 🎉 Cluster created!

Since no-one linked anything to this ticket, not sure what resolved it, but .. FYI anyone interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants