Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Timed out waiting for all machines to be exist #8824

Closed
adilGhaffarDev opened this issue Jun 8, 2023 · 19 comments · Fixed by #9125
Closed

Timed out waiting for all machines to be exist #8824

adilGhaffarDev opened this issue Jun 8, 2023 · 19 comments · Fixed by #9125
Assignees
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@adilGhaffarDev
Copy link
Contributor

adilGhaffarDev commented Jun 8, 2023

Which jobs are flaking?

  • periodic-cluster-api-e2e-mink8s-release-1-3
  • periodic-cluster-api-e2e-main
  • periodic-cluster-api-e2e-release-1-4

e.g. https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-3/1666214582626029568

Which tests are flaking?

  • When testing clusterctl upgrades (v1.x=>current) Should create a management cluster and then upgrade all the providers

Since when has it been flaking?

Minor flakes since 04-06-2023

Testgrid link

https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#2822326a66dd24850a9d

Reason for failure (if possible)

To be analyzed.

Anything else we need to know?

No response

Label(s) to be applied

/kind flake

@k8s-ci-robot k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 8, 2023
@adilGhaffarDev adilGhaffarDev changed the title Timed out waiting for nodes to be created for MachineDeployment clusterctl-upgrade Timed out waiting for all machines to be exist Jun 8, 2023
@fabriziopandini
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 12, 2023
@killianmuldoon
Copy link
Contributor

@killianmuldoon
Copy link
Contributor

/help

@k8s-ci-robot
Copy link
Contributor

@killianmuldoon:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

  • Why are we solving this issue?
  • To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
  • Does this issue have zero to low barrier of entry?
  • How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 12, 2023
@chrischdi
Copy link
Member

chrischdi commented Jul 14, 2023

I'll gonna try to take a look into this

/assign

Persistent triage link https://storage.googleapis.com/k8s-triage/index.html?date=2023-07-13&job=.*-cluster-api-.*&xjob=.*-provider-.*#a20c32c92add5bfec5f5 (Edit: this link does not match the issue)

@sbueringer
Copy link
Member

Just looked at one of the cases. I think there's a realistic chance this is the same issue as here: #8786 (comment)

(can be verified by looking for preflight errors in the MachineSet and then checking if KCP has a status version)

@chrischdi

This comment was marked as off-topic.

@chrischdi
Copy link
Member

chrischdi commented Jul 21, 2023

Note: this could be a different issue compared to #8786, the above linked query may lead to different issues.

Analysing the prowjob linked at the first post:

It looks like the control plane container is not able to start and CAPD's container creation/start does not work:

CAPD log:

E0606 23:24:39.060065       1 controller.go:326] "Reconciler error" err="failed to create worker DockerMachine: error starting container \"clusterctl-upgrade-oh686r-control-plane-6hdqw\": Error response from daemon: driver failed programming external connectivity on endpoint clusterctl-upgrade-oh686r-control-plane-6hdqw (637e7efe77a6abeb44ff037b39458562423c693c10f122156d32be2fc9081eda): Bind for 127.0.0.1:33369 failed: port is already allocated" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" dockerMachine="clusterctl-upgrade/clusterctl-upgrade-oh686r-control-plane-rs4c7" namespace="clusterctl-upgrade" name="clusterctl-upgrade-oh686r-control-plane-rs4c7" reconcileID=ea60290d-b9c5-428f-8659-1cbf819797b0
...
E0606 23:24:39.841361       1 controller.go:326] "Reconciler error" err="failed to exec DockerMachine bootstrap: failed to run cloud config: stdout:  stderr: : error creating container exec: Error response from daemon: Container b787bae2a0df7cd9272a1ebc84209e448f9ce2e34131d3aa673c47202cdf7943 is not running" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" dockerMachine="clusterctl-upgrade/clusterctl-upgrade-oh686r-control-plane-rs4c7" namespace="clusterctl-upgrade" name="clusterctl-upgrade-oh686r-control-plane-rs4c7" reconcileID=4d7ed318-faad-44f8-a7dd-0b2a07851588

Same information in docker log

time="2023-06-06T23:24:39.058365704Z" level=error msg="b787bae2a0df7cd9272a1ebc84209e448f9ce2e34131d3aa673c47202cdf7943 cleanup: failed to delete container from containerd: no such container"
time="2023-06-06T23:24:39.058475121Z" level=error msg="Handler for POST /v1.41/containers/b787bae2a0df7cd9272a1ebc84209e448f9ce2e34131d3aa673c47202cdf7943/start returned error: driver failed programming external connectivity on endpoint clusterctl-upgrade-oh686r-control-plane-6hdqw (637e7efe77a6abeb44ff037b39458562423c693c10f122156d32be2fc9081eda): Bind for 127.0.0.1:33369 failed: port is already allocated"

Updated link from the first comment: https://storage.googleapis.com/k8s-triage/index.html?date=2023-06-10&job=.*-cluster-api-.*&xjob=.*-provider-.*#2822326a66dd24850a9d

Additional more flexible query to find the issue independent of the query id:
here

@chrischdi
Copy link
Member

chrischdi commented Jul 21, 2023

So this issue here is:

  • CAPD node container for ControlPlane node tries to get started, but the podrt it tires to use is already in-use
  • Because of that the container for the node cannot be created.

@sbueringer
Copy link
Member

but the pod

I assume port?

Ah interesting. Am I seeing correctly that we hand over 0 as a host port to docker?

This would suggest that Docker itself should pick a random port (?) (maybe I'm looking at the wrong code)

@sbueringer
Copy link
Member

sbueringer commented Jul 24, 2023

Assuming I'm looking at the right code. I wonder if we should just implement a retry (e.g. via requeue) and be done with it :) (+ surface it in logs that we're retrying)

P.S. Given that we just fixed #8786 not sure if we have a clear signal right now how often this specific issue occurs

@chrischdi
Copy link
Member

A simple requeue is not enough in this case. We also have to delete the container.

Sidenote reproducible via:

$ docker create --name foo -p 31333:8080 golang:1.20.6 tail -f /dev/null
$ docker create --name bar -p 31333:8080 golang:1.20.6 tail -f /dev/null
$ docker start foo
$ docker start bar
Error response from daemon: driver failed programming external connectivity on endpoint bar (6277eb12151633aff6a016eefa0c328c4c596831d1abda8d41751c14d114a778): Bind for 0.0.0.0:31333 failed: port is already allocated
Error: failed to start containers: bar

@killianmuldoon
Copy link
Contributor

/reopen

To assess if this has fixed the underlying flakes and to track the cherry-picks:
#9131
#9130

@k8s-ci-robot
Copy link
Contributor

@killianmuldoon: Reopened this issue.

In response to this:

/reopen

To assess if this has fixed the underlying flakes and to track the cherry-picks:
#9131
#9130

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Aug 7, 2023
@chrischdi
Copy link
Member

chrischdi commented Aug 8, 2023

Link to check if issue still exists on main (because cherry-picks will only get merged after)

PR got merged at 7/08/2023 11:56:31 UTC
We did not have the issue since 06/08/2023, 02:37:28 UTC xref
The issue appeared at least around every ~5 days on main.

I'll postpone checking if it is fixed for main until wednesday 16th August for merging. This gives us 9 days to see if we did get rid of the issue on main.

@chrischdi
Copy link
Member

Note: after merging the cherry-picks: we should also cherry-pick #9139 on top.

@killianmuldoon
Copy link
Contributor

I think we can close this now - if the same issue pops up we can take another look, but this error message is the result of a number of different possible underlying errors.

Thanks again for fixing this @chrischdi!

/close

@chrischdi
Copy link
Member

There was only one occurence of this flake at 16/08/2023, 05:25:47 xref.

However, the issue here occurred during the upgrade clusterctl upgrade tests when CAPI v1.0.5 was running.

So there was no occurency since merging the fix.

Also the cherry-picks got merged now:

/close

@k8s-ci-robot
Copy link
Contributor

@killianmuldoon: Closing this issue.

In response to this:

I think we can close this now - if the same issue pops up we can take another look, but this error message is the result of a number of different possible underlying errors.

Thanks again for fixing this @chrischdi!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. kind/flake Categorizes issue or PR as related to a flaky test. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
6 participants