Timed out waiting for all machines to be exist #8824

adilGhaffarDev · 2023-06-08T11:41:07Z

Which jobs are flaking?

periodic-cluster-api-e2e-mink8s-release-1-3
periodic-cluster-api-e2e-main
periodic-cluster-api-e2e-release-1-4

e.g. https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/periodic-cluster-api-e2e-mink8s-release-1-3/1666214582626029568

Which tests are flaking?

When testing clusterctl upgrades (v1.x=>current) Should create a management cluster and then upgrade all the providers

Since when has it been flaking?

Minor flakes since 04-06-2023

Testgrid link

https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#2822326a66dd24850a9d

Reason for failure (if possible)

To be analyzed.

Anything else we need to know?

No response

Label(s) to be applied

/kind flake

fabriziopandini · 2023-06-12T20:33:59Z

/triage accepted

killianmuldoon · 2023-07-12T11:40:42Z

This seems to be the same flake: https://storage.googleapis.com/k8s-triage/index.html?job=.*-cluster-api-.*&xjob=.*-provider-.*#a20c32c92add5bfec5f5

killianmuldoon · 2023-07-12T11:45:58Z

/help

k8s-ci-robot · 2023-07-12T11:46:00Z

@killianmuldoon:
This request has been marked as needing help from a contributor.

Guidelines

Please ensure that the issue body includes answers to the following questions:

Why are we solving this issue?
To address this issue, are there any code changes? If there are code changes, what needs to be done in the code and what places can the assignee treat as reference points?
Does this issue have zero to low barrier of entry?
How can the assignee reach out to you for help?

For more details on the requirements of such an issue, please see here and ensure that they are met.

If this request no longer meets these requirements, the label can be removed
by commenting with the /remove-help command.

In response to this:

/help

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chrischdi · 2023-07-14T08:50:03Z

I'll gonna try to take a look into this

/assign

~~Persistent triage link https://storage.googleapis.com/k8s-triage/index.html?date=2023-07-13&job=.*-cluster-api-.*&xjob=.*-provider-.*#a20c32c92add5bfec5f5~~ (Edit: this link does not match the issue)

sbueringer · 2023-07-14T10:26:26Z

Just looked at one of the cases. I think there's a realistic chance this is the same issue as here: #8786 (comment)

(can be verified by looking for preflight errors in the MachineSet and then checking if KCP has a status version)

chrischdi · 2023-07-21T11:48:27Z

Note: this could be a different issue compared to #8786, the above linked query may lead to different issues.

Analysing the prowjob linked at the first post:

It looks like the control plane container is not able to start and CAPD's container creation/start does not work:

CAPD log:

E0606 23:24:39.060065       1 controller.go:326] "Reconciler error" err="failed to create worker DockerMachine: error starting container \"clusterctl-upgrade-oh686r-control-plane-6hdqw\": Error response from daemon: driver failed programming external connectivity on endpoint clusterctl-upgrade-oh686r-control-plane-6hdqw (637e7efe77a6abeb44ff037b39458562423c693c10f122156d32be2fc9081eda): Bind for 127.0.0.1:33369 failed: port is already allocated" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" dockerMachine="clusterctl-upgrade/clusterctl-upgrade-oh686r-control-plane-rs4c7" namespace="clusterctl-upgrade" name="clusterctl-upgrade-oh686r-control-plane-rs4c7" reconcileID=ea60290d-b9c5-428f-8659-1cbf819797b0
...
E0606 23:24:39.841361       1 controller.go:326] "Reconciler error" err="failed to exec DockerMachine bootstrap: failed to run cloud config: stdout:  stderr: : error creating container exec: Error response from daemon: Container b787bae2a0df7cd9272a1ebc84209e448f9ce2e34131d3aa673c47202cdf7943 is not running" controller="dockermachine" controllerGroup="infrastructure.cluster.x-k8s.io" controllerKind="DockerMachine" dockerMachine="clusterctl-upgrade/clusterctl-upgrade-oh686r-control-plane-rs4c7" namespace="clusterctl-upgrade" name="clusterctl-upgrade-oh686r-control-plane-rs4c7" reconcileID=4d7ed318-faad-44f8-a7dd-0b2a07851588

Same information in docker log

time="2023-06-06T23:24:39.058365704Z" level=error msg="b787bae2a0df7cd9272a1ebc84209e448f9ce2e34131d3aa673c47202cdf7943 cleanup: failed to delete container from containerd: no such container"
time="2023-06-06T23:24:39.058475121Z" level=error msg="Handler for POST /v1.41/containers/b787bae2a0df7cd9272a1ebc84209e448f9ce2e34131d3aa673c47202cdf7943/start returned error: driver failed programming external connectivity on endpoint clusterctl-upgrade-oh686r-control-plane-6hdqw (637e7efe77a6abeb44ff037b39458562423c693c10f122156d32be2fc9081eda): Bind for 127.0.0.1:33369 failed: port is already allocated"

Updated link from the first comment: https://storage.googleapis.com/k8s-triage/index.html?date=2023-06-10&job=.*-cluster-api-.*&xjob=.*-provider-.*#2822326a66dd24850a9d

Additional more flexible query to find the issue independent of the query id:
here

chrischdi · 2023-07-21T11:51:46Z

So this issue here is:

CAPD node container for ControlPlane node tries to get started, but the podrt it tires to use is already in-use
Because of that the container for the node cannot be created.

sbueringer · 2023-07-21T12:01:27Z

but the pod

I assume port?

Ah interesting. Am I seeing correctly that we hand over 0 as a host port to docker?

cluster-api/test/infrastructure/docker/internal/docker/manager.go

Line 67 in 7a4e06c

HostPort: port,

This would suggest that Docker itself should pick a random port (?) (maybe I'm looking at the wrong code)

sbueringer · 2023-07-24T08:52:41Z

Assuming I'm looking at the right code. I wonder if we should just implement a retry (e.g. via requeue) and be done with it :) (+ surface it in logs that we're retrying)

P.S. Given that we just fixed #8786 not sure if we have a clear signal right now how often this specific issue occurs

chrischdi · 2023-08-04T16:30:49Z

A simple requeue is not enough in this case. We also have to delete the container.

Sidenote reproducible via:

$ docker create --name foo -p 31333:8080 golang:1.20.6 tail -f /dev/null
$ docker create --name bar -p 31333:8080 golang:1.20.6 tail -f /dev/null
$ docker start foo
$ docker start bar
Error response from daemon: driver failed programming external connectivity on endpoint bar (6277eb12151633aff6a016eefa0c328c4c596831d1abda8d41751c14d114a778): Bind for 0.0.0.0:31333 failed: port is already allocated
Error: failed to start containers: bar

killianmuldoon · 2023-08-07T12:01:29Z

/reopen

To assess if this has fixed the underlying flakes and to track the cherry-picks:
#9131
#9130

k8s-ci-robot · 2023-08-07T12:01:33Z

@killianmuldoon: Reopened this issue.

In response to this:

/reopen

To assess if this has fixed the underlying flakes and to track the cherry-picks:
#9131
#9130

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chrischdi · 2023-08-08T13:53:58Z

Link to check if issue still exists on main (because cherry-picks will only get merged after)

PR got merged at 7/08/2023 11:56:31 UTC
We did not have the issue since 06/08/2023, 02:37:28 UTC xref
The issue appeared at least around every ~5 days on main.

I'll postpone checking if it is fixed for main until wednesday 16th August for merging. This gives us 9 days to see if we did get rid of the issue on main.

chrischdi · 2023-08-08T15:56:36Z

Note: after merging the cherry-picks: we should also cherry-pick #9139 on top.

killianmuldoon · 2023-08-18T10:25:21Z

I think we can close this now - if the same issue pops up we can take another look, but this error message is the result of a number of different possible underlying errors.

Thanks again for fixing this @chrischdi!

/close

chrischdi · 2023-08-18T10:25:24Z

There was only one occurence of this flake at 16/08/2023, 05:25:47 xref.

However, the issue here occurred during the upgrade clusterctl upgrade tests when CAPI v1.0.5 was running.

So there was no occurency since merging the fix.

Also the cherry-picks got merged now:

[release-1.5] 🐛 CAPD: delete container after failed start to work around port allocation issues #9130
[release-1.4] 🐛 CAPD: delete container after failed start to work around port allocation issues #9131
[release-1.5] 🌱 CAPD: fix multi error handling in RunContainer #9243
[release-1.4] 🌱 CAPD: fix multi error handling in RunContainer #9242

/close

k8s-ci-robot · 2023-08-18T10:25:25Z

@killianmuldoon: Closing this issue.

In response to this:

I think we can close this now - if the same issue pops up we can take another look, but this error message is the result of a number of different possible underlying errors.

Thanks again for fixing this @chrischdi!

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added kind/flake Categorizes issue or PR as related to a flaky test. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 8, 2023

adilGhaffarDev changed the title ~~Timed out waiting for nodes to be created for MachineDeployment clusterctl-upgrade~~ Timed out waiting for all machines to be exist Jun 8, 2023

k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jun 12, 2023

k8s-ci-robot added the help wanted Denotes an issue that needs help from a contributor. Must meet "help wanted" guidelines. label Jul 12, 2023

k8s-ci-robot assigned chrischdi Jul 14, 2023

This comment was marked as off-topic.

Sign in to view

chrischdi mentioned this issue Aug 4, 2023

🐛 CAPD: delete container after failed start to work around port allocation issues #9125

Merged

k8s-ci-robot closed this as completed in #9125 Aug 7, 2023

k8s-ci-robot reopened this Aug 7, 2023

chrischdi mentioned this issue Aug 7, 2023

🌱 CAPD: fix multi error handling in RunContainer #9139

Merged

k8s-ci-robot closed this as completed Aug 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Timed out waiting for all machines to be exist #8824

Timed out waiting for all machines to be exist #8824

adilGhaffarDev commented Jun 8, 2023 •

edited

Loading

fabriziopandini commented Jun 12, 2023

killianmuldoon commented Jul 12, 2023

killianmuldoon commented Jul 12, 2023

k8s-ci-robot commented Jul 12, 2023

chrischdi commented Jul 14, 2023 •

edited

Loading

sbueringer commented Jul 14, 2023

This comment was marked as off-topic.

chrischdi commented Jul 21, 2023 •

edited

Loading

chrischdi commented Jul 21, 2023 •

edited

Loading

sbueringer commented Jul 21, 2023

sbueringer commented Jul 24, 2023 •

edited

Loading

chrischdi commented Aug 4, 2023

killianmuldoon commented Aug 7, 2023

k8s-ci-robot commented Aug 7, 2023

chrischdi commented Aug 8, 2023 •

edited

Loading

chrischdi commented Aug 8, 2023

killianmuldoon commented Aug 18, 2023

chrischdi commented Aug 18, 2023

k8s-ci-robot commented Aug 18, 2023

Timed out waiting for all machines to be exist #8824

Timed out waiting for all machines to be exist #8824

Comments

adilGhaffarDev commented Jun 8, 2023 • edited Loading

Which jobs are flaking?

Which tests are flaking?

Since when has it been flaking?

Testgrid link

Reason for failure (if possible)

Anything else we need to know?

Label(s) to be applied

fabriziopandini commented Jun 12, 2023

killianmuldoon commented Jul 12, 2023

killianmuldoon commented Jul 12, 2023

k8s-ci-robot commented Jul 12, 2023

Guidelines

chrischdi commented Jul 14, 2023 • edited Loading

sbueringer commented Jul 14, 2023

This comment was marked as off-topic.

chrischdi commented Jul 21, 2023 • edited Loading

chrischdi commented Jul 21, 2023 • edited Loading

sbueringer commented Jul 21, 2023

sbueringer commented Jul 24, 2023 • edited Loading

chrischdi commented Aug 4, 2023

killianmuldoon commented Aug 7, 2023

k8s-ci-robot commented Aug 7, 2023

chrischdi commented Aug 8, 2023 • edited Loading

chrischdi commented Aug 8, 2023

killianmuldoon commented Aug 18, 2023

chrischdi commented Aug 18, 2023

k8s-ci-robot commented Aug 18, 2023

adilGhaffarDev commented Jun 8, 2023 •

edited

Loading

chrischdi commented Jul 14, 2023 •

edited

Loading

chrischdi commented Jul 21, 2023 •

edited

Loading

chrischdi commented Jul 21, 2023 •

edited

Loading

sbueringer commented Jul 24, 2023 •

edited

Loading

chrischdi commented Aug 8, 2023 •

edited

Loading