Disk and File CI jobs fail to configure management cluster fully #1596

jsturtevant · 2021-08-10T17:01:05Z

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

During the Management cluster configuration steps, the system doesn't seem to be fully online. I noticed this only happens with the disk and file CI jobs. The error messages aren't the same but it looks to may be related (if not we can split them out)

The Disk and File jobs have a different entry point from the conformance jobs. There could be a difference in the scripts?

Disk and File: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/scripts/ci-entrypoint.sh
Conformance jobs: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/scripts/ci-conformance.sh

kubectl apply -f templates/addons/calico-resource-set.yaml
Error from server (InternalError): error when creating "templates/addons/calico-resource-set.yaml": Internal error occurred: failed calling webhook "default.clusterresourceset.addons.cluster.x-k8s.io": Post https://capi-webhook-service.capi-system.svc:443/mutate-addons-cluster-x-k8s-io-v1alpha4-clusterresourceset?timeout=10s: dial tcp 10.97.59.108:443: connect: connection refused

mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
error: An error occurred while waiting for the condition to be satisfied: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeedingUnable to connect to the server: net/http: TLS handshake timeout

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-azure-disk-vmss-master/1423754824020660224
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-azure-file-1-22/1424062353263038464
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-azure-disk-1-21/1424189442528120832
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-azure-file-vmss-1-20/1422249304768122880
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-azure-disk-vmss-1-19/1424061850206605312

What did you expect to happen:
Management cluster to come online

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

cluster-api-provider-azure version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

CecileRobertMichon · 2021-08-10T17:56:25Z

This should be fixed by kubernetes-sigs/cluster-api#4989 once we pull in a newer CAPI version which will make sure the webhook is ready before marking the deployment as ready (we already wait for the deployment to be ready before proceeding)

jsturtevant · 2021-08-10T18:51:55Z

Any ideas on why it is only occurring in the Disk and File jobs only?

CecileRobertMichon · 2021-08-12T17:23:00Z

The conformance jobs create the cluster resource set for the CNI per cluster, when each workload cluster is created. This probably gives it a few more seconds for the webhook to become ready which is why we don't run into this error. The ci-entrypoint script installs the calico resource set right after creating the management cluster, as soon as the deployments are "available".

CecileRobertMichon · 2021-08-19T21:21:50Z

Should be fixed by #1619 and #1620

k8s-triage-robot · 2021-11-17T23:40:30Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle stale
Mark this issue or PR as rotten with /lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2021-12-18T00:37:34Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue or PR as fresh with /remove-lifecycle rotten
Close this issue or PR with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

k8s-triage-robot · 2022-01-17T00:41:28Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen
Mark this issue or PR as fresh with /remove-lifecycle rotten
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

k8s-ci-robot · 2022-01-17T00:41:39Z

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

After 90d of inactivity, lifecycle/stale is applied

After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied

After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Reopen this issue or PR with /reopen

Mark this issue or PR as fresh with /remove-lifecycle rotten

Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 10, 2021

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 17, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 18, 2021

k8s-ci-robot closed this as completed Jan 17, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Disk and File CI jobs fail to configure management cluster fully #1596

Disk and File CI jobs fail to configure management cluster fully #1596

jsturtevant commented Aug 10, 2021

CecileRobertMichon commented Aug 10, 2021

jsturtevant commented Aug 10, 2021

CecileRobertMichon commented Aug 12, 2021

CecileRobertMichon commented Aug 19, 2021

k8s-triage-robot commented Nov 17, 2021

k8s-triage-robot commented Dec 18, 2021

k8s-triage-robot commented Jan 17, 2022

k8s-ci-robot commented Jan 17, 2022

Disk and File CI jobs fail to configure management cluster fully #1596

Disk and File CI jobs fail to configure management cluster fully #1596

Comments

jsturtevant commented Aug 10, 2021

CecileRobertMichon commented Aug 10, 2021

jsturtevant commented Aug 10, 2021

CecileRobertMichon commented Aug 12, 2021

CecileRobertMichon commented Aug 19, 2021

k8s-triage-robot commented Nov 17, 2021

k8s-triage-robot commented Dec 18, 2021

k8s-triage-robot commented Jan 17, 2022

k8s-ci-robot commented Jan 17, 2022