Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Disk and File CI jobs fail to configure management cluster fully #1596

Closed
jsturtevant opened this issue Aug 10, 2021 · 8 comments
Closed

Disk and File CI jobs fail to configure management cluster fully #1596

jsturtevant opened this issue Aug 10, 2021 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@jsturtevant
Copy link
Contributor

/kind bug

What steps did you take and what happened:
[A clear and concise description of what the bug is.]

During the Management cluster configuration steps, the system doesn't seem to be fully online. I noticed this only happens with the disk and file CI jobs. The error messages aren't the same but it looks to may be related (if not we can split them out)

The Disk and File jobs have a different entry point from the conformance jobs. There could be a difference in the scripts?

Disk and File: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/scripts/ci-entrypoint.sh
Conformance jobs: https://github.com/kubernetes-sigs/cluster-api-provider-azure/blob/master/scripts/ci-conformance.sh

kubectl apply -f templates/addons/calico-resource-set.yaml
Error from server (InternalError): error when creating "templates/addons/calico-resource-set.yaml": Internal error occurred: failed calling webhook "default.clusterresourceset.addons.cluster.x-k8s.io": Post https://capi-webhook-service.capi-system.svc:443/mutate-addons-cluster-x-k8s-io-v1alpha4-clusterresourceset?timeout=10s: dial tcp 10.97.59.108:443: connect: connection refused
mutatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
validatingwebhookconfiguration.admissionregistration.k8s.io/cert-manager-webhook created
error: An error occurred while waiting for the condition to be satisfied: an error on the server ("unable to decode an event from the watch stream: http2: client connection lost") has prevented the request from succeedingUnable to connect to the server: net/http: TLS handshake timeout

https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-azure-disk-vmss-master/1423754824020660224
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-azure-file-1-22/1424062353263038464
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-azure-disk-1-21/1424189442528120832
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-azure-file-vmss-1-20/1422249304768122880
https://prow.k8s.io/view/gs/kubernetes-jenkins/logs/capz-azure-disk-vmss-1-19/1424061850206605312

What did you expect to happen:
Management cluster to come online

Anything else you would like to add:
[Miscellaneous information that will assist in solving the issue.]

Environment:

  • cluster-api-provider-azure version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Aug 10, 2021
@CecileRobertMichon
Copy link
Contributor

This should be fixed by kubernetes-sigs/cluster-api#4989 once we pull in a newer CAPI version which will make sure the webhook is ready before marking the deployment as ready (we already wait for the deployment to be ready before proceeding)

@jsturtevant
Copy link
Contributor Author

Any ideas on why it is only occurring in the Disk and File jobs only?

@CecileRobertMichon
Copy link
Contributor

The conformance jobs create the cluster resource set for the CNI per cluster, when each workload cluster is created. This probably gives it a few more seconds for the webhook to become ready which is why we don't run into this error. The ci-entrypoint script installs the calico resource set right after creating the management cluster, as soon as the deployments are "available".

@CecileRobertMichon
Copy link
Contributor

Should be fixed by #1619 and #1620

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Nov 17, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Dec 18, 2021
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue.

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue or PR with /reopen
  • Mark this issue or PR as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants