Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CAPBk hangs, possibly in slow API server situations? #2073

Closed
jayunit100 opened this issue Jan 15, 2020 · 8 comments
Closed

CAPBk hangs, possibly in slow API server situations? #2073

jayunit100 opened this issue Jan 15, 2020 · 8 comments
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.

Comments

@jayunit100
Copy link
Contributor

jayunit100 commented Jan 15, 2020

What steps did you take and what happened:

This is a subissue of #1855, because the solution to it likely will involve a liveness check :)

  • I had a cluster that was running for a "long" time (not long-long... but... 3 days).
  • The first CAPV cluster I made on it went fine.
  • context: A few times i did observe etcd watch failures in my management cluster.
  • The next cluster creation did not - and deletion events as well as creation events hung/failed.
  • Then i tried to make a new cluster via the capv, and nothing happened.
  • Finally, decided what if delete this capbk pod? The pod for capbk did not delete easily.... Maybe it some how got into some kind of hung state... had to use grace-period=1.

After restarting my capbk pod , i found that the cluster came to life instantly, and all VMs came up w/ proper configs.

So i figure, this is related to some kind of problem where, if your API server is slow CAPBK hangs . Somehow restarting CAPBK fixes things.

.... I assume CAPBK isnt DOS'ing the API server, but won't rule that out until someone tells me to stop guessing ?

What did you expect to happen:

Capbk should die periodically if its in a hung state or not capable of doing its job.

Anything else you would like to add:

I dont have logs anymore for this capbk instance, but i do have a witness, @fabriziopandini ... :)

Environment:

  • Cluster-api version: v1alpha2
  • Kubernetes version: (use kubectl version): 1.16
  • OS (e.g. from /etc/os-release): ubuntu 18

/kind bug

@k8s-ci-robot k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 15, 2020
@jayunit100
Copy link
Contributor Author

im ok closing this but just wanted to go on record w/ a concrete use case in case this plays into the priority of tackling the broader issue.

@ncdc ncdc added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Jan 15, 2020
@ncdc ncdc added this to the Next milestone Jan 15, 2020
@chuckha
Copy link
Contributor

chuckha commented Jan 16, 2020

This sounds a lot like something @akutz ran into. He found that, I believe, client-go had an infinite timeout when connecting to the API server. Is it possible this is hanging on creating a client to a non-responsive or perhaps very slow API server as you suggest?

@jayunit100
Copy link
Contributor Author

not sure. plausible though bc my infra was a little slow when i saw this.

@jayunit100
Copy link
Contributor Author

jayunit100 commented Feb 5, 2020

related:

think i just saw an issue with capv-manager, where restarting the capv-controller-manager- pod immediately caused my cluster creation to resume again.

it continually sais ‘succesfully reconciled’ but doesnt actually create a new VM, and then restarting it, creates a new VM. I saw something similar a few weeks ago where restarting cabpk actually solved a stopped up management plane.

CAPV Controller Manager spams that resource is not patched

I didnt capture all the logs, but i did manage to capture this being repeated alot in capv-controller-manager:

I0205 21:06:47.006142       1 vspheremachine_controller.go:258] capv-controller-manager/vspheremachine-controller/default/management-cluster-controlplane-0 "level"=0 "msg"="resource is not patched"  "local-resource-version"="322477" "remote-resource-version"="322477"

CABPK logs , nothing suspicious...

And i captured all the logs for cabpk during this time period here

I0204 20:18:45.328133       1 kubeadmconfig_controller.go:138] KubeadmConfigReconciler "level"=0 "msg"="ignoring config for an already ready machine" "kubeadmconfig"={"Namespace":"default","Name":"workload-cluster-5-controlplane-0"} "machine-name"="workload-cluster-5-controlplane-0" 
I0204 20:18:45.328149       1 controller.go:242] controller-runtime/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="kubeadmconfig" "request"={"Namespace":"default","Name":"workload-cluster-5-controlplane-0"}
I0204 20:18:45.328197       1 kubeadmconfig_controller.go:138] KubeadmConfigReconciler "level"=0 "msg"="ignoring config for an already ready machine" "kubeadmconfig"=...
dmconfig"={"Namespace":"default","Name":"workload-cluster-5-md-0-klgqk"} "machine-name"="workload-cluster-5-md-0-79b76cbf6-tnt4r" 
I0204 20:18:45.328349       1 controller.go:242] controller-runtime/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="kubeadmconfig" "request"={"Namespace":"default","Name":"workload-cluster-5-md-0-klgqk"}
I0204 20:18:45.328511       1 reflector.go:243] pkg/mod/k8s.io/client-go@v0.0.0-20190918200256-06eb1244587a/tools/cache/reflector.go:98: forcing resync
I0204 20:18:45.328553       1 controller.go:242] controller-runtime/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="kubeadmconfig" "request"={"Namespace":"default","Name":"management-cluster-controlplane-0"}

Just FYI if any one is curious about this issue.

@vincepri
Copy link
Member

@jayunit100 do we have an action items here, or should we close this one out?

@fabriziopandini
Copy link
Member

IMO a possible action item is to check all the client-go /controller runtime client have a finite timeout

@vincepri
Copy link
Member

Duplicate of #2993

/close

@k8s-ci-robot
Copy link
Contributor

@vincepri: Closing this issue.

In response to this:

Duplicate of #2993

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

6 participants