CAPBk hangs, possibly in slow API server situations? #2073

jayunit100 · 2020-01-15T14:51:20Z

What steps did you take and what happened:

This is a subissue of #1855, because the solution to it likely will involve a liveness check :)

I had a cluster that was running for a "long" time (not long-long... but... 3 days).
The first CAPV cluster I made on it went fine.
context: A few times i did observe etcd watch failures in my management cluster.
The next cluster creation did not - and deletion events as well as creation events hung/failed.
Then i tried to make a new cluster via the capv, and nothing happened.
Finally, decided what if delete this capbk pod? The pod for capbk did not delete easily.... Maybe it some how got into some kind of hung state... had to use grace-period=1.

After restarting my `capbk` pod , i found that the cluster came to life instantly, and all VMs came up w/ proper configs.

So i figure, this is related to some kind of problem where, if your API server is slow CAPBK hangs . Somehow restarting CAPBK fixes things.

.... I assume CAPBK isnt DOS'ing the API server, but won't rule that out until someone tells me to stop guessing ?

What did you expect to happen:

Capbk should die periodically if its in a hung state or not capable of doing its job.

Anything else you would like to add:

I dont have logs anymore for this capbk instance, but i do have a witness, @fabriziopandini ... :)

Environment:

Cluster-api version: v1alpha2
Kubernetes version: (use kubectl version): 1.16
OS (e.g. from /etc/os-release): ubuntu 18

/kind bug

The text was updated successfully, but these errors were encountered:

jayunit100 · 2020-01-15T14:55:31Z

im ok closing this but just wanted to go on record w/ a concrete use case in case this plays into the priority of tackling the broader issue.

chuckha · 2020-01-16T20:03:00Z

This sounds a lot like something @akutz ran into. He found that, I believe, client-go had an infinite timeout when connecting to the API server. Is it possible this is hanging on creating a client to a non-responsive or perhaps very slow API server as you suggest?

jayunit100 · 2020-01-23T19:42:39Z

not sure. plausible though bc my infra was a little slow when i saw this.

jayunit100 · 2020-02-05T21:11:54Z

it continually sais ‘succesfully reconciled’ but doesnt actually create a new VM, and then restarting it, creates a new VM. I saw something similar a few weeks ago where restarting cabpk actually solved a stopped up management plane.

CAPV Controller Manager spams that resource is not patched

I didnt capture all the logs, but i did manage to capture this being repeated alot in capv-controller-manager:

I0205 21:06:47.006142       1 vspheremachine_controller.go:258] capv-controller-manager/vspheremachine-controller/default/management-cluster-controlplane-0 "level"=0 "msg"="resource is not patched"  "local-resource-version"="322477" "remote-resource-version"="322477"

CABPK logs , nothing suspicious...

And i captured all the logs for cabpk during this time period here

I0204 20:18:45.328133       1 kubeadmconfig_controller.go:138] KubeadmConfigReconciler "level"=0 "msg"="ignoring config for an already ready machine" "kubeadmconfig"={"Namespace":"default","Name":"workload-cluster-5-controlplane-0"} "machine-name"="workload-cluster-5-controlplane-0" 
I0204 20:18:45.328149       1 controller.go:242] controller-runtime/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="kubeadmconfig" "request"={"Namespace":"default","Name":"workload-cluster-5-controlplane-0"}
I0204 20:18:45.328197       1 kubeadmconfig_controller.go:138] KubeadmConfigReconciler "level"=0 "msg"="ignoring config for an already ready machine" "kubeadmconfig"=...
dmconfig"={"Namespace":"default","Name":"workload-cluster-5-md-0-klgqk"} "machine-name"="workload-cluster-5-md-0-79b76cbf6-tnt4r" 
I0204 20:18:45.328349       1 controller.go:242] controller-runtime/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="kubeadmconfig" "request"={"Namespace":"default","Name":"workload-cluster-5-md-0-klgqk"}
I0204 20:18:45.328511       1 reflector.go:243] pkg/mod/k8s.io/client-go@v0.0.0-20190918200256-06eb1244587a/tools/cache/reflector.go:98: forcing resync
I0204 20:18:45.328553       1 controller.go:242] controller-runtime/controller "level"=1 "msg"="Successfully Reconciled"  "controller"="kubeadmconfig" "request"={"Namespace":"default","Name":"management-cluster-controlplane-0"}

Just FYI if any one is curious about this issue.

vincepri · 2020-04-20T19:42:12Z

@jayunit100 do we have an action items here, or should we close this one out?

fabriziopandini · 2020-04-21T08:51:43Z

IMO a possible action item is to check all the client-go /controller runtime client have a finite timeout

vincepri · 2020-06-10T20:24:49Z

Duplicate of #2993

/close

k8s-ci-robot · 2020-06-10T20:24:55Z

@vincepri: Closing this issue.

In response to this:

Duplicate of #2993

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

k8s-ci-robot added the kind/bug Categorizes issue or PR as related to a bug. label Jan 15, 2020

ncdc added the priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. label Jan 15, 2020

ncdc added this to the Next milestone Jan 15, 2020

jayunit100 mentioned this issue Feb 5, 2020

Had to restart Capv-controller-manager to resume cluster creation. kubernetes-sigs/cluster-api-provider-vsphere#751

Closed

k8s-ci-robot closed this as completed Jun 10, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CAPBk hangs, possibly in slow API server situations? #2073

CAPBk hangs, possibly in slow API server situations? #2073

jayunit100 commented Jan 15, 2020 •

edited

Loading

jayunit100 commented Jan 15, 2020

chuckha commented Jan 16, 2020

jayunit100 commented Jan 23, 2020

jayunit100 commented Feb 5, 2020 •

edited

Loading

vincepri commented Apr 20, 2020

fabriziopandini commented Apr 21, 2020

vincepri commented Jun 10, 2020

k8s-ci-robot commented Jun 10, 2020

CAPBk hangs, possibly in slow API server situations? #2073

CAPBk hangs, possibly in slow API server situations? #2073

Comments

jayunit100 commented Jan 15, 2020 • edited Loading

After restarting my capbk pod , i found that the cluster came to life instantly, and all VMs came up w/ proper configs.

jayunit100 commented Jan 15, 2020

chuckha commented Jan 16, 2020

jayunit100 commented Jan 23, 2020

jayunit100 commented Feb 5, 2020 • edited Loading

CAPV Controller Manager spams that resource is not patched

CABPK logs , nothing suspicious...

vincepri commented Apr 20, 2020

fabriziopandini commented Apr 21, 2020

vincepri commented Jun 10, 2020

k8s-ci-robot commented Jun 10, 2020

jayunit100 commented Jan 15, 2020 •

edited

Loading

After restarting my `capbk` pod , i found that the cluster came to life instantly, and all VMs came up w/ proper configs.

jayunit100 commented Feb 5, 2020 •

edited

Loading