-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cAdvisor] cluster creation with v0.8.x and Kubernetes built from source fails on some hosts #1569
Comments
/assign |
Setting GOFLAGS= (empty string) may fix, I suspect it's the upstream providerless build regressing. |
/lifecycle active |
have this reproduced at least, still determining root cause. side effect of root cause is clearly that the api server never comes up which breaks everything else. |
does not appear to be the GOFLAGS / providerless build. |
Seems to work at v1.19.0-alpha.0 but not v1.19.0-alpha.1 at least on this mac |
started a proper bisect. |
on a linux environment that builds faster but has similar issues: |
any hints in the kubelet logs? |
grabbing them now |
on gLinux workstation: |
ahhh I see a familiar log line, this is probably https://kubernetes.slack.com/archives/CEKK1KTN2/p1586971733013400?thread_ts=1586856163.478400&cid=CEKK1KTN2 ... was overloaded and forgot about this :( Perhaps we're now detecting available CPU differently, which would make sense given a cAdvisor upgrade. ... great |
@dims helped me hunt down kubernetes/kubernetes#89859 |
we're going to need a cAdvisor upgrade in k/k, which is blocked on klog => klog v2 working out between the various repos ... I'm not sure if there's a good work around in kind, possibly specifying the resources manually. |
What's the ETA?
I thought go modules allowed having multiple versions of the same dependency/module in the same module. |
There's no ETA currently, but I'll be pushing to get this sorted before 1.19 is released, at least.
I'm not actually clear what went wrong here yet, just what I've been informed on the state of things upstream there at a high level. |
This bug isn't kind specific and definitely needs to get fixed, but it's difficult to give an ETA currently. I think cAdvisor is pulled into multiple downstream deps that are NOT module enabled is the problem, we may have to upgrade them too, but I think it switched to the v2 API which they may not be prepared for, or something like that (btw kubernetes builds with modules disabled for ... reasons). |
Not the first time I see cadvisor causing issues: https://github.blog/2019-11-21-debugging-network-stalls-on-kubernetes/ |
update: there are some PRs in flight regarding klog. rollback doesn't seem to be an option, it's in too many repos and they will want to roll forward. I'm going to try to devote some more time to helping get these in soon. |
Kubernetes is on klog v2 now, haven't checked if we managed to get cAdvisor updated enough yet though. |
@BenTheElder yes it did ! google/cadvisor@8af10c6...6a8d614 |
thanks dims. validating the fix today 👍 |
I have just successfully created a cluster from the latest Kubernetes master branch. Thanks! |
thanks for confirming! FYI @howardjohn |
I still see this:
this is with kind v0.8.1 and kubernetes v1.19.0-beta.0.320+3fc7831cd8a704 |
Same issue in a different context, if it helps us narrow down kubernetes/kubernetes#91795 |
kubernetes/kubernetes#89859 is tracking the 0 cpu reported, though it seems it's more than one bug with the new NUMA detection in cAdvisor. There's a patch out right now that I need to vet in our environment, perhaps it also fixes it in your context? Have been tracking down a containerd regression .. |
patch: google/cadvisor#2567 |
progress in the cAdvisor patch, but it's now counting disabled cores (HT disabled) in num_cores (which kubernetes uses) so .. not quite there yet. once that's done it still has to get pulled into kubernetes |
I've validated a working fix in cAdvisor in this follow-up google/cadvisor#2579 It's not merged yet, and then we'll need to pull into Kubernetes. |
Discussed the situation we're in managing this dependency with k8s-code-organization project today: I'm not sure we have a clear solution yet, but we at least have a pair of not so great options:
|
pending kubernetes/kubernetes#91366, after further discussion with SIG node recently. |
kubernetes/kubernetes#91366 is poised to merge with the fix. likely within the next day (it's in the queue) |
kubernetes/kubernetes#91366 merged 6 hours ago. |
confirmed that this is fixed |
What happened:
I have cloned the Kubernetes repo on my dev machine (MAC) at
$(go env GOPATH)/src/k8s.io/kubernetes
.I successfully ran
kind build node-image
, which picked the latest Kubernetes master branch commit (0a6c826d3e92dae8f20d6199d0ac7deeca9eed71).Then I ran
kind create cluster --image kindest/node:latest
, and got:So apparently the Kubelet never replies to
GET https://kind-control-plane:6443/healthz?timeout=10s
requests.What you expected to happen:
I expected the cluster to boot successfully.
How to reproduce it (as minimally and precisely as possible):
As explained above.
Anything else we need to know?:
If I simply run
kind create cluster
a Kubernetes v1.18.2 cluster gets created successfully.Follow the logs from the node container when running
kind create cluster --image kindest/node:latest
, notice that there are some "Failed to..." messages; they show up even for the successfulkind create cluster
case.Environment:
kind version
): bothv0.8.0
andv0.8.1
kubectl version
): kubectl isv1.18.0
; Kubernetes is at commit 0a6c826d3e92dae8f20d6199d0ac7deeca9eed71 from master (latest commit at the time of this writing)docker info
):19.03.8
/etc/os-release
):Mac OS X 10.14.6
The text was updated successfully, but these errors were encountered: