Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

"failed to init node with kubeadm" when using new base image #884

Closed
howardjohn opened this issue Sep 30, 2019 · 9 comments
Closed

"failed to init node with kubeadm" when using new base image #884

howardjohn opened this issue Sep 30, 2019 · 9 comments
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question.

Comments

@howardjohn
Copy link
Contributor

howardjohn commented Sep 30, 2019

What happened:
kind create cluster fails with Error: failed to create cluster: failed to init node with kubeadm: exit status 1

What you expected to happen:
kind create cluster does not fail

How to reproduce it (as minimally and precisely as possible):

Apply this pod to a kubernetes cluster should do it:

apiVersion: v1
kind: Pod
metadata:
  annotations:
    prow.k8s.io/job: e2e_cni
    testgrid-dashboards: istio_cni
  creationTimestamp: null
  labels:
    created-by-prow: "true"
    prow.k8s.io/id: 61509b6f-e324-11e9-a948-e86a6428716e
    prow.k8s.io/job: e2e_cni
    prow.k8s.io/refs.org: istio
    prow.k8s.io/refs.pull: "183"
    prow.k8s.io/refs.repo: cni
    prow.k8s.io/type: presubmit
  name: 61509b6f-e324-11e9-a948-e86a6428716e
spec:
  automountServiceAccountToken: false
  containers:
  - command:
    - /tools/entrypoint
    env:
    - name: GOPROXY
      value: https://proxy.golang.org
    - name: BUILD_WITH_CONTAINER
      value: "0"
    - name: ARTIFACTS
      value: /logs/artifacts
    - name: BUILD_ID
    - name: BUILD_NUMBER
    - name: GOPATH
      value: /home/prow/go
    - name: JOB_NAME
      value: e2e_cni
    - name: JOB_SPEC
      value: '{"type":"presubmit","job":"e2e_cni","prowjobid":"61509b6f-e324-11e9-a948-e86a6428716e","refs":{"org":"istio","repo":"cni","base_ref":"master","base_sha":"774dea7a1d3872bff22c7dd06fd10a16be552301","pulls":[{"number":183,"author":"sdake","sha":"0955af677b7da69c0ed3af34c5c1eabe4e0c4167"}],"path_alias":"istio.io/cni"},"extra_refs":[{"org":"istio","repo":"istio","base_ref":"master","path_alias":"istio.io/istio"}]}'
    - name: JOB_TYPE
      value: presubmit
    - name: PROW_JOB_ID
      value: 61509b6f-e324-11e9-a948-e86a6428716e
    - name: PULL_BASE_REF
      value: master
    - name: PULL_BASE_SHA
      value: 774dea7a1d3872bff22c7dd06fd10a16be552301
    - name: PULL_NUMBER
      value: "183"
    - name: PULL_PULL_SHA
      value: 0955af677b7da69c0ed3af34c5c1eabe4e0c4167
    - name: PULL_REFS
      value: master:774dea7a1d3872bff22c7dd06fd10a16be552301,183:0955af677b7da69c0ed3af34c5c1eabe4e0c4167
    - name: REPO_NAME
      value: cni
    - name: REPO_OWNER
      value: istio
    - name: ENTRYPOINT_OPTIONS
      value: '{"timeout":7200000000000,"grace_period":15000000000,"artifact_dir":"/logs/artifacts","args":["entrypoint","make","prow-e2e"],"process_log":"/logs/process-log.txt","marker_file":"/logs/marker-file.txt","metadata_file":"/logs/artifacts/metadata.json"}'
    image: gcr.io/istio-testing/build-tools:2019-09-29T15-31-13
    name: test
    resources:
      limits:
        cpu: "3"
        memory: 24Gi
      requests:
        cpu: 500m
        memory: 3Gi
    securityContext:
      privileged: true
    volumeMounts:
    - mountPath: /lib/modules
      name: modules
      readOnly: true
    - mountPath: /sys/fs/cgroup
      name: cgroup
    - mountPath: /logs
      name: logs
    - mountPath: /tools
      name: tools
    - mountPath: /home/prow/go
      name: code
    workingDir: /home/prow/go/src/istio.io/cni
  - command:
    - /sidecar
    env:
    - name: JOB_SPEC
      value: '{"type":"presubmit","job":"e2e_cni","prowjobid":"61509b6f-e324-11e9-a948-e86a6428716e","refs":{"org":"istio","repo":"cni","base_ref":"master","base_sha":"774dea7a1d3872bff22c7dd06fd10a16be552301","pulls":[{"number":183,"author":"sdake","sha":"0955af677b7da69c0ed3af34c5c1eabe4e0c4167"}],"path_alias":"istio.io/cni"},"extra_refs":[{"org":"istio","repo":"istio","base_ref":"master","path_alias":"istio.io/istio"}]}'
    - name: SIDECAR_OPTIONS
      value: '{"gcs_options":{"items":["/logs/artifacts"],"bucket":"istio-prow","path_strategy":"explicit","local_output_dir":"/output","dry_run":false},"entries":[{"args":["entrypoint","make","prow-e2e"],"process_log":"/logs/process-log.txt","marker_file":"/logs/marker-file.txt","metadata_file":"/logs/artifacts/metadata.json"}]}'
    image: gcr.io/k8s-prow/sidecar:v20190927-a9e239ad8
    name: sidecar
    resources: {}
    volumeMounts:
    - mountPath: /logs
      name: logs
    - mountPath: /output
      name: output
  initContainers:
  - command:
    - /clonerefs
    env:
    - name: CLONEREFS_OPTIONS
      value: '{"src_root":"/home/prow/go","log":"/logs/clone.json","git_user_name":"ci-robot","git_user_email":"ci-robot@k8s.io","refs":[{"org":"istio","repo":"cni","base_ref":"master","base_sha":"774dea7a1d3872bff22c7dd06fd10a16be552301","pulls":[{"number":183,"author":"sdake","sha":"0955af677b7da69c0ed3af34c5c1eabe4e0c4167"}],"path_alias":"istio.io/cni"},{"org":"istio","repo":"istio","base_ref":"master","path_alias":"istio.io/istio"}]}'
    image: gcr.io/k8s-prow/clonerefs:v20190927-a9e239ad8
    name: clonerefs
    resources: {}
    volumeMounts:
    - mountPath: /logs
      name: logs
    - mountPath: /home/prow/go
      name: code
  - command:
    - /initupload
    env:
    - name: INITUPLOAD_OPTIONS
      value: '{"bucket":"istio-prow","path_strategy":"explicit","local_output_dir":"/output","dry_run":false,"log":"/logs/clone.json"}'
    - name: JOB_SPEC
      value: '{"type":"presubmit","job":"e2e_cni","prowjobid":"61509b6f-e324-11e9-a948-e86a6428716e","refs":{"org":"istio","repo":"cni","base_ref":"master","base_sha":"774dea7a1d3872bff22c7dd06fd10a16be552301","pulls":[{"number":183,"author":"sdake","sha":"0955af677b7da69c0ed3af34c5c1eabe4e0c4167"}],"path_alias":"istio.io/cni"},"extra_refs":[{"org":"istio","repo":"istio","base_ref":"master","path_alias":"istio.io/istio"}]}'
    image: gcr.io/k8s-prow/initupload:v20190927-a9e239ad8
    name: initupload
    resources: {}
    volumeMounts:
    - mountPath: /logs
      name: logs
    - mountPath: /output
      name: output
  - args:
    - /entrypoint
    - /tools/entrypoint
    command:
    - /bin/cp
    image: gcr.io/k8s-prow/entrypoint:v20190927-a9e239ad8
    name: place-entrypoint
    resources: {}
    volumeMounts:
    - mountPath: /tools
      name: tools
  restartPolicy: Never
  volumes:
  - hostPath:
      path: /lib/modules
      type: Directory
    name: modules
  - hostPath:
      path: /sys/fs/cgroup
      type: Directory
    name: cgroup
  - emptyDir: {}
    name: logs
  - emptyDir: {}
    name: tools
  - hostPath:
      path: /tmp/prowjob-out-e2e_cni-189561415
    name: output
  - emptyDir: {}
    name: code
status: {}

Anything else we need to know?:
We are running this in prow on GKE. The only variable here compared to our other setups (we run everything in kind currently with no issues) is the image we are using. The previous image we start service start docker then kind create cluster. In the new image we run daemon -U -- dockerd -s=vfs

Environment:

  • kind version: (use kind version): 0.5.1
  • Kubernetes version: (use kubectl version): We are running prow on GKE 1.13, starting Kind 1.15
  • Docker version: (use docker info): 19.03.2
  • OS (e.g. from /etc/os-release): cOS

Dump of logs: kind.tar.gz

Interesting logs:

Sep 30 01:51:36 istio-testing-control-plane kubelet[131]: F0930 01:51:36.497515     131 server.go:198] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory
Sep 30 01:51:36 istio-testing-control-plane containerd[136]: time="2019-09-30T01:51:36.505924860Z" level=error msg="Failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"
Sep 30 01:51:39 istio-testing-control-plane containerd[136]: time="2019-09-30T01:51:39.903081575Z" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:5823a8d719ed767a806f3685c8c4ccaaba0bc9d968252640b01c6a9d517b3618: failed to mount /var/lib/containerd/tmpmounts/containerd-mount955794128: invalid argument: unknown" key="extract-882784908-NUE7 sha256:053f1d6024bdf8b23b06c50930f89a88ee199316243636b46b98bdce14e2378b"
Sep 30 01:51:39 istio-testing-control-plane containerd[136]: time="2019-09-30T01:51:39.907893721Z" level=error msg="PullImage "k8s.gcr.io/kube-controller-manager:v1.15.3" failed" error="failed to pull and unpack image "k8s.gcr.io/kube-controller-manager:v1.15.3": failed to unpack image on snapshotter overlayfs: failed to extract layer sha256:5823a8d719ed767a806f3685c8c4ccaaba0bc9d968252640b01c6a9d517b3618: failed to mount /var/lib/containerd/tmpmounts/containerd-mount955794128: invalid argument: unknown"

Any help debugging this would be appreciated.

We can docker run hello-world from within pod

@howardjohn howardjohn added the kind/bug Categorizes issue or PR as related to a bug. label Sep 30, 2019
@BenTheElder
Copy link
Member

to be clear this is not a kind base image, this is the image you are running kind inside of (in a GKE pod?)

@BenTheElder
Copy link
Member

can you share the podspec / image?

is there a reason you are using vfs instead of overlay on an emptyDir or similar?

@BenTheElder BenTheElder added the kind/support Categorizes issue or PR as a support question. label Sep 30, 2019
@howardjohn
Copy link
Contributor Author

Yes this is the image in gke, we are not using a special kind base image. The pod spec is in the issue (collapsed).

I'm not exactly sure why vfs is used.. I can try not using vfs

@BenTheElder
Copy link
Member

Sep 30 01:51:36 istio-testing-control-plane kubelet[131]: F0930 01:51:36.497515 131 server.go:198] failed to load Kubelet config file /var/lib/kubelet/config.yaml, error failed to read kubelet config file "/var/lib/kubelet/config.yaml", error: open /var/lib/kubelet/config.yaml: no such file or directory

Sep 30 01:51:36 istio-testing-control-plane containerd[136]: time="2019-09-30T01:51:36.505924860Z" level=error msg="Failed to load cni during init, please check CRI plugin status before setting up network for pods" error="cni config load failed: no network config found in /etc/cni/net.d: cni plugin not initialized: failed to load cni config"

These are both normal on any cluster using kubeadm, these files simply haven't been written yet and kubelet etc. are crashlooping. It's part of the design of kuebadm

Sep 30 01:51:39 istio-testing-control-plane containerd[136]: time="2019-09-30T01:51:39.903081575Z" level=info msg="apply failure, attempting cleanup" error="failed to extract layer sha256:5823a8d719ed767a806f3685c8c4ccaaba0bc9d968252640b01c6a9d517b3618: failed to mount /var/lib/containerd/tmpmounts/containerd-mount955794128: invalid argument: unknown" key="extract-882784908-NUE7 sha256:053f1d6024bdf8b23b06c50930f89a88ee199316243636b46b98bdce14e2378b"
Sep 30 01:51:39 istio-testing-control-plane containerd[136]: time="2019-09-30T01:51:39.9078937

This looks like the actual issue and is likely due to switching to vfs (which docker does not recommend for production use and behaves quite differently than the other drivers)

Looking at your podspec it looks like the docker data root is inside the (pod) container filesystem, this is probably going to be wildly slow especially when combined with vfs, overlay(2) is generally the recommended docker/containerd storage driver.

#303 -- the note about /var/lib/docker applies.

volumeMounts:
- name: docker-root 
  mountPath: /var/lib/docker
volumes:
- name: docker-root
  emptyDir: {}

@BenTheElder BenTheElder self-assigned this Sep 30, 2019
@howardjohn
Copy link
Contributor Author

Ok so the /var/lib/docker and turning off vfs fixes this. I think just the /var/lib/docker is the fix, removing vfs just seems like a good idea in general, but just removing that didn't seem to fix it.

I think this is because in the new image /var/lib/docker doesn't actually exist. I think this is due to some aggressive optimizations to the docker image size -- not sure if this will cause other problems, but seems like it might.

We actually never mount /var/lib/docker to a emptyDir, I will try this on our existing tests and see if it impacts performance.

Thanks yet again for your help!

@BenTheElder
Copy link
Member

BenTheElder commented Sep 30, 2019

Ok so the /var/lib/docker and turning off vfs fixes this. I think just the /var/lib/docker is the fix, removing vfs just seems like a good idea in general, but just removing that didn't seem to fix it.

ACK, we probably had overlay on vfs on overlay(2) 😅 (or roughly: overlay on overlay which is not going to work),

I think this is because in the new image /var/lib/docker doesn't actually exist. I think this is due to some aggressive optimizations to the docker image size -- not sure if this will cause other problems, but seems like it might.

The directory / volume itself shouldn't affect the image size materially.

We actually never mount /var/lib/docker to a emptyDir, I will try this on our existing tests and see if it impacts performance.

If it is declared as a VOLUME you should get ~ the equivalent without adding a emptyDir to the pod I think (?)

Thanks yet again for your help!

Thanks for the detailed issue!

@cippaciong
Copy link

Hello, I'm not sure if the errors are related but I'm facing the same error message trying to create a cluster on my laptop.
I don't have any special configuration, I just installed kind 0.5.1 and tried to create a cluster with kind create cluster.
Here are the logs: kind_logs.tar.gz
Please let me know if you need additional information.

@BenTheElder
Copy link
Member

@cippaciong please file a new support issue to track, it's unlikely to be related to what we discussed here.

Have you also checked https://kind.sigs.k8s.io/docs/user/known-issues/ ?

@cippaciong
Copy link

@BenTheElder Thanks, I checked the known issues and moved from btrfs to overlay2 but the error persists. I opened a new issue #889

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. kind/support Categorizes issue or PR as a support question.
Projects
None yet
Development

No branches or pull requests

3 participants