Skip to content
bgrant0607 edited this page Oct 21, 2014 · 49 revisions

Tips that may help you debug why Kubernetes isn't working.

Of course, also take a look at the documentation, especially the getting-started guides.

When asking for help, please indicate your hosting platform (GCE, Vagrant, etc.) and OS distribution (Debian, CoreOS, Fedora, etc.).

Checking logs

Depending on the Linux distribution, the logs of system components, including Docker, will be in /var/log or /tmp, or can be accessed using journalctl on systemd-based systems, such as Fedora, RHEL7, or CoreOS. Salt logs on minions are in /var/log/salt/minion.

If you don't see much useful in the logs, you could try turning on verbose logging on the Kubernetes component you suspect has a problem. See https://github.com/golang/glog for more details.

You can see what containers have been created on a node using docker ps -a.

You can see what's happening in Kubernetes with cluster/kubecfg.sh list events.

Basic

  • Ensure all backend components are running
    • on master: apiserver, controller, scheduler, etcd
      • IMPORTANT: Some older turnup instructions don't include the scheduler. Ensure the scheduler is running on the master host.
    • on nodes: proxy, kubelet, docker
  • Ensure all k8s components have --etcd_servers set correctly on the command line (if it isn't, you should see error messages in their logs)
    • If it's not set, your networking setup may be broken, since it is usually initialized from the IP address of kubernetes-master, such as in cluster/saltbase/salt/apiserver/default

By symptom

  • dev-build-and-up.sh waits for ever at Waiting for cluster initialization
    • Try cluster/kube-down.sh and hack/dev-build-and-up.sh again
      • If it still hangs, ctrl-c and try hack/dev-build-and-push.sh
      • Check whether all the VMs exist -- typically one master VM and N minions
        • If so, check whether you can ssh into them
        • Check serial console output, if available
    • If it still doesn't work, see provider-specific issues below
  • dev-build-and-up.sh reports Docker failed to install on kubernetes-minion-1
    • Verify that you can ssh into the minions
    • Check /var/log/salt/minion to see what part of the installation failed
  • SSL23_GET_SERVER_HELLO:sslv3 alert handshake failure in the minion's salt log
    • Try python -c "import urllib2; req = urllib2.Request('https://get.docker.io/gpg'); response = urllib2.urlopen(req); print response.read()". If it fails with the above message, then Docker enabled SNI on docker.io and/or redirected to docker.com (which has SNI enabled).
  • kubecfg cannot reach apiserver
    • Ensure KUBERNETES_MASTER or KUBE_MASTER_IP is set, or use -h
    • Ensure apiserver is running
      • Check that the process is running on the master
      • Check its logs
  • You were able to create a replicationController but see no pods
    • The replication controller didn't create the pods. Check that the controller is running, and look at its logs.
  • kubecfg hangs forever or a pod is in state Waiting forever
    • Check whether hosts are being assigned to your pods. If not, then they aren't being scheduled. If they are, check Kubelet and Docker logs.
    • Ensure kubelet is looking in the right place in etcd for its pods. If you see something like DEBUG: get /registry/hosts/127.0.0.1/kubelet in a kubelet's logs, then check whether the apiserver is using the same name or IP for that minion. If not, check the value of the --hostname_override command-line flag on kubelet.
    • It could also be that the image fetch is not working (e.g., because you mistyped the image name). Check Docker logs.
  • apiserver reports Error synchronizing container: Get http://:10250/podInfo?podID=foo: dial tcp :10250: connection refused
    • Just means that pod foo has not yet been scheduled (see #1285)
    • Check whether the scheduler is running properly
    • If the scheduler is running, possibly no minion addresses were passed to the apiserver using --machines (see hack/local-cluster-up.sh for an example)
  • Cannot connect to the container
    • Try to telnet to the minion at its service port, and/or to the pod's IP and port
    • Check whether the container has been created in Docker: sudo docker ps -a
      • If you don't see the container, there could be a problem with the pod configuration, image, Docker, or Kubelet
      • If you see containers created every 10 seconds, then container creation is failing or the container's process is failing
  • Why does PUT return {"kind":"Status","creationTimestamp":null,"apiVersion":"v1beta1","status":"failure","message":"replicationController \"fooController\" cannot be updated: 105: Key already exists (/registry/controllers/fooController) [25464]","reason":"conflict","details":{"id":"fooController","kind":"replicationController"},"code":409}?
    • We use resourceVersion for optimistic concurrency. The value assigned by the system at the last mutation of the object needs to be provided when performing an update, in order to prevent accidentally clobbering another update. kubecfg achieves this by doing a GET of the object, extracting the resourceVersion, and inserting it into the json of the PUT, which defeats the purpose of the concurrency control, but works for single-user scenarios.

Build problems

make clean

or

rm -rf Godeps/_workspace/pkg output _output

Networking problems

TODO

Other provider-specific issues

TODO

GCE

  • Ensure you can ssh to an instance, which may require enabling billing and/or creating an ssh key. Create an instance if you don't have one, then use gcutil ssh to ssh into it.
  • gcutil listfirewalls ; gcutil getfirewall default-ssh
    • If default-ssh doesn't exist, do gcutil addfirewall --description "SSH allowed from anywhere" --allowed=tcp:22 default-ssh
  • gcutil listnetworks
Clone this wiki locally