Skip to content

Latest commit

 

History

History
392 lines (289 loc) · 18.7 KB

troubleshooting.md

File metadata and controls

392 lines (289 loc) · 18.7 KB

Troubleshooting

This is a guide on how to troubleshoot issues related to the Cluster API provider for vSphere (CAPV).

Debugging issues

This section describes how to debug issues tha occur while trying to deploy a new cluster with clusterctl and CAPV.

Bootstrapping with logging

The first step to figuring out what went wrong is to increase the logging.

Adjusting log levels

There are three places to adjust the log level when bootstrapping cluster.

Adjusting the CAPI manager log level

The following steps may be used to adjust the CAPI manager's log level:

  1. Open the provider-components.yaml file, ex. ./out/management-cluster/provider-components.yaml

  2. Search for cluster-api/cluster-api-controller

  3. Modify the pod spec for the CAPI manager to indicate where to send logs and the log level:

    spec:
      containers:
      - args:
        - --logtostderr
        - -v=6
        command:
        - /manager
        image: us.gcr.io/k8s-artifacts-prod/cluster-api/cluster-api-controller:v0.1.7
        name: manager

A log level of six should provided additional information useful for figuring out most issues.

Adjusting the CAPV manager log level
  1. Open the provider-components.yaml file, ex. ./out/management-cluster/provider-components.yaml

  2. Search for cluster-api-provider-vsphere

  3. Modify the pod spec for the CAPV manager to indicate the log level:

    spec:
      containers:
      - args:
        - --logtostderr
        - -v=6
        command:
        - /manager
        env:
        - name: NODE_NAME
          valueFrom:
            fieldRef:
              fieldPath: spec.nodeName
        image: gcr.io/cluster-api-provider-vsphere/ci/manager:latest
        name: manager

A log level of six should provided additional information useful for figuring out most issues.

Adjusting the clusterctl log level

The clusterctl log level may be specified when running clusterctl:

clusterctl create cluster \
  -a ./out/management-cluster/addons.yaml \
  -c ./out/management-cluster/cluster.yaml \
  -m ./out/management-cluster/machines.yaml \
  -p ./out/management-cluster/provider-components.yaml \
  --kubeconfig-out ./out/management-cluster/kubeconfig \
  --provider vsphere \
  --bootstrap-type kind \
  -v 6

The last line of the above command, -v 6, tells clusterctl to log messages at level six. This should provide additional information that may be used to diagnose issues.

Accessing the logs in the bootstrap cluster

The clusterctl logs are client-side only. The more interesting information is occurring inside of the bootstrap cluster. This section describes how to access the logs in the bootstrap cluster.

Exporting the kubeconfig

To make the subsequent steps easier, please go ahead and export a KUBECONFIG environment variable to point to the bootstrap cluster that is or will be running via Kind:

export KUBECONFIG=$(kind get kubeconfig-path --name=clusterapi)
Following the CAPI manager logs

The following command may be used to follow the logs from the CAPI manager:

$ while ! kubectl \
  -n cluster-api-system logs cluster-api-controller-manager-0 -f || \
  true; do sleep 1; done
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
I0726 18:52:10.267212       1 main.go:65] Registering Components
I0726 18:52:10.269442       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="machinedeployment-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"selector":{},"template":{"metadata":{"creationTimestamp":null},"spec":{"metadata":{"creationTimestamp":null},"providerSpec":{},"versions":{"kubelet":""}}}},"status":{}}}
I0726 18:52:10.269819       1 controller.go:121] kubebuilder/controller "level"=0 "msg"="Starting EventSource"  "controller"="machinedeployment-controller" "source"={"Type":{"metadata":{"creationTimestamp":null},"spec":{"selector":{},"template":{"metadata":{"creationTimestamp":null},"spec":{"metadata":{"creationTimestamp":null},"providerSpec":{},"versions":{"kubelet":""}}}},"status":{"replicas":0}}}

The above command immediately begins trying to follow the CAPI manager log, even before the bootstrap cluster and the CAPI manager pod exist. Once the latter is finally available, the command will start following its log.

Following the CAPV manager logs

To tail the logs from the CAPV manager image, use the following command:

$ while ! kubectl \
  -n vsphere-provider-system logs vsphere-provider-controller-manager-0 \
  -f || true; do sleep 1; done
The connection to the server localhost:8080 was refused - did you specify the right host or port?
The connection to the server localhost:8080 was refused - did you specify the right host or port?
Error from server (NotFound): namespaces "vsphere-provider-system" not found
Error from server (NotFound): namespaces "vsphere-provider-system" not found
Error from server (NotFound): pods "vsphere-provider-controller-manager-0" not found
Error from server (NotFound): pods "vsphere-provider-controller-manager-0" not found
Error from server (BadRequest): container "manager" in pod "vsphere-provider-controller-manager-0" is waiting to start: ContainerCreating
Error from server (BadRequest): container "manager" in pod "vsphere-provider-controller-manager-0" is waiting to start: ContainerCreating
I0726 18:30:17.988872       1 main.go:92] Cluster-api objects are synchronized every 10m0s
I0726 18:30:17.989473       1 main.go:93] The default requeue period is 20s
I0726 18:30:18.066304       1 round_trippers.go:405] GET https://10.96.0.1:443/api?timeout=32s 200 OK in 76 milliseconds

The above command immediately begins trying to follow the CAPV manager log, even before the bootstrap cluster and the CAPV manager pod exist. Once the latter is finally available, the command will start following its log.

Following Kubernetes core component logs

Solving issues may also require accessing the logs from the bootstrap cluster's core components:

The API server
kubectl -n kube-system logs kube-apiserver-clusterapi-control-plane -f
The controller manager
kubectl -n kube-system logs kube-controller-manager-clusterapi-control-plane -f
The scheduler
kubectl -n kube-system logs kube-scheduler-clusterapi-control-plane -f

Common issues

This section contains issues commonly encountered by people using CAPV.

Ensure prerequisites are up to date

The Getting Started guide lists the prerequisites for deploying clusters with CAPV. Make sure those prerequisites, such as clusterctl, kubectl, kind, etc. are up to date.

Missing manifest files during bootstrap phase

If you are using CAPV from a previously tested CAPV on, you may be using an out of date manifest docker image. You can remedy this by removing your existing CAPV manifest image by using docker rmi gcr.io/cluster-api-provider-vsphere/release/manifests:latest or by updating the command to specify a specific manifest image, for example:

$ docker run --rm \
>   -v "$(pwd)":/out \
>   -v "$(pwd)/envvars.txt":/envvars.txt:ro \
>   gcr.io/cluster-api-provider-vsphere/release/manifests:0.5.2-alpha.1 \
>   -c management-cluster

This will ensure that the desired image is being used.

envvars.txt is a directory

When generating the YAML manifest the following error may occur:

$ docker run --rm \
>   -v "$(pwd)":/out \
>   -v "$(pwd)/envvars.txt":/envvars.txt:ro \
>   gcr.io/cluster-api-provider-vsphere/release/manifests:latest \
>   -c management-cluster
/build/hack/generate-yaml.sh: line 90: source: /envvars.txt: is a directory

This means that "$(pwd)/envvars.txt" does not refer to an existing file on the localhost. So instead of bind mounting a file into the container, Docker created a new directory on the localhost at the path "$(pwd)/envvars.txt" and bind mounted it into the container.

Make sure the path to the envvars.txt file is correct before using it to generate the YAML manifests.

Failed to retrieve kubeconfig secret

When bootstrapping the management cluster, the vSphere manager log may emit errors similar to the following:

E0726 17:12:54.812485       1 actuator.go:217] [cluster-actuator]/cluster.k8s.io/v1alpha1/default/v0.4.0-beta.2 "msg"="target cluster is not ready" "error"="unable to get client for target cluster: failed to retrieve kubeconfig secret for Cluster \"management-cluster\" in namespace \"default\": secret not found"

The above error does not mean there is a problem. Kubernetes components operate in a reconciliation model -- a message loops attempts to reconcile the desired state over and over until it is achieved or a timeout occurs.

The error message simply indicates that the first control plane node for the target cluster has not yet come online and provided the information necessary to generate the kubeconfig for the target cluster.

It is quite typical to see many errors in Kubernetes service logs, from the API server, to the controller manager, to the kubelet -- the errors are eventually reconciled as the expected configurations are provided and the desired state is reconciled.

Timed out while failing to retrieve kubeconfig secret

When clusterctl times out waiting for the management cluster to come online, and the vSphere manager log repeats failed to retrieve kubeconfig secret for Cluster over and over again, it means there was an error bringing the management cluster's first control plane node online. Possible reasons include:

Cannot access the vSphere endpoint

Two common causes for a failed deployment are related to accessing the remote vSphere endpoint:

  1. The host from which clusterctl is executed must have access to the vSphere endpoint to which the management cluster is being deployed.
  2. The provided vSphere credentials are invalid.

A quick way to validate both access and the credentials is using the program govc or its container image, vmware/govc:

# Define the vSphere's endpoint and access information.
$ export GOVC_URL="myvcenter.com" GOVC_USERNAME="username" GOVC_PASSWORD="password"

# Use "govc" to list the contents of the vSphere endpoint.
$ docker run --rm \
  -e GOVC_URL -e GOVC_USERNAME -e GOVC_PASSWORD \
  vmware/govc \
  ls -k
/my-datacenter/vm
/my-datacenter/network
/my-datacenter/host
/my-datacenter/datastore

If the above command fails then there is an issue with accessing the vSphere endpoint, and it must be corrected before clusterctl will succeed.

A VM with the same name already exists

Deployed VMs get their names from the names of the machines in machines.yaml and machineset.yaml. If a VM with the same name already exists in the same location as one of the VMs that would be created by a new cluster, then the new cluster will fail to deploy and the CAPV manager log will include an error similar to the following:

I0726 18:52:48.920975       1 util.go:195] default-logger/cluster.k8s.io/v1alpha1/default/v0.4.0-beta.2/management-cluster-controlplane-1/task-231288 "level"=2 "msg"="task failed"  "description-id"="VirtualMachine.clone"

Use the govc image to check to see if there is a VM with the same name:

$ docker run --rm \
  -e GOVC_URL -e GOVC_USERNAME -e GOVC_PASSWORD \
  vmware/govc \
  vm.info -k management-cluster-controlplane-1
Name:           management-cluster-controlplane-1
  Path:         /my-datacenter/vm/management-cluster-controlplane-1
  UUID:         4230a650-c92a-d99a-d9f7-fa2fd770e536
  Guest name:   Other 3.x or later Linux (64-bit)
  Memory:       2048MB
  CPU:          2 vCPU(s)
  Power state:  poweredOn
  Boot time:    <nil>
  IP address:   5.6.7.8
  Host:         1.2.3.4

A static IP address must include the segment length

Another common error is to omit the segment length when using a static IP address. For example:

network:
  devices:
  - networkName: "sddc-cgw-network-6"
    gateway4: 192.168.6.1
    ipAddrs:
    - 192.168.6.20/24
    nameservers:
    - 1.1.1.1
    - 1.0.0.1

The above network configuration defines a static IP address, 192.168.6.20, but also includes the required segment length. Without this, clusterctl will timeout waiting for the control plane to come online.

Multiple networks

A machine with multiple networks may cause the bootstrap process to fail for various reasons.

Multiple default routes

A machine that defines two networks may lead to failure if both networks use DHCP and two default routes are defined on the guest. For example:

network:
  devices:
  - networkName: "sddc-cgw-network-5"
    dhcp4: true
  - networkName: "sddc-cgw-network-6"
    dhcp4: true

The above network configuratoin from a machine definition includes two network devices, both using DHCP. This likely causes two default routes to be defined on the guest, meaning it's not possible to determine the default IPv4 address that should be used by Kubernetes.

Preferring an IP address

Another reason a machine with two networks can lead to failure is because the order in which IP addresses are returned externally from a VM is not guaranteed to be the same order as they are when inspected inside the guest. The solution for this is to define a preferred CIDR -- the network segment that contains the IP that the kubeadm bootstrap process selected for the API server. For example:

network:
  preferredAPIServerCidr: "192.168.5.0/24"
  devices:
  - networkName: "sddc-cgw-network-6"
    ipAddrs:
    - 192.168.6.20/24
  - networkName: "sddc-cgw-network-5"
    dhcp4: true

The above network definition specifies the CIDR to which the IP address belongs that is bound to the Kubernetes API server on the guest.

Network Time Protocol (NTP) related problems causing Kubernetes CA related problems

During the bootstrapping process a CA certificate is transferred to the new VM. This CA has a "not valid until" date associated with it. If the ESXI host does not have NTP properly configured there is a chance you will get an error during the kubeadm bootstrapping process which will output an error similar to this in the /var/log/cloud-init-output.log log on the VM:

[certs] Using certificateDir folder "/etc/kubernetes/pki"
error execution phase certs/ca: failure loading ca certificate: failed to load certificate: the certificate is not valid yet

The solution for this is to either properly configure NTP in vCenter Configuring Network Time Protocol (NTP) on an ESXi host using the vSphere Web Client (57147) or add a NTP server block to the KubeadmConfig:

spec:
  ntp:
    enabled: true
    servers:
      - 192.168.2.1

Machine object stuck in a provisioning state

This section discusses issues that can cause a Machine object to be stuck in a provisioning state.

kubectl get machine
NAME                             PROVIDERID   PHASE
capi-quickstart-controlplane-0                provisioning

To troubleshoot these type of scenarios capv-controller-manager logs are a good starting point. These logs can be retrived using kubectl logs capv-controller-manager-88f646758-nj8fs -n capv-system

VM folder does not exist

One of the scenarios where a machine object fails to provision successfully and is stuck in a provisioning state is when the VM folder specified in the manifest does not exist. Below error messages can be seen in the capv-controller-manager logs:

kubectl logs capv-controller-manager-88f646758-nj8fs -n capv-system

I0219 15:15:28.802698       1 vspheremachine_controller.go:249] capv-controller-manager/vspheremachine-controller/default/capi-quickstart-controlplane-0 "level"=0 "msg"="resource patch was not required"  "local-resource-version"="63276" "remote-resource-version"="63276"
E0219 15:15:28.802789       1 controller.go:218] controller-runtime/controller "msg"="Reconciler error" "error"="failed to reconcile VM: unable to get folder for \"infrastructure.cluster.x-k8s.io/v1alpha2, Kind=VSphereCluster default/capi-quickstart/capi-quickstart-controlplane-0\": folder 'clusterapiVM' not found"  "controller"="vspheremachine" "request"={"Namespace":"default","Name":"capi-quickstart-controlplane-0"}

To resolve this error create a VM folder with the name as specified in the manifest. This can be done using the vCenter UI or govc. For example in case of this error, govc folder.create /Datacenter/vm/clusterapiVM, resolves the issue.