Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

vSphere cloud provider fails to attach a volume due to Unable to find VM by UUID #58927

Closed
Xuxe opened this issue Jan 28, 2018 · 37 comments · Fixed by #59519
Closed

vSphere cloud provider fails to attach a volume due to Unable to find VM by UUID #58927

Xuxe opened this issue Jan 28, 2018 · 37 comments · Fixed by #59519
Assignees
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.

Comments

@Xuxe
Copy link

Xuxe commented Jan 28, 2018

Is this a BUG REPORT or FEATURE REQUEST?:
/kind bug

What happened:
I'm trying to deploy a Kubernetes v1.9.2 cluster on vSphere with the vSphere-Cloud-Provider.
All works fine except the persistent volume attachment.
With Kubernetes v1.8.7 the attachment works fine on the same VMs and vSphere environment.

What you expected to happen:
The volume attachment works without errors like "Cannot find node "kbnnode01" in cache. Node not found!!!" or "[datacenter.go:78] Unable to find VM by UUID. VM UUID: f7f53642-5cc2-ced1-37f1-c6b04522a27e" in the kube-controller log.

How to reproduce it (as minimally and precisely as possible):
Deploy a Kubernetes Cluster via Kubespray on vSphere based off the official kubernetes docs and kubespray docs:

And try to deploy the offical vSphere-Cloud-Provider test pods (persistent volume) provided here:
https://github.com/kubernetes/kubernetes/tree/master/examples/volumes/vsphere

Anything else we need to know?:
With Kubernets v1.8.7 the volume attachment works on the same VM and vSphere environment.
Both the CoreOS and vanilla Hyperkube images from Google are affected.
The issue remains after i have added the VM UUID to the cloud config.

Based on the VMware docs for the vSphere-Cloud-Provider (https://vmware.github.io/vsphere-storage-for-kubernetes/documentation/existing.html) the provider should pick the UUID from /sys/class/dmi/id/product_serial if vm-uuid is unset or empty. But this ID is different from the IDs in my logs so it can't find the VM in vSphere.
product_serial from kbnnode01 is 42365F38-CF20-C79C-80D3-52363D75A0EF and from logs it is
f7f53642-5cc2-ced1-37f1-c6b04522a27e

Environment:

  • Kubernetes version (use kubectl version): v1.9.2
  • Cloud provider or hardware configuration: vSphere-Cloud-Provider on ESXi 6.5 and vCenter 6.5 with latest updates
  • OS (e.g. from /etc/os-release): Ubuntu 16.04
  • Kernel (e.g. uname -a): Linux kbnmaster01 4.4.0-112-generic
  • Install tools: Kubespray
  • Others: Used the RBAC Role provided in issue vSphere cloud provider broken with controller-manager v1.9.0 #57279 to fix the "Failed to list *v1.Node: nodes is forbidden" error

My cloud config:

[Global]
datacenter = "Falkenstein"
datastore = "datastore1"
insecure-flag = 1
password = "SECRET"
port = 443
server = "vcenter.xnet.local"
user = "kubernetes_svc@vsphere.local"
working-dir = "/Falkenstein/vm/Kubernetes/"
vm-uuid =

[Disk]
scsicontrollertype = pvscsi
@k8s-ci-robot k8s-ci-robot added needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage. and removed needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. labels Jan 28, 2018
@Xuxe Xuxe changed the title vSphere Cloud Provider fails to attach a volume due to Unable to find VM by UUID vSphere cloud provider fails to attach a volume due to Unable to find VM by UUID Jan 29, 2018
@divyenpatel
Copy link
Member

divyenpatel commented Jan 29, 2018

@Xuxe vm-uuid in vsphere.conf is no longer used in Kubernetes 1.9.2 release. We have removed the unused code as in the PR: #58230

You should remove vm-uuid from vsphere.conf file.

Can you provide me the output of followings

kubectl describe node kbnnode01 | grep  "System UUID"

From kbnnode01 VM, provide the output of following file.

# cat /sys/class/dmi/id/product_serial

For your reference, look at the following code
https://github.com/kubernetes/kubernetes/blob/v1.9.2/pkg/cloudprovider/providers/vsphere/nodemanager.go#L77

nodeUUID := node.Status.NodeInfo.SystemUUID

Later we are finding the VM using this UUID.
https://github.com/kubernetes/kubernetes/blob/v1.9.2/pkg/cloudprovider/providers/vsphere/nodemanager.go#L165

vm, err := res.datacenter.GetVMByUUID(ctx, nodeUUID)

So nodeUUID should match with UUID set in the /sys/class/dmi/id/product_serial, else we will not be able to find the node VM.

Regarding kubespray documentation - https://github.com/kubernetes-incubator/kubespray/blob/master/docs/vsphere.md, It is not applicable to 1.9 releases.

Please refer https://vmware.github.io/vsphere-storage-for-kubernetes/documentation/existing.html for up-to-date documentations.

@divyenpatel
Copy link
Member

/assign divyenpatel

@Xuxe
Copy link
Author

Xuxe commented Jan 29, 2018

Hey @divyenpatel ! Thanks so far :)

I already found the deprecated "vm-uuid" in one issue while browsing the VMware Kubernetes repo and updated my config. But the issue persist.

[Global]
datacenters = "Falkenstein"
insecure-flag = 1
password = "SECRET"
port = 443
user = "kubernetes_svc@vsphere.local"

[VirtualCenter "vcenter.xnet.local"]
[Workspace]
datacenter = "Falkenstein"
server = "vcenter.xnet.local"
folder = "Kubernetes"
default-datastore = "datastore1"
resourcepool-path = "K8s-Pool"

[Disk]
scsicontrollertype = pvscsi

Your requested outputs:

Worker:

root@kbnmaster01:~# kubectl describe node kbnnode01 | grep  "System UUID"
 System UUID:                F7F53642-5CC2-CED1-37F1-C6B04522A27E
root@kbnnode01:~# cat /sys/class/dmi/id/product_serial
VMware-42 36 f5 f7 c2 5c d1 ce-37 f1 c6 b0 45 22 a2 7e

Master:

root@kbnmaster01:~# kubectl describe node kbnmaster01 | grep  "System UUID"
 System UUID:                385F3642-20CF-9CC7-80D3-52363D75A0EF
root@kbnmaster01:~# cat /sys/class/dmi/id/product_serial
VMware-42 36 5f 38 cf 20 c7 9c-80 d3 52 36 3d 75 a0 ef

You see, the IDs are different. Do i miss something here?

@divyenpatel
Copy link
Member

@Xuxe we need to figure out how nodes are getting registered with this UUID.
Can you share parameters kubespray is setting up on the kubelet service on nodes?

@Xuxe
Copy link
Author

Xuxe commented Jan 29, 2018

@divyenpatel

here you are!

The systemd file:

[Unit]
Description=Kubernetes Kubelet Server
Documentation=https://github.com/GoogleCloudPlatform/kubernetes
After=docker.service
Wants=docker.socket

[Service]
EnvironmentFile=-/etc/kubernetes/kubelet.env
ExecStartPre=-/bin/mkdir -p /var/lib/kubelet/volume-plugins
ExecStart=/usr/local/bin/kubelet \
                $KUBE_LOGTOSTDERR \
                $KUBE_LOG_LEVEL \
                $KUBELET_API_SERVER \
                $KUBELET_ADDRESS \
                $KUBELET_PORT \
                $KUBELET_HOSTNAME \
                $KUBE_ALLOW_PRIV \
                $KUBELET_ARGS \
                $DOCKER_SOCKET \
                $KUBELET_NETWORK_PLUGIN \
                $KUBELET_VOLUME_PLUGIN \
                $KUBELET_CLOUDPROVIDER
Restart=always
RestartSec=10s

[Install]
WantedBy=multi-user.target

And the environment file for the master kubelet:

root@kbnmaster01:/etc/kubernetes/manifests# cat /etc/kubernetes/kubelet.env
# logging to stderr means we get it in the systemd journal
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=2"
# The address for the info server to serve on (set to 0.0.0.0 or "" for all interfaces)
KUBELET_ADDRESS="--address=0.0.0.0 --node-ip=10.125.75.170"
# The port for the info server to serve on
# KUBELET_PORT="--port=10250"
# You may leave this blank to use the actual hostname
KUBELET_HOSTNAME="--hostname-override=kbnmaster01"






KUBELET_ARGS="--pod-manifest-path=/etc/kubernetes/manifests \
--cadvisor-port=0 \
--pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.0 \
--node-status-update-frequency=10s \
--docker-disable-shared-pid=True \
--client-ca-file=/etc/kubernetes/ssl/ca.pem \
--tls-cert-file=/etc/kubernetes/ssl/node-kbnmaster01.pem \
--tls-private-key-file=/etc/kubernetes/ssl/node-kbnmaster01-key.pem \
--anonymous-auth=false \
--cgroup-driver=cgroupfs \
--cgroups-per-qos=True \
--fail-swap-on=True \
--enforce-node-allocatable=""  --cluster-dns=10.233.0.3 --cluster-domain=cluster.local --resolv-conf=/etc/resolv.conf --kubeconfig=/etc/kubernetes/node-kubeconfig.yaml --register-with-taints=node-role.kubernetes.io/master=:NoSchedule --kube-reserved cpu=200m,memory=512M --node-labels=node-role.kubernetes.io/master=true  --feature-gates=Initializers=False,PersistentLocalVolumes=False  "
KUBELET_NETWORK_PLUGIN="--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"

KUBELET_VOLUME_PLUGIN="--volume-plugin-dir=/var/lib/kubelet/volume-plugins"

# Should this cluster be allowed to run privileged docker containers
KUBE_ALLOW_PRIV="--allow-privileged=true"
KUBELET_CLOUDPROVIDER="--cloud-provider=vsphere --cloud-config=/etc/kubernetes/cloud_config"

PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

The env file for the worker node kubelet:

root@kbnnode01:~# cat /etc/kubernetes/kubelet.env
# logging to stderr means we get it in the systemd journal
KUBE_LOGTOSTDERR="--logtostderr=true"
KUBE_LOG_LEVEL="--v=2"
# The address for the info server to serve on (set to 0.0.0.0 or "" for all interfaces)
KUBELET_ADDRESS="--address=0.0.0.0 --node-ip=10.125.75.171"
# The port for the info server to serve on
# KUBELET_PORT="--port=10250"
# You may leave this blank to use the actual hostname
KUBELET_HOSTNAME="--hostname-override=kbnnode01"






KUBELET_ARGS="--pod-manifest-path=/etc/kubernetes/manifests \
--cadvisor-port=0 \
--pod-infra-container-image=gcr.io/google_containers/pause-amd64:3.0 \
--node-status-update-frequency=10s \
--docker-disable-shared-pid=True \
--client-ca-file=/etc/kubernetes/ssl/ca.pem \
--tls-cert-file=/etc/kubernetes/ssl/node-kbnnode01.pem \
--tls-private-key-file=/etc/kubernetes/ssl/node-kbnnode01-key.pem \
--anonymous-auth=false \
--cgroup-driver=cgroupfs \
--cgroups-per-qos=True \
--fail-swap-on=True \
--enforce-node-allocatable=""  --cluster-dns=10.233.0.3 --cluster-domain=cluster.local --resolv-conf=/etc/resolv.conf --kubeconfig=/etc/kubernetes/node-kubeconfig.yaml --kube-reserved cpu=100m,memory=256M --node-labels=node-role.kubernetes.io/node=true  --feature-gates=Initializers=False,PersistentLocalVolumes=False  "
KUBELET_NETWORK_PLUGIN="--network-plugin=cni --cni-conf-dir=/etc/cni/net.d --cni-bin-dir=/opt/cni/bin"

KUBELET_VOLUME_PLUGIN="--volume-plugin-dir=/var/lib/kubelet/volume-plugins"

# Should this cluster be allowed to run privileged docker containers
KUBE_ALLOW_PRIV="--allow-privileged=true"
KUBELET_CLOUDPROVIDER="--cloud-provider=vsphere --cloud-config=/etc/kubernetes/cloud_config"

PATH=/usr/local/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin:/sbin:/bin

@sberd
Copy link

sberd commented Jan 30, 2018

Hi, I have some issue. I'm deploy cluster using kubeadm (It's updated version from 1.8.x to 1.9.2) and vmware guide https://vmware.github.io/vsphere-storage-for-kubernetes/documentation/existing.html
Node System UUID is equal /sys/class/dmi/id/product_uuid, not product_serial.

@divyenpatel
Copy link
Member

@Xuxe I have installed Kubernetes Cluster using kubespray.

root@node-1:~# kubectl get nodes
NAME      STATUS    ROLES     AGE       VERSION
node-1    Ready     master    1h        v1.9.2+coreos.0
node-2    Ready     node      1h        v1.9.2+coreos.0
node-3    Ready     node      1h        v1.9.2+coreos.0
node-4    Ready     node      1h        v1.9.2+coreos.0

I did not face any issue regarding "Unable to find VM by UUID", while attaching volume with node VM.
I see System UUID set in the System Info of Node is matching with product_uuid and product_serial .

root@node-1:~# cat /sys/class/dmi/id/product_uuid
421C2FF3-C92F-05E6-16EF-E07D97394978
root@node-1:~# cat /sys/class/dmi/id/product_serial 
VMware-42 1c 2f f3 c9 2f 05 e6-16 ef e0 7d 97 39 49 78
root@node-1:~# kubectl describe node node-1 | grep "System UUID"
 System UUID:                421C2FF3-C92F-05E6-16EF-E07D97394978

vSphere Cloud Provider is also able to locate the node VM using System UUID. Verified volume attachment with node VM was successful.

{"log":"I0130 04:42:19.436744       1 operation_generator.go:308] AttachVolume.Attach succeeded for volume \"pvc-bad2c362-0577-11e8-bd99-0050569c2f8c\" (UniqueName: \"kubernetes.io/vsphere-volume/[vsanDatastore] f1ec5f5a-5c90-ce74-8d69-02002a623c85/kubernetes-dynamic-pvc-bad2c362-0577-11e8-bd99-0050569c2f8c.vmdk\") from node \"node-3\" \n","stream":"stderr","time":"2018-01-30T04:42:19.437109739Z"}

@divyenpatel
Copy link
Member

@sberd

Node System UUID is equal /sys/class/dmi/id/product_uuid, not product_serial.

For some linux distro we observed /sys/class/dmi/id/product_uuid was not reported correctly, so we changed logic to read uuid from product_serial - See https://github.com/kubernetes/kubernete//pull/45311

In 1.9 release, VCP no longer read and parse this UUID, VCP just relies on the System UUID set on the node.

@sberd
Copy link

sberd commented Jan 30, 2018

In my case it has different UUID:

[root@dpcloud01-et.ftc.ru ~]
# cat /sys/class/dmi/id/product_uuid
736E0342-777A-25A3-F32F-D3728309CFA8
[root@dpcloud01-et.ftc.ru ~]
# cat /sys/class/dmi/id/product_serial 
VMware-42 03 6e 73 7a 77 a3 25-f3 2f d3 72 83 09 cf a8
[root@dpcloud01-et.ftc.ru ~]
#  kubectl describe node dpcloud01-et | grep "System UUID"
 System UUID:                736E0342-777A-25A3-F32F-D3728309CFA8
[root@dpcloud01-et.ftc.ru ~]

May be it's vmware template settings.

@divyenpatel
Copy link
Member

@sberd Can you share OS detail? Do you have VMware tools installed on the VM?

@Xuxe
Copy link
Author

Xuxe commented Jan 30, 2018

@divyenpatel

One thing i have observed is my uuid from the dmidecode command is different from /sys/class/dmi/id/product_serial.

From my master:

huebju@kbnmaster01:~$ sudo cat /sys/class/dmi/id/product_serial
VMware-42 36 5f 38 cf 20 c7 9c-80 d3 52 36 3d 75 a0 ef
huebju@kbnmaster01:~$ sudo dmidecode | grep "UUID"
        UUID: 385F3642-20CF-9CC7-80D3-52363D75A0EF
huebju@kbnmaster01:~$ kubectl describe node kbnmaster01 | grep "System UUID"
 System UUID:                385F3642-20CF-9CC7-80D3-52363D75A0EF
huebju@kbnmaster01:~$

My product id matches the system uuid from kubectl describe:

huebju@kbnmaster01:~$ sudo cat /sys/class/dmi/id/product_uuid
385F3642-20CF-9CC7-80D3-52363D75A0EF
root@kbnmaster01:~# /usr/bin/vmtoolsd -v
VMware Tools daemon, version 10.0.7.52125 (build-3227872)

I have installed the open-vm-tools and they are active.
What could cause this issue?

@sberd
Copy link

sberd commented Jan 30, 2018

@divyenpatel Is it enough?

CentOS Linux release 7.4.1708 (Core) 
open-vm-tools.x86_64       10.1.5-3.el7 

@divyenpatel
Copy link
Member

Thank you @Xuxe @sberd

One thing we are sure that we are not able to locate VMs from vCenter using UUID set in /sys/class/dmi/id/product_uuid.

Let me get back to you after consulting vSphere experts on this. We need to know in what situations /sys/class/dmi/id/product_uuid and /sys/class/dmi/id/product_serial can differ or get synced?

@Xuxe
Copy link
Author

Xuxe commented Jan 30, 2018

@divyenpatel

@sberd Which ESXi compatibility do you use?

I discovered the issue seems to come from the ESXi compatibility 6.5+.

I tested a CoreOS production iso to exclude a issue with Ubuntu. With ESXi compatibility v11 (6.0) both product_serial and product_uuid matching. But on a other CoreOS VM with ESXi compatibility v13 (6.5+) the serial and uuid does not match.

Photon OS with Hardware compatibility v11 also works, both serial and uuid matching.
I can't test the v13 OVA for now, because i get a error during ova import. I have to check this.

@sberd
Copy link

sberd commented Jan 30, 2018

@Xuxe @divyenpatel
I have the same compatibility 6.5+.

@divyenpatel
Copy link
Member

divyenpatel commented Jan 30, 2018

@Xuxe
Interesting. I have "ESXi 6.0 and later (VM version 11)" compatibility set on node VMs and template VM used for cloning node VMs.

I have quickly tried out upgrading compatibility to "ESXi 6.5 and later (VM version 13)" on the VM after cloning from the template VM.

I see serial and UUID match.

# cat /sys/class/dmi/id/product_uuid
421C7FD2-223C-0287-FF6A-D12CB461DFA2
# cat /sys/class/dmi/id/product_serial
VMware-42 1c 7f d2 22 3c 02 87-ff 6a d1 2c b4 61 df a2

Guest OS is Ubuntu Linux (64-bit). I will check this out with CoreOS.

@patschi
Copy link

patschi commented Jan 30, 2018

Environment
I was also testing some stuff with:

  • VMware vSphere Hypervisor 6.5
    • Host patchlevel:
      [root@vmw-hv-01:~] vmware -vl
      VMware ESXi 6.5.0 build-5969303
      VMware ESXi 6.5.0 Update 1
      
  • Ubuntu 16.04 Desktop
    • with VM version v13, so ESXi 6.5 and later

Test
The output of product_uuid seems not to be the same as product_serial:

root@lnx-ubuntu-gui-1604:~# cat /sys/class/dmi/id/product_uuid
CE9F4D56-BC1B-C44E-1621-AAC76A4FA94D
root@lnx-ubuntu-gui-1604:~# cat /sys/class/dmi/id/product_serial
VMware-56 4d 9f ce 1b bc 4e c4-16 21 aa c7 6a 4f a9 4d

It seems the UUIDs are being saved within the .vmx file of the virtual machine:

uuid.bios = "56 4d 9f ce 1b bc 4e c4-16 21 aa c7 6a 4f a9 4d"
uuid.location = "56 4d 9f ce 1b bc 4e c4-16 21 aa c7 6a 4f a9 4d"

So I guess when upgrading the VM version from anything older to the latest v13 the UUIDs within the VM stay the same, as it's saved within the .VMX file on the host. According to the VMware KB article:

The UUID is based on the physical computer's identifier and the path to the virtual machine's  
configuration file. This UUID is generated when you power on or reset the virtual machine.  
As long as you do not move or copy the virtual machine to another location,  
the UUID remains constant.  

I'm just wondering about where the different value of product_uuid comes from...

@Xuxe
Copy link
Author

Xuxe commented Jan 30, 2018

@patschi @divyenpatel

yep, after upgrading my v11 coreos box to v13 the UUID stays constant and both serial and uuid matching.
So it seems to be a (bug?) in ESXi and compatibility v13 it self?

But at least, a temporary workaround was found.

Patchlevel:

[root@esxi01:~] vmware -vl
VMware ESXi 6.5.0 build-6765664
VMware ESXi 6.5.0 Update 1

@divyenpatel
Copy link
Member

@Xuxe I am not able to reproduce the issue you have observed with "ESXi 6.0 and later (VM version 11)" compatibility on coreos.

Deployed core using the OVA - https://stable.release.core-os.net/amd64-usr/current/coreos_production_vmware_ova.ova

Core OS Version

localhost ~ # uname -a
Linux localhost 4.14.11-coreos #1 SMP Fri Jan 5 11:00:14 UTC 2018 x86_64 Intel(R) Xeon(R) CPU E5-2680 v4 @ 2.40GHz GenuineIntel GNU/Linux
localhost ~ # cat /etc/os-release
NAME="Container Linux by CoreOS"
ID=coreos
VERSION=1576.5.0
VERSION_ID=1576.5.0
BUILD_ID=2018-01-05-1121
PRETTY_NAME="Container Linux by CoreOS 1576.5.0 (Ladybug)"
ANSI_COLOR="38;5;75"
HOME_URL="https://coreos.com/"
BUG_REPORT_URL="https://issues.coreos.com"
COREOS_BOARD="amd64-usr"

Checked /sys/class/dmi/id/product_uuid and cat /sys/class/dmi/id/product_serial in the VM.
Both are in sync.

localhost ~ # cat /sys/class/dmi/id/product_uuid
421C827E-0915-AAF2-B154-1FAD4A6297FB
localhost ~ # cat /sys/class/dmi/id/product_serial
VMware-42 1c 82 7e 09 15 aa f2-b1 54 1f ad 4a 62 97 fb

Both /sys/class/dmi/id/product_uuid and cat /sys/class/dmi/id/product_serial files are getting updated with new UUIDs in the new clones created out of this VM.

localhost ~ # cat /sys/class/dmi/id/product_uuid
421C77A8-8EF3-11BB-A764-9DF719B74686
localhost ~ # cat /sys/class/dmi/id/product_serial
VMware-42 1c 77 a8 8e f3 11 bb-a7 64 9d f7 19 b7 46 86

I see no issue with ESXi 6.0 and later (VM version 11) compatiblity.

How are you cloning VMs for creating Node VMs? I am manually cloning VMs from vCenter.

@Xuxe
Copy link
Author

Xuxe commented Jan 30, 2018

@divyenpatel

I have not cloned my ubuntu vms both are fresh setups.
CoreOS was also not cloned, installed via ISO not the OVA.

To be clear, if you upgrade a vm from 6.0 and later to 6.5 there is no issue (the ids stay constant and good, i have not tested to clone a upgraded vm).

Please try to create a new vm from begining based on 6.5 and later, now you should see the ids don't match.
As you see @patschi above does have the same issue, with a vm from begining based on 6.5 and later.

@sberd
Copy link

sberd commented Jan 31, 2018

@divyenpatel @Xuxe
Hi all, I recreated vm with compabiltiy 6.0. And numbers are equal.

[root@centos-test ~]# cat /sys/class/dmi/id/product_uuid
42037CBD-4CA8-BC5B-5509-0FC4D4B3223F
[root@centos-test ~]# cat /sys/class/dmi/id/product_serial
VMware-42 03 7c bd 4c a8 bc 5b-55 09 0f c4 d4 b3 22 3f
[root@centos-test ~]# cat /sys/class/dmi/id/product_serial | sed -e 's/^VMware-//' -e 's/-/ /' | awk '{ print toupper($1$2$3$4 "-" $5$6 "-" $7$8 "-" $9$10 "-" $11$12$13$14$15$16) }'
42037CBD-4CA8-BC5B-5509-0FC4D4B3223F 

@sberd
Copy link

sberd commented Jan 31, 2018

As workaround I changed uuid.bios in vm, and everything worked well. May be it's not good solution for production, but it was my test installation.

@divyenpatel
Copy link
Member

@sberd @Xuxe
We observed this issue while testing VCP on SUSE Linux 12.3.

We installed SLE 12.3 from ISO with ESXi 6.5 and later (VM version 13) compatibility. product_serial and product_uuid did not match.

Reinstalled SLE 12.3 from ISO with ESXi 6.0 and later (VM version 11) compatibility. Verified product_serial and product_uuid matched.

@divyenpatel
Copy link
Member

May be it's not good solution for production, but it was my test installation.

@sberd Agree. Kubernetes vSphere Cloud Provider team will come up with the solution and update this issue.

cc: @kubernetes/vmware

@divyenpatel
Copy link
Member

We are working on this issue. Here is the internal VMware Issue: vmware-archive#450

@plenderyou
Copy link

I suffered the same issue on a clean install of kubernetes. I rebuilt the machines using the old Compatibility layer and all seemed to work I could provision disks. One question though is disk.enableUUID still required it seemed to work without it

@divyenpatel
Copy link
Member

One question though is disk.enableUUID still required it seemed to work without it

@plenderyou
Yes we need to set disk.enableUUID to 1, so that disks can be identified in the container host using its UUID.

@divyenpatel
Copy link
Member

/assign abrarshivani

@divyenpatel
Copy link
Member

/unassign divyenpatel

@divyenpatel
Copy link
Member

/area platform/vsphere

@divyenpatel
Copy link
Member

/sig area/platform/vsphere

k8s-github-robot pushed a commit that referenced this issue Feb 8, 2018
Automatic merge from submit-queue. If you want to cherry-pick this change to another branch, please follow the instructions <a href="https://github.com/kubernetes/community/blob/master/contributors/devel/cherry-picks.md">here</a>.

Report InstanceID for vSphere Cloud Provider as UUID obtained from product_serial file 

**What this PR does / why we need it**:
vSphere Cloud Provider is not able to find the nodes for VMs created on vSphere v1.6.5. Kubelet fetches SystemUUID from file ```/sys/class/dmi/id/product_uuid```. vSphere Cloud Provider uses this uuid as VM identifier to get node information from vCenter. vCenter v1.6.5 doesn't recognize this uuids, as a result, nodes are not found. 

UUID present in file ```/sys/class/dmi/id/product_serial``` is recognized by vCenter. Yet,  Kubelet doesn't report this. Therefore, in this PR InstanceID is reported as UUID which is fetched from file 
```/sys/class/dmi/id/product_serial```.

**Which issue(s) this PR fixes** *(optional, in `fixes #<issue number>(, fixes #<issue_number>, ...)` format, will close the issue(s) when PR gets merged)*:
Fixes #58927

**Special notes for your reviewer**:
Internally review here: vmware-archive#452

Tested:
Launched K8s cluster using kubeadm (Used Ubuntu VM compatible with vSphere version 6.5.)
_**Note: Installed Ubuntu from ISO**_
Observed following:
```
Master
> cat /sys/class/dmi/id/product_uuid
743F0E42-84EA-A2F9-7736-6106BB5DBF6B

> cat /sys/class/dmi/id/product_serial
VMware-42 0e 3f 74 ea 84 f9 a2-77 36 61 06 bb 5d bf 6b

Node
> cat /sys/class/dmi/id/product_uuid
956E0E42-CC9D-3D89-9757-F27CEB539B76

> cat /sys/class/dmi/id/product_serial
VMware-42 0e 6e 95 9d cc 89 3d-97 57 f2 7c eb 53 9b 76
```
With this fix controller manager was able to find the nodes.
**controller manager logs**
```
{"log":"I0205 22:43:00.106416       1 nodemanager.go:183] Found node ubuntu-node as vm=VirtualMachine:vm-95 in vc=10.161.120.115 and datacenter=vcqaDC\n","stream":"stderr","time":"2018-02-05T22:43:00.421010375Z"}
```


**Release note**:

```release-note
vSphere Cloud Provider supports VMs provisioned on vSphere v1.6.5
```
@divyenpatel
Copy link
Member

@abrarshivani we should cherry pick this change to 1.9 release.

@Harsun
Copy link

Harsun commented Oct 7, 2019

Looks like the issue is still valid with vm compability 6.7.0 u3.
For guest OS ubuntu 18.04.3, with Kubernetes 1.15.3, Kubespray 2.11.0
unfortunately, UUID and product_serial are not same.

@dheerajinganti
Copy link

I see System UUID set in the System Info of Node is matching with product_uuid and product_serial, however it does not match with ProvideID of the node. Due to this, i am getting no VM found error and node is unable to mound the volume. below are output for provider id and SystemUUID.

all nodes were working fine but since last week, I started getting no VM found error only for two worker nodes. How do I update node ProviderID with right UUID?

kubectl describe nodes | grep UUID

System UUID: 420B0194-7D17-98FD-BED3-42E7B8C0234D
System UUID: 420BE0F9-6E13-06E8-193B-05D28876E082
System UUID: 420B68A9-961F-993F-2740-AEF065F19C48
System UUID: 8FCF0B42-F2A2-54D1-329B-575D9BC5E19C
System UUID: 420B78CF-33E0-600A-E11F-EBC16D99C91E
System UUID: 18480B42-604D-A838-3245-833DA9B5DFF3

kubectl describe nodes | grep ProviderID

ProviderID: vsphere://420b0194-7d17-98fd-bed3-42e7b8c0234d
ProviderID: vsphere://420be0f9-6e13-06e8-193b-05d28876e082
ProviderID: vsphere://420b68a9-961f-993f-2740-aef065f19c48
ProviderID: vsphere://420bcf8f-a2f2-d154-329b-575d9bc5e19c
ProviderID: vsphere://420b78cf-33e0-600a-e11f-ebc16d99c91e
ProviderID: vsphere://420b4818-4d60-38a8-3245-833da9b5dff3

kubectl get nodes -o json | jq '.items[]|[.metadata.name, .spec.providerID, .status.nodeInfo.systemUUID]'

[
"k8s-kubespray-master-0",
"vsphere://420b0194-7d17-98fd-bed3-42e7b8c0234d",
"420B0194-7D17-98FD-BED3-42E7B8C0234D"
]
[
"k8s-kubespray-master-1",
"vsphere://420be0f9-6e13-06e8-193b-05d28876e082",
"420BE0F9-6E13-06E8-193B-05D28876E082"
]
[
"k8s-kubespray-master-2",
"vsphere://420b68a9-961f-993f-2740-aef065f19c48",
"420B68A9-961F-993F-2740-AEF065F19C48"
]
[
"k8s-kubespray-worker-0",
"vsphere://420bcf8f-a2f2-d154-329b-575d9bc5e19c",
"8FCF0B42-F2A2-54D1-329B-575D9BC5E19C"
]
[
"k8s-kubespray-worker-1",
"vsphere://420b78cf-33e0-600a-e11f-ebc16d99c91e",
"420B78CF-33E0-600A-E11F-EBC16D99C91E"
]
[
"k8s-kubespray-worker-2",
"vsphere://420b4818-4d60-38a8-3245-833da9b5dff3",
"18480B42-604D-A838-3245-833DA9B5DFF3"
]

@VariableDeclared
Copy link

VariableDeclared commented Nov 7, 2019

This issue is still present in VSphere 6.5, the issue seems that in come cases product_uuid seems the source of truth and in other cases product_serial.

There is one pattern that I could spot looking reading this thread though, and that is that the pairs are reversed.

The first group of eight are reversed and then after that the next two groups of four are reversed. After that the last sixteen letters are in the correct order.

@hamelg
Copy link

hamelg commented Nov 7, 2019

us too, we encounter this issue.
It seems it is a related to the smbios version.

With SMBIOS 2.4, we get the same ids.
With SMBIOS 2.7, the ids are different.

Your can see your smbios version :
dmidecode |grep SMBIOS

we are running RHEL 7.6 on esxi 6.5.0.

Here are some notes related to this issue :
https://kb.vmware.com/s/article/53609
https://access.redhat.com/solutions/82203

@Tejeev
Copy link

Tejeev commented Nov 21, 2019

We see a very similar issue when F5 BigIp nodes are in the cluster. Removing the nodes resolves the issue. Not sure if our particular situation is a K8s, vSphere, or BigIP bug

@jorioux
Copy link

jorioux commented Feb 25, 2020

We have that issue with centos 7 and vSphere 6.7u3. We also have a f5 bigip VM on that VMware cluster, maybe it's linked, but I don't see how! This is a deep one.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Categorizes issue or PR as related to a bug. sig/storage Categorizes an issue or PR as relevant to SIG Storage.
Projects
None yet