Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to stage volume to node #2269

Closed
xiaoping8385 opened this issue Apr 10, 2024 · 19 comments
Closed

Failed to stage volume to node #2269

xiaoping8385 opened this issue Apr 10, 2024 · 19 comments
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.

Comments

@xiaoping8385
Copy link

xiaoping8385 commented Apr 10, 2024

What happened:
When we run several pod on k8s cluster two of the pods which running on a same node can not work

describe pod

root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl describe pod sample-statefulset-1
Name:             sample-statefulset-1
Namespace:        od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf
Priority:         0
Service Account:  stateful-set-sa
Node:             849574ed-eef0-49a3-aef2-a03c1737cd4a/10.0.16.20
Start Time:       Tue, 09 Apr 2024 13:35:44 +0000
Labels:           app=pv-ex
                  controller-revision-hash=sample-statefulset-b77598d95
                  statefulset.kubernetes.io/pod-name=sample-statefulset-1
Annotations:      <none>
Status:           Pending
IP:               
IPs:              <none>
Controlled By:    StatefulSet/sample-statefulset
Containers:
  pv-ex:
    Container ID:   
    Image:          gcr.io/cf-pks-releng-a1/pksrelengci/persistent-example:latest
    Image ID:       
    Port:           3000/TCP
    Host Port:      0/TCP
    State:          Waiting
      Reason:       ContainerCreating
    Ready:          False
    Restart Count:  0
    Environment:
      LOG_FILE:  /usr/share/pv-ex/log.txt
    Mounts:
      /usr/share/pv-ex from www (rw)
      /var/run/secrets/kubernetes.io/serviceaccount from kube-api-access-2rs44 (ro)
Conditions:
  Type              Status
  Initialized       True 
  Ready             False 
  ContainersReady   False 
  PodScheduled      True 
Volumes:
  www:
    Type:       PersistentVolumeClaim (a reference to a PersistentVolumeClaim in the same namespace)
    ClaimName:  www-sample-statefulset-1
    ReadOnly:   false
  kube-api-access-2rs44:
    Type:                    Projected (a volume that contains injected data from multiple sources)
    TokenExpirationSeconds:  3607
    ConfigMapName:           kube-root-ca.crt
    ConfigMapOptional:       <nil>
    DownwardAPI:             true
QoS Class:                   BestEffort
Node-Selectors:              <none>
Tolerations:                 node.kubernetes.io/not-ready:NoExecute op=Exists for 300s
                             node.kubernetes.io/unreachable:NoExecute op=Exists for 300s
Events:
  Type     Reason       Age                    From     Message
  ----     ------       ----                   ----     -------
  Warning  FailedMount  6m21s (x273 over 18h)  kubelet  MountVolume.MountDevice failed for volume "pvc-53087397-597f-41aa-bccb-e2d8ce450626" : rpc error: code = DeadlineExceeded desc = context deadline exceeded
  Warning  FailedMount  97s (x482 over 18h)    kubelet  Unable to attach or mount volumes: unmounted volumes=[www], unattached volumes=[], failed to process volumes=[]: timed out waiting for the condition


describe pv:
 kubectl describe pv pvc-53087397-597f-41aa-bccb-e2d8ce450626
Name:              pvc-53087397-597f-41aa-bccb-e2d8ce450626
Labels:            <none>
Annotations:       pv.kubernetes.io/migrated-to: disk.csi.azure.com
                   pv.kubernetes.io/provisioned-by: kubernetes.io/azure-disk
                   volume.kubernetes.io/provisioner-deletion-secret-name: 
                   volume.kubernetes.io/provisioner-deletion-secret-namespace: 
Finalizers:        [kubernetes.io/pv-protection external-attacher/disk-csi-azure-com]
StorageClass:      ci-storage
Status:            Bound
Claim:             od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf/www-sample-statefulset-1
Reclaim Policy:    Delete
Access Modes:      RWO
VolumeMode:        Filesystem
Capacity:          1Gi
Node Affinity:     
  Required Terms:  
    Term 0:        topology.disk.csi.azure.com/zone in [westus2-1]
Message:           
Source:
    Type:         AzureDisk (an Azure Data Disk mount on the host and bind mount to the pod)
    DiskName:     pvc-53087397-597f-41aa-bccb-e2d8ce450626
    DiskURI:      /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626
    Kind:         Managed
    FSType:       
    CachingMode:  ReadWrite
    ReadOnly:     false
Events:           <none>

kubectl describe volumeattachment csi-5e61ec3524c386fbcdbfa765f2c0b089c359348ef4961e9d4e77ff0134b12aa7
Name:         csi-5e61ec3524c386fbcdbfa765f2c0b089c359348ef4961e9d4e77ff0134b12aa7
Namespace:    
Labels:       <none>
Annotations:  csi.alpha.kubernetes.io/node-id: 849574ed-eef0-49a3-aef2-a03c1737cd4a
API Version:  storage.k8s.io/v1
Kind:         VolumeAttachment
Metadata:
  Creation Timestamp:  2024-04-09T13:35:44Z
  Finalizers:
    external-attacher/disk-csi-azure-com
  Resource Version:  102334
  UID:               15ff9764-3ab6-4c99-b809-f2a26cb08362
Spec:
  Attacher:   disk.csi.azure.com
  Node Name:  849574ed-eef0-49a3-aef2-a03c1737cd4a
  Source:
    Persistent Volume Name:  pvc-53087397-597f-41aa-bccb-e2d8ce450626
Status:
  Attached:  true
  Attachment Metadata:
    LUN:                  2
    Diskiopsreadwrite:    500
    Diskmbpsreadwrite:    60
    Networkaccesspolicy:  AllowAll
    Skuname:              Standard_LRS
Events:                   <none>

csicontroller.log
kubectl logs  csi-azuredisk-controller-b5cd56b5f-85wcc  -n kube-system | grep pvc-53087397-597f-41aa-bccb-e2d8ce450626 
Defaulted container "csi-provisioner" out of: csi-provisioner, csi-attacher, csi-snapshotter, csi-resizer, liveness-probe, azuredisk
I0409 13:35:42.159495       1 controller.go:923] successfully created PV pvc-53087397-597f-41aa-bccb-e2d8ce450626 for PVC www-sample-statefulset-1 and csi volume name /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626
I0409 13:35:42.159599       1 controller.go:1442] provision "od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf/www-sample-statefulset-1" class "ci-storage": volume "pvc-53087397-597f-41aa-bccb-e2d8ce450626" provisioned
I0409 13:35:42.169628       1 event.go:298] Event(v1.ObjectReference{Kind:"PersistentVolumeClaim", Namespace:"od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf", Name:"www-sample-statefulset-1", UID:"53087397-597f-41aa-bccb-e2d8ce450626", APIVersion:"v1", ResourceVersion:"102231", FieldPath:""}): type: 'Normal' reason: 'ProvisioningSucceeded' Successfully provisioned volume pvc-53087397-597f-41aa-bccb-e2d8ce450626
root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl logs  csi-azuredisk-controller-b5cd56b5f-85wcc  -n kube-system -c azuredisk | grep pvc-53087397-597f-41aa-bccb-e2d8ce450626 
I0409 13:35:39.617879       1 utils.go:78] GRPC request: {"accessibility_requirements":{"preferred":[{"segments":{"topology.disk.csi.azure.com/zone":"westus2-1","topology.kubernetes.io/zone":"westus2-1"}},{"segments":{"topology.disk.csi.azure.com/zone":"westus2-2","topology.kubernetes.io/zone":"westus2-2"}},{"segments":{"topology.disk.csi.azure.com/zone":"westus2-3","topology.kubernetes.io/zone":"westus2-3"}}],"requisite":[{"segments":{"topology.disk.csi.azure.com/zone":"westus2-3","topology.kubernetes.io/zone":"westus2-3"}},{"segments":{"topology.disk.csi.azure.com/zone":"westus2-2","topology.kubernetes.io/zone":"westus2-2"}},{"segments":{"topology.disk.csi.azure.com/zone":"westus2-1","topology.kubernetes.io/zone":"westus2-1"}}]},"capacity_range":{"required_bytes":1073741824},"name":"pvc-53087397-597f-41aa-bccb-e2d8ce450626","parameters":{"csi.storage.k8s.io/pv/name":"pvc-53087397-597f-41aa-bccb-e2d8ce450626","csi.storage.k8s.io/pvc/name":"www-sample-statefulset-1","csi.storage.k8s.io/pvc/namespace":"od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf","kind":"managed","storageaccounttype":"Standard_LRS"},"volume_capabilities":[{"AccessType":{"Mount":{}},"access_mode":{"mode":7}}]}
I0409 13:35:39.618148       1 controllerserver.go:198] begin to create azure disk(pvc-53087397-597f-41aa-bccb-e2d8ce450626) account type(Standard_LRS) rg(obsidian-flier) location(westus2) size(1) diskZone(westus2-1) maxShares(0)
I0409 13:35:39.618168       1 azure_managedDiskController.go:98] azureDisk - creating new managed Name:pvc-53087397-597f-41aa-bccb-e2d8ce450626 StorageAccountType:Standard_LRS Size:1
I0409 13:35:42.158743       1 azure_managedDiskController.go:288] azureDisk - created new MD Name:pvc-53087397-597f-41aa-bccb-e2d8ce450626 StorageAccountType:Standard_LRS Size:1
I0409 13:35:42.158780       1 controllerserver.go:294] create azure disk(pvc-53087397-597f-41aa-bccb-e2d8ce450626) account type(Standard_LRS) rg(obsidian-flier) location(westus2) size(1) tags(map[kubernetes.io-created-for-pv-name:pvc-53087397-597f-41aa-bccb-e2d8ce450626 kubernetes.io-created-for-pvc-name:www-sample-statefulset-1 kubernetes.io-created-for-pvc-namespace:od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf]) successfully
I0409 13:35:42.158843       1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=2.540643032 request="azuredisk_csi_driver_controller_create_volume" resource_group="obsidian-flier" subscription_id="066b2cf2-eca1-4e23-98b4-ce94c72b99b3" source="disk.csi.azure.com" volumeid="/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626" result_code="succeeded"
I0409 13:35:42.158867       1 utils.go:84] GRPC response: {"volume":{"accessible_topology":[{"segments":{"topology.disk.csi.azure.com/zone":"westus2-1"}}],"capacity_bytes":1073741824,"content_source":{"Type":null},"volume_context":{"csi.storage.k8s.io/pv/name":"pvc-53087397-597f-41aa-bccb-e2d8ce450626","csi.storage.k8s.io/pvc/name":"www-sample-statefulset-1","csi.storage.k8s.io/pvc/namespace":"od-pks-testing-8ca861f7-71ae-4058-b5b0-f04db02e25cf","kind":"Managed","requestedsizegib":"1","storageaccounttype":"Standard_LRS"},"volume_id":"/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626"}}
I0409 13:35:44.853034       1 utils.go:78] GRPC request: {"node_id":"849574ed-eef0-49a3-aef2-a03c1737cd4a","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volume_context":{"cachingmode":"ReadWrite","fstype":"","kind":"Managed"},"volume_id":"/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626"}
I0409 13:35:44.891972       1 controllerserver.go:401] GetDiskLun returned: cannot find Lun for disk pvc-53087397-597f-41aa-bccb-e2d8ce450626. Initiating attaching volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 to node 849574ed-eef0-49a3-aef2-a03c1737cd4a (vmState Succeeded).
I0409 13:35:44.892007       1 controllerserver.go:426] Trying to attach volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 to node 849574ed-eef0-49a3-aef2-a03c1737cd4a
I0409 13:35:44.892032       1 azure_controller_common.go:255] wait 1000ms for more requests on node 849574ed-eef0-49a3-aef2-a03c1737cd4a, current disk attach: /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626
I0409 13:35:45.892758       1 azure_controller_common.go:269] Trying to attach volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 lun 2 to node 849574ed-eef0-49a3-aef2-a03c1737cd4a, diskMap len:1, map[/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourcegroups/obsidian-flier/providers/microsoft.compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626:%!s(*provider.AttachDiskOptions=&{ReadWrite pvc-53087397-597f-41aa-bccb-e2d8ce450626  false 2})]
I0409 13:35:45.892866       1 azure_controller_standard.go:103] azureDisk - update(obsidian-flier): vm(849574ed-eef0-49a3-aef2-a03c1737cd4a) - attach disk list(map[/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourcegroups/obsidian-flier/providers/microsoft.compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626:%!s(*provider.AttachDiskOptions=&{ReadWrite pvc-53087397-597f-41aa-bccb-e2d8ce450626  false 2})])
I0409 13:35:46.399029       1 azure_controller_standard.go:116] azureDisk - update(obsidian-flier): vm(849574ed-eef0-49a3-aef2-a03c1737cd4a) - attach disk list(map[/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourcegroups/obsidian-flier/providers/microsoft.compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626:%!s(*provider.AttachDiskOptions=&{ReadWrite pvc-53087397-597f-41aa-bccb-e2d8ce450626  false 2})]) returned with <nil>
I0409 13:35:56.607788       1 azure_controller_common.go:558] azureDisk - find disk: lun 2 name pvc-53087397-597f-41aa-bccb-e2d8ce450626 uri /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626
I0409 13:35:56.607819       1 controllerserver.go:436] Attach operation successful: volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 attached to node 849574ed-eef0-49a3-aef2-a03c1737cd4a.
I0409 13:35:56.607841       1 controllerserver.go:456] attach volume /subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626 to node 849574ed-eef0-49a3-aef2-a03c1737cd4a successfully
I0409 13:35:56.607904       1 azure_metrics.go:115] "Observed Request Latency" latency_seconds=11.715902039 request="azuredisk_csi_driver_controller_publish_volume" resource_group="obsidian-flier" subscription_id="066b2cf2-eca1-4e23-98b4-ce94c72b99b3" source="disk.csi.azure.com" volumeid="/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626" node="849574ed-eef0-49a3-aef2-a03c1737cd4a" result_code="succeeded"
root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl logs csi-azuredisk-node-4ts6c -n kube-system | grep  pvc-53087397-597f-41aa-bccb-e2d8ce450626
Defaulted container "liveness-probe" out of: liveness-probe, node-driver-registrar, azuredisk
root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl logs csi-azuredisk-node-4ts6c -n kube-system -c azuredisk| grep  pvc-53087397-597f-41aa-bccb-e2d8ce450626
root@b972e187-7d97-4933-4479-6f99f744c6a7:/tmp/build/88f1b62d# kubectl logs csi-azuredisk-node-4ts6c -n kube-system -c azuredisk| grep  pvc-53087397-597f-41aa-bcc^Ce2d8ce450626


CSInode.log 
I0410 06:48:42.302185       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 1:0:0:0 directly
I0410 06:48:42.302208       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:0 directly
I0410 06:48:42.302213       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:1 directly
I0410 06:48:42.302218       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:2 directly
I0410 06:48:42.302222       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 4:0:1:0 directly
I0410 06:48:42.302226       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 5:0:0:0 directly
I0410 06:48:43.302264       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 1:0:0:0 directly
I0410 06:48:43.302287       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:0 directly
I0410 06:48:43.302292       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:1 directly
I0410 06:48:43.302296       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:2 directly
I0410 06:48:43.302300       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 4:0:1:0 directly
I0410 06:48:43.302304       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 5:0:0:0 directly
I0410 06:48:44.302389       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 1:0:0:0 directly
I0410 06:48:44.302414       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:0 directly
I0410 06:48:44.302419       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:1 directly
I0410 06:48:44.302423       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 2:0:0:2 directly
I0410 06:48:44.302426       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 4:0:1:0 directly
I0410 06:48:44.302430       1 azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 5:0:0:0 directly
I0410 06:48:44.729773       1 utils.go:77] GRPC call: /csi.v1.Node/NodeStageVolume
I0410 06:48:44.729791       1 utils.go:78] GRPC request: {"publish_context":{"LUN":"2","diskiopsreadwrite":"500","diskmbpsreadwrite":"60","networkaccesspolicy":"AllowAll","skuname":"Standard_LRS"},"staging_target_path":"/var/vcap/data/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/ad2407d46e6a6d65ad8f32685e88d02e2ff433ef4fb076f20ad79eefded4d947/globalmount","volume_capability":{"AccessType":{"Mount":{}},"access_mode":{"mode":7}},"volue_context":{"cachingmode":"ReadWrite","fstype":"","kind":"Managed"},"volume_id":"/subscriptions/066b2cf2-eca1-4e23-98b4-ce94c72b99b3/resourceGroups/obsidian-flier/providers/Microsoft.Compute/disks/pvc-53087397-597f-41aa-bccb-e2d8ce450626"}

What you expected to happen:
The pod should running
How to reproduce it:
Happens occasionally, recreate the worker node usually resolve this issue
Anything else we need to know?:

Environment:

  • CSI Driver version:
kubectl get po -n kube-system -o yaml | grep mcr | grep azuredisk
image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
      imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:d001cbb89e1e56eeed62575bcc9558fc00044fefb0177ea52b6311c152e5239c
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
      imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:d001cbb89e1e56eeed62575bcc9558fc00044fefb0177ea52b6311c152e5239c
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
      imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:d001cbb89e1e56eeed62575bcc9558fc00044fefb0177ea52b6311c152e5239c
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
      image: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi:v1.28.2
      imageID: mcr.microsoft.com/oss/kubernetes-csi/azuredisk-csi@sha256:d001cbb89e1e56eeed62575bcc9558fc00044fefb0177ea52b6311c152e5239c
  • Kubernetes version (use kubectl version):
  • kubectl version
WARNING: This version information is deprecated and will be replaced with the output from kubectl version --short.  Use --output=yaml|json to get the full version.
Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.12", GitCommit:"12031002905c0410706974560cbdf2dad9278919", GitTreeState:"clean", BuildDate:"2024-03-15T02:15:31Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"}
Kustomize Version: v5.0.1
Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.12+vmware.1", GitCommit:"2eb649beaf2d31b1873b8ff68a1bffe984f8ec05", GitTreeState:"clean", BuildDate:"2024-03-27T07:10:53Z", GoVersion:"go1.21.8", Compiler:"gc", Platform:"linux/amd64"}
- OS (e.g. from /etc/os-release):
- worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:~$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.4 LTS"
NAME="Ubuntu"
VERSION_ID="22.04"
VERSION="22.04.4 LTS (Jammy Jellyfish)"
VERSION_CODENAME=jammy
ID=ubuntu
ID_LIKE=debian
HOME_URL="https://www.ubuntu.com/"
SUPPORT_URL="https://help.ubuntu.com/"
BUG_REPORT_URL="https://bugs.launchpad.net/ubuntu/"
PRIVACY_POLICY_URL="https://www.ubuntu.com/legal/terms-and-policies/privacy-policy"
UBUNTU_CODENAME=jammy
  • Kernel (e.g. uname -a):
Linux 849574ed-eef0-49a3-aef2-a03c1737cd4a 5.15.0-101-generic #111-Ubuntu SMP Tue Mar 5 20:16:58 UTC 2024 x86_64 x86_64 x86_64 GNU/Linux
  • Install tools:
  • Others:
@xiaoping8385
Copy link
Author

ssh to the worker node, we can see the devices(sdd sde) are there, but not staged.

worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:~$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
fd0      2:0    1     4K  0 disk 
sda      8:0    0   403G  0 disk 
├─sda1   8:1    0   4.8G  0 part /home
│                                /
└─sda2   8:2    0 398.2G  0 part /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/host-var-log-antrea/antrea-ovs/1
                                 /var/vcap/data/kubelet/pods/9f022aaf-d7b0-477a-b153-9c59404a403f/volume-subpaths/antrea-config/antrea-controller/0
                                 /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-config/antrea-agent/0
                                 /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-config/install-cni/0
                                 /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-agent-tweaker-config/antrea-agent-tweaker/0
                                 /var/tmp
                                 /tmp
                                 /opt
                                 /var/opt
                                 /var/log
                                 /var/vcap/data
sdb      8:16   0    64G  0 disk 
└─sdb1   8:17   0    64G  0 part 
sdc      8:32   0    50G  0 disk 
└─sdc1   8:33   0    50G  0 part /var/vcap/store
sdd      8:48   0     1G  0 disk 
sde      8:64   0     1G  0 disk 
sr0     11:0    1   648K  0 rom  



worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:/sys/bus/scsi/devices$ ls -al
total 0
drwxr-xr-x 2 root root 0 Apr  9 08:53 .
drwxr-xr-x 4 root root 0 Apr  9 08:53 ..
lrwxrwxrwx 1 root root 0 Apr  9 08:53 1:0:0:0 -> ../../../devices/pci0000:00/0000:00:07.1/ata2/host1/target1:0:0/1:0:0:0
lrwxrwxrwx 1 root root 0 Apr  9 09:01 2:0:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2/target2:0:0/2:0:0:0
lrwxrwxrwx 1 root root 0 Apr  9 13:35 2:0:0:1 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2/target2:0:0/2:0:0:1
lrwxrwxrwx 1 root root 0 Apr  9 13:35 2:0:0:2 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2/target2:0:0/2:0:0:2
lrwxrwxrwx 1 root root 0 Apr  9 08:53 4:0:1:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host4/target4:0:1/4:0:1:0
lrwxrwxrwx 1 root root 0 Apr  9 08:53 5:0:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host5/target5:0:0/5:0:0:0
lrwxrwxrwx 1 root root 0 Apr  9 08:53 host0 -> ../../../devices/pci0000:00/0000:00:07.1/ata1/host0
lrwxrwxrwx 1 root root 0 Apr  9 08:53 host1 -> ../../../devices/pci0000:00/0000:00:07.1/ata2/host1
lrwxrwxrwx 1 root root 0 Apr  9 08:53 host2 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2
lrwxrwxrwx 1 root root 0 Apr  9 08:53 host3 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781a-1e82-4818-a1c3-63d806ec15bb/host3
lrwxrwxrwx 1 root root 0 Apr  9 08:53 host4 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host4
lrwxrwxrwx 1 root root 0 Apr  9 08:53 host5 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host5
lrwxrwxrwx 1 root root 0 Apr  9 08:53 target1:0:0 -> ../../../devices/pci0000:00/0000:00:07.1/ata2/host1/target1:0:0
lrwxrwxrwx 1 root root 0 Apr 10 07:21 target2:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/f8b3781b-1e82-4818-a1c3-63d806ec15bb/host2/target2:0:0
lrwxrwxrwx 1 root root 0 Apr  9 08:53 target4:0:1 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0001-8899-0000-000000000000/host4/target4:0:1
lrwxrwxrwx 1 root root 0 Apr  9 08:53 target5:0:0 -> ../../../devices/LNXSYSTM:00/LNXSYBUS:00/PNP0A03:00/device:07/VMBUS:01/00000000-0000-8899-0000-000000000000/host5/target5:0:0

@andyzhangx
Copy link
Member

andyzhangx commented Apr 10, 2024

azure_common_linux.go:121] /dev/disk/azure is not populated, now try to parse 1:0:0:0 directly

from the error msg, it seems you are missing such udev rule on the node, follow guide here to generate the udev rule on the node: #1631 (comment), Azure VM should have such built-in rule by default, is your node type special? @xiaoping8385

@xiaoping8385
Copy link
Author

xiaoping8385 commented Apr 11, 2024

There is no such rule, and not all the node have the same issue, for eg, we found one of the nodes can mount pv successfully, and there is also no such built-in-rule,
In normal node:


> resize2fs /dev/sdd
resize2fs 1.46.5 (30-Dec-2021)
The filesystem is already 262144 (4k) blocks long.  Nothing to do!


worker/722ad0af-9e98-4c46-b47d-44ba282a5960:~$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
fd0      2:0    1     4K  0 disk 
sda      8:0    0    64G  0 disk 
└─sda1   8:1    0    64G  0 part 
sdb      8:16   0   403G  0 disk 
├─sdb1   8:17   0   4.8G  0 part /home
│                                /
└─sdb2   8:18   0 398.2G  0 part /var/vcap/data/kubelet/pods/5986f253-7cd5-4b7a-ad6b-f6e8c8c0f681/volume-subpaths/host-var-log-antrea/antrea-ovs/1
                                 /var/vcap/data/kubelet/pods/5986f253-7cd5-4b7a-ad6b-f6e8c8c0f681/volume-subpaths/antrea-config/antrea-agent/0
                                 /var/vcap/data/kubelet/pods/5986f253-7cd5-4b7a-ad6b-f6e8c8c0f681/volume-subpaths/antrea-config/install-cni/0
                                 /var/vcap/data/kubelet/pods/5986f253-7cd5-4b7a-ad6b-f6e8c8c0f681/volume-subpaths/antrea-agent-tweaker-config/antrea-agent-tweaker/0
                                 /var/tmp
                                 /tmp
                                 /opt
                                 /var/opt
                                 /var/log
                                 /var/vcap/data
sdc      8:32   0    50G  0 disk 
└─sdc1   8:33   0    50G  0 part /var/vcap/store
sdd      8:48   0     1G  0 disk /var/vcap/data/kubelet/pods/fa2d11c4-ef95-47ef-b2b8-7b5fbf8ed279/volumes/kubernetes.io~csi/pvc-684f4e73-681d-4164-97d7-5440bafc950b/mount
                                 /var/vcap/data/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/c4d484f06606c7650a1b33528b67e8221f3c74b68b45db9bd301ad189c62a2eb/globalmount
sr0     11:0    1   648K  0 rom  
worker/722ad0af-9e98-4c46-b47d-44ba282a5960:~$ cd /etc/udev/rules.d/
worker/722ad0af-9e98-4c46-b47d-44ba282a5960:/etc/udev/rules.d$ ls
worker/722ad0af-9e98-4c46-b47d-44ba282a5960:/etc/udev/rules.d$ ls -al
total 8
drwxr-xr-x 2 root root 4096 Apr  7  2022 .
drwxr-xr-x 4 root root 4096 Mar 18 12:14 ..
worker/722ad0af-9e98-4c46-b47d-44ba282a5960:/etc/udev/rules.d$ 

In problematic node,

worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:~$ lsblk
NAME   MAJ:MIN RM   SIZE RO TYPE MOUNTPOINTS
fd0      2:0    1     4K  0 disk 
sda      8:0    0   403G  0 disk 
├─sda1   8:1    0   4.8G  0 part /home
│                                /
└─sda2   8:2    0 398.2G  0 part /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/host-var-log-antrea/antrea-ovs/1
                                 /var/vcap/data/kubelet/pods/9f022aaf-d7b0-477a-b153-9c59404a403f/volume-subpaths/antrea-config/antrea-controller/0
                                 /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-config/antrea-agent/0
                                 /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-config/install-cni/0
                                 /var/vcap/data/kubelet/pods/1db7b8bf-91dc-4526-a47d-50bb0705cfd4/volume-subpaths/antrea-agent-tweaker-config/antrea-agent-tweaker/0
                                 /var/tmp
                                 /tmp
                                 /opt
                                 /var/opt
                                 /var/log
                                 /var/vcap/data
sdb      8:16   0    64G  0 disk 
└─sdb1   8:17   0    64G  0 part 
sdc      8:32   0    50G  0 disk 
└─sdc1   8:33   0    50G  0 part /var/vcap/store
sdd      8:48   0     1G  0 disk 
sde      8:64   0     1G  0 disk 

worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:/var/vcap/bosh_ssh/bosh_79bac66d3614401# resize2fs /dev/sdd
resize2fs 1.46.5 (30-Dec-2021)
resize2fs: Bad magic number in super-block while trying to open /dev/sdd
Couldn't find valid filesystem superblock.

worker/0ebcd467-c9d1-457a-8931-84520ab05cc9:/etc/udev/rules.d# resize2fs /dev/sde
resize2fs 1.46.5 (30-Dec-2021)
resize2fs: Bad magic number in super-block while trying to open /dev/sde
Couldn't find valid filesystem superblock.

Both node are created from the same template

@andyzhangx
Copy link
Member

here is the data disk parsing logic, the reliable way is creating that udev rule on the node, that would create device links under /dev/disk/azure

if len(azureDisks) == 0 {
klog.V(4).Infof("/dev/disk/azure is not populated, now try to parse %v directly", name)
target, err := strconv.Atoi(arr[0])
if err != nil {
klog.Errorf("failed to parse target from %v (%v), err %v", arr[0], name, err)
continue
}
// as observed, targets 0-3 are used by OS disks. Skip them
if target <= 3 {
continue
}
}
// extract LUN from the path.
// LUN is the last index of the array, i.e. 1 in /sys/bus/scsi/devices/3:0:0:1
l, err := strconv.Atoi(arr[3])
if err != nil {
// unknown path format, continue to read the next one
klog.V(4).Infof("azure disk - failed to parse lun from %v (%v), err %v", arr[3], name, err)
continue
}
if lun == l {
// find the matching LUN
// read vendor and model to ensure it is a VHD disk
vendorPath := filepath.Join(sysPath, name, "vendor")
vendorBytes, err := io.ReadFile(vendorPath)
if err != nil {
klog.Errorf("failed to read device vendor, err: %v", err)
continue
}
vendor := strings.TrimSpace(string(vendorBytes))
if strings.ToUpper(vendor) != "MSFT" {
klog.V(4).Infof("vendor doesn't match VHD, got %s", vendor)
continue
}
modelPath := filepath.Join(sysPath, name, "model")
modelBytes, err := io.ReadFile(modelPath)
if err != nil {
klog.Errorf("failed to read device model, err: %v", err)
continue
}
model := strings.TrimSpace(string(modelBytes))
if strings.ToUpper(model) != "VIRTUAL DISK" {
klog.V(4).Infof("model doesn't match VHD, got %s", model)
continue
}
// find a disk, validate name
dir := filepath.Join(sysPath, name, "block")
if dev, err := io.ReadDir(dir); err == nil {
found := false
devName := dev[0].Name()
for _, diskName := range azureDisks {
klog.V(12).Infof("azureDisk - validating disk %q with sys disk %q", devName, diskName)
if devName == diskName {
found = true
break
}
}
if !found {
devLinkPaths := []string{"/dev/disk/azure/scsi1/", "/dev/disk/by-id/"}
for _, devLinkPath := range devLinkPaths {
diskPath, err := getDiskLinkByDevName(io, devLinkPath, devName)
if err == nil {
klog.V(4).Infof("azureDisk - found %s by %s under %s", diskPath, devName, devLinkPath)
return diskPath, nil
}
klog.Warningf("azureDisk - getDiskLinkByDevName by %s under %s failed, error: %v", devName, devLinkPath, err)
}
return "/dev/" + devName, nil
}
}
}
}

@xiaoping8385
Copy link
Author

xiaoping8385 commented Apr 11, 2024

Who should be responsible to creating the udev rule? I thought the vm is created by azure cloud. and it can not explain that some of the nodes work and some doesn't

@andyzhangx
Copy link
Member

can you try creating the udev rule on one node and check whether it works? @xiaoping8385 if it's working, then I could provide you a daemonset to creating udev rule on every node.

@xiaoping8385
Copy link
Author

So i need to create a script (66-azure-storage.rules )as described in #1631 comment and run that?

@andyzhangx
Copy link
Member

@xiaoping8385 just create that script, and it would work automatically. and then reschedule pod with disk volume onto that node.

@xiaoping8385
Copy link
Author

@andyzhangx Still not work, I deployed the script, and delete the pod, and also restart the csi-azuredisk-node, but the pod still not running. Is that possible we have a zoom or chat?

@andyzhangx
Copy link
Member

@xiaoping8385 can you run following command on the node, and try again?

This command will reload the udev rules and trigger the creation of any new device nodes that match the rules.

sudo udevadm control --reload-rules && sudo udevadm trigger

@xiaoping8385
Copy link
Author

it works finally, thanks, and one more question, how should we prevent this issue in future or do we need an extra deployment? what's the root cause for this issue

@andyzhangx
Copy link
Member

it works finally, thanks, and one more question, how should we prevent this issue in future or do we need an extra deployment? what's the root cause for this issue

@xiaoping8385 per this doc, the linux VM should have udev rules:
https://learn.microsoft.com/en-us/troubleshoot/azure/virtual-machines/linux/troubleshoot-device-names-problems#solution

What is your k8s cluster? are you manually setting up k8s cluster based on Azure Ubuntu VMs?

@xiaoping8385
Copy link
Author

@andyzhangx Sorry for late reply, Our k8s cluster is actually created by bosh director, which has a CPI to talk with azure platform to create vms, I don't think there is a udev rules in our envs,as I mentioned before,we have two nodes without such rule, but in one node the volume can be successfully staged, but the other don't. is there other ways other than udev rules that help to stage the volume?

@andyzhangx
Copy link
Member

@andyzhangx Sorry for late reply, Our k8s cluster is actually created by bosh director, which has a CPI to talk with azure platform to create vms, I don't think there is a udev rules in our envs,as I mentioned before,we have two nodes without such rule, but in one node the volume can be successfully staged, but the other don't. is there other ways other than udev rules that help to stage the volume?

@xiaoping8385 using udev rules are the most reliable way to detect a data disk correctly. Are you able to create a daemonset to write udev rules on every node?

@xiaoping8385
Copy link
Author

sometimes yes, since this issue is happened occasionally.

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jul 21, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Aug 20, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

@k8s-ci-robot k8s-ci-robot closed this as not planned Won't fix, can't repro, duplicate, stale Sep 19, 2024
@k8s-ci-robot
Copy link
Contributor

@k8s-triage-robot: Closing this issue, marking it as "Not Planned".

In response to this:

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues and PRs.

This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Reopen this issue with /reopen
  • Mark this issue as fresh with /remove-lifecycle rotten
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/close not-planned

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed.
Projects
None yet
Development

No branches or pull requests

4 participants