Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

disk lookup by LUN ID can match a wrong disk #2034

Closed
RomanBednar opened this issue Oct 30, 2023 · 7 comments
Closed

disk lookup by LUN ID can match a wrong disk #2034

RomanBednar opened this issue Oct 30, 2023 · 7 comments

Comments

@RomanBednar
Copy link
Contributor

What happened:

When the driver is trying to match a disk by LUN ID in findDiskByLunWithConstraint it can end up matching a wrong disk if system is configured with multiple iSCSI disks with same suffix, for example /sys/bus/scsi/devices/6:0:0:0 and /sys/bus/scsi/devices/4:0:1:0 - both have :0 suffix which is what the driver uses for matching.

In the case we observed 6:0:0:0 was the disk expected to match but instead 4:0:1:0 was matched. Probably because there are two search paths and if the LUN ID 0 lookup fails for /dev/disk/azure/scsi1/ but succeeds for /dev/disk/by-id/ it might return incorrect disk as there are no further checks.

The disk with 4:0:1:0 in our case appears to be a Bitlocker Encryption Key (BEK) Volume which contained partitions and caused mount-utils to fail on format detection which returns "unknown data, probably partitions" in such case:

2023-09-01T13:55:38.139067886Z W0901 13:55:38.139019       1 azure_common_linux.go:180] azureDisk - getDiskLinkByDevName by sda under /dev/disk/azure/scsi1/ failed, error: device name(sda) is not found under /dev/disk/azure/scsi1/
2023-09-01T13:55:38.142227132Z I0901 13:55:38.142183       1 nodeserver.go:159] NodeStageVolume: formatting /dev/disk/by-id/scsi-14d53465420202020aef00e7eb172024e97cbb5e3b3d21e04 and mounting at /var/lib/kubelet/plugins/kubernetes.io/csi/disk.csi.azure.com/27afb2c50859602ff458e1ca914c4ac28aedab21e97344c0c0812f86b7f9c
42e/globalmount with mount options([])
2023-09-01T13:55:38.147728710Z W0901 13:55:38.147691       1 mount_linux.go:464] Configured to mount disk /dev/disk/by-id/scsi-14d53465420202020aef00e7eb172024e97cbb5e3b3d21e04 as unknown data, probably partitions but current format is ext4, things might break
2023-09-01T13:55:38.159282176Z I0901 13:55:38.159227       1 mount_linux.go:389] `fsck` error fsck from util-linux 2.32.1
2023-09-01T13:55:38.159282176Z fsck.ext2: Bad magic number in super-block while trying to open /dev/sda

Not sure why the code is written this way, but would it be possible to change this so the /dev/disk/azure/scsi1/ path is searched first and any possible fallback to by-id is more sophisticated to ensure we match correct disk? Current code will match almost anything by-id which is not safe as shown above.

What you expected to happen:

Driver should match correct disk for systems with multiple iSCSI disks with same suffix.

How to reproduce it:

Not sure - in our case it happened during upgrade when previous volumes were provisioned with in-tree plugin, but we don't have any evidence on when the BEK Volume was added. And as @jsafrane pointed out, the in-tree plugin is using the same device discovery so it's unclear why we did not hit this issue before.

Anything else we need to know?:

Environment:

  • CSI Driver version:
  • Kubernetes version (use kubectl version):
  • OS (e.g. from /etc/os-release):
  • Kernel (e.g. uname -a):
  • Install tools:
  • Others:
@RomanBednar
Copy link
Contributor Author

@andyzhangx - there's an older similar issue which was closed without a solution: #584

@andyzhangx
Copy link
Member

@andyzhangx - there's an older similar issue which was closed without a solution: #584

@RomanBednar thanks for the reminder, I will check try to fix it in Nov.

@andyzhangx
Copy link
Member

@RomanBednar can you set udev rule on every node? thus /dev/disk/azure/scsi1/ is populated correctly, that's a must on Azure

#1631 (comment)

@andyzhangx
Copy link
Member

close this issue since /dev/disk/azure/scsi1/ is NOT populated correctly, you need to set udev rules on every Azure node

@RomanBednar
Copy link
Contributor Author

@andyzhangx Sorry for late reply, we were waiting for customer feedback. They're using this set of udev rules currently and still hit the issue: https://github.com/Azure/WALinuxAgent/blob/master/config/66-azure-storage.rules

And the /dev/disk/azure/scsi1/ path appears to be populated correctly:

/dev/disk/azure/scsi1:
total 0
drwxr-xr-x. 2 0 0 280 Dec 11 15:30 .
drwxr-xr-x. 3 0 0 200 Dec 11 15:28 ..
lrwxrwxrwx. 1 0 0  12 Dec 11 15:28 lun0 -> ../../../sdd
lrwxrwxrwx. 1 0 0  12 Dec 11 15:32 lun1 -> ../../../sde
lrwxrwxrwx. 1 0 0  12 Dec 11 15:32 lun10 -> ../../../sdm
lrwxrwxrwx. 1 0 0  12 Dec 11 15:32 lun11 -> ../../../sdi
lrwxrwxrwx. 1 0 0  12 Dec 11 15:32 lun2 -> ../../../sdl
lrwxrwxrwx. 1 0 0  12 Dec 11 15:32 lun3 -> ../../../sdn
lrwxrwxrwx. 1 0 0  12 Dec 11 15:32 lun4 -> ../../../sdk
lrwxrwxrwx. 1 0 0  12 Dec 11 15:32 lun5 -> ../../../sdo
lrwxrwxrwx. 1 0 0  12 Dec 11 15:32 lun6 -> ../../../sdj
lrwxrwxrwx. 1 0 0  12 Dec 11 15:32 lun7 -> ../../../sdh
lrwxrwxrwx. 1 0 0  12 Dec 11 15:32 lun8 -> ../../../sdf
lrwxrwxrwx. 1 0 0  12 Dec 11 15:32 lun9 -> ../../../sdg

If I understand it right, the reason to have the udev rules was so that the driver matches the disk in /dev/disk/azure/scsi1 path correct?

But when the driver logs this warning: azureDisk - getDiskLinkByDevName by sda under /dev/disk/azure/scsi1/ failed, the disk (sda) it selected and returned for mounting (/dev/sda) is already wrong. This could be that the LUN ID is matched earlier in findDiskByLunWithConstraint under /sys/ (e.g. /sys/bus/scsi/devices/4:0:0:0) and resolved to sda (which is a BEK volume for us) - so having /dev/disk/azure/scsi1/ populated does not help with this.

@andyzhangx
Copy link
Member

@RomanBednar that's incorrect, if /dev/disk/azure/scsi1/ is populated and there is at least one disk under that path, then it won't go to following code snippet:

if len(azureDisks) == 0 {
klog.V(4).Infof("/dev/disk/azure is not populated, now try to parse %v directly", name)
target, err := strconv.Atoi(arr[0])
if err != nil {
klog.Errorf("failed to parse target from %v (%v), err %v", arr[0], name, err)
continue
}
// as observed, targets 0-3 are used by OS disks. Skip them
if target <= 3 {
continue
}
}

This azure disk csi driver will always find disk under /dev/disk/azure/scsi1/, and other way, .e.g. /sys/bus/scsi/devices/4:0:0:0 is not reliable, we should avoid using other way to find the disk.

@RomanBednar
Copy link
Contributor Author

@andyzhangx I might have described it wrong, you're right there is no matching done for devices under /sys but what is happening is that this function returns a disk which it did not match under /dev/disk/azure/scsi1/ but returns /dev/sdX regardless (whatever it's currently searching for), while it should have been filtered out (and next disk should have been tried, eventually matching the right disk).

What we have under /dev/disk/azure/scsi1/ are all disks that driver can use for volumes and there is no system disk (like BEK volume) - this should be correct right? But what happens is that it reads the scsi links and resolves to disk name (like sda), then fails to find it in /dev/disk/azure/scsi1/ and proceeds to return /dev/sda (incorrect) with a warning.

Maybe there's a different issue that I don't see yet? Because we have the right udev rules, the correct path is populated and still we're getting a wrong disk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants