Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

blockdev: query physical sector size, not logical #1056

Closed
wants to merge 1 commit into from

Conversation

jlebon
Copy link
Member

@jlebon jlebon commented Nov 25, 2022

When determining whether to install the 512b or 4k image, we should be comparing against the physical sector size of the drive, not its logical sector size. Some 4k drives report themselves as having a logical sector size of 512 bytes for backwards compatibility with OSes and software that can't do 4k blocks (these are also called "512e" drives). However, actually using the drive as a 512b one provides reduced performance because the drive needs to translate all requests.

For comparison, mkfs.xfs will use a sector size of 4096 on 512e drives:

https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/mkfs/xfs_mkfs.c?id=3498b68025ce#n1932

One can also experimentally verify this:

$ qemu-img create -f qcow2 storage-512.qcow2 1G
$ qemu-img create -f qcow2 storage-512e.qcow2 1G
$ qemu-img create -f qcow2 storage-4kn.qcow2 1G
$ cosa run -- \
    -drive if=none,id=storage-512,format=qcow2,file=storage-512.qcow2 \
    -device virtio-blk,drive=storage-512,physical_block_size=512,logical_block_size=512,serial=storage-512 \
    -drive if=none,id=storage-512e,format=qcow2,file=storage-512e.qcow2 \
    -device virtio-blk,drive=storage-512e,physical_block_size=4096,logical_block_size=512,serial=storage-512e \
    -drive if=none,id=storage-4kn,format=qcow2,file=storage-4kn.qcow2 \
    -device virtio-blk,drive=storage-4kn,physical_block_size=4096,logical_block_size=4096,serial=storage-4kn
[core@cosa-devsh ~]$ sudo mkfs.xfs /dev/disk/by-id/virtio-storage-512 | grep sectsz
         =                       sectsz=512   attr=2, projid32bit=1
         =                       sectsz=512   sunit=0 blks, lazy-count=1
[core@cosa-devsh ~]$ sudo mkfs.xfs /dev/disk/by-id/virtio-storage-512e | grep sectsz
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
[core@cosa-devsh ~]$ sudo mkfs.xfs /dev/disk/by-id/virtio-storage-4kn | grep sectsz
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       sectsz=4096  sunit=1 blks, lazy-count=1

To match this, we only need to change get_sector_size_for_path to use BLKPBSZGET instead of BLKSSZGET.

When determining whether to install the 512b or 4k image, we should be
comparing against the physical sector size of the drive, not its logical
sector size. Some 4k drives report themselves as having a logical sector
size of 512 bytes for backwards compatibility with OSes and software
that can't do 4k blocks (these are also called "512e" drives). However,
actually using the drive as a 512b one provides reduced performance
because the drive needs to translate all requests.

For comparison, `mkfs.xfs` will use a sector size of 4096 on 512e
drives:

https://git.kernel.org/pub/scm/fs/xfs/xfsprogs-dev.git/tree/mkfs/xfs_mkfs.c?id=3498b68025ce#n1932

One can also experimentally verify this:

```
$ qemu-img create -f qcow2 storage-512.qcow2 1G
$ qemu-img create -f qcow2 storage-512e.qcow2 1G
$ qemu-img create -f qcow2 storage-4kn.qcow2 1G
$ cosa run -- \
    -drive if=none,id=storage-512,format=qcow2,file=storage-512.qcow2 \
    -device virtio-blk,drive=storage-512,physical_block_size=512,logical_block_size=512,serial=storage-512 \
    -drive if=none,id=storage-512e,format=qcow2,file=storage-512e.qcow2 \
    -device virtio-blk,drive=storage-512e,physical_block_size=4096,logical_block_size=512,serial=storage-512e \
    -drive if=none,id=storage-4kn,format=qcow2,file=storage-4kn.qcow2 \
    -device virtio-blk,drive=storage-4kn,physical_block_size=4096,logical_block_size=4096,serial=storage-4kn
[core@cosa-devsh ~]$ sudo mkfs.xfs /dev/disk/by-id/virtio-storage-512 | grep sectsz
         =                       sectsz=512   attr=2, projid32bit=1
         =                       sectsz=512   sunit=0 blks, lazy-count=1
[core@cosa-devsh ~]$ sudo mkfs.xfs /dev/disk/by-id/virtio-storage-512e | grep sectsz
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
[core@cosa-devsh ~]$ sudo mkfs.xfs /dev/disk/by-id/virtio-storage-4kn | grep sectsz
         =                       sectsz=4096  attr=2, projid32bit=1
         =                       sectsz=4096  sunit=1 blks, lazy-count=1
```

To match this, we only need to change `get_sector_size_for_path` to use
`BLKPBSZGET` instead of `BLKSSZGET`.
@dustymabe
Copy link
Member

Nice work investigating this and writing a very detailed commit message here!

@prestist
Copy link
Contributor

Really good job! nice catch!

@jlebon
Copy link
Member Author

jlebon commented Dec 1, 2022

OK, after learning more about GPT, I think this is incorrect actually. The GPT is located at LBA 1, which is based on the logical sector size of the disk, not physical. So even on a 512e disk, it needs to be at offset 512. This patch will make it so that we install the 4Kn image, where the GPT is located at offset 4096. So the kernel will be unable to read the partition table because it expects it at offset 512.

So we're in this awkward situation where for optimal performance on 512e disks, we want the partition table of the 512b image, but the XFS filesystem of the 4k image. I guess we could create a third metal512e disk, but yuck. This feels a lot like coreos/fedora-coreos-tracker#1183; the current behaviour works but it's not optimal. When adding the "auto-detect and reprovision" logic to fix that issue, we could glob this issue in there too so that we reprovision also when we detect a 512e disk.

Anyway, the workaround short-term is also the same (manually reprovision the rootfs like this).

@dustymabe
Copy link
Member

I wonder if we need a separate tracker issue for this problem. We could still fix it the same way as coreos/fedora-coreos-tracker#1183, but a complete description of the problem with all surrounding context would be really nice for our future selves (and others).

@jlebon jlebon deleted the pr/use-physical-block-size branch April 19, 2023 02:25
@jlebon
Copy link
Member Author

jlebon commented Sep 8, 2023

I wonder if we need a separate tracker issue for this problem. We could still fix it the same way as coreos/fedora-coreos-tracker#1183, but a complete description of the problem with all surrounding context would be really nice for our future selves (and others).

I think we can consider this issue as "fixed" also by coreos/fedora-coreos-config#2320. It would be too disruptive to always reprovision if we detect a 512e disk. I think a reasonable approach would be to reprovision only when the disk is over a certain size, as a proxy for "this is a long-running instance". The autosave-xfs work implements just that.

@dustymabe
Copy link
Member

Noting here that we bumped the "certain size" threshold in coreos/fedora-coreos-config#2565

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

4 participants