Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ceph: ceph-volume lvm batch support #225

Open
junousi opened this issue Jun 17, 2023 · 6 comments
Open

Ceph: ceph-volume lvm batch support #225

junousi opened this issue Jun 17, 2023 · 6 comments

Comments

@junousi
Copy link
Contributor

junousi commented Jun 17, 2023

This is part feature request part reminder for myself as I could probably whip up a PR at some point.

According to The Literature (1*, 2*) throwing a DB/WAL to an NVMe might not be the only way to utilize an NVMe. But, when storing data, it might be sub-optimal to have the OSD span the entire device - it is better to split it at least a little bit.

Enter LVM batches:

A) ceph-volume lvm batch --osds-per-device 4 /dev/nvme2n1

or with separate DB

B) ceph-volume lvm batch --osds-per-device 4 /dev/nvme2n1 --db-devices /dev/nvme0n1

or with multiple data cards

C) ceph-volume lvm batch --osds-per-device 4 /dev/nvme2n1 /dev/nvme3n1 --db-devices /dev/nvme0n1

I suppose the input data could look something like:

pve_ceph_osds:
  # current form factor
  - device: /dev/sdc
  # proposed form factor 
  - lvm_batch:
      osds_per_device: 4
      devices:
        - /dev/nvme2n1
        - /dev/nvme3n1
      db_devices:
        - /dev/nvme0n1

...but up for debate of course.

I have tested scenario A) by running the command manually server-side, then adding just - device /dev/nvme2n1 into pve_ceph_osds then running playbook. Works fine. This is on latest PVE, Ceph quincy, latest version of the role. But it would be handy to have the role control the batch creation, hence writing up this Issue.

IIUC when offloading DB to a dedicated device, once a batch is built there will be no possibility to add more devices later on as all of the DB device space will be used entirely and split evenly upon build time. So a device add to an existing batch would probably mean a teardown + rebuild for the entire batch. Meaning better have spare cluster capacity for such scenarios.


1* https://forum.proxmox.com/threads/recommended-way-of-creating-multiple-osds-per-nvme-disk.52252/
2* https://www.reddit.com/r/ceph/comments/jnyxgm/how_do_you_create_multiple_osds_per_disk_with/

@lae
Copy link
Owner

lae commented Jun 17, 2023

Sounds like a good idea.

Some suggestions for implementation using the suggested schema (but should still apply in any case):

  • pve_ceph_osds[*].device entries should still remain valid/functional
  • there should be an early role variable check to ensure devices in pve_ceph_osds[*].lvm_batch.devices[*] and pve_ceph_osds[*].device are all unique/don't overlap

I think, ideally, if we're going to mix use of pveceph and ceph-volume for OSD management, that we should switch to writing a module for it, since it should be easier to safely reason about actions and idempotency with python rather than using set_fact and all that0. But I guess adding a task prior to Create Ceph OSDs for lvm batch entries and then producing a list for the Create Ceph OSDs task could be one method? (if I'm understanding the procedure you describe correctly)

Hm...

After skimming through The Literature1, I'm wondering if trying to interface over lvm batch may be unnecessary, or rather should be its own exclusive role variable/task that is used to generate a list of OSDs and values for flags2 to pass to ceph-volume/pveceph. The idempotency aspect3 from ceph-volume lvm batch seems like it might be useful to interface over, too, and it looks like you could "plan" to have more data devices by specifying --(block-db|block-wal|journal)-slots to some expected number of devices when using batch.

Anyway, the reason why I think it might be unnecessary is that...the following seems to suggest osds_per_device just specifies a worker count for accessing a device rather than physically splitting it?

Number of osd daemons per “DATA” device. To fully utilize nvme devices multiple osds are required. Can be used to split dual-actuator devices across 2 OSDs, by setting the option to 2.4

Which seems to suggest to me that maybe there's another way of configuring it outside of using lvm batch. But given the general benefits from lvm batch, might as well use it. Alternatively...drive groups5 but I feel like that would definitely be pushing the scope of this role. 😅

I'll think about it some more but if you open a PR I'll take a look and evaluate.

0 https://github.com/lae/ansible-role-proxmox/blob/develop/tasks/ceph.yml#L39
1 https://docs.ceph.com/en/latest/ceph-volume/lvm/batch/
2 https://docs.ceph.com/en/latest/ceph-volume/lvm/batch/#json-reporting
3 https://docs.ceph.com/en/latest/ceph-volume/lvm/batch/#idempotency-and-disk-replacements
4 https://docs.ceph.com/en/latest/cephadm/services/osd/#ceph.deployment.drive_group.DriveGroupSpec.osds_per_device
5 https://docs.ceph.com/en/latest/cephadm/services/osd/#drivegroups

@junousi
Copy link
Contributor Author

junousi commented Jun 17, 2023

After skimming through The Literature1, I'm wondering if trying to interface over lvm batch may be unnecessary

It absolutely can be, and rather than trying to implement something asap I think it's a good idea to let this Issue brew for a while in case other ceph users would want to give feedback.

For example, simply augmenting support for the pveceph osd ... -db_size parameter would probably allow similar control over the DB grain sizes. In similar fashion, I think some LVM wrapper/boilerplate could be considered towards a pveceph-only solution for partitioning the data part. It's irritatingly convenient that ceph-volume seems to do all these things automagically :)

Thanks for the schema pointers. I was also contemplating something like:

pve_ceph_osds:
  - device: /dev/nvme2n1
    block.db: /dev/nvme0n1
    lvm_batch: true
    osds_per_device: 4
  - device: /dev/nvme3n1
    block.db: /dev/nvme0n1
    lvm_batch: true
    osds_per_device: 4

...which would retain more of the existing structure, but needlessly replicate information.

And as well something like:

pve_ceph_osds:
  - device: /dev/nvme2n1
    lvm_batch: LABEL
pve_ceph_osds_batches:
  - LABEL:
      osds_per_device: 4
      block.db: /dev/nvme0n1

...which I suppose is a bit unintuitive.

@junousi
Copy link
Contributor Author

junousi commented Jun 17, 2023

Anyway, the reason why I think it might be unnecessary is that...the following seems to suggest osds_per_device just specifies a worker count for accessing a device rather than physically splitting it?

Regarding this I can mention from brief testing that the lvm batch --osds-per-device 4 definitely generated something ... tangible:

12    ssd   0.36389          osd.12              up   1.00000  1.00000 # these existed before
13    ssd   0.36389          osd.13              up   1.00000  1.00000 # these existed before
14    ssd   0.36389          osd.14              up   1.00000  1.00000 # these existed before
15    ssd   0.36389          osd.15              up   1.00000  1.00000 # these existed before
16    ssd   0.36389          osd.16              up   1.00000  1.00000 # these existed before
17    ssd   0.36389          osd.17              up   1.00000  1.00000 # these existed before
18    ssd   0.90959          osd.18              up   1.00000  1.00000 # these spawned from LVM batch op
19    ssd   0.90959          osd.19              up   1.00000  1.00000 # these spawned from LVM batch op
20    ssd   0.90959          osd.20              up   1.00000  1.00000 # these spawned from LVM batch op
21    ssd   0.90959          osd.21              up   1.00000  1.00000 # these spawned from LVM batch op

@lae
Copy link
Owner

lae commented Jun 17, 2023

is now showing additional osd.18 through osd.21 per ceph osd tree.

Right, yeah. From my interpretation those are just extra OSD daemons running for the same device (hence the worker term I used), which brings performance improvements. So it's a number that could theoretically be modified at any time without modifying the associated disk, I think.

@junousi
Copy link
Contributor Author

junousi commented Jun 19, 2023

is now showing additional osd.18 through osd.21 per ceph osd tree.

Right, yeah. From my interpretation those are just extra OSD daemons running for the same device (hence the worker term I used), which brings performance improvements. So it's a number that could theoretically be modified at any time without modifying the associated disk, I think.

Computer says no (but also...yes? regarding the ls-by-host part):

# ceph-volume lvm list /dev/nvme0n1 --format json|grep -i '\"ceph.osd_'|awk -F'"' '{print $4}'
4ab98f5e-be73-411c-ac7d-da3bfef8a85a
18
4cf2509a-fed0-4019-928a-4df3be8fed89
19
8418902d-4277-401b-ada7-3ebf4a58f411
20
c436bde0-9e20-4ee2-b28d-54af30caaa36
21
# lvs | fgrep -f <(ceph-volume lvm list /dev/nvme0n1 --format json|grep -i '\"ceph.osd_fsid'|awk -F'"' '{print $4}')
  osd-block-4ab98f5e-be73-411c-ac7d-da3bfef8a85a ceph-8fa4913d-d31b-4fff-9540-82e2f5a77166 -wi-ao---- <931.48g                                                    
  osd-block-4cf2509a-fed0-4019-928a-4df3be8fed89 ceph-8fa4913d-d31b-4fff-9540-82e2f5a77166 -wi-ao---- <931.48g                                                    
  osd-block-8418902d-4277-401b-ada7-3ebf4a58f411 ceph-8fa4913d-d31b-4fff-9540-82e2f5a77166 -wi-ao---- <931.48g                                                    
  osd-block-c436bde0-9e20-4ee2-b28d-54af30caaa36 ceph-8fa4913d-d31b-4fff-9540-82e2f5a77166 -wi-ao---- <931.48g                                                    
# ceph device ls-by-host $(hostname)|grep nvme
WD_BLACK_AN1500_WUBT21180201                     nvme0n1  osd.18 osd.19 osd.20 osd.21                  
# ceph-volume lvm batch --osds-per-device 2 /dev/nvme0n1
--> DEPRECATION NOTICE
--> You are using the legacy automatic disk sorting behavior
--> The Pacific release will change the default to --no-auto
--> passed data devices: 1 physical, 0 LVM
--> relative data size: 0.5
--> All data devices are unavailable

Total OSDs: 0

  Type            Path                                                    LV Size         % of device
--> The above OSDs would be created if the operation continues
--> do you want to proceed? (yes/no) yes
#
# # (nothing happens when answering yes)

If one batch op has already carved the NVMe into ~ 4x900GB , even if user would be willing to take the hit from a rebalance or mitigate it with noout or whatever, the tool is not magical enough to rearrange the NVMe to e.g. the 2x1800GB attempted above.

I dunno, maybe this is a bit more involved than I initially thought:) I pushed some stuff here but will now be AFK for quite some time before can look at this further.

@lae
Copy link
Owner

lae commented Jun 19, 2023

Oh! Okay, yeah that does seem to indicate my understanding for osds_from_devices from just the docs was incorrect. (I have like, no environment (or funds to make one to be honest) to be able to test/experiment with this unfortunately....)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants