-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Poor RBD performance as LIO-TCMU iSCSI target #359
Comments
We are just starting to investigate performance. One known issue is that for LIO and open-iscsi you need to have node.session.cmds_max match the LIO default_cmdsn_depth setting. If they are not the same, then there seems to be a bug on the initiator side where IOs are requeued and do not get retried quickly like normal. There is another issue for latency/IOPs type of tests where one command slows others. The attached patch is a hack around it but it needs work because it can cause extra switches. For target_core_user there are other issues like its memory allocation in the main path, but you might not be hitting that with the fio arguments you are using. |
Thank you~ I retested the performance with tcmu-runner-1.3.0 and set node.session.cmds_max match the LIO default_cmdsn_depth.There are some improvements in performance( 18.8 K IOPS),and the performance the same as using TGT.
if I test with the patch,the performance (32K IOPS) approach the RBD itself. |
Thanks for testing.
Yeah, the patch needs some cleanup, because of what you notice below.
It is used during failover/failback and recovery to make sure IOs are not being executed in the handler modules (handler_rbd, handler_glfs, etc) when we execute a callout like lock() or (re)open(). So ideally, we have these issues:
|
@Github641234230 I noticed that your kernel version is 3.10.0-693.11.6.el7.x86_64. |
@mikechristie If our product can only use CentOS 7.4 (3.10.0-693.11.6.el7.x86_64) and I want to do HA.what should I do?Which patch I can use? I've tried using targetcli to export RBDs to all gateway nodes,On the iscsi client side, I use dm-multipath to find it and it can work well( both Active/Active and active/passive).Is there any problem using this method for HA? |
For upstream tcmu-runner/ceph-iscsi-cli HA support you have to use RHEL 7.5 beta or newer kernel or this kernel: https://github.com/ceph/ceph-client. HA is only supported with active/passive. You must use the settings here http://docs.ceph.com/docs/master/rbd/iscsi-initiators/ Just because dm-multipath let's you setup active/active does not mean it is safe. You can end up with data corruption. Use the settings in the docs. If you are doing single node (non HA) then you can do active/active across multiple portals on that one node. |
@MIZZ122 if you have other questions about active/active can you open a new issues or discuss it in the issue for active/active. This issue is for perf only. |
@mikechristie Any update on this issue? |
@lxbsz was testing it out for gluster with the perf team. lxbsz, did it help and did you make the changes I requested and were they needed or was it ok to just always just complete right away? It looks like you probably got busy with resize so I can do the changes. Are you guys working with the perf team still, so we can get them tested? |
@mikechristie Yes, we and the perf team together test this. The environment is base PostgreSQL database when running on Gluster Block Volume in a CNS environment. 1, by changing node.session.cmds_max to match the LIO default_cmdsn_depth. 2, by https://github.com/open-iscsi/tcmu-runner/files/1654757/runner-dont-wait.txt 3, by changing the default_cmdsn_depth to 64: So we are preparing to have a more test about this later. These days we are busy with the RHGS's release. |
Ok, assume this is back on me. |
We will test this by mixing them up later once we have enough time. |
Can I use this (https://github.com/open-iscsi/tcmu-runner/files/1654757/runner-dont-wait.txt) patch for production ESXi environment? |
It is perfectly safe crash wise but might cause other regressions. If you can test, I can give you a patch later this week that makes it configurable so we can try to figure out if there is some balance between the 2 extreme settings being used with and without the patch or if it might need to be configurable for the type of workload. |
Ok, i'am waiting patch and instruction for how test it (ceph, tcmu-runner, FIO) |
In my test environment for ceph rbd, the tgt perf is better than lio-tcmu. 4K rand_write 4K rand_read 64K write 64K read The perf of TGT is better than LIO. It's strange. |
@mikechristie 4M seq write & one LIO gw for a RBD 4M seq read & one LIO gw for a RBD 4M seq write & one LIO gw for four RBDs 4M seq read & one LIO gw for four RBDs I have a poor throughput for scsi disk using TCMU, is this having something to do with the
as you say ? |
@mikechristie Has the runner-dont-wait.txt patch already been merged to 1.4RC1? |
Yes. |
@mikechristie I am having performance issue with EC RBD as backend store. I am using 1.4RC1.KRBD seq write speed is about 600MB/s.TCUM+RBD seq write speed is around 30MB/s. |
hi @shadowlinyf ,will you test it again afterwards, is tcmu still very poor? |
now i meet the same performance like this , fio with rbd ,the result was about 500MB/s , if with tcmu of user:rbd , the fio test result was about 15MB/s ,this performance is too poor , my env is :kernel -5.0.4 , tcmu -lasest release 1.4.1 , ceph - 12.2.11 |
i meet the same performance like this. I seem to have solved this my problem. Although there are other performance issues. |
Hi~ I am a senior university student and I've been learning ceph and iscsi recently.
I'm using fio to test the performance of the RBD,but performance degradation when using RBDs with
LIO-TCMU.
My test is mainly about the performance of the RBD as a target using LIO_TCMU、the performance of the RBD itself (no iSCSI or LIO-TCMU)、the performance of the RBD as a target using TGT.
Details about the test environment:
I use targetcli(or tgtadm) to create target device and use initiator to login it.And then,I use fio to test the device.
1)the performance of the RBD itself (no iSCSI or LIO-TCMU)
rbd create image-10 --size 102400 (rbd default features = 3)
fio test config
performance: 35-40 K IOPS
2)the performance of the RBD as a target using TGT.
create lun:
tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1 --backing-store rbd/image-10 --bstype rbd
initiator
iscsiadm -m node --targetname iqn.2018-01.com.example02:iscsi -p 192.168.x.x:3260 -l
the lun was mounted as /dev/sdw
fio test
performance: 18-20K IOPS
3)the performance of the RBD as a target using LIO_TCMU
use targetcli to create lun and tpg default_cmdsn_depth=512.
initiator side
node.session.cmds_max = 2048
node.session.queue_depth = 1024
/dev/sdv backend image-10
performance: 7K IOPS
I found an issue similar to me, but I still haven't found the problem
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-October/044021.html
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-December/045347.html
Thanks for any help anyone can provide!
The text was updated successfully, but these errors were encountered: