Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Poor RBD performance as LIO-TCMU iSCSI target #359

Open
DongyuanPan opened this issue Jan 23, 2018 · 23 comments
Open

Poor RBD performance as LIO-TCMU iSCSI target #359

DongyuanPan opened this issue Jan 23, 2018 · 23 comments

Comments

@DongyuanPan
Copy link

Hi~ I am a senior university student and I've been learning ceph and iscsi recently.

I'm using fio to test the performance of the RBD,but performance degradation when using RBDs with
LIO-TCMU.

My test is mainly about the performance of the RBD as a target using LIO_TCMU、the performance of the RBD itself (no iSCSI or LIO-TCMU)、the performance of the RBD as a target using TGT.

Details about the test environment:

  • Single node test "cluster" (osd pool default size = 1) with Ceph (version 12.2.2)
  • CentOS 7.4 (3.10.0-693.11.6.el7.x86_64)
  • fio-2.99 tcmu-runner-1.3.0-re4
  • 16 OSD and osd_objectstore is bluestore
  • rbd default features = 3

I use targetcli(or tgtadm) to create target device and use initiator to login it.And then,I use fio to test the device.
1)the performance of the RBD itself (no iSCSI or LIO-TCMU)
rbd create image-10 --size 102400 (rbd default features = 3)
fio test config

[global]
#logging
#write_iops_log=write_iops_log
#write_bw_log=write_bw_log
#write_lat_log=write_lat_log
ioengine=rbd
clientname=admin
pool=rbd
rbdname=image-10
rw=randwrite
bs=4k
numjobs=4
buffered=0
runtime=180
group_reporting=1

[rbd_iodepth32]
iodepth=128
#write_iops_log=write_rbd_default_feature_one
#log_avg_msec=1000

performance: 35-40 K IOPS

2)the performance of the RBD as a target using TGT.
create lun:
tgtadm --lld iscsi --mode logicalunit --op new --tid 1 --lun 1 --backing-store rbd/image-10 --bstype rbd

initiator
iscsiadm -m node --targetname iqn.2018-01.com.example02:iscsi -p 192.168.x.x:3260 -l

the lun was mounted as /dev/sdw

fio test

[global]
bs=4k
ioengine=libaio
iodepth=128
direct=1
#sync=1
runtime=30
size=60G
buffered=0
#directory=/mnt
numjobs=4
filename=/dev/sdw
group_reporting=1

[rand-write]
time_based
write_iops_log=write_tgt_default_feature_three
log_avg_msec=1000
rw=randwrite
#stonewall

performance: 18-20K IOPS

3)the performance of the RBD as a target using LIO_TCMU
use targetcli to create lun and tpg default_cmdsn_depth=512.
initiator side
node.session.cmds_max = 2048
node.session.queue_depth = 1024

fio
[global]
bs=4k
ioengine=libaio
iodepth=128
direct=1
#sync=1
runtime=180
size=50G
buffered=0
#directory=/mnt
numjobs=4
filename=/dev/sdv
group_reporting=1

[rand-write]
time_based
write_iops_log=write_tgt_default_feature_three
log_avg_msec=1000
rw=randwrite
#stonewall

/dev/sdv backend image-10

performance: 7K IOPS

I found an issue similar to me, but I still haven't found the problem
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-October/044021.html
http://lists.ceph.com/pipermail/ceph-users-ceph.com/2014-December/045347.html

Thanks for any help anyone can provide!

@mikechristie
Copy link
Collaborator

We are just starting to investigate performance.

One known issue is that for LIO and open-iscsi you need to have node.session.cmds_max match the LIO default_cmdsn_depth setting. If they are not the same, then there seems to be a bug on the initiator side where IOs are requeued and do not get retried quickly like normal.

There is another issue for latency/IOPs type of tests where one command slows others. The attached patch

runner-dont-wait.txt

is a hack around it but it needs work because it can cause extra switches.

For target_core_user there are other issues like its memory allocation in the main path, but you might not be hitting that with the fio arguments you are using.

@DongyuanPan
Copy link
Author

Thank you~
@mikechristie

I retested the performance with tcmu-runner-1.3.0 and set node.session.cmds_max match the LIO default_cmdsn_depth.There are some improvements in performance( 18.8 K IOPS),and the performance the same as using TGT.
If I use tcmu-runner-1.3.0 without optimization . Is this normal for the performance(the same as TGT)?

There is another issue for latency/IOPs type of tests where one command slows others. The attached patch runner-dont-wait.txt is a hack around it but it needs work because it can cause extra switches.

if I test with the patch,the performance (32K IOPS) approach the RBD itself.
But the patch is only for tests?
The argument wakeup is determined by aio_track->tracked_aio_ops. AIO must be tracked ? What might occur if I do not track AIO ?Can this parameter be specified by the user?

@mikechristie
Copy link
Collaborator

Thanks for testing.

if I test with the patch,the performance (32K IOPS) approach the RBD itself. But the patch is only for tests?

Yeah, the patch needs some cleanup, because of what you notice below.

The argument wakeup is determined by aio_track->tracked_aio_ops. AIO must be tracked ? What might occur if I do not track AIO ?Can this parameter be specified by the user?

It is used during failover/failback and recovery to make sure IOs are not being executed in the handler modules (handler_rbd, handler_glfs, etc) when we execute a callout like lock() or (re)open().

So ideally, we have these issues:

  1. In aio_command_finish we do not want to batch commands like we do today. We can either completely drop the batching like in the patch attached in the previous comment, or we can try to add some setting to try and limit how long we wait before calling tcmulib_processing_complete. For example we could do something like:

If (!wakeup && current_batch_wait > batch_timeout)
        tcmulib_processing_complete(dev);

  1. We would like to remove the track_lock from the main IO path, but we still need a way to make sure IO is not running on the handler when we do the lock/open callouts. We can maybe replace the aio_wait_for_empty_queue calls with tcmu_flush_device calls.

@MIZZ122
Copy link

MIZZ122 commented Jan 25, 2018

@Github641234230 I noticed that your kernel version is 3.10.0-693.11.6.el7.x86_64.
Did you add some patch for your kernel?
Are you going to be HA?
My kernel is 3.10.0-693.11.6.el7.x86_64,tcmu-runner-1.3.0-re4.
There is IOERROE when modify the kernel parameter enable = 1.

@MIZZ122
Copy link

MIZZ122 commented Jan 25, 2018

@mikechristie If our product can only use CentOS 7.4 (3.10.0-693.11.6.el7.x86_64) and I want to do HA.what should I do?Which patch I can use?

I've tried using targetcli to export RBDs to all gateway nodes,On the iscsi client side, I use dm-multipath to find it and it can work well( both Active/Active and active/passive).Is there any problem using this method for HA?
And this issue #356
Active/Active is not supported. I am very confused.

@mikechristie
Copy link
Collaborator

@MIZZ122

For upstream tcmu-runner/ceph-iscsi-cli HA support you have to use RHEL 7.5 beta or newer kernel or this kernel:

https://github.com/ceph/ceph-client.

HA is only supported with active/passive. You must use the settings here

http://docs.ceph.com/docs/master/rbd/iscsi-initiators/

Just because dm-multipath let's you setup active/active does not mean it is safe. You can end up with data corruption. Use the settings in the docs.

If you are doing single node (non HA) then you can do active/active across multiple portals on that one node.

@mikechristie
Copy link
Collaborator

@MIZZ122 if you have other questions about active/active can you open a new issues or discuss it in the issue for active/active. This issue is for perf only.

@dillaman
Copy link
Collaborator

@mikechristie Any update on this issue?

@mikechristie
Copy link
Collaborator

@lxbsz was testing it out for gluster with the perf team. lxbsz, did it help and did you make the changes I requested and were they needed or was it ok to just always just complete right away?

It looks like you probably got busy with resize so I can do the changes. Are you guys working with the perf team still, so we can get them tested?

@lxbsz
Copy link
Collaborator

lxbsz commented Mar 16, 2018

@mikechristie Yes, we and the perf team together test this.

The environment is base PostgreSQL database when running on Gluster Block Volume in a CNS environment.

1, by changing node.session.cmds_max to match the LIO default_cmdsn_depth.
The performance improved just very small improvement, about 5%?

2, by https://github.com/open-iscsi/tcmu-runner/files/1654757/runner-dont-wait.txt
The performance improved about 10%.

3, by changing the default_cmdsn_depth to 64:
The performance improved about 27%.

So we are preparing to have a more test about this later. These days we are busy with the RHGS's release.

@mikechristie
Copy link
Collaborator

Ok, assume this is back on me.

@lxbsz
Copy link
Collaborator

lxbsz commented Mar 16, 2018

We will test this by mixing them up later once we have enough time.

@serjponomarev
Copy link

Can I use this (https://github.com/open-iscsi/tcmu-runner/files/1654757/runner-dont-wait.txt) patch for production ESXi environment?
If not recommended, how i can help you to investigate performance for fix it?
I have all needed hardware

@mikechristie
Copy link
Collaborator

It is perfectly safe crash wise but might cause other regressions. If you can test, I can give you a patch later this week that makes it configurable so we can try to figure out if there is some balance between the 2 extreme settings being used with and without the patch or if it might need to be configurable for the type of workload.

@serjponomarev
Copy link

Ok, i'am waiting patch and instruction for how test it (ceph, tcmu-runner, FIO)

@DongyuanPan
Copy link
Author

DongyuanPan commented Apr 4, 2018

In my test environment for ceph rbd, the tgt perf is better than lio-tcmu.
So I created an IBLOCK backstore from a /dev/sda block device by Targetcli in order to test the LIO perf without tcmu/tcmu-runner.

4K rand_write
LIO+SSD DISK -> IOPS=48.9k, BW=191MiB/s
TGT+SSD DISK -> IOPS=49.2k, BW=192MiB/s

4K rand_read
LIO+SSD DISK -> IOPS=44.9k, BW=175MiB/s
TGT+SSD DISK -> IOPS=46.5k, BW=182MiB/s

64K write
LIO+SSD DISK ->IOPS=6221, BW=389MiB/s
TGT+SSD DISK -> IOPS=9100, BW=569MiB/s

64K read
LIO+SSD DISK ->IOPS=8389, BW=524MiB/s
TGT+SSD DISK ->IOPS=19.3k, BW=1208MiB/s

The perf of TGT is better than LIO. It's strange.
Thanks for any help anyone can provide!

@wwba
Copy link

wwba commented Apr 9, 2018

@mikechristie
In my ceph cluster, the throughput of the scsi disks is much lower than RBD's.
I run the LIO iscsi gateway in vm with kernel version '4.16.0-0.rc6'. In the vm, I compaired the performance of tcmu-runner with KRBD using fio util(sync=1, -ioengine=psync -bs=4M -numjobs=10).

4M seq write & one LIO gw for a RBD
KRBD
BW=409MiB, avg lat = 97ms

LIO + TCMU
BW=131MiB, avg lat = 305ms

TGT+rbd_bs
BW=362MiB, avg lat = 110ms

4M seq read & one LIO gw for a RBD
KRBD
BW=1571MiB, avg lat = 25ms

LIO + TCMU
BW=256MiB, avg lat = 155ms

TGT+rbd_bs
BW=1556MiB, avg lat = 26ms

4M seq write & one LIO gw for four RBDs
KRBD
BW=205MiB, avg lat = 190ms

LIO + TCMU
BW=42MiB, avg lat = 921ms

TGT+rbd_bs
BW=193MiB, avg lat = 206ms

4M seq read & one LIO gw for four RBDs
KRBD
BW=416MiB, avg lat = 96ms

LIO + TCMU
BW=148MiB, avg lat = 270ms

TGT+rbd_bs
BW=397MiB, avg lat = 100ms

I have a poor throughput for scsi disk using TCMU, is this having something to do with the

For target_core_user there are other issues like its memory allocation in the main path

as you say ?

@shadowlinyf
Copy link

@mikechristie Has the runner-dont-wait.txt patch already been merged to 1.4RC1?

@mikechristie
Copy link
Collaborator

Yes.

@shadowlinyf
Copy link

@mikechristie I am having performance issue with EC RBD as backend store. I am using 1.4RC1.KRBD seq write speed is about 600MB/s.TCUM+RBD seq write speed is around 30MB/s.

@NUABO
Copy link

NUABO commented Sep 18, 2018

hi @shadowlinyf ,will you test it again afterwards, is tcmu still very poor?

@Allenscript
Copy link

now i meet the same performance like this , fio with rbd ,the result was about 500MB/s , if with tcmu of user:rbd , the fio test result was about 15MB/s ,this performance is too poor , my env is :kernel -5.0.4 , tcmu -lasest release 1.4.1 , ceph - 12.2.11

@deng-ruixuan
Copy link

now i meet the same performance like this , fio with rbd ,the result was about 500MB/s , if with tcmu of user:rbd , the fio test result was about 15MB/s ,this performance is too poor , my env is :kernel -5.0.4 , tcmu -lasest release 1.4.1 , ceph - 12.2.11

i meet the same performance like this. I seem to have solved this my problem. Although there are other performance issues.
we can try to use gwcli to set the following parameters for disk:
/disks> reconfigure blockpool/image01 hw_max_sectors 8192
/disks> reconfigure blockpool/image01 max_data_area_mb 128
After setting,the performance of tcmu can approximate the performance of librbd in HHD scenarios

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests