tcmur: improve batch kernel wake up notifications #392

dillaman · 2018-03-30T17:19:34Z

Only fire a single wake up concurrently. If additional AIO
completions arrive while the kernel is being woken up, the
wake up will fire again.

Signed-off-by: Jason Dillaman dillaman@redhat.com

dillaman · 2018-03-30T17:23:35Z

This is a compromise between the current implementation and the solution proposed in #359. It serializes the wake ups to the kernel but attempts to batch if multiple completions arrive while waking the kernel up.

In my development environment, it improves 4K small IO from ~5K IOPS to ~8.5K IOPS.

mikechristie · 2018-03-30T22:18:29Z

Adding @lxbsz who just pinged me on this. You lucked out! The ceph team did it for gluster again :)

seroyer · 2018-04-01T15:02:25Z

tcmur_aio.c

+{
+	struct tcmu_track_aio *aio_track = &rdev->track_queue;
+
+	pthread_cleanup_push(_cleanup_mutex_lock, (void *)&aio_track->track_lock);


I don't think cleanup is necessary here. It's necessary when there is a cancellation point in the critical section. My reading of posix (and the linux man page) doesn't include assert as a cancellation point, but I could be wrong... It may output, which could block (and hence should count as a cancellation point), but if it does, it will exit the application anyway so is probably moot.

Yeah, it's probably not needed since the daemon will crash if we hit the assert.

I think we are sometimes overly careful. We can do a clean up patch later.

serjponomarev · 2018-04-01T15:49:19Z

In my test environment this patch increase performance to 3x times and decrease latency to 3x times.

FIO pattern: --bs=4k --iodepth=32 --numjobs=4 --direct=1 --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio (--rw=randwrite, --rw=randread)
tcmu-release -> randread iops 7019, lat (msec): min=5, max=226, avg=22.64, stdev=18.78
tcmu-release -> randwrite iops 3395, lat (msec): min=11, max=244, avg=38.80, stdev= 6.09

tcmu-this-patch -> randread iops 19.1k, lat (usec): min=906, max=37268, avg=6657.18, stdev=1972.39
tcmu-this-patch -> randwrite iops 8406, lat (usec): min=2575, max=83831, avg=15722.03, stdev=6089.71

lxbsz · 2018-04-02T06:18:16Z

@mikechristie Yeah, this is what we really need and thanks very much @dillaman 's great work.

I will test this later.

zhuozh · 2018-04-03T14:02:33Z

Hi, all. I tested this patch with one size=1T lun, 8K randwrite IOPS increase to 2x times.(randwrite: ~121 IOPS to ~243 IOPS. ) I will create more LUN to stress the whole cluster and feedback the result.

DongyuanPan · 2018-04-04T04:59:30Z

Hi ~ I tested the perf with the patch and did some comparison
my environment:
Centos 7.4 (3.10.0-693.11.6.el7.x86_64)
ceph version 10.2.5
Fio 2.99

RBD 4k rand_write
Tcmu-runner-1.3.0-> 14k IOPS
Tcmu-runner with the patch -> 18.3k IOPS
Tgt -> 15k IOPS

RBD 4k rand_read
Tcmu-runner-1.3.0-> 23k IOPS
Tcmu-runner with the patch -> 30.7k IOPS
Tgt -> 22k IOPS

RBD 64k write
Tcmu-runner-1.3.0-> 181MiB/s
Tcmu-runner with the patch -> 327MiB/s
Tgt -> 427MiB/s

RBD 64k read
Tcmu-runner-1.3.0-> 393MiB/s
Tcmu-runner with the patch -> 535MiB/s
Tgt -> 1442MiB/s

lxbsz · 2018-04-04T05:02:49Z

I have test this based the gluster-block product:
Without this:
IOPS:
read : io=311108KB, bw=5181.5KB/s, iops=1294, runt= 60043msec
write: io=310624KB, bw=5173.4KB/s, iops=1292, runt= 60043msec
BW:
read : io=2237.0MB, bw=37944KB/s, iops=35, runt= 60371msec
write: io=2005.0MB, bw=33921KB/s, iops=31, runt= 60526msec

With this patch:
IOPS:
read : io=351892KB, bw=5861.6KB/s, iops=1464, runt= 60039msec
write: io=351424KB, bw=5853.3KB/s, iops=1462, runt= 60039msec
BW:
read : io=2353.0MB, bw=39934KB/s, iops=36, runt= 60337msec
write: io=2403.0MB, bw=40698KB/s, iops=37, runt= 60461msec

The IOPS is about 13% improved and the BW is about 10~20% improved.

zhuozh · 2018-04-04T07:12:14Z

@serjponomarev @Github641234230 I found your IOPS/BW is very high. what's the LUN size in your test? What is your cluster HW configuration? For the reason that RBD is thin-provisioned, did you dd the LUN firstly before the test to make sure the space is really allocated.
By the way, with single LUN, my IOPS improves 2x times, but with vdbench write mutiple LUNs simultaneously it only improves 10% in total.

DongyuanPan · 2018-04-04T09:01:04Z

@zhuozh LUN size = 100G . Yes, the space is really allocated for the RBD.
Details about the test environment:
Single node test "cluster"
CentOS 7.4 (3.10.0-693.11.6.el7.x86_64)
16 OSD
800G SSD

rbd info
rbd image 'image-10':
size 102400 MB in 25600 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.455f82ae8944a
format: 2
features: layering
flags:

fio 2.99

[global]
bs=4k
ioengine=libaio
iodepth=128
direct=1
#sync=1
runtime=300
size=60G
buffered=0
#directory=/mnt
numjobs=4
filename=/dev/sdv
group_reporting=1

[randwrite]
time_based
#write_iops_log=
#log_avg_msec=1000
rw=randwrite
#stonewall

serjponomarev · 2018-04-04T09:02:06Z

@zhuozh in my tests:
3 x physical host with ESXi, Intel Xeon 2696, 10G Mellanox and Samsung EVO 960
Ceph - v12.2.3
RBD size - 100Gb (before tests, space allocated, write 4M blocks))
3 x osd host -> ESXi VM (8vCPU, 8Gb RAM, 10G Public/Cluster ), OS - Centos 7.4, default kernel.
osd storage - VMDK (50Gb, eager zeroed) on Samsung EVO 960
3 x OSD 50Gb on each OSD host, total 9 OSD x 50 Gb
replica count 3
Ceph all debug disabled

mikechristie · 2018-04-05T00:07:38Z

tcmur_cmd_handler.c

 		tcmulib_processing_complete(dev);
+		track_aio_wakeup_finish(rdev, &wake_up);
+	}
 }


I think we could hit a race:

pending_wakeups = 0. thread1 calls track_aio_request_finish, pending_wakeups is incremented to 1, and wake_up=1 is returned. It does it's call to tcmur_command_complete and then falls into the while loop and does tcmulib_processing_complete.

thread2 then calls track_aio_request_finish and increments pending_wakeups=2. wake_up=0 is returned so it is relying on thread1 to call tcmulib_processing_complete.

thread1 then calls track_aio_wakeup_finish. It sees aio_track->pending_wakeups > 1 and sets pending_wakeups=1and returns wake_up=1.

thread1 is super fast and we loop again and makes the tcmulib_processing_complete call and track_aio_wakeup_finish which should have completed the processing for the cmd for thread2. pending_wakeups=1 so it is now set to 0 and wake_up=0 is returned. thread1 is now done. It returns.

thread2 finally calls tcmur_command_complete (or it could have been in the middle of calling it but yet done updating the ring buffer). wake_up was 0 for it, so it does not go into the loop above and tcmulib_processing_complete is never called for it.

If the race analysis is correct I think we just need to do something like the attached where we hold the device completion lock while updating the aio track calls as well as doing the tcmulib_command_complete. aio_command_finish it just a little funky in that it straddles both the aio code and tcmur cmd handler code, so my change here is a little gross.

Jasons392-plus-my-extra-dumb-locking.txt

Couldn't the call to tcmur_command_complete and track_aio_request_finish be swapped?

Couldn't the call to tcmur_command_complete and track_aio_request_finish be swapped?

Lol, yeah that is much better! My eyes were probably going crosseyed :)

OK -- swapped the two function calls.

Only fire a single wake up concurrently. If additional AIO completions arrive while the kernel is being woken up, the wake up will fire again. Signed-off-by: Jason Dillaman <dillaman@redhat.com>

mikechristie · 2018-04-06T19:55:11Z

Thanks.

dillaman changed the title ~~tcmur: batch kernel wake up notifications~~ tcmur: improve batch kernel wake up notifications Mar 30, 2018

dillaman force-pushed the wip-batch-wake-ups branch from 3e54340 to 46141e3 Compare March 30, 2018 17:51

seroyer reviewed Apr 1, 2018

View reviewed changes

mikechristie reviewed Apr 5, 2018

View reviewed changes

tcmur: batch kernel wake up notifications

4c873a2

Only fire a single wake up concurrently. If additional AIO completions arrive while the kernel is being woken up, the wake up will fire again. Signed-off-by: Jason Dillaman <dillaman@redhat.com>

dillaman force-pushed the wip-batch-wake-ups branch from 46141e3 to 4c873a2 Compare April 6, 2018 00:20

mikechristie merged commit 0058479 into open-iscsi:master Apr 6, 2018

dillaman deleted the wip-batch-wake-ups branch April 6, 2018 20:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

tcmur: improve batch kernel wake up notifications #392

tcmur: improve batch kernel wake up notifications #392

dillaman commented Mar 30, 2018

dillaman commented Mar 30, 2018

mikechristie commented Mar 30, 2018

seroyer Apr 1, 2018

mikechristie Apr 6, 2018

serjponomarev commented Apr 1, 2018

lxbsz commented Apr 2, 2018 •

edited

Loading

zhuozh commented Apr 3, 2018

DongyuanPan commented Apr 4, 2018

lxbsz commented Apr 4, 2018

zhuozh commented Apr 4, 2018

DongyuanPan commented Apr 4, 2018

serjponomarev commented Apr 4, 2018

mikechristie Apr 5, 2018

dillaman Apr 5, 2018

mikechristie Apr 5, 2018

dillaman Apr 6, 2018

mikechristie commented Apr 6, 2018

tcmur: improve batch kernel wake up notifications #392

tcmur: improve batch kernel wake up notifications #392

Conversation

dillaman commented Mar 30, 2018

dillaman commented Mar 30, 2018

mikechristie commented Mar 30, 2018

seroyer Apr 1, 2018

Choose a reason for hiding this comment

mikechristie Apr 6, 2018

Choose a reason for hiding this comment

serjponomarev commented Apr 1, 2018

lxbsz commented Apr 2, 2018 • edited Loading

zhuozh commented Apr 3, 2018

DongyuanPan commented Apr 4, 2018

lxbsz commented Apr 4, 2018

zhuozh commented Apr 4, 2018

DongyuanPan commented Apr 4, 2018

serjponomarev commented Apr 4, 2018

mikechristie Apr 5, 2018

Choose a reason for hiding this comment

dillaman Apr 5, 2018

Choose a reason for hiding this comment

mikechristie Apr 5, 2018

Choose a reason for hiding this comment

dillaman Apr 6, 2018

Choose a reason for hiding this comment

mikechristie commented Apr 6, 2018

lxbsz commented Apr 2, 2018 •

edited

Loading