Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tcmur: improve batch kernel wake up notifications #392

Merged
merged 1 commit into from
Apr 6, 2018

Conversation

dillaman
Copy link
Collaborator

Only fire a single wake up concurrently. If additional AIO
completions arrive while the kernel is being woken up, the
wake up will fire again.

Signed-off-by: Jason Dillaman dillaman@redhat.com

@dillaman
Copy link
Collaborator Author

This is a compromise between the current implementation and the solution proposed in #359. It serializes the wake ups to the kernel but attempts to batch if multiple completions arrive while waking the kernel up.

In my development environment, it improves 4K small IO from ~5K IOPS to ~8.5K IOPS.

@dillaman dillaman changed the title tcmur: batch kernel wake up notifications tcmur: improve batch kernel wake up notifications Mar 30, 2018
@mikechristie
Copy link
Collaborator

Adding @lxbsz who just pinged me on this. You lucked out! The ceph team did it for gluster again :)

{
struct tcmu_track_aio *aio_track = &rdev->track_queue;

pthread_cleanup_push(_cleanup_mutex_lock, (void *)&aio_track->track_lock);
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think cleanup is necessary here. It's necessary when there is a cancellation point in the critical section. My reading of posix (and the linux man page) doesn't include assert as a cancellation point, but I could be wrong... It may output, which could block (and hence should count as a cancellation point), but if it does, it will exit the application anyway so is probably moot.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, it's probably not needed since the daemon will crash if we hit the assert.

I think we are sometimes overly careful. We can do a clean up patch later.

@serjponomarev
Copy link

In my test environment this patch increase performance to 3x times and decrease latency to 3x times.

FIO pattern: --bs=4k --iodepth=32 --numjobs=4 --direct=1 --refill_buffers --norandommap --randrepeat=0 --ioengine=libaio (--rw=randwrite, --rw=randread)
tcmu-release -> randread iops 7019, lat (msec): min=5, max=226, avg=22.64, stdev=18.78
tcmu-release -> randwrite iops 3395, lat (msec): min=11, max=244, avg=38.80, stdev= 6.09

tcmu-this-patch -> randread iops 19.1k, lat (usec): min=906, max=37268, avg=6657.18, stdev=1972.39
tcmu-this-patch -> randwrite iops 8406, lat (usec): min=2575, max=83831, avg=15722.03, stdev=6089.71

@lxbsz
Copy link
Collaborator

lxbsz commented Apr 2, 2018

@mikechristie Yeah, this is what we really need and thanks very much @dillaman 's great work.

I will test this later.

@zhuozh
Copy link

zhuozh commented Apr 3, 2018

Hi, all. I tested this patch with one size=1T lun, 8K randwrite IOPS increase to 2x times.(randwrite: ~121 IOPS to ~243 IOPS. ) I will create more LUN to stress the whole cluster and feedback the result.

@DongyuanPan
Copy link

Hi ~ I tested the perf with the patch and did some comparison
my environment:
Centos 7.4 (3.10.0-693.11.6.el7.x86_64)
ceph version 10.2.5
Fio 2.99

RBD 4k rand_write
Tcmu-runner-1.3.0-> 14k IOPS
Tcmu-runner with the patch -> 18.3k IOPS
Tgt -> 15k IOPS

RBD 4k rand_read
Tcmu-runner-1.3.0-> 23k IOPS
Tcmu-runner with the patch -> 30.7k IOPS
Tgt -> 22k IOPS

RBD 64k write
Tcmu-runner-1.3.0-> 181MiB/s
Tcmu-runner with the patch -> 327MiB/s
Tgt -> 427MiB/s

RBD 64k read
Tcmu-runner-1.3.0-> 393MiB/s
Tcmu-runner with the patch -> 535MiB/s
Tgt -> 1442MiB/s

@lxbsz
Copy link
Collaborator

lxbsz commented Apr 4, 2018

I have test this based the gluster-block product:
Without this:
IOPS:
read : io=311108KB, bw=5181.5KB/s, iops=1294, runt= 60043msec
write: io=310624KB, bw=5173.4KB/s, iops=1292, runt= 60043msec
BW:
read : io=2237.0MB, bw=37944KB/s, iops=35, runt= 60371msec
write: io=2005.0MB, bw=33921KB/s, iops=31, runt= 60526msec

With this patch:
IOPS:
read : io=351892KB, bw=5861.6KB/s, iops=1464, runt= 60039msec
write: io=351424KB, bw=5853.3KB/s, iops=1462, runt= 60039msec
BW:
read : io=2353.0MB, bw=39934KB/s, iops=36, runt= 60337msec
write: io=2403.0MB, bw=40698KB/s, iops=37, runt= 60461msec

The IOPS is about 13% improved and the BW is about 10~20% improved.

@zhuozh
Copy link

zhuozh commented Apr 4, 2018

@serjponomarev @Github641234230 I found your IOPS/BW is very high. what's the LUN size in your test? What is your cluster HW configuration? For the reason that RBD is thin-provisioned, did you dd the LUN firstly before the test to make sure the space is really allocated.
By the way, with single LUN, my IOPS improves 2x times, but with vdbench write mutiple LUNs simultaneously it only improves 10% in total.

@DongyuanPan
Copy link

@zhuozh LUN size = 100G . Yes, the space is really allocated for the RBD.
Details about the test environment:
Single node test "cluster"
CentOS 7.4 (3.10.0-693.11.6.el7.x86_64)
16 OSD
800G SSD

rbd info
rbd image 'image-10':
size 102400 MB in 25600 objects
order 22 (4096 kB objects)
block_name_prefix: rbd_data.455f82ae8944a
format: 2
features: layering
flags:

fio 2.99

[global]
bs=4k
ioengine=libaio
iodepth=128
direct=1
#sync=1
runtime=300
size=60G
buffered=0
#directory=/mnt
numjobs=4
filename=/dev/sdv
group_reporting=1

[randwrite]
time_based
#write_iops_log=
#log_avg_msec=1000
rw=randwrite
#stonewall

@serjponomarev
Copy link

@zhuozh in my tests:
3 x physical host with ESXi, Intel Xeon 2696, 10G Mellanox and Samsung EVO 960
Ceph - v12.2.3
RBD size - 100Gb (before tests, space allocated, write 4M blocks))
3 x osd host -> ESXi VM (8vCPU, 8Gb RAM, 10G Public/Cluster ), OS - Centos 7.4, default kernel.
osd storage - VMDK (50Gb, eager zeroed) on Samsung EVO 960
3 x OSD 50Gb on each OSD host, total 9 OSD x 50 Gb
replica count 3
Ceph all debug disabled

tcmulib_processing_complete(dev);
track_aio_wakeup_finish(rdev, &wake_up);
}
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could hit a race:

  1. pending_wakeups = 0. thread1 calls track_aio_request_finish, pending_wakeups is incremented to 1, and wake_up=1 is returned. It does it's call to tcmur_command_complete and then falls into the while loop and does tcmulib_processing_complete.

  2. thread2 then calls track_aio_request_finish and increments pending_wakeups=2. wake_up=0 is returned so it is relying on thread1 to call tcmulib_processing_complete.

  3. thread1 then calls track_aio_wakeup_finish. It sees aio_track->pending_wakeups > 1 and sets pending_wakeups=1and returns wake_up=1.

  4. thread1 is super fast and we loop again and makes the tcmulib_processing_complete call and track_aio_wakeup_finish which should have completed the processing for the cmd for thread2. pending_wakeups=1 so it is now set to 0 and wake_up=0 is returned. thread1 is now done. It returns.

  5. thread2 finally calls tcmur_command_complete (or it could have been in the middle of calling it but yet done updating the ring buffer). wake_up was 0 for it, so it does not go into the loop above and tcmulib_processing_complete is never called for it.

If the race analysis is correct I think we just need to do something like the attached where we hold the device completion lock while updating the aio track calls as well as doing the tcmulib_command_complete. aio_command_finish it just a little funky in that it straddles both the aio code and tcmur cmd handler code, so my change here is a little gross.

Jasons392-plus-my-extra-dumb-locking.txt

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't the call to tcmur_command_complete and track_aio_request_finish be swapped?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't the call to tcmur_command_complete and track_aio_request_finish be swapped?

Lol, yeah that is much better! My eyes were probably going crosseyed :)

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK -- swapped the two function calls.

Only fire a single wake up concurrently. If additional AIO
completions arrive while the kernel is being woken up, the
wake up will fire again.

Signed-off-by: Jason Dillaman <dillaman@redhat.com>
@mikechristie
Copy link
Collaborator

Thanks.

@mikechristie mikechristie merged commit 0058479 into open-iscsi:master Apr 6, 2018
@dillaman dillaman deleted the wip-batch-wake-ups branch April 6, 2018 20:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants