-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
sync writes very slow despite presence of SLOG #2373
Comments
I'm going to create a very isolate, simple test case for this and post results tonight. |
@dswartz It would be useful if you could include some basic profiling data from your testing. Does 'iostat -mx' of the host show the disk to be saturated? Is the system CPU bound? etc. |
I didn't get a chance to do this last night. I'm dubious of I/O |
Test methodology: 3GHZ Pentium-D (dual-core). 8GB RAM. 640GB Sata disk, and the intel SLOG Sequential read: 58MB/sec Attached is 10-second snapshot using 'iostat -mx'. I am going to re-run |
You may be generating a sequential write workload in the VM. But by the time it gets to zfs it seems to be 4k synchronous writes. According to iostat your ssd is sustaining roughly 4000 4k writes per second which isn't too shabby. It will be interesting to see what the workload under Illumos looks like. |
A comment on the ZoL stats. The intel spec sheet claims up to 19K IOPS. Even if that is BS, something is obviously wrong with how we are scheduling the writes. It should not be possible for a single gigabit NFS client to saturate a high quality SLOG device like the s3700. |
@dswartz The really interesting bit here is that under OmniOS the writes to the SLOG device are far larger, roughly 64k. So it only takes 1000/s or so to saturate the Gigabit link. I suspect the performance issue you're seeing here is due primarily to a difference in the NFS implementation. It appears that the Linux server and your NFS client are negotiating a wsize of 4k. While on the other hand the OmniOS server and your NFS client are negotiating a wsize of 64k. That difference in synchronous request size would completely explain what you're seeing. If you're up for two more experiments I'd try the following independent tests.
|
Interesting. I rebooted the ZoL disk. From what I can tell, there is no (obvious) way to change the nfs client parameters for vsphere, so I would need to set the export parameters on the CentOS ZoL server. I apparently can't do that via the sharenfs attribute, so how do I proceed? Also curious as to why CentOS and Omnios negotiated such different sizes... |
Out of curiosity, how do you infer the wsize being negotiated and used |
So I tried mounting the share from a virtual ubuntu, and did 4GB of write 4K wsize 4294967296 bytes (4.3 GB) copied, 246.53 s, 17.4 MB/s 64K wsize 4294967296 bytes (4.3 GB) copied, 73.6705 s, 58.3 MB/s default wsize (unknown?) 4294967296 bytes (4.3 GB) copied, 119.104 s, 36.1 MB/s Which jibes (roughly at least) with your theory. From what I can tell |
@dswartz I was able to infer what was likely going on based on the average request size seen by the server. When mounting nfs synchronously each of those write requests will be written immediately to the SLOG. So if you're seeing all 4k IO (avgrq-sz = 8) then nfs must be making small 4k synchronous writes. The only reason I'm aware of that nfs would do this is if the request size was negotiated to 4k. Now why nfs would be negotiating the request size to 4k I'm not at all sure. That would be a question for the maintainers of the Linux nfs kernel server. My understanding is that the client and server should negotiate at connect time the largest request size supported by both the client and server. The man page says:
|
Yeah, I saw this too. Puzzled as to what is happening differently when |
There's got to be something else going on. From what I can tell, vsphere is sending 512KB nfs writes, regardless of sync mode. Here's an example where it was slow (16MB/sec). 10.0.0.4.1940350015 > 10.0.0.31.2049: 1444 write fh 524288 (524288) (this is from tcpdump) I then set sync=disabled and the tcpdump output didn't look any different (as far as I could see, anyway...) Still digging... |
Have you tried running nfsstat on the server? That might make it clearer what the nfs workload is. Although the tcpdump is pretty convincing. It does sound like more investigation is needed. |
So more digging has turned up this: ESXi forces the NFS client file block time dd if=/dev/SSDA of=/test/foo/bar ibs=64K obs=4K SSDA is another random SSD not being used for anything else - intent was [root@centos-vsa1 ~]# time dd if=/dev/sdd of=/test/vsphere/foo ibs=512K Pretty much what I see via crystaldiskmark. If the methodology here I will repeat this when I get home and can reboot the testbed on OmniOS. |
@dswartz It depends on exactly what the NFS server is issuing to ZFS. If its making 128 4K synchronous write calls there's nothing really we can do. Because for each individual 4K write it's asking that it be done synchronously so we can't return until it's done. That said, someone should really profile this on the kernel side to see exactly what the NFS kernel server is doing. Only then will we have an idea of what can be done to improve things. |
Maybe I wasn't clear. I get that that the whole collection needs to be |
@dswartz Right, I understand. Just keep in mind there's a layering on the servers. ZFS will only do what NFS server asks it to do. If it asks us to do 128 4k synchronous write that's what we have to do. If it asks us to do a single 512k write we'll do that instead. Someone needs to determine exactly what the NFS server is requesting and we can go from there. |
Okay, I did some digging into the Linux NFS server, and found some debug root@sphinx:~# time dd if=/dev/sda of=/mnt/foo bs=512K count=1 I had enabled nfsd debugging on the CentOS ZoL server, and after the above nfsd_dispatch: vers 3 proc 4 e.g. precisely one WRITE request, which completed successfully. So there |
Forgot to mention: the ubuntu nfs client mount was synchronous... |
Your post interested me as I was about to try something similar on my server. I set up a filesystem on my 2x2 "RAID10" ZFS pool, to test without a SLOG. I see better performance than you obtain, I consistently get 15 to 20MB/s (synchronous) over a gigabit ethernet, which while not great, is better than you are getting. There has to be something different in what we are doing. ServerThe NFS V3 server is an Ubuntu 12.04 box with ZOL 6.2. The filesystem is exported via /etc/exports not by ZFS parameters. cat /etc/exports /zebu/tmp 192.168.24.0/24(rw,insecure,no_subtree_check) ClientThe NFS client is an Ubuntu 10.04 system. Nothing special was done to the mount command: mount -o sync zebra:/zebu/tmp /mnt/ZZ cat /proc/mounts | grep ZZ note the wsize $dd if=/tmp/B of=/mnt/ZZ/C3 bs=512k Basically I get twice your performance with fewer vdevs and no SLOG, so what are the differences between the two setups ? cat /proc/version I am at 6.2 the current tagged version, you are at HEAD. I used /etc/exports to export the filesystem on the server; how do you export on the server ? Was your pool or filesystem set up with some funky parameters, is your asize correct for your disks ? I know a "works for me" is not helpful ;o(, but there has to be something in your setup which is causing your poor performance, and I thought I would share to see if this helps the developers to suggest something. |
Lots of interesting info. Here's the thing though: it has nothing to do with my pool. I have a 3x2 SAS pool and it works fine with sync=standard using the intel s3700 as SLOG under OmniOS. If I boot CentOS with ZoL and exact same hardware, the write throughput to the pool over gigabit goes from 80+MB/sec to 10MB/sec or so. The test info I have been posting about is a testbed with a single Sata drive and the intel SSD as SLOG - sucks the same way until and unless I boot OmniOS, then it's fine again (e.g. gigabit is the limiting factor.) I haven't tried with CentOS and my production pool with on-pool ZIL, so it's entirely possible I'd get about 20MB/sec like you. Notwithstanding that, it's still 1/4 or less of what OmniOS is delivering. It sure looks like somehow we are splitting up the 512KB sync write into a crapload of smaller writes to the SLOG, and it's hitting its IOPS limit. I'm not trying to be a jerk here, but this is NOT a problem I happen to have. I will bet you anything you can reproduce it trivially, assuming you have an SSD to use for an SLOG. Create a pool on a single disk, add the SSD as SLOG, share it out via NFS (using ZoL), mount it synchronously from your linux client and do several hundred MB write, using 'dd', and you will see crap write performance. Boot from OmniOS (possibly other Opensolaris distros, haven't tested that) and repeat the exact same test. Sustained write performance will got up by a factor of 4 or more. |
I can fully understand, whether it is 8MB/s or 20MB/s its NOT 90MB/s When I look at the iostats for my server, it appears to be writing in 128k chunks (bytes/s / IOP/s) which is the record size of the filesystem. This is a typical iostat -dxm 10 output : Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util I get an avgrq-sz of 220ish or 110kB. Again this is without a SLOG. In the original jpeg you posted with the iostat output, is sdc the SLOG device ? I ask because that is being written to in 4k chunks (avgrq_sz = 8.19). So without a SLOG my results son't seem to be broken up into 4k writes, but in your case with an SLOG they are. (guessing here) In looking at your OmniOS iostat, it seems to be writing ~128k chunks to the SLOG (78978kB/s / 676.5 w/s) So it looks like the answer is in how ZOL and OmniOS write to the SLOG but again I’m guessing. Perhaps the developers might be able to see how to make the SLOG write in bigger chunks. So if you try your test without an SLOG, does it work more like my tests (i.e.. 128k chunks) ? Anyway good luck, hope you get an answer. I’ll not waste more of your time with wild guesses ;o) |
Well, this is really annoying. I think I may have been chasing a parked car. I did a mount from the production OmniOS server to the ZoL server and did 100MB sync writes. About 24MB/sec. I then removed the intel ssd as SLOG and formatted it with ext4 and shared it out. Mounted that filesystem to OmniOS and repeated. About 30MB/sec! I've been groveling through the kernel NFSD code and it looks like it might be breaking up the buffer passed by nfsd to the vfs later into smaller (page sized?) chunks? So it looks like it will loop, writing 4096 byte blocks to the SLOG. What I don't understand is why they are not being coalesced into bigger blocks? |
So, this is a bummer. I was looking at google results for do_loop_readv_writev to see if I could prove or disprove nfsd is breaking up 512KB write into 4KB chunks, which is what is killing SLOG performance. Here is an excerpt from issue #1790: Oct 15 23:54:21 fs2 kernel: [] zpl_write_common+0x52/0x70 [zfs] so it sure looks like the answer is yes. I understand this is not zfs' fault, but this is a major performance hit compared to a competing platform (opensolaris flavor). Where do we go from here? |
@dswartz Nice find! That neatly explains what's going on here and what we need to do to fix it. Let me explain. The readv() and writev() system calls are implemented in one of two ways on Linux. If the underlying filesystem provides the aio_read and aio_write callbacks then the async IO interfaces will be used. The entire large IO will be passed to the filesystem as a vector and the caller can block until it's complete. This would allow us to do the optimal thing and issue larger IOs to the disk. Unfortunately, the async IO callbacks haven't been implemented yet for ZoL. In this case the Linux kernel falls back to a compatibility code. It will call the do_loop_readv_writev function will in turn calls the normal read/write callbacks for each chunk of the vector IO. In this case those chunks appear to be 4k because of the page size. The fix is for us to spend the time and get the asynchronous IO interfaces implemented, see #223. This gives us one more reason to prioritize getting that done. In the short term I don't think there's a quick fix. You may want to run OmniOS if all your IO is going to be synchronous. At least until we can resolve this properly. |
Okay, back to this issue now, since Ryao's AIO patch seems to have helped nfs sync writes a lot. Still not nearly as good as omnios. As promised, I'm moving my updates from that pull request. As you may recall, when reading from ssd #1 and writing to a file on a sync=always dataset on another ssd-backed pool (with intel s3700 as SLOG), I was seeing it unable to exceed 1K IOPS. I just repeated the same test with a fresh install of omnios and see this: root@omnios2:~# time dd if=/test/sync/foo of=/test/sync/foo2 bs=1M count=8K
pool alloc free read write read write test 321K 119G 0 2.80K 0 183M Note it got to almost 3K IOPS and the aggregate write rate to the data pool was almost 200MB/sec. |
@dswartz With the AIO patch applied it would be useful to gather data from |
will do... |
Since my 'production' pool lives off an LSI 6gb HBA I decided to re-run Create a compression=lz4 dataset on the pool with sync=standard. The numbers were actually pretty good now - sustained sequential write of |
I actually have the crystaldiskmark screenshot as well as the iostat -mx output for the ZoL/AIO run. Since I can upload those now, I am doing so... zfs with aio zfs without aio
|
Sorry that got messed up. The screenshots? The first is with AIO, the second is stock ZoL. The text captures? The first is 'zpool iostat -v' and the second 'iostat -mx' as requested. |
The good news is the |
Any thoughts as to what I can try tweaking next? |
@dswartz Well the read activity during the write isn't good. Do you recall seeing the same amount of read activity during the write test under OmniOS? |
I don't believe so, no. I think I saw this earlier with ZoL, but skipped over it. |
@dswartz The read-modify-write behavior will introducing some latency since the writes must block on the reads. That probably has a significant impact and might explain why the disk isn't saturated. Although I'd expect the same issue on OmniOS. |
Hmmm, with 8K records the write perf went back down to about 50MB/sec. I'm not sure I understand why the RMW would hurt here. For a spinner, sure, but this is a high-performance SSD, so latency should be pretty close to zero, no? As in IOPS should be the limiting factor? Or something like that? |
nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Closes: openzfs#223 openzfs#2373 Signed-off-by: Richard Yao <ryao@gentoo.org>
nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Closes: openzfs#223 openzfs#2373 Signed-off-by: Richard Yao <ryao@gentoo.org>
nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Closes: openzfs#223 openzfs#2373 Signed-off-by: Richard Yao <ryao@gentoo.org>
@dswartz Is this a NUMA system? Do any of the following block device tuning knobs help?
https://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf |
No, it was a cheapo Pentium-D CPU on an intel motherboard... |
nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Lastly, it is important to known that handling of the iovec updates differs between Illumos and Linux in the implementation of read/write. On Linux, it is the VFS' responsibility whle on Illumos, it is the filesystem's reponsibility. We take the intermediate solution of copying the iovec so that the ZFS code can update it like on Solaris while leaving the originals alone. This imposes some overhead. We could always revisit this should profiling show that the allocations are a problem. Closes: openzfs#223 openzfs#2373 Signed-off-by: Richard Yao <ryao@gentoo.org>
nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Lastly, it is important to known that handling of the iovec updates differs between Illumos and Linux in the implementation of read/write. On Linux, it is the VFS' responsibility whle on Illumos, it is the filesystem's reponsibility. We take the intermediate solution of copying the iovec so that the ZFS code can update it like on Solaris while leaving the originals alone. This imposes some overhead. We could always revisit this should profiling show that the allocations are a problem. Closes: openzfs#223 openzfs#2373 Signed-off-by: Richard Yao <ryao@gentoo.org>
I am likewise seeing poor NFS performance with an SSD SLOG and sync writes. Good to find some answers at least. |
@dswartz Support for AIO has now been merged in to master which should help your performance. If we need to make additional performance improvements let's open a new issue to track them. |
Cool! |
nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Lastly, it is important to know that handling of the iovec updates differs between Illumos and Linux in the implementation of read/write. On Linux, it is the VFS' responsibility whle on Illumos, it is the filesystem's responsibility. We take the intermediate solution of copying the iovec so that the ZFS code can update it like on Solaris while leaving the originals alone. This imposes some overhead. We could always revisit this should profiling show that the allocations are a problem. Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#223 Closes openzfs#2373
nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to make our fops->aio_write implementation synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not pass O_SYNC on files opened with it and instead relies on a open()/write()/close() to enforce synchronous behavior when the flush is only guarenteed on last close. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure when something else is also accessing the file. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. It also should be noted that `aio_cancel()` will always return `AIO_NOTCANCELED` under this implementation. It is possible to implement aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()` to set a callback function for cancelling work sent to taskqs, but the simpler approach is allowed by the specification: ``` Which operations are cancelable is implementation-defined. ``` http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html The only programs on my system that are capable of using `aio_cancel()` are QEMU, beecrypt and fio use it according to a recursive grep of my system's `/usr/src/debug`. That suggests that `aio_cancel()` users are rare. Implementing aio_cancel() is left to a future date when it is clear that there are consumers that benefit from its implementation to justify the work. Lastly, it is important to known that handling of the iovec updates differs between Illumos and Linux in the implementation of read/write. On Linux, it is the VFS' responsibility whle on Illumos, it is the filesystem's reponsibility. We take the intermediate solution of copying the iovec so that the ZFS code can update it like on Solaris while leaving the originals alone. This imposes some overhead. We could always revisit this should profiling show that the allocations are a problem. Closes: openzfs#223 openzfs#2373 Signed-off-by: Richard Yao <ryao@gentoo.org>
nfsd uses do_readv_writev() to implement fops->read and fops->write. do_readv_writev() will attempt to read/write using fops->aio_read and fops->aio_write, but it will fallback to fops->read and fops->write when AIO is not available. However, the fallback will perform a call for each individual data page. Since our default recordsize is 128KB, sequential operations on NFS will generate 32 DMU transactions where only 1 transaction was needed. That was unnecessary overhead and we implement fops->aio_read and fops->aio_write to eliminate it. ZFS originated in OpenSolaris, where the AIO API is entirely implemented in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ and VOP_FSYNC. Linux implements AIO inside the kernel itself. Linux filesystems therefore must implement their own AIO logic and nearly all of them implement fops->aio_write synchronously. Consequently, they do not implement aio_fsync(). However, since the ZPL works by mapping Linux's VFS calls to the functions implementing Illumos' VFS operations, we instead implement AIO in the kernel by mapping the operations to the VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement fops->aio_fsync. One might be inclined to implement fops->aio_write synchronous to make software that expects this behavior safe. However, there are several reasons not to do this: 1. Other platforms do not implement aio_write() synchronously and since the majority of userland software using AIO should be cross platform, expectations of synchronous behavior should not be a problem. 2. We would hurt the performance of programs that use POSIX interfaces properly while simultaneously encouraging the creation of more non-compliant software. 3. The broader community concluded that userland software should be patched to properly use POSIX interfaces instead of implementing hacks in filesystems to cater to broken software. This concept is best described as the O_PONIES debate. 4. Making an asynchronous write synchronous is non sequitur. Any software dependent on synchronous aio_write behavior will suffer data loss on ZFSOnLinux in a kernel panic / system failure of at most zfs_txg_timeout seconds, which by default is 5 seconds. This seems like a reasonable consequence of using non-compliant software. It should be noted that this is also a problem in the kernel itself where nfsd does not respect O_SYNC on files and assumes synchronous behavior from do_readv_writev(), even though its fallback clearly does not enforce it. Exporting any filesystem that does not implement AIO via NFS risks data loss in the event of a kernel panic / system failure. Exporting any file system that implements AIO the way this patch does bears similar risk. However, it seems reasonable to forgo crippling our AIO implementation in favor of developing patches to fix this problem in Linux's nfsd for the reasons stated earlier. In the interim, the risk will remain. Failing to implement AIO will not change the problem that nfsd created, so there is no reason for nfsd's mistake to block our implementation of AIO. Closes: openzfs#223 openzfs#2373 Signed-off-by: Richard Yao <ryao@gentoo.org>
Trimmed down from my post to zfs-discuss mailing list. Raid10 array on a JBOD chassis. Dataset shared to vsphere using NFS (and therefore forced sync mode). Got a good SLOG SSD (intel s3700). With this as a log device, over gigabit, I get 100 MB/sec read and only 13MB/sec using crystaldiskmark from a win7 virtual client. If I boot a latest and greatest omnios instead, on the same exact HW (literally using the same pool, dataset, etc), I get 90MB/sec. 'zfs iostat -v' does indicate writes to the SLOG, so I am at a loss as to what is wrong, but this makes ZoL unusable for this use case for me. I found issue #1012, but it isn't clear (to me at least) if this is the same thing.
The text was updated successfully, but these errors were encountered: