Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync writes very slow despite presence of SLOG #2373

Closed
dswartz opened this issue Jun 8, 2014 · 53 comments
Closed

sync writes very slow despite presence of SLOG #2373

dswartz opened this issue Jun 8, 2014 · 53 comments
Labels
Type: Performance Performance improvement or performance problem
Milestone

Comments

@dswartz
Copy link
Contributor

dswartz commented Jun 8, 2014

Trimmed down from my post to zfs-discuss mailing list. Raid10 array on a JBOD chassis. Dataset shared to vsphere using NFS (and therefore forced sync mode). Got a good SLOG SSD (intel s3700). With this as a log device, over gigabit, I get 100 MB/sec read and only 13MB/sec using crystaldiskmark from a win7 virtual client. If I boot a latest and greatest omnios instead, on the same exact HW (literally using the same pool, dataset, etc), I get 90MB/sec. 'zfs iostat -v' does indicate writes to the SLOG, so I am at a loss as to what is wrong, but this makes ZoL unusable for this use case for me. I found issue #1012, but it isn't clear (to me at least) if this is the same thing.

@dswartz
Copy link
Contributor Author

dswartz commented Jun 9, 2014

I'm going to create a very isolate, simple test case for this and post results tonight.

@behlendorf behlendorf added this to the 0.7.0 milestone Jun 10, 2014
@behlendorf
Copy link
Contributor

@dswartz It would be useful if you could include some basic profiling data from your testing. Does 'iostat -mx' of the host show the disk to be saturated? Is the system CPU bound? etc.

@dswartz
Copy link
Contributor Author

dswartz commented Jun 10, 2014

@dswartz It would be useful if you could include some basic profiling data
from your testing. Does 'iostat -mx' of the host show the disk to be
saturated? Is the system CPU bound? etc.

I didn't get a chance to do this last night. I'm dubious of I/O
throttling, since it is a single gigabit-limited client talking to a 3x2
raid10 on a quad-core, HT-enabled xeon with 16GB ram. Also, if I set
sync=disabled, everything is fine. I will get the above for you tonight,
as well as a comparison against omnios. Note the testbed will be a single
7200RPM sata disk with the intel SLOG, since my JBOD is back in
production.

@dswartz
Copy link
Contributor Author

dswartz commented Jun 11, 2014

Test methodology:

3GHZ Pentium-D (dual-core). 8GB RAM. 640GB Sata disk, and the intel SLOG
device, connected to an M1015 HBA. Client is win7 running on vsphere 5.5
with test disk mounted from ZoL using sync-mode NFS. Test results with
CentOS 6.5:

Sequential read: 58MB/sec
Sequential write: 9MB/sec!

Attached is 10-second snapshot using 'iostat -mx'. I am going to re-run
using omnios now...

@dswartz
Copy link
Contributor Author

dswartz commented Jun 11, 2014

Weird, jpeg screenshot disappeared? Attaching here...
iostat

@behlendorf
Copy link
Contributor

You may be generating a sequential write workload in the VM. But by the time it gets to zfs it seems to be 4k synchronous writes. According to iostat your ssd is sustaining roughly 4000 4k writes per second which isn't too shabby. It will be interesting to see what the workload under Illumos looks like.

@dswartz
Copy link
Contributor Author

dswartz commented Jun 11, 2014

Well, interesting. Here is the omnios info. Sequential read: 103MB/sec, Sequential write: 81MB/sec. Attached is iostat output (command line args are somewhat different than linux, but I think have what you want?) Note that the SLOG is the c8t55 WWN...
omnios iostat

@dswartz
Copy link
Contributor Author

dswartz commented Jun 11, 2014

A comment on the ZoL stats. The intel spec sheet claims up to 19K IOPS. Even if that is BS, something is obviously wrong with how we are scheduling the writes. It should not be possible for a single gigabit NFS client to saturate a high quality SLOG device like the s3700.

@behlendorf
Copy link
Contributor

@dswartz The really interesting bit here is that under OmniOS the writes to the SLOG device are far larger, roughly 64k. So it only takes 1000/s or so to saturate the Gigabit link.

I suspect the performance issue you're seeing here is due primarily to a difference in the NFS implementation. It appears that the Linux server and your NFS client are negotiating a wsize of 4k. While on the other hand the OmniOS server and your NFS client are negotiating a wsize of 64k. That difference in synchronous request size would completely explain what you're seeing.

If you're up for two more experiments I'd try the following independent tests.

  1. Force a rsize and wsize of 64k on the client. Add rsize=65536,wsize=65536 to your client's NFS mount options. What you should see on the Linux server is an avgrq-sz of close to 128 sectors (512b each). You should also see much better write performance.

  2. Instead of running this test with zfs use ext4 and create an external journal device on the ssd. Make sure the file system is configured to use the data=journal option so the writes go to the journal first. Also make sure you use the same client mount options as in your original tests. I'm interested to see if the NFS server/client negotiate a different request size.

@dswartz
Copy link
Contributor Author

dswartz commented Jun 11, 2014

Interesting. I rebooted the ZoL disk. From what I can tell, there is no (obvious) way to change the nfs client parameters for vsphere, so I would need to set the export parameters on the CentOS ZoL server. I apparently can't do that via the sharenfs attribute, so how do I proceed? Also curious as to why CentOS and Omnios negotiated such different sizes...

@dswartz
Copy link
Contributor Author

dswartz commented Jun 11, 2014

Out of curiosity, how do you infer the wsize being negotiated and used
from the respective iostat outputs? A bit annoyed that the nfs server
doesn't appear to have a way to force the wsize. Even more annoyed that
vsphere doesn't seem to have a way to override this. I'm puzzled as to
why the same client is negotiating such disparate sizes with the two
server OS's...

@dswartz
Copy link
Contributor Author

dswartz commented Jun 11, 2014

So I tried mounting the share from a virtual ubuntu, and did 4GB of write
to it with different parameters:

4K wsize

4294967296 bytes (4.3 GB) copied, 246.53 s, 17.4 MB/s

64K wsize

4294967296 bytes (4.3 GB) copied, 73.6705 s, 58.3 MB/s

default wsize (unknown?)

4294967296 bytes (4.3 GB) copied, 119.104 s, 36.1 MB/s

Which jibes (roughly at least) with your theory. From what I can tell
(please correct me if I am wrong), if the client specifies nothing, the
server's default rules. It sounds like Linux (at least CentOS) defaults
to a much smaller value than OpenSolaris derivatives. As I said earlier,
vsphere apparently provides no way to specify nfs client tweaks like this.
Is there a way I can change the default settings on the CentOS end?

@behlendorf
Copy link
Contributor

@dswartz I was able to infer what was likely going on based on the average request size seen by the server. When mounting nfs synchronously each of those write requests will be written immediately to the SLOG. So if you're seeing all 4k IO (avgrq-sz = 8) then nfs must be making small 4k synchronous writes. The only reason I'm aware of that nfs would do this is if the request size was negotiated to 4k.

Now why nfs would be negotiating the request size to 4k I'm not at all sure. That would be a question for the maintainers of the Linux nfs kernel server. My understanding is that the client and server should negotiate at connect time the largest request size supported by both the client and server. The man page says:

       wsize=n        The  maximum  number  of bytes per network WRITE request
                      that the NFS client can send when writing data to a file
                      on  an  NFS server. The actual data payload size of each
                      NFS WRITE request is equal to or smaller than the  wsize
                      setting.  The  largest  write  payload  supported by the
                      Linux NFS client is 1,048,576 bytes (one megabyte).

                      Similar to rsize , the wsize value is a  positive  inte-
                      gral  multiple  of  1024.   Specified wsize values lower
                      than 1024 are replaced with  4096;  values  larger  than
                      1048576  are replaced with 1048576. If a specified value
                      is within the supported range  but  not  a  multiple  of
                      1024,  it  is  rounded  down  to the nearest multiple of
                      1024.

                      If a wsize value is not specified, or if  the  specified
                      wsize  value  is  larger  than  the  maximum that either
                      client or server can  support,  the  client  and  server
                      negotiate  the  largest  wsize  value that they can both
                      support.

                      The wsize mount option as specified on the mount(8) com-
                      mand  line  appears  in the /etc/mtab file. However, the
                      effective wsize  value  negotiated  by  the  client  and
                      server is reported in the /proc/mounts file.

@dswartz
Copy link
Contributor Author

dswartz commented Jun 11, 2014

@dswartz I was able to infer what was likely going on based on the average
request size seen by the server. When mounting nfs synchronously each of
those write requests will be written immediately to the SLOG. So if
you're seeing all 4k IO (avgrq-sz = 8) then nfs must be making small 4k
synchronous writes. The only reason I'm aware of that nfs would do this
is if the request size was negotiated to 4k.

Now why nfs would be negotiating the request size to 4k I'm not at all
sure. That would be a question for the maintainers of the Linux nfs
kernel server. My understanding is that the client and server should
negotiate at connect time the largest request size supported by both the
client and server.

Yeah, I saw this too. Puzzled as to what is happening differently when
vsphere is talking to omnios vs linux. Need to dig some more...

@dswartz
Copy link
Contributor Author

dswartz commented Jun 11, 2014

There's got to be something else going on. From what I can tell, vsphere is sending 512KB nfs writes, regardless of sync mode. Here's an example where it was slow (16MB/sec).

10.0.0.4.1940350015 > 10.0.0.31.2049: 1444 write fh 524288 (524288)
10.0.0.4.1940350016 > 10.0.0.31.2049: 1444 write fh 524288 (524288)
10.0.0.4.1940350017 > 10.0.0.31.2049: 1444 write fh 524288 (524288)
10.0.0.4.1940350018 > 10.0.0.31.2049: 1444 write fh 524288 (524288)
10.0.0.4.1940350019 > 10.0.0.31.2049: 1444 write fh 524288 (524288)
10.0.0.4.1940350020 > 10.0.0.31.2049: 1444 write fh 524288 (524288)
10.0.0.4.1940350022 > 10.0.0.31.2049: 1444 write fh 524288 (524288)

(this is from tcpdump) I then set sync=disabled and the tcpdump output didn't look any different (as far as I could see, anyway...) Still digging...

@behlendorf
Copy link
Contributor

Have you tried running nfsstat on the server? That might make it clearer what the nfs workload is. Although the tcpdump is pretty convincing.

It does sound like more investigation is needed.

@dswartz
Copy link
Contributor Author

dswartz commented Jun 12, 2014

So more digging has turned up this: ESXi forces the NFS client file block
size to be reported as 4KB. So in the tcpdump trace I posted, it sends
512KB of total data, in the form of (presumably) 128 4KB blocks. When
they get to the NFS server, it then has to do 128 synchronous writes to
the SLOG. I have consistently seen the SLOG pegging at around 5K IOPS. I
am thinking this is what is killing the write performance from ESXi. Here
is where I am confused. ESXi is doing synchronous NFS writes, sending
over a big batch (512KB) of data, in the form of 4KB "blocks". My
understanding is that the NFS servers ACKs the client's write when all the
data is safely on stable storage, correct? This means all 128 4KB writes.
Is there a reason the SLOG writes are not being coalesced? I infer they
are not, because this is all being written sequentially (unless we are not
allocating from the ZIL sequentially?) The NOOP scheduler supposedly does
write coalescing, so I don't know why that wouldn't help. I am not home
right now (where the testbed is), so my next step will have to wait until
this evening. Namely, try the same test on the OmniOS system. I verified
my hypothesis about 4K vs 512K by doing this:

time dd if=/dev/SSDA of=/test/foo/bar ibs=64K obs=4K

SSDA is another random SSD not being used for anything else - intent was
to have a very high-speed source of data. foo is a dataset on the sata
1-drive pool, with the intel s3700 as SLOG. When I run this, I get the
following:

[root@centos-vsa1 ~]# time dd if=/dev/sdd of=/test/vsphere/foo ibs=512K
obs=4K
^C2394+0 records in
306366+0 records out
1254875136 bytes (1.3 GB) copied, 125.488 s, 10.0 MB/s

Pretty much what I see via crystaldiskmark. If the methodology here
wasn't clear, the input blocks are 512KB, to match what the NFS client is
sending over, and the output blocks are 4KB, to match what the 'nfs block
size' is.

I will repeat this when I get home and can reboot the testbed on OmniOS.

@behlendorf
Copy link
Contributor

@dswartz It depends on exactly what the NFS server is issuing to ZFS. If its making 128 4K synchronous write calls there's nothing really we can do. Because for each individual 4K write it's asking that it be done synchronously so we can't return until it's done. That said, someone should really profile this on the kernel side to see exactly what the NFS kernel server is doing. Only then will we have an idea of what can be done to improve things.

@dswartz
Copy link
Contributor Author

dswartz commented Jun 12, 2014

@dswartz It depends on exactly what the NFS server is issuing to ZFS. If
its making 128 4K synchronous write calls there's nothing really we can
do. Because for each individual 4K write it's asking that it be done
synchronously so we can't return until it's done. That said, someone
should really profile this on the kernel side to see exactly what the NFS
kernel server is doing. Only then will we have an idea of what can be
done to improve things.

Maybe I wasn't clear. I get that that the whole collection needs to be
complete before we ACK to the NFS client. What I am not sure of is
whether we really need to be doing 128 discrete writes? I need to find
out if OmniOS is suffering the same slow speed when I do the non-NFS 'dd'
test. Stay tuned...

@behlendorf
Copy link
Contributor

@dswartz Right, I understand. Just keep in mind there's a layering on the servers. ZFS will only do what NFS server asks it to do. If it asks us to do 128 4k synchronous write that's what we have to do. If it asks us to do a single 512k write we'll do that instead. Someone needs to determine exactly what the NFS server is requesting and we can go from there.

@dswartz
Copy link
Contributor Author

dswartz commented Jun 12, 2014

@dswartz Right, I understand. Just keep in mind there's a layering on
the
servers. ZFS will only do what NFS server asks it to do. If it asks us
to do 128 4k synchronous write that's what we have to do. If it asks us
to do a single 512k write we'll do that instead. Someone needs to
determine exactly what the NFS server is requesting and we can go from
there.

Okay, I did some digging into the Linux NFS server, and found some debug
flags I could turn on. I then did a V3 mount from my ubuntu VM to the ZoL
server and did:

root@sphinx:~# time dd if=/dev/sda of=/mnt/foo bs=512K count=1
1+0 records in
1+0 records out
524288 bytes (524 kB) copied, 0.0808596 s, 6.5 MB/s

I had enabled nfsd debugging on the CentOS ZoL server, and after the above
completed, did 'dmesg'. Here is the interesting part:

nfsd_dispatch: vers 3 proc 4
nfsd: ACCESS(3) 20: 00060001 41573fb7 a98bec00 00000000 00000000
00000000 0x1f
nfsd: fh_verify(20: 00060001 41573fb7 a98bec00 00000000 00000000 00000000)
nfsd_dispatch: vers 3 proc 1
nfsd: GETATTR(3) 32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a
nfsd: fh_verify(32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a)
nfsd_dispatch: vers 3 proc 4
nfsd: ACCESS(3) 32: 01060001 41573fb7 a98bec00 00000000 00000000
00b8000a 0x2d
nfsd: fh_verify(32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a)
nfsd_dispatch: vers 3 proc 2
nfsd: SETATTR(3) 32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a
nfsd: fh_verify(32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a)
nfsd_dispatch: vers 3 proc 7
nfsd: WRITE(3) 32: 01060001 41573fb7 a98bec00 00000000 00000000
00b8000a 524288 bytes at 0 stable
nfsd: fh_verify(32: 01060001 41573fb7 a98bec00 00000000 00000000 00b8000a)
nfsd: write complete host_err=524288

e.g. precisely one WRITE request, which completed successfully. So there
were not in fact 128 4K writes (the 4K blocksize thing seems to have been
a red herring.) So nfsd is getting a 512K block of data, but is doing
something suboptimal. So nfsd is then writing to the file on the dataset
in question. And because it is synchronous, it goes through the SLOG, and
seems to be thrashing it good and hard. I will be happy to dig some more,
if you can point me in the right direction, but at this point, nfsd looks
to be exonerated, no?

@dswartz
Copy link
Contributor Author

dswartz commented Jun 12, 2014

Forgot to mention: the ubuntu nfs client mount was synchronous...

@ColdCanuck
Copy link
Contributor

Your post interested me as I was about to try something similar on my server. I set up a filesystem on my 2x2 "RAID10" ZFS pool, to test without a SLOG. I see better performance than you obtain, I consistently get 15 to 20MB/s (synchronous) over a gigabit ethernet, which while not great, is better than you are getting. There has to be something different in what we are doing.

Server

The NFS V3 server is an Ubuntu 12.04 box with ZOL 6.2. The filesystem is exported via /etc/exports not by ZFS parameters.

cat /etc/exports

/zebu/tmp 192.168.24.0/24(rw,insecure,no_subtree_check)

Client

The NFS client is an Ubuntu 10.04 system. Nothing special was done to the mount command:

mount -o sync zebra:/zebu/tmp /mnt/ZZ

cat /proc/mounts | grep ZZ
zebra:/zebu/tmp /mnt/ZZ nfs rw,sync,relatime,vers=3,rsize=1048576,wsize=1048576,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,mountaddr=192.168.24.7,mountvers=3,mountproto=tcp,addr=192.168.24.7 0 0

note the wsize

$dd if=/tmp/B of=/mnt/ZZ/C3 bs=512k
2048+0 records in
2048+0 records out
1073741824 bytes (1.1 GB) copied, 63.1314 s, 17.0 MB/s

Basically I get twice your performance with fewer vdevs and no SLOG, so what are the differences between the two setups ?

cat /proc/version
Linux version 3.8.0-38-generic (buildd@lamiak) (gcc version 4.6.3 (Ubuntu/Linaro 4.6.3-1ubuntu5) ) #56~precise1-Ubuntu SMP Thu Mar 13 16:22:48 UTC 2014

I am at 6.2 the current tagged version, you are at HEAD.

I used /etc/exports to export the filesystem on the server; how do you export on the server ?

Was your pool or filesystem set up with some funky parameters, is your asize correct for your disks ?

I know a "works for me" is not helpful ;o(, but there has to be something in your setup which is causing your poor performance, and I thought I would share to see if this helps the developers to suggest something.

@dswartz
Copy link
Contributor Author

dswartz commented Jun 12, 2014

Lots of interesting info. Here's the thing though: it has nothing to do with my pool. I have a 3x2 SAS pool and it works fine with sync=standard using the intel s3700 as SLOG under OmniOS. If I boot CentOS with ZoL and exact same hardware, the write throughput to the pool over gigabit goes from 80+MB/sec to 10MB/sec or so. The test info I have been posting about is a testbed with a single Sata drive and the intel SSD as SLOG - sucks the same way until and unless I boot OmniOS, then it's fine again (e.g. gigabit is the limiting factor.) I haven't tried with CentOS and my production pool with on-pool ZIL, so it's entirely possible I'd get about 20MB/sec like you. Notwithstanding that, it's still 1/4 or less of what OmniOS is delivering. It sure looks like somehow we are splitting up the 512KB sync write into a crapload of smaller writes to the SLOG, and it's hitting its IOPS limit. I'm not trying to be a jerk here, but this is NOT a problem I happen to have. I will bet you anything you can reproduce it trivially, assuming you have an SSD to use for an SLOG. Create a pool on a single disk, add the SSD as SLOG, share it out via NFS (using ZoL), mount it synchronously from your linux client and do several hundred MB write, using 'dd', and you will see crap write performance. Boot from OmniOS (possibly other Opensolaris distros, haven't tested that) and repeat the exact same test. Sustained write performance will got up by a factor of 4 or more.

@ColdCanuck
Copy link
Contributor

I can fully understand, whether it is 8MB/s or 20MB/s its NOT 90MB/s

When I look at the iostats for my server, it appears to be writing in 128k chunks (bytes/s / IOP/s) which is the record size of the filesystem.

This is a typical iostat -dxm 10 output :

Device: rrqm/s wrqm/s r/s w/s rMB/s wMB/s avgrq-sz avgqu-sz await r_await w_await svctm %util
sda 0.00 9.60 0.00 185.70 0.00 20.10 221.72 1.42 7.64 0.00 7.64 2.31 42.96
sdb 0.00 10.10 0.00 185.30 0.00 20.10 222.20 1.42 7.64 0.00 7.64 2.37 43.96
sdc 0.00 10.40 0.00 193.00 0.00 21.02 223.03 1.54 8.00 0.00 8.00 2.27 43.84
sdd 0.00 10.60 0.00 191.40 0.00 21.02 224.89 1.54 8.04 0.00 8.04 2.39 45.76

I get an avgrq-sz of 220ish or 110kB. Again this is without a SLOG.

In the original jpeg you posted with the iostat output, is sdc the SLOG device ? I ask because that is being written to in 4k chunks (avgrq_sz = 8.19). So without a SLOG my results son't seem to be broken up into 4k writes, but in your case with an SLOG they are. (guessing here)

In looking at your OmniOS iostat, it seems to be writing ~128k chunks to the SLOG (78978kB/s / 676.5 w/s)

So it looks like the answer is in how ZOL and OmniOS write to the SLOG but again I’m guessing. Perhaps the developers might be able to see how to make the SLOG write in bigger chunks.

So if you try your test without an SLOG, does it work more like my tests (i.e.. 128k chunks) ?

Anyway good luck, hope you get an answer. I’ll not waste more of your time with wild guesses ;o)

@dswartz
Copy link
Contributor Author

dswartz commented Jun 13, 2014

Well, this is really annoying. I think I may have been chasing a parked car. I did a mount from the production OmniOS server to the ZoL server and did 100MB sync writes. About 24MB/sec. I then removed the intel ssd as SLOG and formatted it with ext4 and shared it out. Mounted that filesystem to OmniOS and repeated. About 30MB/sec! I've been groveling through the kernel NFSD code and it looks like it might be breaking up the buffer passed by nfsd to the vfs later into smaller (page sized?) chunks? So it looks like it will loop, writing 4096 byte blocks to the SLOG. What I don't understand is why they are not being coalesced into bigger blocks?

@dswartz
Copy link
Contributor Author

dswartz commented Jun 13, 2014

So, this is a bummer. I was looking at google results for do_loop_readv_writev to see if I could prove or disprove nfsd is breaking up 512KB write into 4KB chunks, which is what is killing SLOG performance. Here is an excerpt from issue #1790:

Oct 15 23:54:21 fs2 kernel: [] zpl_write_common+0x52/0x70 [zfs]
Oct 15 23:54:21 fs2 kernel: [] zpl_write+0x68/0xa0 [zfs]
Oct 15 23:54:21 fs2 kernel: [] ? zpl_write+0x0/0xa0 [zfs]
Oct 15 23:54:21 fs2 kernel: [] do_loop_readv_writev+0x59/0x90
Oct 15 23:54:21 fs2 kernel: [] do_readv_writev+0x1e6/0x1f0

so it sure looks like the answer is yes. I understand this is not zfs' fault, but this is a major performance hit compared to a competing platform (opensolaris flavor). Where do we go from here?

@behlendorf
Copy link
Contributor

@dswartz Nice find! That neatly explains what's going on here and what we need to do to fix it. Let me explain.

The readv() and writev() system calls are implemented in one of two ways on Linux. If the underlying filesystem provides the aio_read and aio_write callbacks then the async IO interfaces will be used. The entire large IO will be passed to the filesystem as a vector and the caller can block until it's complete. This would allow us to do the optimal thing and issue larger IOs to the disk.

Unfortunately, the async IO callbacks haven't been implemented yet for ZoL. In this case the Linux kernel falls back to a compatibility code. It will call the do_loop_readv_writev function will in turn calls the normal read/write callbacks for each chunk of the vector IO. In this case those chunks appear to be 4k because of the page size.

The fix is for us to spend the time and get the asynchronous IO interfaces implemented, see #223. This gives us one more reason to prioritize getting that done. In the short term I don't think there's a quick fix. You may want to run OmniOS if all your IO is going to be synchronous. At least until we can resolve this properly.

@dswartz
Copy link
Contributor Author

dswartz commented Aug 14, 2014

Okay, back to this issue now, since Ryao's AIO patch seems to have helped nfs sync writes a lot. Still not nearly as good as omnios. As promised, I'm moving my updates from that pull request. As you may recall, when reading from ssd #1 and writing to a file on a sync=always dataset on another ssd-backed pool (with intel s3700 as SLOG), I was seeing it unable to exceed 1K IOPS. I just repeated the same test with a fresh install of omnios and see this:

root@omnios2:~# time dd if=/test/sync/foo of=/test/sync/foo2 bs=1M count=8K
8192+0 records in
8192+0 records out
8589934592 bytes (8.6 GB) copied, 47.5814 s, 181 MB/s

                        capacity     operations    bandwidth

pool alloc free read write read write


test 321K 119G 0 2.80K 0 183M
c6t5002538550038176d0 321K 119G 0 116 0 341K
logs - - - - - -
c6t55CD2E404B4CD14Fd0 256M 92.8G 0 2.69K 0 183M


Note it got to almost 3K IOPS and the aggregate write rate to the data pool was almost 200MB/sec.

@behlendorf
Copy link
Contributor

@dswartz With the AIO patch applied it would be useful to gather data from iostat -mx to see the drive utilization, IO/s, and average request size. Also just checking the system to see if we're CPU bound would be good. Then we'll have an idea where to look next.

@dswartz
Copy link
Contributor Author

dswartz commented Aug 14, 2014

@dswartz With the AIO patch applied it would be useful to gather data from
iostat -mx to see the drive utilization, IO/s, and average request size.
Also just checking the system to see if we're CPU bound would be good.
Then we'll have an idea where to look next.

will do...

@dswartz
Copy link
Contributor Author

dswartz commented Aug 15, 2014

Since my 'production' pool lives off an LSI 6gb HBA I decided to re-run
the NFS test using that. Methodology:

Create a compression=lz4 dataset on the pool with sync=standard.
CentOS7 on a random sata drive. Data pool on a samsung 840 SSD.
SLOG on a 20% slice of intel s3700.
Mount dataset in vsphere, and add a 32GB vmdk to a win7 VM.
Run crystaldiskmark.

The numbers were actually pretty good now - sustained sequential write of
just under 70MB/sec. I will post the iostat info Brian requested later.
I want to reboot the same config with omnios and re-test with the exact
same config and post those numbers too. Later this evening...

@dswartz
Copy link
Contributor Author

dswartz commented Aug 15, 2014

I actually have the crystaldiskmark screenshot as well as the iostat -mx output for the ZoL/AIO run. Since I can upload those now, I am doing so...

zfs with aio

zfs aio

zfs without aio

zfs stock

                                           capacity     operations    bandwidth
pool                                    alloc   free   read  write   read  write
--------------------------------------  -----  -----  -----  -----  -----  -----
test                                    4.04G   115G    265  1.48K  33.1M   134M
  ata-Samsung_SSD_840_PRO_Series_S12PNEACA01937F  4.04G   115G    265    584  33.1M  64.3M
logs                                        -      -      -      -      -      -
  wwn-0x55cd2e404b4cd14f                 224M  92.8G      0    928      0  69.9M
--------------------------------------  -----  -----  -----  -----  -----  -----
Device:         rrqm/s   wrqm/s     r/s     w/s    rMB/s    wMB/s avgrq-sz avgqu-sz   await r_await w_await  svctm  %util
sdc               0.00     0.00  262.50  578.70    32.81    64.23   236.26     0.52    0.62    0.86    0.50   0.43  36.50

@dswartz
Copy link
Contributor Author

dswartz commented Aug 15, 2014

Sorry that got messed up. The screenshots? The first is with AIO, the second is stock ZoL.

The text captures? The first is 'zpool iostat -v' and the second 'iostat -mx' as requested.

@behlendorf
Copy link
Contributor

The good news is the avgrq-sz is now basically where we need it to be. And we're also clearly not yet saturating the disk so there's still significant room for improvement.

@dswartz
Copy link
Contributor Author

dswartz commented Aug 15, 2014

The good news is the avgrq-sz is now basically where we need it to be.
And we're also clearly not yet saturating the disk so there's still
significant room for improvement.

Any thoughts as to what I can try tweaking next?

@behlendorf
Copy link
Contributor

@dswartz Well the read activity during the write isn't good. Do you recall seeing the same amount of read activity during the write test under OmniOS?

@dswartz
Copy link
Contributor Author

dswartz commented Aug 15, 2014

@dswartz Well the read activity during the write isn't good. Do you
recall seeing the same amount of read activity during the write test under
OmniOS?

I don't believe so, no. I think I saw this earlier with ZoL, but skipped over it.
It seems to be some kind of RMW artifact due to default 128KB NFS
recordsize on the dataset (no idea why omnios doesn't get this.) OTOH,
I'm not sure how this is hurting us, since the aggregate R/W for the data
disk is barely 100MB/sec and it's a samsung 840PRO which can do several
times that. I can try changing the recordsize to say 8KB and see (I seem
to recall seeing nexentator FAQ a couple of years ago recommending that
NFS targets of vsphere have much smaller recordsizes...) I will try that
when I get home...

@behlendorf
Copy link
Contributor

@dswartz The read-modify-write behavior will introducing some latency since the writes must block on the reads. That probably has a significant impact and might explain why the disk isn't saturated. Although I'd expect the same issue on OmniOS.

@dswartz
Copy link
Contributor Author

dswartz commented Aug 16, 2014

Hmmm, with 8K records the write perf went back down to about 50MB/sec. I'm not sure I understand why the RMW would hurt here. For a spinner, sure, but this is a high-performance SSD, so latency should be pretty close to zero, no? As in IOPS should be the limiting factor? Or something like that?

ryao added a commit to ryao/zfs that referenced this issue Sep 2, 2014
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.

It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:

```
Which operations are cancelable is implementation-defined.
```

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.

Closes:
openzfs#223
openzfs#2373

Signed-off-by: Richard Yao <ryao@gentoo.org>
ryao added a commit to ryao/zfs that referenced this issue Sep 3, 2014
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.

It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:

```
Which operations are cancelable is implementation-defined.
```

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.

Closes:
openzfs#223
openzfs#2373

Signed-off-by: Richard Yao <ryao@gentoo.org>
ryao added a commit to ryao/zfs that referenced this issue Sep 3, 2014
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.

It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:

```
Which operations are cancelable is implementation-defined.
```

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.

Closes:
openzfs#223
openzfs#2373

Signed-off-by: Richard Yao <ryao@gentoo.org>
@ryao
Copy link
Contributor

ryao commented Sep 3, 2014

@dswartz Is this a NUMA system? Do any of the following block device tuning knobs help?

echo 0 > /sys/block/[device]/queue/add_random
echo 2 > /sys/block/[device]/queue/rq_affinity

https://events.linuxfoundation.org/sites/events/files/eeus13_shelton.pdf

@dswartz
Copy link
Contributor Author

dswartz commented Sep 3, 2014

@dswartz Is this a NUMA system? Do any of the following block device
tuning knobs help?

echo 0 > /sys/block/[device]/queue/add_random
echo 2 > /sys/block/[device]/queue/rq_affinity

No, it was a cheapo Pentium-D CPU on an intel motherboard...

ryao added a commit to ryao/zfs that referenced this issue Sep 3, 2014
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.

It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:

```
Which operations are cancelable is implementation-defined.
```

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.

Lastly, it is important to known that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux, it is
the VFS' responsibility whle on Illumos, it is the filesystem's reponsibility.
We take the intermediate solution of copying the iovec so that the ZFS code can
update it like on Solaris while leaving the originals alone. This imposes some
overhead. We could always revisit this should profiling show that the
allocations are a problem.

Closes:
openzfs#223
openzfs#2373

Signed-off-by: Richard Yao <ryao@gentoo.org>
ryao added a commit to ryao/zfs that referenced this issue Sep 3, 2014
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.

It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:

```
Which operations are cancelable is implementation-defined.
```

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.

Lastly, it is important to known that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux, it is
the VFS' responsibility whle on Illumos, it is the filesystem's reponsibility.
We take the intermediate solution of copying the iovec so that the ZFS code can
update it like on Solaris while leaving the originals alone. This imposes some
overhead. We could always revisit this should profiling show that the
allocations are a problem.

Closes:
openzfs#223
openzfs#2373

Signed-off-by: Richard Yao <ryao@gentoo.org>
@erocm123
Copy link

erocm123 commented Sep 4, 2014

I am likewise seeing poor NFS performance with an SSD SLOG and sync writes. Good to find some answers at least.

@behlendorf
Copy link
Contributor

@dswartz Support for AIO has now been merged in to master which should help your performance. If we need to make additional performance improvements let's open a new issue to track them.

@behlendorf behlendorf modified the milestones: 0.6.4, 0.7.0 Sep 5, 2014
@dswartz
Copy link
Contributor Author

dswartz commented Sep 6, 2014

@dswartz Support for AIO has now been merged in to master which should
help your performance. If we need to make additional performance
improvements let's open a new issue to track them.

Cool!

DeHackEd pushed a commit to DeHackEd/zfs that referenced this issue Sep 18, 2014
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.

It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:

```
Which operations are cancelable is implementation-defined.
```

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.

Lastly, it is important to know that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux,
it is the VFS' responsibility whle on Illumos, it is the filesystem's
responsibility.  We take the intermediate solution of copying the iovec
so that the ZFS code can update it like on Solaris while leaving the
originals alone. This imposes some overhead. We could always revisit
this should profiling show that the allocations are a problem.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#223
Closes openzfs#2373
ryao added a commit to ryao/zfs that referenced this issue Oct 8, 2014
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.

It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:

```
Which operations are cancelable is implementation-defined.
```

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.

Lastly, it is important to known that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux, it is
the VFS' responsibility whle on Illumos, it is the filesystem's reponsibility.
We take the intermediate solution of copying the iovec so that the ZFS code can
update it like on Solaris while leaving the originals alone. This imposes some
overhead. We could always revisit this should profiling show that the
allocations are a problem.

Closes:
openzfs#223
openzfs#2373

Signed-off-by: Richard Yao <ryao@gentoo.org>
behlendorf pushed a commit to behlendorf/zfs that referenced this issue Jun 9, 2015
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to implement fops->aio_write synchronous to make
software that expects this behavior safe. However, there are several
reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not respect O_SYNC on files and assumes synchronous
behavior from do_readv_writev(), even though its fallback clearly does
not enforce it.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure. Exporting any file
system that implements AIO the way this patch does bears similar risk.
However, it seems reasonable to forgo crippling our AIO implementation
in favor of developing patches to fix this problem in Linux's nfsd for
the reasons stated earlier. In the interim, the risk will remain.
Failing to implement AIO will not change the problem that nfsd created,
so there is no reason for nfsd's mistake to block our implementation of
AIO.

Closes:
openzfs#223
openzfs#2373

Signed-off-by: Richard Yao <ryao@gentoo.org>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

7 participants
@behlendorf @ryao @ColdCanuck @dswartz @ikiris @erocm123 and others