Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct IO #224

Closed
behlendorf opened this issue May 3, 2011 · 61 comments
Closed

Direct IO #224

behlendorf opened this issue May 3, 2011 · 61 comments
Labels
Type: Feature Feature request or new feature

Comments

@behlendorf
Copy link
Contributor

The direct IO handlers have not yet been implemented. Supporting direct IO would have been a problem a few years back because of how ZFS copies everything in to the ARC cache. However, recently ZFS started supporting a zero-copy interface which we may be able to leverage for direct IO support.

@ghost
Copy link

ghost commented Aug 14, 2012

hmm. why not to do it in that way: let O_DIRECT always return true? does it metter that ZFS copies everything in to the ARC cache? let fake a bit an OS. It shouldn't hurt so much.... oh, and that is just my freak idea

@uejji
Copy link

uejji commented Oct 29, 2012

Unable to start mysqld with InnoDB databases living in a ZFS dataset. Is this related to this issue?

Using ppa:zfs-native/stable on Precise using Quantal kernel.

Here is the info of the system and dataset, followed by info from log snipped from /var/log/syslog

root@HumanFish:/# uname -a
Linux HumanFish.net 3.5.0-17-generic #28-Ubuntu SMP Tue Oct 9 19:31:23 UTC 2012 x86_64 x86_64 x86_64 GNU/Linux

root@HumanFish:/# lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description: Ubuntu 12.04.1 LTS
Release: 12.04
Codename: precise

root@HumanFish:/# zfs get all zpool/mysql
NAME PROPERTY VALUE SOURCE
zpool/mysql type filesystem -
zpool/mysql creation Mon Oct 29 9:44 2012 -
zpool/mysql used 71M -
zpool/mysql available 204G -
zpool/mysql referenced 71M -
zpool/mysql compressratio 3.38x -
zpool/mysql mounted yes -
zpool/mysql quota none default
zpool/mysql reservation none default
zpool/mysql recordsize 128K default
zpool/mysql mountpoint /var/lib/mysql local
zpool/mysql sharenfs off default
zpool/mysql checksum on default
zpool/mysql compression lzjb local
zpool/mysql atime on default
zpool/mysql devices on default
zpool/mysql exec on default
zpool/mysql setuid on default
zpool/mysql readonly off default
zpool/mysql zoned off default
zpool/mysql snapdir hidden default
zpool/mysql aclinherit restricted default
zpool/mysql canmount on default
zpool/mysql xattr on default
zpool/mysql copies 2 local
zpool/mysql version 5 -
zpool/mysql utf8only off -
zpool/mysql normalization none -
zpool/mysql casesensitivity sensitive -
zpool/mysql vscan off default
zpool/mysql nbmand off default
zpool/mysql sharesmb off default
zpool/mysql refquota none default
zpool/mysql refreservation none default
zpool/mysql primarycache all default
zpool/mysql secondarycache all default
zpool/mysql usedbysnapshots 0 -
zpool/mysql usedbydataset 71M -
zpool/mysql usedbychildren 0 -
zpool/mysql usedbyrefreservation 0 -
zpool/mysql logbias latency default
zpool/mysql dedup off default
zpool/mysql mlslabel none default
zpool/mysql sync standard default
zpool/mysql refcompressratio 3.38x -
zpool/mysql written 71M -

Oct 29 09:45:37 HumanFish mysqld_safe: Starting mysqld daemon with databases from /var/lib/mysql
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: The InnoDB memory heap is disabled
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Mutexes and rw_locks use GCC atomic builtins
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Compressed tables use zlib 1.2.3.4
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Using Linux native AIO
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Initializing buffer pool, size = 256.0M
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Completed initialization of buffer pool
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Failed to set O_DIRECT on file ./ibdata1: OPEN: Invalid argument, continuing anyway
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: O_DIRECT is known to result in 'Invalid argument' on Linux on tmpfs, see MySQL Bug#26662
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Failed to set O_DIRECT on file ./ibdata1: OPEN: Invalid argument, continuing anyway
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: O_DIRECT is known to result in 'Invalid argument' on Linux on tmpfs, see MySQL Bug#26662
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: highest supported file format is Barracuda.
Oct 29 09:45:37 HumanFish mysqld: 121029 9:45:37 InnoDB: Operating system error number 22 in a file operation.
Oct 29 09:45:37 HumanFish mysqld: InnoDB: Error number 22 means 'Invalid argument'.
Oct 29 09:45:37 HumanFish mysqld: InnoDB: Some operating system error numbers are described at
Oct 29 09:45:37 HumanFish mysqld: InnoDB: http://dev.mysql.com/doc/refman/5.5/en/operating-system-error-codes.html
Oct 29 09:45:37 HumanFish mysqld: InnoDB: File name ./ib_logfile0
Oct 29 09:45:37 HumanFish mysqld: InnoDB: File operation call: 'aio write'.
Oct 29 09:45:37 HumanFish mysqld: InnoDB: Cannot continue operation.
Oct 29 09:45:37 HumanFish mysqld_safe: mysqld from pid file /var/run/mysqld/mysqld.pid ended
Oct 29 09:46:07 HumanFish /etc/init.d/mysql[19687]: 0 processes alive and '/usr/bin/mysqladmin --defaults-file=/etc/mysql/debian.cnf ping' resulted in
Oct 29 09:46:07 HumanFish /etc/init.d/mysql[19687]: #7/usr/bin/mysqladmin: connect to server at 'localhost' failed
Oct 29 09:46:07 HumanFish /etc/init.d/mysql[19687]: error: 'Can't connect to local MySQL server through socket '/var/run/mysqld/mysqld.sock' (2)'
Oct 29 09:46:07 HumanFish /etc/init.d/mysql[19687]: Check that mysqld is running and that the socket: '/var/run/mysqld/mysqld.sock' exists!

@behlendorf
Copy link
Contributor Author

@uejji I'm no mysql expert but this more related to #223. We don't yet support the aio, most applications in this instance fall back to the normal I/O syscalls.

@uejji
Copy link

uejji commented Oct 29, 2012

@behlendorf I see. The errors about O_DIRECT in the log led me here through a Google search. I'll watch that issue in the meantime.

Thanks.

@behlendorf
Copy link
Contributor Author

@uejji See http://forum.percona.com/index.php?t=msg&goto=7577&S=0d0bff59d914393490d494ffaa9205a5 for a workaround to the aio issue.

@uejji
Copy link

uejji commented Oct 29, 2012

@behlendorf The innodb_use_native_aio option didn't exist by default in my.cnf, but adding it manually worked fine.

Thanks for locating the workaround for me. I guess the eventual goal will be that it's no longer necessary.

@pavel-odintsov
Copy link

Any news about O_DIRECT support?

@pquan
Copy link

pquan commented Apr 21, 2014

It can't be honored as zfs is double buffer. o_direct makes no sense
anyway. O_SYNC is a better way.

2014-04-21 14:18 GMT+02:00 pavel-odintsov notifications@github.com:

Any news about O_DIRECT support?


Reply to this email directly or view it on GitHubhttps://github.com//issues/224#issuecomment-40931865
.

@pruiz
Copy link

pruiz commented Apr 21, 2014

Well, then a flag which allows 'ignoring' O_DIRECT requests (w/o failing) could be a plus on some situations.

I know this would can be dangerous on some situations, but there are others where this can be assumed, and also, non-advanced users can be notified by emiting some kind of warning, etc. when such a flag is set.

@pruiz
Copy link

pruiz commented Apr 21, 2014

Another option would be providing a flag with three options (ignore, dsync, sync), which would mean:

  • ignore => Simply ignore O_DIRECT flag and just perform as with standard reqs.
  • dsync => Assume O_DIRECT == O_DSYNC.
  • sync => Assume O_DIRECT == O_SYNC.

Greets

@behlendorf
Copy link
Contributor Author

Making the behavior of O_DIRECT configurable with a property sounds like it may be a reasonable approach. However, we should be careful not to muddle the meaning of O_DIRECT.

The O_DIRECT flag only indicates that all the kernel caching should be bypassed. Data should be transferred directly to or from the user space process to the physical device. Unlike O_SYNC it makes no guarantees about the durability of the data on disk.

Given those requirements I could see a property which allows the following behavior:

  • disable => O_DIRECT as is not strictly supported.
  • ignore => Simply ignore O_DIRECT flag and just perform as with standard reqs.
  • enable => Never cache these blocks in the ARC. We can't avoid copies which might be made in the pipeline but we can disable the caching.

@pruiz
Copy link

pruiz commented Apr 23, 2014

That sounds pretty neat, and would allow some scenarios not supported right now, even with their own tradeoffs. ;)

@maci0
Copy link
Contributor

maci0 commented May 2, 2014

newer versions of virt-manager want use cache=none as default for qemu virtual images which in turn means qemu tries to use O_DIRECT and libvirt will throw errors.
the error messages will confuse most users not aware of the fact that ZoL doesn't support O_DIRECT yet.
+1 for any kind of solution

ryao referenced this issue Sep 8, 2014
nfsd uses do_readv_writev() to implement fops->read and fops->write.
do_readv_writev() will attempt to read/write using fops->aio_read and
fops->aio_write, but it will fallback to fops->read and fops->write when
AIO is not available. However, the fallback will perform a call for each
individual data page. Since our default recordsize is 128KB, sequential
operations on NFS will generate 32 DMU transactions where only 1
transaction was needed. That was unnecessary overhead and we implement
fops->aio_read and fops->aio_write to eliminate it.

ZFS originated in OpenSolaris, where the AIO API is entirely implemented
in userland's libc by intelligently mapping them to VOP_WRITE, VOP_READ
and VOP_FSYNC.  Linux implements AIO inside the kernel itself. Linux
filesystems therefore must implement their own AIO logic and nearly all
of them implement fops->aio_write synchronously. Consequently, they do
not implement aio_fsync(). However, since the ZPL works by mapping
Linux's VFS calls to the functions implementing Illumos' VFS operations,
we instead implement AIO in the kernel by mapping the operations to the
VOP_READ, VOP_WRITE and VOP_FSYNC equivalents. We therefore implement
fops->aio_fsync.

One might be inclined to make our fops->aio_write implementation
synchronous to make software that expects this behavior safe. However,
there are several reasons not to do this:

1. Other platforms do not implement aio_write() synchronously and since
the majority of userland software using AIO should be cross platform,
expectations of synchronous behavior should not be a problem.

2. We would hurt the performance of programs that use POSIX interfaces
properly while simultaneously encouraging the creation of more
non-compliant software.

3. The broader community concluded that userland software should be
patched to properly use POSIX interfaces instead of implementing hacks
in filesystems to cater to broken software. This concept is best
described as the O_PONIES debate.

4. Making an asynchronous write synchronous is non sequitur.

Any software dependent on synchronous aio_write behavior will suffer
data loss on ZFSOnLinux in a kernel panic / system failure of at most
zfs_txg_timeout seconds, which by default is 5 seconds. This seems like
a reasonable consequence of using non-compliant software.

It should be noted that this is also a problem in the kernel itself
where nfsd does not pass O_SYNC on files opened with it and instead
relies on a open()/write()/close() to enforce synchronous behavior when
the flush is only guarenteed on last close.

Exporting any filesystem that does not implement AIO via NFS risks data
loss in the event of a kernel panic / system failure when something else
is also accessing the file. Exporting any file system that implements
AIO the way this patch does bears similar risk. However, it seems
reasonable to forgo crippling our AIO implementation in favor of
developing patches to fix this problem in Linux's nfsd for the reasons
stated earlier. In the interim, the risk will remain. Failing to
implement AIO will not change the problem that nfsd created, so there is
no reason for nfsd's mistake to block our implementation of AIO.

It also should be noted that `aio_cancel()` will always return
`AIO_NOTCANCELED` under this implementation. It is possible to implement
aio_cancel by deferring work to taskqs and use `kiocb_set_cancel_fn()`
to set a callback function for cancelling work sent to taskqs, but the
simpler approach is allowed by the specification:

```
Which operations are cancelable is implementation-defined.
```

http://pubs.opengroup.org/onlinepubs/009695399/functions/aio_cancel.html

The only programs on my system that are capable of using `aio_cancel()`
are QEMU, beecrypt and fio use it according to a recursive grep of my
system's `/usr/src/debug`. That suggests that `aio_cancel()` users are
rare. Implementing aio_cancel() is left to a future date when it is
clear that there are consumers that benefit from its implementation to
justify the work.

Lastly, it is important to know that handling of the iovec updates differs
between Illumos and Linux in the implementation of read/write. On Linux,
it is the VFS' responsibility whle on Illumos, it is the filesystem's
responsibility.  We take the intermediate solution of copying the iovec
so that the ZFS code can update it like on Solaris while leaving the
originals alone. This imposes some overhead. We could always revisit
this should profiling show that the allocations are a problem.

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes #223
Closes #2373
@behlendorf behlendorf removed this from the 0.8.0 milestone Oct 3, 2014
@mgancarzdsi
Copy link

👍 For this.. I've been experimenting with oVirt as a virtualization manager and I'd love to use ZFS for it's data stores, but as far as I understand, I can't add it as a local data store due to this issue.

@behlendorf behlendorf added this to the 0.6.4 milestone Nov 20, 2014
@maci0
Copy link
Contributor

maci0 commented Jan 9, 2015

the solution in illumos kvm is rather crude too:
https://github.com/joyent/illumos-kvm-cmd/blob/master/block/raw-posix.c#L97

@pavel-odintsov
Copy link

It's rather better than "silent ignore O_DIRECT".

@behlendorf behlendorf modified the milestones: 0.6.5, 0.6.4 Feb 17, 2015
@behlendorf
Copy link
Contributor Author

After investigating what it will take to support this I'm bumping this functionality from the 0.6.4 tag. To add this functionality we must implement the address_space_operations.direct_IO callback for the ZPL. This will allow us to pin in memory the pages for IO which have been passed by the application. IO can then be performed directly to those pages. This will require us to add an additional interface to the DMU which accepts an struct iov_iter. While this work isn't particularly difficult, it's also not critical functionality and we don't want it to hold up the next release.

@ryao
Copy link
Contributor

ryao commented Jul 23, 2015

@behlendorf We can not just pin the user pages. We also need to mark them CoW so that userland cannot modify them as they are being read. Otherwise, we risk writing incorrect checksums. In the case of compression, userland modification of the pages while the compression algorithm is run would result in undefined behavior and might pose a security risk.

That said, I have a commit that implements O_DIRECT by mapping it to userspace here:

a08c76a

It was written after a user asked for the patch and it is not meant to be merged, but the commit message has a discussion of what O_DIRECT actually means that I will reproduce below:

DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX
for database workloads. Its purpose was to allow the database to bypass
the page and buffer caches to prevent unnecessary IO operations (e.g.
readahead) while preventing contention for system memory between the
database and kernel caches.

Unfortunately, the semantics were never defined in any standard. The
semantics of O_DIRECT in XFS in Linux are as follows:

1. O_DIRECT requires IOs be aligned to backing device's sector size.
2. O_DIRECT performs unbuffered IO operations between user memory and block
device (DMA when the block device is physical hardware).
3. O_DIRECT implies O_DSYNC.
4. O_DIRECT disables any locking that would serialize IO operations.

The first is not possible in ZFS beause there is no backing device in
the general case.

The second is not possible in ZFS in the presence of compression because
that prevents us from doing DMA from user pages. If we relax the
requirement in the case of compression, we encunter another hurdle. In
specific, avoiding the userland to kernel copy risks other userland
threads modifying buffers during compression and checksum computations.
For compressed data, this would cause undefined behavior while for
checksums, this would imply we write incorrect checksums to disk.  It
would be possible to avoid those issues if we modify the page tables to
make any changes by userland to memory trigger page faults and perform
CoW operations.  However, it is unclear if it is wise for a filesystem
driver to do this.

The third is doable, but we would need to make ZIL perform indirect
logging to avoid writing the data twice.

The fourth is already done for all IO in ZFS.

Other Linux filesystems such as ext4 do not follow #3. Mac OS X does not
implement O_DIRECT, but it does implement F_NOCACHE, which is similiar
to #2 in that it prevents new data from being cached. AIX relaxes #3 by
only committing the file data to disk. Metadata updates required should
the operations make the file larger are asynchronous unless O_DSYNC is
specified.

On Solaris and Illumos, there is a library function called directio(3C)
that allows userspace to provide a hint to the filesystem that DirectIO
is useful, but the filesystem is free to ignore it. The semantics are
also entirely a filesystem decision. Those that do not implement it
return ENOTTY.

Given the lack of standardization and ZFS' heritage, one solution to
provide compatibility with userland processes that expect DirectIO is to
treat DirectIO as a hint that we ignore. This can be done trivially by
implementing a shim that maps aops->direct_IO to AIO. There is also
already code in ZoL for bypassing the page cache when O_DIRECT is
specified, but it has been inert until now.

If it turns out that it is acceptable for a filesystem driver to
interact with the page tables, the scatter-gather list work will need be
finished and we would need to utilize the page tables to make operations
on the userland pages safe.

References:
http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
https://blogs.oracle.com/roch/entry/zfs_and_directio
https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
https://illumos.org/man/3c/directio
https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html
https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html

@Bronek
Copy link

Bronek commented Jul 26, 2015

@ryao thanks for the writeup. I use ZFS zvols as a backing storage for qemu, and am also vaguely familiar with how databases perform IO. Mapping direct_IO to AIO is definitely good first step but it would be great if both 2. and 3. received (eventually) attention as well.

Regarding 2. COW seems definitely like a good direction. I would also expect lower memory utilisation and possibly other performance gains from O_DIRECT if (and only if) compression is not enabled.

Regarding 3. that's interesting one. One can rather trivially (although not cheaply) increase IO subsystem performance by attaching nvme PCIe backed SLOG device; it would be great if ZIL could be used (if configured so - an extra option would be needed) as a primary backing storage mapped to O_DIRECT rather than indirect logging. This would help to preserve the benefits of fast SLOG device i.e. very low latency of synchronous writes; while at the same time guaranteeing data safety and low memory utilisation (primary goals of O_DIRECT in scenarios I am familar with).

@au-phiware
Copy link

@behlendorf acf0ade seems unrelated to Direct IO... did you mean to close this one?

@behlendorf
Copy link
Contributor Author

@au-phiware whoops, no I did not. It was accidentally caused by merging the SPL and it's history in to the ZFS repository. See PR #7556, we'll probably have a few more of these.

@pkramme
Copy link

pkramme commented Jul 29, 2018

What is the progress on this? I tried installing OVirt, but, as Oirt needs direct IO, the installation failed.

My workaround is to use a ZVOL with XFS in it.

@shodanshok
Copy link
Contributor

@behlendorf any update on the matter? I agree that O_DIRECT can be implemented by simply treating it as a hint ZFS can ignore. As another step, it would be great if O_DIRECT requests pollute the ARC as little as possible (basically what you suggested in your comment on 22 Apr 2014).

@behlendorf
Copy link
Contributor Author

No one I'm aware of is working on this. If someone would like to take a crack at it I'm happy help with the design, which could initially be the basic one described above, and review the changes.

@shodanshok
Copy link
Contributor

shodanshok commented Aug 17, 2018

@behlendorf Well, I just tried on FreeBSD 11.x a small C program with O_DIRECT support [1] and it really seems O_DIRECT is ignored: writes go into ARC and are served from it when data is read. ZFS compression for the dataset it off.

This do not surprise me: O_DIRECT implies zero-memory-copy and/or DMA from main memory to the disk themselves. While with standard filesystem this should be possible, with CoW+checksum (and anything which transforms data when they flow, ie: compression) this become very difficult.

# Before running the test program:
ARC Size:                               0.09%   1.14    MiB
        Target Size: (Adaptive)         100.00% 1.20    GiB
        Min Size (Hard Limit):          12.50%  153.30  MiB
        Max Size (High Water):          8:1     1.20    GiB

# After running it:
ARC Size:                               48.65%  596.61  MiB
        Target Size: (Adaptive)         100.00% 1.20    GiB
        Min Size (Hard Limit):          12.50%  153.30  MiB
        Max Size (High Water):          8:1     1.20    GiB

# Reading the just-written file shows data are server by ARC (ie: too fast for coming from the disk)
root@freebsd:~ # dd if=/tank/test.img of=/dev/null bs=1M
512+0 records in
512+0 records out
536870912 bytes transferred in 0.188852 secs (2842809718 bytes/sec)

[1] Test program:

root@freebsd:~ # cat test.c
#define _GNU_SOURCE
#include <string.h>
#include <stdlib.h>
#include <fcntl.h>
#include <stdio.h>
#include <unistd.h>
#define BLOCKSIZE 128*1024

int main()
{
        void *buffer;
        int i = 0;
        int w = 0;
        buffer = malloc(BLOCKSIZE);
        buffer = memset(buffer, 48, BLOCKSIZE);
        int f = open("/tank/test.img", O_CREAT|O_TRUNC|O_WRONLY|O_DIRECT);
        for (i=0; i<512*8; i++) {
                w = write(f, buffer, BLOCKSIZE);
        }
        close(f);
        free(buffer);
        return 0;
} 

Am I correct? Would be a simple "ignore O_DIRECT" policy be acceptable in current ZoL?
Thanks.

@shodanshok
Copy link
Contributor

@ryao If I understand it correctly, your old patch a08c76a basically ignores O_DIRECT by serving reads/writes using normal AIO functions.

Any chances to update it for newer ZFS / kernel releases?

@behlendorf
Copy link
Contributor Author

behlendorf commented Aug 17, 2018

Would be a simple "ignore O_DIRECT" policy be acceptable in current ZoL?

Something verify similar would acceptable. The approach taken by @ryao is an excellent start but we need to incorporate two additional changes to stay consistent with the intent of O_DIRECT. Otherwise we risk breaking existing applications which depend on this behavior.

  • O_DIRECT should to behave as if O_SYNC were set.
    [edit] Requirement removed, the open(2) man page explicitly says this is not guaranteed.

  • As @rlaager suggested O_DIRECT IO's should imply primarycache=metadata for those blocks.

Both of these should be relatively easy to implement since all of the needed functionality already exists.

@rlaager
Copy link
Member

rlaager commented Aug 17, 2018

I disagree with the idea that O_DIRECT should imply O_SYNC. My interest here is with cache=none for qemu. As was already mentioned, qemu has cache=none (O_DIRECT) and cache=directsync (both O_DSYNC and O_DIRECT). If someone wants both O_DIRECT and O_SYNC, they can ask for both.

@shodanshok
Copy link
Contributor

shodanshok commented Aug 17, 2018

As @rlaager suggested, O_DIRECT should not imply O_SYNC. From open() man page:

O_DIRECT (since Linux 2.4.10)
Try to minimize cache effects of the I/O to and from this
file. In general this will degrade performance, but it is
useful in special situations, such as when applications do
their own caching. File I/O is done directly to/from user-
space buffers. The O_DIRECT flag on its own makes an effort
to transfer data synchronously, but does not give the
guarantees of the O_SYNC flag that data and necessary metadata
are transferred. To guarantee synchronous I/O, O_SYNC must be
used in addition to O_DIRECT. See NOTES below for further
discussion.

In other words, O_DIRECT is somewhat similar to O_DSYNC but without the necessary I/O barriers (ie: fsync() and ATA FLUSH/FUA) to really immediately commit data to stable storage.

For a first implementation even simply ignoring O_DIRECT (similar to FreeBSD) should be better than current behavior (where open() with O_DIRECT fails with an error). If we can wire O_DIRECT with primarycache=metadata this would be great, however.

@behlendorf
Copy link
Contributor Author

Thanks for explicitly calling out what the man page has to say about this. Given that, I agree we just want the minimal caching behavior.

behlendorf added a commit to behlendorf/zfs that referenced this issue Aug 21, 2018
Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g.  readahead) while preventing contention for system
memory between the database and kernel caches.

On Illumos, there is a library function called directio(3C) that
allows user space to provide a hint to the file system that Direct IO
is useful, but the file system is free to ignore it. The semantics
are also entirely a file system decision. Those that do not
implement it return ENOTTY.

Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.

    1.  Minimize cache effects of the I/O.

    By design the ARC is already scan-resistant which helps mitigate
    the need for special O_DIRECT handling.  Data which is only
    accessed once will be the first to be evicted from the cache.
    This behavior is in consistent with Illumos and FreeBSD.

    Future performance work may wish to investigate the benefits of
    immediately evicting data from the cache which has been read or
    written with the O_DIRECT flag.  Functionally this behavior is
    very similar to applying the 'primarycache=metadata' property
    per open file.

    2. O_DIRECT _MAY_ impose restrictions on IO alignment and length.

    No additional alignment or length restrictions are imposed.

    3. O_DIRECT _MAY_ perform unbuffered IO operations directly
       between user memory and block device.

    No unbuffered IO operations are currently supported.  In order
    to support features such as transparent compression, encryption,
    and checksumming a copy must be made to transform the data.

    4. O_DIRECT _MAY_ imply O_DSYNC (XFS).

    O_DIRECT does not imply O_DSYNC for ZFS.  Callers must provide
    O_DSYNC to request synchronous semantics.

    5. O_DIRECT _MAY_ disable file locking that serializes IO
       operations.  Applications should avoid mixing O_DIRECT
       and normal IO or mmap(2) IO to the same file.  This is
       particularly true for overlapping regions.

    All I/O in ZFS is locked for correctness and this locking is not
    disabled by O_DIRECT.  However, concurrently mixing O_DIRECT,
    mmap(2), and normal I/O on the same file is not recommended.

This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations.  Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.

References:
  * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
  * https://blogs.oracle.com/roch/entry/zfs_and_directio
  * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
  * https://illumos.org/man/3c/directio

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#224
@behlendorf behlendorf mentioned this issue Aug 21, 2018
13 tasks
@behlendorf
Copy link
Contributor Author

I've opened #7823 with an updated version of @ryao's original patch. It does not implement the primarycache=metadata suggestion and instead behaves the same way as Illumos and FreeBSD. After some investigation I decided there were additional complexities which needed more analysis and would be better tackled at a latter date. See the PR for additional details.

behlendorf added a commit to behlendorf/zfs that referenced this issue Aug 21, 2018
Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g.  readahead) while preventing contention for system
memory between the database and kernel caches.

On Illumos, there is a library function called directio(3C) that
allows user space to provide a hint to the file system that Direct IO
is useful, but the file system is free to ignore it. The semantics
are also entirely a file system decision. Those that do not
implement it return ENOTTY.

Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.

    1.  Minimize cache effects of the I/O.

    By design the ARC is already scan-resistant which helps mitigate
    the need for special O_DIRECT handling.  Data which is only
    accessed once will be the first to be evicted from the cache.
    This behavior is in consistent with Illumos and FreeBSD.

    Future performance work may wish to investigate the benefits of
    immediately evicting data from the cache which has been read or
    written with the O_DIRECT flag.  Functionally this behavior is
    very similar to applying the 'primarycache=metadata' property
    per open file.

    2. O_DIRECT _MAY_ impose restrictions on IO alignment and length.

    No additional alignment or length restrictions are imposed.

    3. O_DIRECT _MAY_ perform unbuffered IO operations directly
       between user memory and block device.

    No unbuffered IO operations are currently supported.  In order
    to support features such as transparent compression, encryption,
    and checksumming a copy must be made to transform the data.

    4. O_DIRECT _MAY_ imply O_DSYNC (XFS).

    O_DIRECT does not imply O_DSYNC for ZFS.  Callers must provide
    O_DSYNC to request synchronous semantics.

    5. O_DIRECT _MAY_ disable file locking that serializes IO
       operations.  Applications should avoid mixing O_DIRECT
       and normal IO or mmap(2) IO to the same file.  This is
       particularly true for overlapping regions.

    All I/O in ZFS is locked for correctness and this locking is not
    disabled by O_DIRECT.  However, concurrently mixing O_DIRECT,
    mmap(2), and normal I/O on the same file is not recommended.

This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations.  Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.

References:
  * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
  * https://blogs.oracle.com/roch/entry/zfs_and_directio
  * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
  * https://illumos.org/man/3c/directio

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#224
behlendorf added a commit to behlendorf/zfs that referenced this issue Aug 22, 2018
Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g.  readahead) while preventing contention for system
memory between the database and kernel caches.

On Illumos, there is a library function called directio(3C) that
allows user space to provide a hint to the file system that Direct IO
is useful, but the file system is free to ignore it. The semantics
are also entirely a file system decision. Those that do not
implement it return ENOTTY.

Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.

    1.  Minimize cache effects of the I/O.

    By design the ARC is already scan-resistant which helps mitigate
    the need for special O_DIRECT handling.  Data which is only
    accessed once will be the first to be evicted from the cache.
    This behavior is in consistent with Illumos and FreeBSD.

    Future performance work may wish to investigate the benefits of
    immediately evicting data from the cache which has been read or
    written with the O_DIRECT flag.  Functionally this behavior is
    very similar to applying the 'primarycache=metadata' property
    per open file.

    2. O_DIRECT _MAY_ impose restrictions on IO alignment and length.

    No additional alignment or length restrictions are imposed.

    3. O_DIRECT _MAY_ perform unbuffered IO operations directly
       between user memory and block device.

    No unbuffered IO operations are currently supported.  In order
    to support features such as transparent compression, encryption,
    and checksumming a copy must be made to transform the data.

    4. O_DIRECT _MAY_ imply O_DSYNC (XFS).

    O_DIRECT does not imply O_DSYNC for ZFS.  Callers must provide
    O_DSYNC to request synchronous semantics.

    5. O_DIRECT _MAY_ disable file locking that serializes IO
       operations.  Applications should avoid mixing O_DIRECT
       and normal IO or mmap(2) IO to the same file.  This is
       particularly true for overlapping regions.

    All I/O in ZFS is locked for correctness and this locking is not
    disabled by O_DIRECT.  However, concurrently mixing O_DIRECT,
    mmap(2), and normal I/O on the same file is not recommended.

This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations.  Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.

References:
  * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
  * https://blogs.oracle.com/roch/entry/zfs_and_directio
  * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
  * https://illumos.org/man/3c/directio

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#224
behlendorf added a commit to behlendorf/zfs that referenced this issue Aug 22, 2018
Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g.  readahead) while preventing contention for system
memory between the database and kernel caches.

On Illumos, there is a library function called directio(3C) that
allows user space to provide a hint to the file system that Direct IO
is useful, but the file system is free to ignore it. The semantics
are also entirely a file system decision. Those that do not
implement it return ENOTTY.

Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.

    1.  Minimize cache effects of the I/O.

    By design the ARC is already scan-resistant which helps mitigate
    the need for special O_DIRECT handling.  Data which is only
    accessed once will be the first to be evicted from the cache.
    This behavior is in consistent with Illumos and FreeBSD.

    Future performance work may wish to investigate the benefits of
    immediately evicting data from the cache which has been read or
    written with the O_DIRECT flag.  Functionally this behavior is
    very similar to applying the 'primarycache=metadata' property
    per open file.

    2. O_DIRECT _MAY_ impose restrictions on IO alignment and length.

    No additional alignment or length restrictions are imposed.

    3. O_DIRECT _MAY_ perform unbuffered IO operations directly
       between user memory and block device.

    No unbuffered IO operations are currently supported.  In order
    to support features such as transparent compression, encryption,
    and checksumming a copy must be made to transform the data.

    4. O_DIRECT _MAY_ imply O_DSYNC (XFS).

    O_DIRECT does not imply O_DSYNC for ZFS.  Callers must provide
    O_DSYNC to request synchronous semantics.

    5. O_DIRECT _MAY_ disable file locking that serializes IO
       operations.  Applications should avoid mixing O_DIRECT
       and normal IO or mmap(2) IO to the same file.  This is
       particularly true for overlapping regions.

    All I/O in ZFS is locked for correctness and this locking is not
    disabled by O_DIRECT.  However, concurrently mixing O_DIRECT,
    mmap(2), and normal I/O on the same file is not recommended.

This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations.  Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.

References:
  * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
  * https://blogs.oracle.com/roch/entry/zfs_and_directio
  * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
  * https://illumos.org/man/3c/directio

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#224
behlendorf added a commit to behlendorf/zfs that referenced this issue Aug 23, 2018
Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g.  readahead) while preventing contention for system
memory between the database and kernel caches.

On Illumos, there is a library function called directio(3C) that
allows user space to provide a hint to the file system that Direct IO
is useful, but the file system is free to ignore it. The semantics
are also entirely a file system decision. Those that do not
implement it return ENOTTY.

Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.

    1.  Minimize cache effects of the I/O.

    By design the ARC is already scan-resistant which helps mitigate
    the need for special O_DIRECT handling.  Data which is only
    accessed once will be the first to be evicted from the cache.
    This behavior is in consistent with Illumos and FreeBSD.

    Future performance work may wish to investigate the benefits of
    immediately evicting data from the cache which has been read or
    written with the O_DIRECT flag.  Functionally this behavior is
    very similar to applying the 'primarycache=metadata' property
    per open file.

    2. O_DIRECT _MAY_ impose restrictions on IO alignment and length.

    No additional alignment or length restrictions are imposed.

    3. O_DIRECT _MAY_ perform unbuffered IO operations directly
       between user memory and block device.

    No unbuffered IO operations are currently supported.  In order
    to support features such as transparent compression, encryption,
    and checksumming a copy must be made to transform the data.

    4. O_DIRECT _MAY_ imply O_DSYNC (XFS).

    O_DIRECT does not imply O_DSYNC for ZFS.  Callers must provide
    O_DSYNC to request synchronous semantics.

    5. O_DIRECT _MAY_ disable file locking that serializes IO
       operations.  Applications should avoid mixing O_DIRECT
       and normal IO or mmap(2) IO to the same file.  This is
       particularly true for overlapping regions.

    All I/O in ZFS is locked for correctness and this locking is not
    disabled by O_DIRECT.  However, concurrently mixing O_DIRECT,
    mmap(2), and normal I/O on the same file is not recommended.

This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations.  Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.

References:
  * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
  * https://blogs.oracle.com/roch/entry/zfs_and_directio
  * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
  * https://illumos.org/man/3c/directio

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#224
behlendorf added a commit to behlendorf/zfs that referenced this issue Aug 23, 2018
Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g.  readahead) while preventing contention for system
memory between the database and kernel caches.

On Illumos, there is a library function called directio(3C) that
allows user space to provide a hint to the file system that Direct IO
is useful, but the file system is free to ignore it. The semantics
are also entirely a file system decision. Those that do not
implement it return ENOTTY.

Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.

    1.  Minimize cache effects of the I/O.

    By design the ARC is already scan-resistant which helps mitigate
    the need for special O_DIRECT handling.  Data which is only
    accessed once will be the first to be evicted from the cache.
    This behavior is in consistent with Illumos and FreeBSD.

    Future performance work may wish to investigate the benefits of
    immediately evicting data from the cache which has been read or
    written with the O_DIRECT flag.  Functionally this behavior is
    very similar to applying the 'primarycache=metadata' property
    per open file.

    2. O_DIRECT _MAY_ impose restrictions on IO alignment and length.

    No additional alignment or length restrictions are imposed.

    3. O_DIRECT _MAY_ perform unbuffered IO operations directly
       between user memory and block device.

    No unbuffered IO operations are currently supported.  In order
    to support features such as transparent compression, encryption,
    and checksumming a copy must be made to transform the data.

    4. O_DIRECT _MAY_ imply O_DSYNC (XFS).

    O_DIRECT does not imply O_DSYNC for ZFS.  Callers must provide
    O_DSYNC to request synchronous semantics.

    5. O_DIRECT _MAY_ disable file locking that serializes IO
       operations.  Applications should avoid mixing O_DIRECT
       and normal IO or mmap(2) IO to the same file.  This is
       particularly true for overlapping regions.

    All I/O in ZFS is locked for correctness and this locking is not
    disabled by O_DIRECT.  However, concurrently mixing O_DIRECT,
    mmap(2), and normal I/O on the same file is not recommended.

This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations.  Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.

References:
  * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
  * https://blogs.oracle.com/roch/entry/zfs_and_directio
  * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
  * https://illumos.org/man/3c/directio

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#224
sdimitro pushed a commit to sdimitro/zfs that referenced this issue Feb 18, 2022
…zfs#224)

The object-block mapping can change while a GET is in progress.  After
the object-block map changes, the old object can be deleted.  If the GET
is still in progress, the GET may fail with "object does not exist", and
the agent will panic.

The fix is to retry failed GETs if the object-block mapping for the
requested block has changed (i.e. if we will be getting a different
object).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Feature Feature request or new feature
Projects
None yet
Development

No branches or pull requests