-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Direct IO support #7823
Direct IO support #7823
Conversation
2d2584f
to
a7796d6
Compare
@behlendorf we've focused too much on cache behavior/pollution; certainly whilst that feels very important there is another issue of robustness/correctness in the case of failures DIO is as much about error handling you want a 1:1 correspondence to IOs submitted and error responses if you submit 100s of IOs possible asynchronously from many threads (databases do this), and for some reason some of them don't make it (i'm not sure under what circumstances that might happen with zfs) the application needs to know which IOs failed and which succeeded |
@cwedgwood To me it seems that in ZFS, where data writes can be transformed (ie: compression, encryption, etc) and will be written into recordsize chunks, single I/O error reporting is extremely difficult and invasive. I think this patch, enabling seamless use of DirectIO-only application, is the way to go. |
@cwedgwood could you be more specific, what exactly are you proposing. Applications can pass the [edit] I should add that one thing I did leave out of this PR was to force all |
In addition to @shodanshok comments, ZFS coalesces I/Os, further diminishing the ability to directly attribute an I/O with an I/O. @behlendorf I like the approach. Normally, I'd advocate for a docs change, but for this specific |
Direct IO via the O_DIRECT flag was originally introduced in XFS by IRIX for database workloads. Its purpose was to allow the database to bypass the page and buffer caches to prevent unnecessary IO operations (e.g. readahead) while preventing contention for system memory between the database and kernel caches. On Illumos, there is a library function called directio(3C) that allows user space to provide a hint to the file system that Direct IO is useful, but the file system is free to ignore it. The semantics are also entirely a file system decision. Those that do not implement it return ENOTTY. Since the semantics were never defined in any standard, O_DIRECT is implemented such that it conforms to the behavior described in the Linux open(2) man page as follows. 1. Minimize cache effects of the I/O. By design the ARC is already scan-resistant which helps mitigate the need for special O_DIRECT handling. Data which is only accessed once will be the first to be evicted from the cache. This behavior is in consistent with Illumos and FreeBSD. Future performance work may wish to investigate the benefits of immediately evicting data from the cache which has been read or written with the O_DIRECT flag. Functionally this behavior is very similar to applying the 'primarycache=metadata' property per open file. 2. O_DIRECT _MAY_ impose restrictions on IO alignment and length. No additional alignment or length restrictions are imposed. 3. O_DIRECT _MAY_ perform unbuffered IO operations directly between user memory and block device. No unbuffered IO operations are currently supported. In order to support features such as transparent compression, encryption, and checksumming a copy must be made to transform the data. 4. O_DIRECT _MAY_ imply O_DSYNC (XFS). O_DIRECT does not imply O_DSYNC for ZFS. Callers must provide O_DSYNC to request synchronous semantics. 5. O_DIRECT _MAY_ disable file locking that serializes IO operations. Applications should avoid mixing O_DIRECT and normal IO or mmap(2) IO to the same file. This is particularly true for overlapping regions. All I/O in ZFS is locked for correctness and this locking is not disabled by O_DIRECT. However, concurrently mixing O_DIRECT, mmap(2), and normal I/O on the same file is not recommended. This change is implemented by layering the aops->direct_IO operations on the existing AIO operations. Code already existed in ZFS on Linux for bypassing the page cache when O_DIRECT is specified. References: * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html * https://blogs.oracle.com/roch/entry/zfs_and_directio * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics * https://illumos.org/man/3c/directio Signed-off-by: Richard Yao <ryao@gentoo.org> Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov> Closes openzfs#224
Codecov Report
@@ Coverage Diff @@
## master #7823 +/- ##
==========================================
+ Coverage 78.42% 78.46% +0.04%
==========================================
Files 374 374
Lines 112907 112902 -5
==========================================
+ Hits 88548 88592 +44
+ Misses 24359 24310 -49
Continue to review full report at Codecov.
|
@behlendorf thanks! Do you know will be the next version with O_DIRECT? |
@yakirgb O_DIRECT will be added to 0.8. Though it should be pretty easy to cherry-pick back to 0.7.x if you want to run a custom build. |
How can disable O_DIRECT support for a specific pool or dataset, if necessary? |
Motivation and Context
Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g. readahead) while preventing contention for system
memory between the database and kernel caches.
Since it was originally introduced it has become a standard flag which
is implemented, in some form, by the majority of Linux filesystems.
Failure to support this flag can result in compatibility problems.
Issue #224.
Description
Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.
Minimize cache effects of the I/O.
By design the ARC is already scan-resistant which helps mitigate
the need for special O_DIRECT handling. Data which is only
accessed once will be the first to be evicted from the cache.
This behavior is in consistent with Illumos and FreeBSD.
Future performance work may wish to investigate the benefits of
immediately evicting data from the cache which has been read or
written with the O_DIRECT flag. Functionally this behavior is
very similar to applying the 'primarycache=metadata' property
per open file.
O_DIRECT MAY impose restrictions on IO alignment and length.
No additional alignment or length restrictions are imposed.
O_DIRECT MAY perform unbuffered IO operations directly
between user memory and block device.
No unbuffered IO operations are currently supported. In order
to support features such as transparent compression, encryption,
and checksumming a copy must be made to transform the data.
O_DIRECT MAY imply O_DSYNC (XFS).
O_DIRECT does not imply O_DSYNC for ZFS. Callers must provide
O_DSYNC to request synchronous semantics.
O_DIRECT MAY disable file locking that serializes IO
operations. Applications should avoid mixing O_DIRECT
and normal IO or mmap(2) IO to the same file. This is
particularly true for overlapping regions.
All I/O in ZFS is locked for correctness and this locking is not
disabled by O_DIRECT. However, concurrently mixing O_DIRECT,
mmap(2), and normal I/O on the same file is not recommended.
This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations. Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.
References:
Original patch proposed by Richard Yao.
How Has This Been Tested?
All testing was performed using the 4.18.0-1.el7.elrepo.x86_64 kernel.
Compatibility code was added to support kernels back to 2.6.32 for RHEL6.
Five new test cases were added which use the
fio(1)
verify mode to checkthe most commonly used IO library functions. The O_SYNC and O_DIRECT
flags tested for all of these tests and may be extended as need in the future.
All asynchronous IO and direct IO test cases included with xfstests were
manually run and passed.
Additional build and test results are still needed for a wider range of kernels.
The CI bots and new fio test cases will cover the majority of kernels used by
current distributions but additional manual testing would be welcome.
Types of changes
Checklist:
Signed-off-by
.