Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct IO support #7823

Merged
merged 1 commit into from
Aug 27, 2018
Merged

Direct IO support #7823

merged 1 commit into from
Aug 27, 2018

Conversation

behlendorf
Copy link
Contributor

Motivation and Context

Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g. readahead) while preventing contention for system
memory between the database and kernel caches.

Since it was originally introduced it has become a standard flag which
is implemented, in some form, by the majority of Linux filesystems.
Failure to support this flag can result in compatibility problems.

Issue #224.

Description

Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.

  1. Minimize cache effects of the I/O.

    By design the ARC is already scan-resistant which helps mitigate
    the need for special O_DIRECT handling. Data which is only
    accessed once will be the first to be evicted from the cache.
    This behavior is in consistent with Illumos and FreeBSD.

    Future performance work may wish to investigate the benefits of
    immediately evicting data from the cache which has been read or
    written with the O_DIRECT flag. Functionally this behavior is
    very similar to applying the 'primarycache=metadata' property
    per open file.

  2. O_DIRECT MAY impose restrictions on IO alignment and length.

    No additional alignment or length restrictions are imposed.

  3. O_DIRECT MAY perform unbuffered IO operations directly
    between user memory and block device.

    No unbuffered IO operations are currently supported. In order
    to support features such as transparent compression, encryption,
    and checksumming a copy must be made to transform the data.

  4. O_DIRECT MAY imply O_DSYNC (XFS).

    O_DIRECT does not imply O_DSYNC for ZFS. Callers must provide
    O_DSYNC to request synchronous semantics.

  5. O_DIRECT MAY disable file locking that serializes IO
    operations. Applications should avoid mixing O_DIRECT
    and normal IO or mmap(2) IO to the same file. This is
    particularly true for overlapping regions.

    All I/O in ZFS is locked for correctness and this locking is not
    disabled by O_DIRECT. However, concurrently mixing O_DIRECT,
    mmap(2), and normal I/O on the same file is not recommended.

This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations. Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.

References:

Original patch proposed by Richard Yao.

How Has This Been Tested?

All testing was performed using the 4.18.0-1.el7.elrepo.x86_64 kernel.
Compatibility code was added to support kernels back to 2.6.32 for RHEL6.

  • Five new test cases were added which use the fio(1) verify mode to check
    the most commonly used IO library functions. The O_SYNC and O_DIRECT
    flags tested for all of these tests and may be extended as need in the future.

  • All asynchronous IO and direct IO test cases included with xfstests were
    manually run and passed.

Additional build and test results are still needed for a wider range of kernels.
The CI bots and new fio test cases will cover the majority of kernels used by
current distributions but additional manual testing would be welcome.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • All new and existing tests passed.
  • All commit messages are properly formatted and contain Signed-off-by.
  • Change has been approved by a ZFS on Linux member.

@behlendorf behlendorf requested a review from ryao August 21, 2018 22:47
@behlendorf behlendorf mentioned this pull request Aug 21, 2018
@behlendorf behlendorf force-pushed the direct-io branch 2 times, most recently from 2d2584f to a7796d6 Compare August 22, 2018 00:26
@cwedgwood
Copy link
Contributor

@behlendorf we've focused too much on cache behavior/pollution; certainly whilst that feels very important there is another issue of robustness/correctness in the case of failures

DIO is as much about error handling

you want a 1:1 correspondence to IOs submitted and error responses

if you submit 100s of IOs possible asynchronously from many threads (databases do this), and for some reason some of them don't make it (i'm not sure under what circumstances that might happen with zfs) the application needs to know which IOs failed and which succeeded

@shodanshok
Copy link
Contributor

@cwedgwood To me it seems that in ZFS, where data writes can be transformed (ie: compression, encryption, etc) and will be written into recordsize chunks, single I/O error reporting is extremely difficult and invasive. I think this patch, enabling seamless use of DirectIO-only application, is the way to go.

@behlendorf
Copy link
Contributor Author

behlendorf commented Aug 22, 2018

@cwedgwood could you be more specific, what exactly are you proposing. Applications can pass the O_SYNC flag along with O_DIRECT if they require a 1:1 correspondence to IOs submitted and error responses.

[edit] I should add that one thing I did leave out of this PR was to force all O_DIRECT IO to be indirectly logged by the ZIL. This may be desirable for database workloads to prevent small blocks from being written twice.

@richardelling
Copy link
Contributor

In addition to @shodanshok comments, ZFS coalesces I/Os, further diminishing the ability to directly attribute an I/O with an I/O.

@behlendorf I like the approach. Normally, I'd advocate for a docs change, but for this specific
case, if it "just works" for those apps that set the flags, I'm inclined to not attract more attention
to the implementation details thus avoiding directio bikeshedding.

Direct IO via the O_DIRECT flag was originally introduced in XFS by
IRIX for database workloads. Its purpose was to allow the database
to bypass the page and buffer caches to prevent unnecessary IO
operations (e.g.  readahead) while preventing contention for system
memory between the database and kernel caches.

On Illumos, there is a library function called directio(3C) that
allows user space to provide a hint to the file system that Direct IO
is useful, but the file system is free to ignore it. The semantics
are also entirely a file system decision. Those that do not
implement it return ENOTTY.

Since the semantics were never defined in any standard, O_DIRECT is
implemented such that it conforms to the behavior described in the
Linux open(2) man page as follows.

    1.  Minimize cache effects of the I/O.

    By design the ARC is already scan-resistant which helps mitigate
    the need for special O_DIRECT handling.  Data which is only
    accessed once will be the first to be evicted from the cache.
    This behavior is in consistent with Illumos and FreeBSD.

    Future performance work may wish to investigate the benefits of
    immediately evicting data from the cache which has been read or
    written with the O_DIRECT flag.  Functionally this behavior is
    very similar to applying the 'primarycache=metadata' property
    per open file.

    2. O_DIRECT _MAY_ impose restrictions on IO alignment and length.

    No additional alignment or length restrictions are imposed.

    3. O_DIRECT _MAY_ perform unbuffered IO operations directly
       between user memory and block device.

    No unbuffered IO operations are currently supported.  In order
    to support features such as transparent compression, encryption,
    and checksumming a copy must be made to transform the data.

    4. O_DIRECT _MAY_ imply O_DSYNC (XFS).

    O_DIRECT does not imply O_DSYNC for ZFS.  Callers must provide
    O_DSYNC to request synchronous semantics.

    5. O_DIRECT _MAY_ disable file locking that serializes IO
       operations.  Applications should avoid mixing O_DIRECT
       and normal IO or mmap(2) IO to the same file.  This is
       particularly true for overlapping regions.

    All I/O in ZFS is locked for correctness and this locking is not
    disabled by O_DIRECT.  However, concurrently mixing O_DIRECT,
    mmap(2), and normal I/O on the same file is not recommended.

This change is implemented by layering the aops->direct_IO operations
on the existing AIO operations.  Code already existed in ZFS on Linux
for bypassing the page cache when O_DIRECT is specified.

References:
  * http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html
  * https://blogs.oracle.com/roch/entry/zfs_and_directio
  * https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics
  * https://illumos.org/man/3c/directio

Signed-off-by: Richard Yao <ryao@gentoo.org>
Signed-off-by: Brian Behlendorf <behlendorf1@llnl.gov>
Closes openzfs#224
@codecov
Copy link

codecov bot commented Aug 23, 2018

Codecov Report

Merging #7823 into master will increase coverage by 0.04%.
The diff coverage is 0%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #7823      +/-   ##
==========================================
+ Coverage   78.42%   78.46%   +0.04%     
==========================================
  Files         374      374              
  Lines      112907   112902       -5     
==========================================
+ Hits        88548    88592      +44     
+ Misses      24359    24310      -49
Flag Coverage Δ
#kernel 78.79% <0%> (-0.05%) ⬇️
#user 67.72% <ø> (+0.43%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c434d88...5f1ea53. Read the comment docs.

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) Reviewed labels Aug 26, 2018
@behlendorf behlendorf merged commit a584ef2 into openzfs:master Aug 27, 2018
@yakirgb
Copy link

yakirgb commented Aug 30, 2018

@behlendorf thanks! Do you know will be the next version with O_DIRECT?
i want to install the rpm of 0.7.10 with O_DIRECT

@behlendorf
Copy link
Contributor Author

@yakirgb O_DIRECT will be added to 0.8. Though it should be pretty easy to cherry-pick back to 0.7.x if you want to run a custom build.

@rnz
Copy link

rnz commented Feb 2, 2021

How can disable O_DIRECT support for a specific pool or dataset, if necessary?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants