From a08c76a8ad63c28384ead72b53a3d7ef73f39357 Mon Sep 17 00:00:00 2001 From: Richard Yao Date: Wed, 22 Jul 2015 18:42:01 -0400 Subject: [PATCH] DirectIO support DirectIO via the O_DIRECT flag was originally introduced in XFS by IRIX for database workloads. Its purpose was to allow the database to bypass the page and buffer caches to prevent unnecessary IO operations (e.g. readahead) while preventing contention for system memory between the database and kernel caches. Unfortunately, the semantics were never defined in any standard. The semantics of O_DIRECT in XFS in Linux are as follows: 1. O_DIRECT requires IOs be aligned to backing device's sector size. 2. O_DIRECT performs unbuffered IO operations between user memory and block device (DMA when the block device is physical hardware). 3. O_DIRECT implies O_DSYNC. 4. O_DIRECT disables any locking that would serialize IO operations. The first is not possible in ZFS beause there is no backing device in the general case. The second is not possible in ZFS in the presence of compression because that prevents us from doing DMA from user pages. If we relax the requirement in the case of compression, we encunter another hurdle. In specific, avoiding the userland to kernel copy risks other userland threads modifying buffers during compression and checksum computations. For compressed data, this would cause undefined behavior while for checksums, this would imply we write incorrect checksums to disk. It would be possible to avoid those issues if we modify the page tables to make any changes by userland to memory trigger page faults and perform CoW operations. However, it is unclear if it is wise for a filesystem driver to do this. The third is doable, but we would need to make ZIL perform indirect logging to avoid writing the data twice. The fourth is already done for all IO in ZFS. Other Linux filesystems such as ext4 do not follow #3. Mac OS X does not implement O_DIRECT, but it does implement F_NOCACHE, which is similiar to #2 in that it prevents new data from being cached. AIX relaxes #3 by only committing the file data to disk. Metadata updates required should the operations make the file larger are asynchronous unless O_DSYNC is specified. On Solaris and Illumos, there is a library function called directio(3C) that allows userspace to provide a hint to the filesystem that DirectIO is useful, but the filesystem is free to ignore it. The semantics are also entirely a filesystem decision. Those that do not implement it return ENOTTY. Given the lack of standardization and ZFS' heritage, one solution to provide compatibility with userland processes that expect DirectIO is to treat DirectIO as a hint that we ignore. This can be done trivially by implementing a shim that maps aops->direct_IO to AIO. There is also already code in ZoL for bypassing the page cache when O_DIRECT is specified, but it has been inert until now. If it turns out that it is acceptable for a filesystem driver to interact with the page tables, the scatter-gather list work will need be finished and we would need to utilize the page tables to make operations on the userland pages safe. References: http://xfs.org/docs/xfsdocs-xml-dev/XFS_User_Guide/tmp/en-US/html/ch02s09.html https://blogs.oracle.com/roch/entry/zfs_and_directio https://ext4.wiki.kernel.org/index.php/Clarifying_Direct_IO's_Semantics https://illumos.org/man/3c/directio https://developer.apple.com/library/mac/#documentation/Darwin/Reference/ManPages/man2/fcntl.2.html https://lists.apple.com/archives/filesystem-dev/2007/Sep/msg00010.html Signed-off-by: Richard Yao --- config/kernel-vfs-direct_IO.m4 | 72 ++++++++++++++++++++++++++++++++++ config/kernel.m4 | 1 + module/zfs/zpl_file.c | 27 +++++++++++++ 3 files changed, 100 insertions(+) create mode 100644 config/kernel-vfs-direct_IO.m4 diff --git a/config/kernel-vfs-direct_IO.m4 b/config/kernel-vfs-direct_IO.m4 new file mode 100644 index 000000000000..87797f4fc8fa --- /dev/null +++ b/config/kernel-vfs-direct_IO.m4 @@ -0,0 +1,72 @@ +dnl # +dnl # Linux 4.1.x API change +dnl # +AC_DEFUN([ZFS_AC_KERNEL_VFS_DIRECT_IO], + [AC_MSG_CHECKING([whether fops->direct_IO() uses iov_iter without rw]) + ZFS_LINUX_TRY_COMPILE([ + #include + + ssize_t test_direct_IO(struct kiocb *kiocb, + struct iov_iter *iter, loff_t offset) + { return 0; } + + static const struct address_space_operations + fops __attribute__ ((unused)) = { + .direct_IO = test_direct_IO, + }; + ],[ + ],[ + AC_MSG_RESULT(yes) + AC_DEFINE(HAVE_VFS_DIRECT_IO_ITER, 1, + [fops->direct_IO() uses iov_iter without rw]) + ],[ + AC_MSG_RESULT(no) + dnl # + dnl # Linux 3.16.x API change + dnl # + [AC_MSG_CHECKING([whether fops->direct_IO() uses iov_iter with rw]) + ZFS_LINUX_TRY_COMPILE([ + #include + + ssize_t test_direct_IO(int rw, struct kiocb *kiocb, + struct iov_iter *iter, loff_t offset) + { return 0; } + + static const struct address_space_operations + fops __attribute__ ((unused)) = { + .direct_IO = test_direct_IO, + }; + ],[ + ],[ + AC_MSG_RESULT(yes) + AC_DEFINE(HAVE_VFS_DIRECT_IO_ITER_RW, 1, + [fops->direct_IO() uses iov_iter with rw]) + ],[ + AC_MSG_RESULT(no) + dnl # + dnl # Ancient Linux API (predates git) + dnl # + [AC_MSG_CHECKING([whether fops->direct_IO() uses iovec]) + ZFS_LINUX_TRY_COMPILE([ + #include + ssize_t test_direct_IO(int rw, + struct kiocb *kiocb, + const struct iovec *iov, loff_t offset, + unsigned long nr_segs) + { return 0; } + + static const struct address_space_operations + fops __attribute__ ((unused)) = { + .direct_IO = test_direct_IO, + }; + ],[ + ],[ + AC_MSG_RESULT(yes) + AC_DEFINE(HAVE_VFS_DIRECT_IO_IOVEC, 1, + [fops->direct_IO() uses iovec]) + ],[ + AC_MSG_ERROR(no) + ]) + ]) + ]) +]) diff --git a/config/kernel.m4 b/config/kernel.m4 index 8e8922ec7b88..f4eb7c01fec6 100644 --- a/config/kernel.m4 +++ b/config/kernel.m4 @@ -100,6 +100,7 @@ AC_DEFUN([ZFS_AC_CONFIG_KERNEL], [ ZFS_AC_KERNEL_LSEEK_EXECUTE ZFS_AC_KERNEL_VFS_ITERATE ZFS_AC_KERNEL_VFS_RW_ITERATE + ZFS_AC_KERNEL_VFS_DIRECT_IO AS_IF([test "$LINUX_OBJ" != "$LINUX"], [ KERNELMAKE_PARAMS="$KERNELMAKE_PARAMS O=$LINUX_OBJ" diff --git a/module/zfs/zpl_file.c b/module/zfs/zpl_file.c index 5471140122ac..e91ede5c8346 100644 --- a/module/zfs/zpl_file.c +++ b/module/zfs/zpl_file.c @@ -396,6 +396,32 @@ zpl_aio_write(struct kiocb *kiocb, const struct iovec *iovp, } #endif /* HAVE_VFS_RW_ITERATE */ +static size_t +#ifdef HAVE_VFS_DIRECT_IO_IOVEC +zpl_direct_IO(int rw, struct kiocb *kiocb, const struct iovec *iovp, + loff_t pos, unsigned long nr_segs) +{ +#elif defined(HAVE_VFS_DIRECT_IO_ITER_RW) +zpl_direct_IO(int rw, struct kiocb *kiocb, struct iov_iter *from, + loff_t pos) +{ +#elif (defined HAVE_VFS_DIRECT_IO_ITER) +zpl_direct_IO(struct kiocb *kiocb, struct iov_iter *from, + loff_t pos) +{ + int rw = iov_iter_rw(iter); +#else +#error "No function prototype found for DirectIO" +#endif + const struct iovec *iovp = from->iov; + loff_t pos = from->nr_segs; +#endif + if (rw == WRITE) + return (zpl_iter_write_common(kiocb, iovp, nr_segs, kiocb->ki_nbytes)); + else + return (zpl_iter_read_common(kiocb, iovp, nr_segs, kiocb->ki_nbytes)); +} + static loff_t zpl_llseek(struct file *filp, loff_t offset, int whence) { @@ -799,6 +825,7 @@ const struct address_space_operations zpl_address_space_operations = { .readpage = zpl_readpage, .writepage = zpl_writepage, .writepages = zpl_writepages, + .direct_IO = zpl_direct_IO, }; const struct file_operations zpl_file_operations = {