-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support fallocate(2) #326
Comments
It is potentially difficult to meaningfully implement fallocate() for ZFS, or any true COW filesystem. The intent of fallocate is to pre-allocate/reserve space for later use, but with a COW filesystem the pre-allocated blocks cannot be overwritten without allocating new blocks, writing into the new blocks, and releasing the old blocks (if not pinned by snapshots). In all cases, having fallocated blocks (with some new flag that marks them as zeroed) cannot be any better than simply reserving some blocks out of those available for the pool, and somehow crediting a dnode with the ability to allocate from those reserved blocks. |
Exactly. Implementing this correctly would be tricky and perhaps not that valuable since fallocate(2) is Linux-specific. I would expect most developers to use the more portable posix_fallocate() which presumably falls back to an alternate approach when fallocate(2) isn't available. I'm not aware of any codes which will be to inconvenienced by not having fallocate(2) available... other than xfstests apparently. |
Well you could in theory do something tricky like just creating a sparse file of the correct size. This would avoid the wasted space of storing the zeroed-out data that wouldn't be reusable anyway due to COW. It would unfortunately break the contract that you won't get ENOSPC, but you can't give that guarantee with COW and you would be less likely to hit that after using an enhanced posix_fallocate() since it wouldn't be wasting space on the zeroed pages. Out of curiousity, would there be any difference in final on-disk layout of a sparse file that is filled in vs a file that is first allocated by zero-filling? I work on mongodb and we use posix_fallocate to quickly allocate large files that we can then mmap. It seems to be the quickest way to preallocate files and have a high probability of contiguous allocations (which again isn't possible due to COW). While I doubt anyone will try to run mongodb on zfs-linux anytime soon (my interest in the project is for a home server), I just wanted to give feedback from a user-space developer's point of view. |
Commit cb2d190 should have closed this. |
I was leaving this issue open because the referenced commit only added support for FALLOC_FL_PUNCH_HOLE. There are still other fallocate flags which are not yet handled. |
@dechamps this doesn't seem to be working for 3.6.x. Looking at your patch for this it looks like this is expected. Is there an update for recent kernels?
3 = fd |
My patch only implements |
@dechamps thanks for that clarification even with FALLOC_FL_PUNCH_HOLE (only):
02 = mode, FALLOC_FL_PUNCH_HOLE |
Apparently fallocate is still not supported on zfs? |
@RJVB |
You can always use: |
As of 0.6.4 the |
I'm using 0.7.2-1, and I noticed that if you run //usr/bin/env make -s "${0%.*}" && ./"${0%.*}" "$@"; s=$?; rm ./"${0%.*}"; exit $s
#include <fcntl.h>
#include <stdio.h>
#include <errno.h>
#include <string.h>
int main () {
int fd = open("./randomfile", O_WRONLY | O_CREAT, S_IRUSR | S_IWUSR);
if (fd == -1) {
perror("open()");
}
int status = posix_fallocate(fd, 0, 100);
if (status != 0) {
printf("%s\n", strerror(status));
}
return 0;
} Running the above on an empty or non-existent file works fine, as soon as you run it again, it fails with EBADF. This is bit strange behaviour. |
@CMCDragonkai that does seem odd. Can you please open a new issue with the above comment so we can track and fix this issue. |
Is allocating disk space (set mode = 0) supported now ? BTW, will |
No, because ZFS's copy-on-write semantics just plain don't allow that. |
@behlendorf While it is not possible (due to CoW) to have a fully working How would you consider to implement a "fake" fallocation, where [1] One of such application is virt-manager: RAW disk images are, by default, fully fallocated. This, depending on disk size, mean GB or TB of null data (zeroes) written to HDDs/SSDs. |
I'd say that makes sense.
According to my fallocate man page
The default operation (i.e., mode is zero) of fallocate() allocates and initializes to zero the disk space within the range specified by offset and len. The file size (as reported by stat(2)) will be changed if offset+len is greater than the file size.
I don't see how that is different from writing `len` 0s to disk starting at `offset`, am I missing something. Either way, maybe doing that write inside the ZFS driver means it can be done more efficiently than when using userland calls?
|
@RJVB On filesystem supporting
Point n.1 (space reservation) is at odds with ZFS because as a CoW filesystems, by its very nature, it continuously allocate new data blocks while keeping track of past ones via snapshots. This means that you can't really count on And here come point n.2 - fast file allocation. On platform where Opinions are welcomed! |
- as no user data are written (and only some very terse metadata are flushed to disk), `fallocate` returns almost immediately, enabling very fast file allocations.
You'll notice that the manual does mention writing 0s to disk.
And as to COW systems being at odds with space reservation: btrfs supports it and AFAIK that's a COW filesystem too.
tap into the reservation and/or quota proprierties. However, if I remember correctly, these proprierties only apply to an entire dataset, rather than to a single file.
That doesn't matter for the space reservation aspect, right? Available space is not a per-file property but one of the filesystem/dataset. OK, the file size may not show up as one might expect, but aren't we all used to open files not always showing their exact size while available disk space does show the actual value.
the libc function can force a full file allocation by writing zeroes for all its length. This is very slow, cause unnecessary wear on SSD *and* is basically useless on ZFS.
Why is it basically useless? You still get the space reservation, no? My point is that doing the write in the driver might be more efficient. I don't disagree with your suggestion but software that nowadays falls back to the brute-force method because fallocate() fails might start behaving unexpectedly. Maybe a driver parameter that can be controlled at runtime could activate a low-level actual write-zeroes-to-disk implementation?
Another solution would be to implement a per-file reservation attribute, something like a simulated file size (which can only be larger than the actual file size) which is taken into account for the determination of available disk space (but not for used disk space?). I really fail to see how this would not provide a usable implementation. There are probably combinations with existing file size/content and the fallocate offset/size parameter that I can't get my head around, but you should always be able to let the fallocate() call fail if you detect one of those.
That wouldn't be the 1st ZFS feature that is developed on ZoL first and only then presented for upstreaming to OpenZFS. And this one might be easy to get upstreamed; I presume *BSD have fallocate too (or could use some form of it). And FWIW, the fake fallocate feature would probably have to be presented for upstreaming too.
|
fallocate on BTRFS behave differently than on non-CoW filesystems: while it really allocates blocks for the selected file, any rewrite (after the first write) triggers a new block allocation. This means that file fragmentation is only slightly reduced, and can potentially expose some rough corner with a near-full filesystem.
If you tap onto existing quota/reservation system (which, anyway, operate on dataset rather than single file), yes, I'll end with working space reservation. But if you only count on the fallocated reserved block, any snapshot pinning old data can effectively cause a out of space condition even when writing on fallocated files. Something similar to that:
I really fail to see why an user space application should fail when presented with a sparse file rather than a preallocated one. However, as you suggest, simple let the option be user selectable. In short, while a BTRFS-like fallocate would be the ideal solution, even a fake (user-selectable) implementation would be desiderable. |
@von-copec We are only talking about fallocate() in the ZFS POSIX layer (ZPL) filesystems. If someone puts ext4 on top of a zvol or a file, then ext4 still behaves exactly as it always has. |
I understand, I was attempting to emphasize that the behavior of another filesystem layer versus the ZPL would be considered correct when it is on top of a (sparse) ZVOL, and so the ZPL doing the same thing would be the "same amount of correctness". |
An update on this topic. In the course of implementing fallocate(mode=0) for Lustre-on-ZFS (https://review.whamcloud.com/36506) the
This sets Several open questions exist, since there is absolutely no documentation anywhere about this code:
This seems to be a path toward implementing fully-featured ZFS If this doesn't work out, it still seems practical to go the easy route, for which I've made a simple patch that implements what was previously described here and could hopefully be landed with a minimum of fuss. I don't have any idea how long it would take the dmu_prealloc() approach to finish, but it would need the changes in my patch anyway. |
Several open questions exist, since there is absolutely no documentation anywhere about this code:
Probably an open door, but have you tried to answer your questions by poking around under an Illumos implementation?
|
Yes, the Illumos implementation references this function exactly once, in the code path referenced above, but no actual comments exist in the code that describe these functions. |
Yes, the Illumos implementation references this function exactly once, in the code path referenced above, but no actual comments exist in the code that describe these functions.
I meant empirically, triggering the situations/behaviours you have questions about.
|
@adilger I am right that this preallocation would use the preallocated blocks for the first write only? If so, this seems somewhat similar to BTRFS approach. If so, I am missing why an application (Lustre, in this case) should expect
Disabling compression and checksum seems a way too high price to pay for the very limited benefit (if any) which can be obtained by "true" preallocation on ZFS. Considering how |
@shodanshok, I understand and agree that all of those issues exist. Lustre is a distributed parallel filesystem that layers on top of ZFS, so it isn't the thing that is generating the fallocate() request. It is merely passing on the fallocate() request from a higher-level application down to ZFS, after possibly remapping the arguments appropriately.
I've essentially done exactly that with my PR#10408. However, while this probably works fine for a large majority of use cases, it would fail if eg. an application is trying to fallocate multiple files in advance of writing, or in parallel, but there is not actually enough free space in the filesystem. In that case, each individual fallocate() call would verify enough space is available, but the aggregate of those calls is not available. Fixing this would need "write once" semantics for reserved blocks (similar to what |
Implement semi-compatible functionality for mode=0 (preallocation) and mode=FALLOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL. Since ZFS does COW and snapshots, preallocating blocks for a file cannot guarantee that writes to the file will not run out of space. Even if the first overwrite was guaranteed, it would not handle any later overwrite of blocks due to COW, so strict compliance is futile. Instead, make a best-effort check that at least enough free space is currently available in the pool (with a bit of margin), then create a sparse file of the requested size and continue on with life. This does not handle all cases (e.g. several fallocate() calls before writing into the files when the filesystem is nearly full), which would require a more complex mechanism to be implemented, probably based on a modified version of dmu_prealloc(), but is usable as-is. A new module option zfs_fallocate_reserve_percent is used to control the reserve margin for any single fallocate call. By default, this is 110% of the requested preallocation size, so an additional 10% of available space is reserved for overhead to allow the application a good chance of finishing the write when the fallocate() succeeds. If the heuristics of this basic fallocate implementation are not desirable, the old non-functional behavior of returning EOPNOTSUPP for calls can be restored by setting zfs_fallocate_reserve_percent=0. The parameter of zfs_statvfs() is changed to take an inode instead of a dentry, since no dentry is available in zfs_fallocate_common(). A few tests from @behlendorf cover basic fallocate functionality. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Arshad Hussain <arshad.super@gmail.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andreas Dilger <adilger@dilger.ca> Issue #326 Closes #10408
Implement semi-compatible functionality for mode=0 (preallocation) and mode=FALLOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL. Since ZFS does COW and snapshots, preallocating blocks for a file cannot guarantee that writes to the file will not run out of space. Even if the first overwrite was guaranteed, it would not handle any later overwrite of blocks due to COW, so strict compliance is futile. Instead, make a best-effort check that at least enough free space is currently available in the pool (with a bit of margin), then create a sparse file of the requested size and continue on with life. This does not handle all cases (e.g. several fallocate() calls before writing into the files when the filesystem is nearly full), which would require a more complex mechanism to be implemented, probably based on a modified version of dmu_prealloc(), but is usable as-is. A new module option zfs_fallocate_reserve_percent is used to control the reserve margin for any single fallocate call. By default, this is 110% of the requested preallocation size, so an additional 10% of available space is reserved for overhead to allow the application a good chance of finishing the write when the fallocate() succeeds. If the heuristics of this basic fallocate implementation are not desirable, the old non-functional behavior of returning EOPNOTSUPP for calls can be restored by setting zfs_fallocate_reserve_percent=0. The parameter of zfs_statvfs() is changed to take an inode instead of a dentry, since no dentry is available in zfs_fallocate_common(). A few tests from @behlendorf cover basic fallocate functionality. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Arshad Hussain <arshad.super@gmail.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andreas Dilger <adilger@dilger.ca> Issue #326 Closes #10408
Closing. As discussed above basic |
This bug is fixed in MariaDB 10.1.48, 10.2.35, 10.3.26, 10.4.16, 10.5.7 by MariaDB Pull Request #1658 a.k.a. adding fall-back logic for the code EOPNOTSUPP |
Implement semi-compatible functionality for mode=0 (preallocation) and mode=FALLOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL. Since ZFS does COW and snapshots, preallocating blocks for a file cannot guarantee that writes to the file will not run out of space. Even if the first overwrite was guaranteed, it would not handle any later overwrite of blocks due to COW, so strict compliance is futile. Instead, make a best-effort check that at least enough free space is currently available in the pool (with a bit of margin), then create a sparse file of the requested size and continue on with life. This does not handle all cases (e.g. several fallocate() calls before writing into the files when the filesystem is nearly full), which would require a more complex mechanism to be implemented, probably based on a modified version of dmu_prealloc(), but is usable as-is. A new module option zfs_fallocate_reserve_percent is used to control the reserve margin for any single fallocate call. By default, this is 110% of the requested preallocation size, so an additional 10% of available space is reserved for overhead to allow the application a good chance of finishing the write when the fallocate() succeeds. If the heuristics of this basic fallocate implementation are not desirable, the old non-functional behavior of returning EOPNOTSUPP for calls can be restored by setting zfs_fallocate_reserve_percent=0. The parameter of zfs_statvfs() is changed to take an inode instead of a dentry, since no dentry is available in zfs_fallocate_common(). A few tests from @behlendorf cover basic fallocate functionality. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Arshad Hussain <arshad.super@gmail.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andreas Dilger <adilger@dilger.ca> Issue openzfs#326 Closes openzfs#10408
Implement semi-compatible functionality for mode=0 (preallocation) and mode=FALLOC_FL_KEEP_SIZE (preallocation beyond EOF) for ZPL. Since ZFS does COW and snapshots, preallocating blocks for a file cannot guarantee that writes to the file will not run out of space. Even if the first overwrite was guaranteed, it would not handle any later overwrite of blocks due to COW, so strict compliance is futile. Instead, make a best-effort check that at least enough free space is currently available in the pool (with a bit of margin), then create a sparse file of the requested size and continue on with life. This does not handle all cases (e.g. several fallocate() calls before writing into the files when the filesystem is nearly full), which would require a more complex mechanism to be implemented, probably based on a modified version of dmu_prealloc(), but is usable as-is. A new module option zfs_fallocate_reserve_percent is used to control the reserve margin for any single fallocate call. By default, this is 110% of the requested preallocation size, so an additional 10% of available space is reserved for overhead to allow the application a good chance of finishing the write when the fallocate() succeeds. If the heuristics of this basic fallocate implementation are not desirable, the old non-functional behavior of returning EOPNOTSUPP for calls can be restored by setting zfs_fallocate_reserve_percent=0. The parameter of zfs_statvfs() is changed to take an inode instead of a dentry, since no dentry is available in zfs_fallocate_common(). A few tests from @behlendorf cover basic fallocate functionality. Reviewed-by: Richard Laager <rlaager@wiktel.com> Reviewed-by: Arshad Hussain <arshad.super@gmail.com> Reviewed-by: Matthew Ahrens <mahrens@delphix.com> Co-authored-by: Brian Behlendorf <behlendorf1@llnl.gov> Signed-off-by: Andreas Dilger <adilger@dilger.ca> Issue openzfs#326 Closes openzfs#10408
* Create a separate zfs suite to test features specific to Delphix. * Add a new Github workflow
ZFS does not support fallocate[1] which means if the `pre-allocate` option is set to true qBittorrent will error when adding torrent files 1. openzfs/zfs#326
ZFS does not support fallocate[1] which means if the `pre-allocate` option is set to true qBittorrent will error when adding torrent files 1. openzfs/zfs#326
…penzfs#326) * spl-time: Use KeQueryPerformanceCounter instead of KeQueryTickCount `KeQueryTickCount` seems to only have a 15.625ms resolution unless the interrupt timer frequency is increased, which should be avoided due to power usage. Instead, this switches the `zfs_lbolt`, `gethrtime` and `random_get_bytes` to use `KeQueryPerformanceCounter`. On my system this gives a 100ns resolution. Signed-off-by: Axel Gembe <axel@gembe.net> * spl-time: Add assertion to gethrtime and cache NANOSEC / freq division One less division for each call. Signed-off-by: Axel Gembe <axel@gembe.net> --------- Signed-off-by: Axel Gembe <axel@gembe.net>
…penzfs#326) * spl-time: Use KeQueryPerformanceCounter instead of KeQueryTickCount `KeQueryTickCount` seems to only have a 15.625ms resolution unless the interrupt timer frequency is increased, which should be avoided due to power usage. Instead, this switches the `zfs_lbolt`, `gethrtime` and `random_get_bytes` to use `KeQueryPerformanceCounter`. On my system this gives a 100ns resolution. Signed-off-by: Axel Gembe <axel@gembe.net> * spl-time: Add assertion to gethrtime and cache NANOSEC / freq division One less division for each call. Signed-off-by: Axel Gembe <axel@gembe.net> --------- Signed-off-by: Axel Gembe <axel@gembe.net>
Observed by xfstests 075, fallocate(2) is not yet supported
"fallocate is used to preallocate blocks to a file. For filesystems which support the fallocate system call, this is done quickly by allocating blocks and marking them as uninitialized, requiring no IO to the data blocks. This is much faster than creating a file by filling it with zeros."
The text was updated successfully, but these errors were encountered: