Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement FIDUPERANGE for Linux. #15393

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

Conversation

dberlin
Copy link
Contributor

@dberlin dberlin commented Oct 10, 2023

This resolves issue #11065 by implementing FIDUPERANGE on Linux.

It is a draft because i need a little help from someone a little more familiar with ZFS code to
see if there is a better way to do the locking. (once it's all ready, i'll close this PR and create one with
the proper template).

FIDUPERANGE takes a src file, offset, length, and then an array of places to duplicate the range into.
The result status is per-destination.

For this, the correct locking (or at least one version) looks like this (IMHO, of course :P):

for each destination to duplicate a range into:
A. read lock the src range/offset
B. write lock the dst range/offset
C. compare the src and dest ranges to ensure they are the same.
D. if same, clone src into dst
E. release write lock.
F. release read lock.

Because the results are per-destination, it's actually okay if the src file changes after F but before the next iteration - we will simply report that the ranges for remaining destination are no longer the same when we compare them, and not remap.

Here are the immediate issues:

  1. zfs_read (which i use to read and compare the files) wants to hold the read lock during C above. That is not possible since we have it write locked and they are not recursive locks. It's not obvious to me if zfs_read is the wrong thing to use here, or if i should create a zfs_read_already_locked that takes the locks, verifies they are right for the range, and proceeds without further locking.

  2. zpl_clone_file_range has the same issue (it wants to write lock the destination).

Note: We could hold the read lock for all iterations of the loop, and avoid re-reading the src range for each destination.
I have not bothered to explore this tradeoff.

This resolves issue openzfs#11065 by implementing FIDUPERANGE on Linux.
Copy link
Contributor

@oromenahar oromenahar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't checked the pipeline and didn't though much about it. But this need some changes I think.

Comment on lines 150 to 154
if (flags & REMAP_FILE_DEDUP) {
/* Zero length means to clone everything to the end of the file
*/
if (len == 0)
len = i_size_read(file_inode(src_file)) - src_off;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
if (flags & REMAP_FILE_DEDUP) {
/* Zero length means to clone everything to the end of the file
*/
if (len == 0)
len = i_size_read(file_inode(src_file)) - src_off;
/* Zero length means to clone everything to the end of the file */
if (len == 0)
len = i_size_read(file_inode(src_file)) - src_off;
if (flags & REMAP_FILE_DEDUP) {

this must stay on top of the REMAP_FILE_DEDUP condition. Because if the REMAP_FILE_DEDUP-flag isn't set we still need to calculate the len to the end of the file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep. I already fixed that but push the update.

@behlendorf behlendorf added the Status: Code Review Needed Ready for review and testing label Oct 10, 2023
@dberlin
Copy link
Contributor Author

dberlin commented Oct 11, 2023

didn't checked the pipeline and didn't though much about it. But this need some changes I think.

Also, it definitely needs changes, as without fixing the locking issue mentioned (IE as written here) it will currently deadlock in zfs_read.

For testing, i have changed it locally to have a small race condition that i never trigger instead - The race being as described above - without being able to hold the write lock over the entire compare + clone period, there is a race between compare and clone.

Put another way - if you want a testable version to play around with, i'm happy to push changes that are complete but have the race condition instead.

@dberlin
Copy link
Contributor Author

dberlin commented Oct 11, 2023

I have pushed a version that introduces locked versions of zfs_read and zfs_clone_range and uses them.
I have not attempted to common out most of the the existing functions.

This version is untested (working on it), and simply meant to see what the initial changes would look like.
I'll test it and refine it.

@dberlin
Copy link
Contributor Author

dberlin commented Oct 13, 2023

I have verified the current version at least trivially works, in the sense that it calls clone range on the correct ranges, and clone range says it cloned bytes, and the results have the same checksums/share blocks/etc.

@dberlin
Copy link
Contributor Author

dberlin commented Oct 15, 2023

Okay, i figured out how to use zdb to show the actual blocks being used.
I can verify that prior to FIDEDUPERANGE, they have different blocks, and after, all files share the same blocks.

So it's just the space usage that is not showing it through any method, and i can verify that this occurs with normal block cloning.

@behlendorf behlendorf added the Type: Feature Feature request or new feature label Dec 5, 2023
Comment on lines +1564 to +1572
*/
int
zfs_clone_range_locked(znode_t *inzp, uint64_t *inoffp, znode_t *outzp,
uint64_t *outoffp, uint64_t *lenp, cred_t *cr, zfs_locked_range_t *src_zlr,
zfs_locked_range_t *dst_zlr)
{
zfsvfs_t *inzfsvfs, *outzfsvfs;
objset_t *inos, *outos;
dmu_buf_impl_t *db;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like to copy zfs_clone_range to a new function. For example: encryption check is not done here, because it was added to zfs_clone_range but not to this function. To copy all of the "new" checks to this function is bad and should be avoided if possible IMO. Based on your comment, I would suggest to pass a boolen_t to zfs_clone_range, which can be used to check if the lock is already held. Maybe it's also possible to check if the range locks are already held.

Copy link
Contributor

@oromenahar oromenahar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this can work, but the duplicated code need to be removed if possible?

@dberlin
Copy link
Contributor Author

dberlin commented Dec 8, 2023

I think this can work, but the duplicated code need to be removed if possible?

Yeah, I have no problem doing that, i just didn't want to go down the path if folks have objections.

I'll clean it up

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Code Review Needed Ready for review and testing Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants