Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong b2sum stored in xattr #439

Closed
phiresky opened this issue Sep 24, 2020 · 8 comments
Closed

Wrong b2sum stored in xattr #439

phiresky opened this issue Sep 24, 2020 · 8 comments

Comments

@phiresky
Copy link

I recently noticed that rmlint wasn't finding some duplicate files when using --xattr, but it was finding them without it. I used getfattr to look at the values, and one of each pair of files had a wrong user.rmlint.blake2b.cksum= stored in the file.

One reason I can think of is that when using -Q / -q in conjunction with --xattr, the wrong blake2b sum is stored in the xattrs with the same attribute name, which cannot be detected later, so you get false negatives.

I couldn't find any command in my history that actually ran rmlint -q on these files, but it's the only thing I could think and verified that it can happen. Are there any other ways wrong hashes in xattr could happen? These are image files so they are basically immutable, I never modify them, and they have the same mtime.

@sahib
Copy link
Owner

sahib commented Sep 24, 2020

Does this happen reliably, i.e. always the same files get the wrong xattr? Also, does it also fail if you copy those files in a common directory and run rmlint on them there?

@phiresky
Copy link
Author

I tried to make a minimal reproduction but couldn't manage to so far. It was definitely the case for a lot of .ARW (raw camera image) files, but I didn't try running rmlint with xattr on just a pair of files where it happened before (only without --xattr).

@vb0
Copy link

vb0 commented Nov 27, 2020

I am getting the same as here and also reported in the other mentioned #436 . In my case there it isn't anything else involved like ntfs, wsl, Raspberry Pi, etc. - it's a plain x86 box (albeit only with 8GB RAM and quite a few files) with one single (non-root and not much used, just archival for pics and videos) drive - 6TB btrfs volume. All native with Ubuntu 20.10. Small runs will store the expected b2sum in user.rmlint.blake2b.cksum but the large/complete run on the volume (which I've done only once) created as far as I can tell only wrong check-sums. The files aren't changed in any way, most of them for more than 10 years (and I have md5sums for lots of them and indeed they didn't change). I don't have ECC RAM but the box is really stable and has some other btrfs volumes and I'm doing quite a bit of file shuffling, if there would be some hardware issue that can corrupt checksums btrfs (and not only) would cry bloody murder.

As I can't reproduce it with a few files I'll have to wipe the xattrs and try again for the whole volume (this will take quite a while).

@vb0
Copy link

vb0 commented Nov 27, 2020

Ok, I've found the problem - it's coming from -C and more specifically from --write-unfinished.
Basically if the files are similar in size but differ early enough so they don't have to be read completely a (useless and misleading in the future) partial checksum will be written in user.rmlint.blake2b.cksum. Simplest way to reproduce below.

dd if=/dev/zero of=10m bs=1000000 count=10
echo "a" | cat - 10m > a10m
echo "b" | cat - 10m > b10m
rmlint -C
b2sum *
getfattr -n user.rmlint.blake2b.cksum *

@sahib
Copy link
Owner

sahib commented Dec 18, 2020

Indeed, half finished files write checksums and I agree that this is pretty confusing. Since the next rmlint run can't tell if those were unfinished it's also dangerous. I now pushed a commit ( 8c2784e ) that should make the situation a bit better - it also forbids writing unfinished checksums if -q / -Q was used.

I will decide later what to do with --write-unfinished and uipdate the docs accordingly.

Can you guys check if this fixes it for you?

@phiresky
Copy link
Author

Wow that's a really obvious reason for this to be the case, nice :) I thought it was some complicated bug. I did not think about --xattr implying --write-unfinished.

@vb0
Copy link

vb0 commented Jan 20, 2021

This works, no unfinished checksum is written at least for the simple cases I can easily control and where I've been able to reproduce it previously (just tried with the latest from develop branch). Thanks a lot!

I think some users that have already the bad check-sums will need some cleanup, I ended up with a veeeery crude and probably easy to break:

getfattr -R -h --absolute-names --no-dereference . > /tmp/to_remove
grep "#" /to_remove | cut -c9- | xargs -d '\n' xattr -d user.rmlint.blake2b.cksum
grep "#" /tmp/to_remove | cut -c9- | xargs -d '\n' xattr -d user.rmlint.blake2b.mtime

@SeeSpotRun
Copy link
Collaborator

In https://github.com/sahib/rmlint/tree/develop branch, --write-unfinished is now deprecated.
New options --hash-unmatched and --hash-uniques provide a more robust alternative. Both only write completed checksums.

So now rmlint --xattr --hash-uniques will hash every (non-zero-size) file in the search path.

And rmlint --xattr --hash-unmatched will only hash files that might have had a duplicate. This generally means there was another file with the exact same size.

Both options are slower than the previous --write-unfinished, especially --hash-uniques, but the previous ambiguity is resolved.

Closing issue. Feel free to re-open if you want to continue the conversation.

cebtenzzre added a commit that referenced this issue Jun 20, 2023
Credit to Chris (@sahib) for the original fix. This version only skips
writing checksums when -q/-Q are used, and sets clamp_is_used for both
absolute and relative clamping.

There are still false-positives and false-negatives lurking in this
logic, but I'd rather fix features later than disable them now.
--write-unfinished can provide a significant speedup.

Related to #439
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants