Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Print only warnings when data integrity problem occurs (conjunction with crontab reports to root) #29

Open
szel opened this issue Jan 8, 2020 · 35 comments

Comments

@szel
Copy link

szel commented Jan 8, 2020

Hello!
I have found scorch while researching data integrity tools available for Linux and one of the users mentioned your script.

The script works beautifully and works as expected and thank you for creating it - great job!

Would it be possible to adjust it so it would only print information when there is actual data integrity problem ("FAILED")?

When used with crontab this would allow to setup cron job and cron by default sends e-mail to the user when there is any output on STDOUT (ie. root).

Right now the script also shows information about "CHANGED" hashes and defeats the purpose without additional scripting.

What do you think?

@Wintersdark
Copy link

Wintersdark commented Jan 8, 2020 via email

@Wintersdark
Copy link

Wintersdark commented Jan 8, 2020 via email

@trapexit
Copy link
Owner

trapexit commented Jan 8, 2020

You need to provide more detail. How are you running the script?

There is a difference between "CHANGED" and "FAILED". "CHANGED" is a file that is no longer the same thing or metadata is different which is based on the "--diff-fields" argument. "FAILED" means the hash is different but the size (and other diff fields) are the same.

@trapexit
Copy link
Owner

trapexit commented Jan 8, 2020

If you don't want to find changed files (which could be risky) then set the "diff-fields" to an empty string.

@trapexit
Copy link
Owner

trapexit commented Jan 8, 2020

Actually. Looks like it skips to hash checks if the diff check doesn't return so those files will show as failed instead of changed. I guess I could either change the verbosity setup to make changes only print if set high enough or have another argument to ignore changes all together. Let me look.

@trapexit
Copy link
Owner

trapexit commented Jan 8, 2020

As for regex... it's just regular python regex and should work with any common regex pattern. What are you trying to do?

@trapexit
Copy link
Owner

trapexit commented Jan 8, 2020

Maybe store mtime along with the hash, and have an option to ignore changes
if the mtime on the file is newer than in the db?

mtime, inode, size, and mode are stored in the DB. That's how it determines it "changed". The better way of not having it print out changes if it is unnecessary data for you would be changing the verbosity settings so it is optional. By default it only prints changed and failed. I could change it to be just failed on default and move changed to verbose I guess. Though that would mean I'd need to move current verbose up a level.

An alternative would be to run an "update" first to refresh the metadata on changed files then run the "check". "update" is cheaper. It only does the file diff check. Not the hash check (though it does rehash updated files).

@trapexit
Copy link
Owner

trapexit commented Jan 8, 2020

That said... if you don't care about NFO files then just add -F -f ".*.nfo" and it'll only process files that don't match ending in ".nfo".

@szel
Copy link
Author

szel commented Jan 8, 2020

Motivation

I was hoping to use scorch only for bitrot / corruption detection tool, cron to send STDOUT to root e-mail and normal rsync backups to another server to recover in time.

My files change on daily basis and I use rsync "mtime / size" to detect changes and external backup in combination with hardlinks to create external snapshots of data.

Use case would be the following (all commands as root user where there is nothing else but my data):

1. Initialize checksums for /home

# scorch -d /var/lib/scorch.db add /home

2. Create cron script for periodic data scrubbing

# test -d /root/cron || mkdir -p /root/cron
# cd /root/cron
# cat << EOF > ./scorch.sh
#!/bin/sh
scorch -D 'size,mtime' -d /var/lib/scorch.db check+update /home
scorch -D 'size,mtime' -d /var/lib/scorch.db append /home
scorch -D 'size,mtime' -d /var/lib/scorch.db cleanup /home
EOF
# chmod +x ./scorch.sh
# chmod o-rwx ./scorsh.sh

3. Install cron to run once per month

# crontab -e
# m h  dom mon dow   command
# run data integrity (bitrot) checks on 15th of each month at 03:00 and report to root e-mail
0  3   15 * *		/root/cron/scorch.sh

If there is anything on script STDOUT - it will be sent to root e-mail (my machine has /etc/aliases configured and e-mail working).

4. Configure external backups and keep data since last data scrub

In order to recover when bitrot is detected - we need to know "which files" are good and we need proper backups.

Here is a sample rsync script that I use:

#!/bin/bash

backup_date=`date +%Y-%m-%d`

# configure where to keep last backup and daily archives / snapshots
local_backup_dir="/mnt/backup/home/current"
local_archive_dir="/mnt/backup/home/archives"

# show commands and stop on first error
set -x
set -e

# show current date
date

# backup via ssh and rsync
ionice -c 3 rsync -ave ssh --numeric-ids --delete --rsync-path="ionice -c 3 rsync" \
root@SERVER_IP:/home/ "${local_backup_dir}/" || return_status="$?"

# ignore "vanished files"
test "$return_status" -eq "24" && return_status="0"

# show when backup finished copying data
date

backup_time=`date +%Y%m%d%H%M`

# create snapshot from current backup using hard links
cp -al "$local_backup_dir" "$local_archive_dir/$backup_date"
touch -m -t "$backup_time" "$local_archive_dir/$backup_date"

# purge archives / snapshots older than 90 days
find "$local_archive_dir" -mindepth 1 -maxdepth 1 -mtime +90 -type d | xargs -I {} rm -rf {}

# show when script has ended
date

# return correct exit code
exit "$return_status"

@szel
Copy link
Author

szel commented Jan 8, 2020

I have updated my original answer.

For me any of the following would be great:

  1. Have the ability of launch scorch in mode where it only looks and logs to STDOUT or STDERR when bitrot / data corruption is detected.
  2. Make scorch log FAILED events on STDERR.

If you have any other questions please don't hesitate to ask :-)

@Wintersdark
Copy link

Wintersdark commented Jan 8, 2020 via email

@trapexit
Copy link
Owner

trapexit commented Jan 8, 2020

You can use --maxdata and/or --maxactions to do more regularly checks (with --sort=random). You'd catch things more quickly.

I prefer to talk in terms of general features rather than implementations. If I understand you right you don't care about "changes"? You have lots of legitimate changes and all those changes will make seeing FAILED values?

BTW... if check+update doesn't update on FAILED. Only "CHANGED".

@trapexit
Copy link
Owner

trapexit commented Jan 8, 2020

re regex: ^.*\.(ext1|ext2|ext3)$

That will match those extensions. If you want to easily negate the regex just use -F.

@trapexit
Copy link
Owner

trapexit commented Jan 8, 2020

re changes: if you want to index the files but don't care about changes (or want them emailed at a different time) you could use different databases for them and/or use the filter. Can also use the null hash if you don't even care about bitrot checks given they change so often.

@trapexit
Copy link
Owner

trapexit commented Jan 8, 2020

What I'm getting at with these suggestions is that I'm not sure it's a good idea to ignore changes wholesale. It is better to have different checks for different usecases which I think is mostly available given the features available. Printing to stderr feels wrong because it's not an app failure. grep prints failed file opens to stderr because it's an error about the workflow. Not with the data. A FAILED hash check is not a workflow issue. That said I do have app errors printing to stdout which I should fix.

@szel
Copy link
Author

szel commented Jan 8, 2020

@trapexit

From default options of scorch I suspect that original intent of the program was to detect file corruption on immutable files and perhaps let you know which files were modified recently.

Do you see scorch supporting a scenario where files are modified on daily basis (ie.: .doc or .odt documents) and just focus on file corruption where it can be detected?

@trapexit
Copy link
Owner

trapexit commented Jan 8, 2020

I'm not sure I understand. If the file is modified regularly then the whole "silent" part of "silent corruption detection" is no longer a thing because it will be found when used/updated. scorch is for files that aren't consumed or changed regularly (similar to SnapRAID).

I can make it so it doesn't report changed files but I'm not sure I understand why you would be adding them in the first place. Clearly if they change regularly then they aren't sitting around and at risk of bitrot in the same way an archived file or media generally is.

Can you explain to me what you're trying to accomplish? What is the workflow? Do you want it to index but ignore files that have changed? Would you want it to "update" them but not print it? What would the tool do? How would you run it?

@szel
Copy link
Author

szel commented Jan 9, 2020

Sorry! It is difficult to communicate effectively over such deep subject in just few words.

When I found your project I was given the impression that the scorch is generic tool to detect file corruption on best effort basis and README.md was written in such detail and precise manner I thought it would be a good candidate.

I was not realizing you meant scorch for immutable files checksum tool and this came up when I started testing it.

My case is the following:

  • I have mix of immutable data (90%) and mutable data (10%).
  • Mutable data change just few times a day.
  • Primary data is served on mdadm RAID1 and have mdadm check array sync once per month.
  • Data is regularly backed up on different server.
  • I want to spot file corruption problems but I would need scorch to skip files that have changed since last check because it cannot verify any consistency (when programs open files the modification time is modified).

Allowing this would allow to create a very universal file corruption detection tool and I believe more people could use it.

You have found a good niche for that kind of tool because (IMHO):

  • it would allow to detect file corruption problems on any filesystem on best effort basis (if I would like to checksum immutable files I would go for md5deep/hashdeep project);
  • you cannot use SnapRAID without parity disk and this costs and on top of that you still need a normal backup;
  • people who want realtime protection will go for ZFS;
  • BTRFS future on Linux is not really clear and Linux filesystems are surprisingly lacking in case of data integrity;

To me the only problem now is that if I want to have it done in automatic way I need to ignore output of "update" phase and I cannot distinct program errors from data:

  scorch -D 'size,mtime' -d /var/lib/scorch.db update /home 1>/dev/null
  scorch -D 'size,mtime' -d /var/lib/scorch.db append /home
  scorch -D 'size,mtime' -d /var/lib/scorch.db cleanup /home
  scorch -D 'size,mtime' -d /var/lib/scorch.db check /home

Option to silence "CHANGED" files would be a godsend in my case. What do you think?

@trapexit
Copy link
Owner

trapexit commented Jan 9, 2020

When I found your project I was given the impression that the scorch is generic tool to detect file corruption on best effort basis and README.md was written in such detail and precise manner I thought it would be a good candidate.

This is what I don't quite understand. I don't know what you mean by "generic tool to detect file corruption". It is. However, data corruption on files that change regularly is not possible without detailed understanding of the format itself. Hashes are for general detection. If the file changes a single bit the hash will change. So if you have "live" files that are changed regularly keeping hashes around has no utility. They will never be useful. Only after they stop being changed does storing the hash make sense.

I can change the verbosity settings to make "changed" a different level but I'm still unsure why you index files that knowingly, regularly change. Is it just in case they don't change in the future? Wouldn't that situation be better served with an mtime timeout? Only process files which haven't been modified for a certain period of time? I have the restrict feature for similar workflows. Some people set files they expect not to change to readonly. Or set the sticky bit to indicate it's special. So the restrict option lets people use that to filter out files that don't match. An mtime timeout could work similarly. But that is if I'm understanding why you are indexing regularly changing files. Your intent here is what eludes me. Why index the whole of your home directory including files you know regularly change (and therefore hashing is not useful in bitrot detection).

@trapexit
Copy link
Owner

trapexit commented Jan 9, 2020

I'll change the verbosity settings. update should have been silent by default anyway. But what I'm really trying to understand is why you index files in which indexing isn't useful. Is it just because you have data stored scattershot across the paths and don't have the ability to define some metric to know which to index and which not to index? That is a fine reason but it'd help me a lot to address if I the reasons could be articulated. Right now I feel like I'm guessing.

@trapexit
Copy link
Owner

trapexit commented Jan 9, 2020

regarding hashdeep

The problem with those tools is they don't fit the problem space of the typical data hoarder. There needs to be a way to indicate a file has changed vs have data silently corrupted so I don't get false positives when I rip my BluRay and replace the DVD rip of some movie and want to use the same filename. You need a workflow that enables ease of automation and general use. When you use a tool like mergerfs and you want the ability to index your data and easily see which files you are missing if a drive dies. There are a number of other features and usecases scorch provides.

@trapexit trapexit closed this as completed Jan 9, 2020
@trapexit trapexit reopened this Jan 9, 2020
@trapexit
Copy link
Owner

trapexit commented Jan 9, 2020

Hit the wrong button...

@trapexit
Copy link
Owner

Can you try the version from the verbosity branch?

@trapexit
Copy link
Owner

bump

@azurefreecovid
Copy link

azurefreecovid commented Jun 10, 2020

Hi there @trapexit. Thanks for the great software.

I've been trying to use it in a similar use case to the OP of this thread. I have lots of legitimate changes on the filesystem that I'm running it, but also want to be able to detect bitrot on files that are mixed in that don't change.

I'm scripting the use of Scorch and deciding what to do based on the exit code I receive back from Scorch.

Is there anyway to distinguish between files that have legitimately changed on the file system (that I don't care about), vs bitrot that I do care about? Both seem to result in a exit code of 4.

Would it be possible to add an additional exit code that is just for file changes without the metadata changing (eg bitrot)? This would be extremely useful for me.

My current workflow is (feel free to let me know if I'm using Scorch wrong):

scorch -D 'size,mtime' -d ./scorch.db check+update /data
scorch -D 'size,mtime' -d ./scorch.db append /data
scorch -D 'size,mtime' -d ./scorch.db cleanup /data

I have tried running scorch update then scorch check but in the intervening period the files can change so I still get errors.

If you didn't like the idea of a separate error code for file corruption, rather than file changes and corruption, would it be possible to add new command scorch update+check which updates the metadata first and then does the hash check?

@azurefreecovid
Copy link

For anyone playing along at home Scorch is very well written software and easy to modify. I was able to add the functionality I described above (the additional exit code) extremely easily.

If you'd like the add the functionality yourself you can with the following diff (patch):

diff --git a/scorch b/scorch
index 1523811..fab284e 100755
--- a/scorch
+++ b/scorch
@@ -50,8 +50,6 @@ ERROR_DIGEST_MISMATCH = 4
 ERROR_FOUND = 8
 ERROR_NOT_FOUND = 16
 ERROR_INTERRUPTED = 32
-ERROR_FILE_CORRUPTION = 64
-
 
 
 class Options(object):
@@ -673,7 +671,7 @@ def inst_check(opts,path,db,dbremove,update=False):
 
                 newfi.digest = hash_file(filepath,oldfi.digest)
                 if newfi.digest != oldfi.digest:
-                    err = err | ERROR_DIGEST_MISMATCH | ERROR_FILE_CORRUPTION
+                    err = err | ERROR_DIGEST_MISMATCH
                     oldfi.state   = 'F'
                     oldfi.checked = time.time()
                     if not opts.verbose:

You'll now get an exit code of 64 or above (68 if you have no other errors) when you have a file whose contents have changed (ie different hash value), but the metadata has remained the same.

@trapexit it would be great if you could incorporate this change or something similar, based on exit codes into the mainline of Scorch - that way I don't have to keep the diff and others could easily benefit (assuming someone else finds it useful).

Thanks again for the great software!!

@trapexit
Copy link
Owner

Please don't hijack threads / issues.

As for your request: it makes more sense to keep in line with the behavior of the software and have a change flag. digest mismatch is file corruption. The interpretation of that digest based on other data determines if it's "changed". I'll look at it.

@azurefreecovid
Copy link

azurefreecovid commented Jun 10, 2020

Firstly sorry for the thread hijack. It did seem to me that my issue / use case was very similar (almost identical to the OPs). I'll open a new issue now, which can be found here

Thanks for looking at it! Very much appreciated and the idea of a flag to ignore metadata changes (when you are expecting them for example) would definitely work for me.

@szel
Copy link
Author

szel commented Jun 11, 2020

@azurefreecovid your use case is exact the same as mine but I think that the author of scorch has different vision for the program (or a reason for it to be this way).

@szel
Copy link
Author

szel commented Jun 11, 2020

This is what I don't quite understand. I don't know what you mean by "generic tool to detect file corruption".
...
Why index the whole of your home directory including files you know regularly change (and therefore hashing is not useful in bitrot detection).

Because most files never change and it is useful to have automatic (no human interaction) way to detect that there is something wrong with your data on best effort basis (check what I can) on native Linux ext2/3/4 filesystems. Not everyone wants to use ZFS on Linux.

It is also useful not to separate the data by mutable/immutable collections and not to have to run manual data checks myself (which I will fail to do regularly and reliably).

@trapexit
Copy link
Owner

@azurefreecovid your use case is exact the same as mine but I think that the author of scorch has different vision for the program (or a reason for it to be this way).

? I just didn't consider the return code for changed vs failed. Obviously I focus on that distinction given the high level of configuration around this feature. There are two separate things. One is a return code. One is how verbose things are. Not wanting them conflated isn't some rebut of the request. Thread hijacking confusing the conversation and complicates tracking.

Because most files never change and it is useful to have automatic (no human interaction) way to detect that there is something wrong with your data on best effort basis (check what I can) on native Linux ext2/3/4 filesystems. Not everyone wants to use ZFS on Linux.

I know that. And that's why I wrote the "changes" logic. That's why I don't understand what you mean by "generic tool to detect file corruption." It is. The concern of what gets printed is entirely separate. And I made changes 6 months ago and no one ever commented on it.

@szel
Copy link
Author

szel commented Jun 11, 2020

Well from the discussion I also learned that you purposed STDERR for "program errors" - not data corruption errors so I think the idea from issue title may not be how scorch is written (different usage) and we possibly can't do anything.

Hm.. so @trapexit - If we would like to have this issue gone - what would you like us to do? Is there anything we can do?

@trapexit
Copy link
Owner

Well from the discussion I also learned that you purposed STDERR for "program errors" - not data corruption errors so I think the idea from issue title may not be how scorch is written (different usage) and we possibly can't do anything.

I don't understand what you're talking about. I made the changes to the branch as you requested.

Hm.. so @trapexit - If we would like to have this issue gone - what would you like us to do? Is there anything we can do?

You could test the changes I made for you back in January and tell me if it was sufficient.

I've made a bunch of changes since then so I'll need to port the changes over.

@szel
Copy link
Author

szel commented Jun 11, 2020

@trapexit Sorry!

I missed the message and assumed the changes are not fitting Scorch's purpose and just gave up.

I better get to testing then :-)

@azurefreecovid
Copy link

@azurefreecovid your use case is exact the same as mine but I think that the author of scorch has different vision for the program (or a reason for it to be this way).

? I just didn't consider the return code for changed vs failed. Obviously I focus on that distinction given the high level of configuration around this feature. There are two separate things. One is a return code. One is how verbose things are. Not wanting them conflated isn't some rebut of the request. Thread hijacking confusing the conversation and complicates tracking.

Completely understand. I can see you are very committed to Scorch and helping meet users requirements, thanks again!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants