-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Print only warnings when data integrity problem occurs (conjunction with crontab reports to root) #29
Comments
How do you determine if a file has a data integrity issue vs it's been
changed? I've waffled about this myself. For my media files, it's fine -
if one changed, that's indicative of bitrot as I don't change them.
However, the .NFO files will occasionally change with software changes, and
that prompts thousands of CHANGED notifications. I don't really think
there is a solution to this - the file either matches it's hash or it
doesn't.
At least in my case I think excluding data that can change entirely is
probably the way to go. There's a way to pass regex filters to Scorch to
include/exclude, but I've not figured out how to format the expressions on
the command line yet. Sadly there's no examples in the docs :(
…On Wed., Jan. 8, 2020, 12:16 a.m. szel, ***@***.***> wrote:
Hello!
I have found scorch while researching data integrity tools available for
Linux and one of users mentioned your script.
The script works beautifully and works as expected and thank you for
creating it - great job!
Would it be possible to adjust it so it would only print information when
there is actual data integrity problem ("FAILED")?
When used with crontab this would allow to setup cron job and cron by
default sends e-mail to the user when there is any output on STDOUT (ie.
root).
Right now the script also shows information about "CHANGED" hashes and
defeats the purpose without additional scripting.
What do you think?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#29?email_source=notifications&email_token=ACXP6GFRATQWJEAXT5UVQ7TQ4V4UDA5CNFSM4KED4TO2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IEVYXCQ>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXP6GDMPDRX3HG3U7PAW5DQ4V4UDANCNFSM4KED4TOQ>
.
|
Maybe store mtime along with the hash, and have an option to ignore changes
if the mtime on the file is newer than in the db?
…On Wed., Jan. 8, 2020, 12:16 a.m. szel, ***@***.***> wrote:
Hello!
I have found scorch while researching data integrity tools available for
Linux and one of users mentioned your script.
The script works beautifully and works as expected and thank you for
creating it - great job!
Would it be possible to adjust it so it would only print information when
there is actual data integrity problem ("FAILED")?
When used with crontab this would allow to setup cron job and cron by
default sends e-mail to the user when there is any output on STDOUT (ie.
root).
Right now the script also shows information about "CHANGED" hashes and
defeats the purpose without additional scripting.
What do you think?
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#29?email_source=notifications&email_token=ACXP6GFRATQWJEAXT5UVQ7TQ4V4UDA5CNFSM4KED4TO2YY3PNVWWK3TUL52HS4DFUVEXG43VMWVGG33NNVSW45C7NFSM4IEVYXCQ>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXP6GDMPDRX3HG3U7PAW5DQ4V4UDANCNFSM4KED4TOQ>
.
|
You need to provide more detail. How are you running the script? There is a difference between "CHANGED" and "FAILED". "CHANGED" is a file that is no longer the same thing or metadata is different which is based on the "--diff-fields" argument. "FAILED" means the hash is different but the size (and other diff fields) are the same. |
If you don't want to find changed files (which could be risky) then set the "diff-fields" to an empty string. |
Actually. Looks like it skips to hash checks if the diff check doesn't return so those files will show as failed instead of changed. I guess I could either change the verbosity setup to make changes only print if set high enough or have another argument to ignore changes all together. Let me look. |
As for regex... it's just regular python regex and should work with any common regex pattern. What are you trying to do? |
mtime, inode, size, and mode are stored in the DB. That's how it determines it "changed". The better way of not having it print out changes if it is unnecessary data for you would be changing the verbosity settings so it is optional. By default it only prints changed and failed. I could change it to be just failed on default and move changed to verbose I guess. Though that would mean I'd need to move current verbose up a level. An alternative would be to run an "update" first to refresh the metadata on changed files then run the "check". "update" is cheaper. It only does the file diff check. Not the hash check (though it does rehash updated files). |
That said... if you don't care about NFO files then just add -F -f ".*.nfo" and it'll only process files that don't match ending in ".nfo". |
MotivationI was hoping to use scorch only for bitrot / corruption detection tool, cron to send STDOUT to root e-mail and normal rsync backups to another server to recover in time. My files change on daily basis and I use rsync "mtime / size" to detect changes and external backup in combination with hardlinks to create external snapshots of data. Use case would be the following (all commands as root user where there is nothing else but my data): 1. Initialize checksums for /home
2. Create cron script for periodic data scrubbing
3. Install cron to run once per month
If there is anything on script STDOUT - it will be sent to root e-mail (my machine has /etc/aliases configured and e-mail working). 4. Configure external backups and keep data since last data scrubIn order to recover when bitrot is detected - we need to know "which files" are good and we need proper backups. Here is a sample rsync script that I use:
|
I have updated my original answer. For me any of the following would be great:
If you have any other questions please don't hesitate to ask :-) |
Just exclude specific extensions. Both to avoid CHANGED notifications for
files that may change, and also for performance purposes - I only check X
files at random each run, so I'd prefer just checking files I care about.
I'm moderately familiar with regex overall and that's not a super
complicated thing to google, but I've not been able to get it to work.
Presumably I'm just formatting it wrong.
…On Wed., Jan. 8, 2020, 5:17 a.m. trapexit, ***@***.***> wrote:
As for regex... it's just regular python regex and should work with any
common regex pattern. What are you trying to do?
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#29?email_source=notifications&email_token=ACXP6GG6TE7J4CLCXXO4WULQ4W75XA5CNFSM4KED4TO2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEIMGFQA#issuecomment-572023488>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACXP6GD74ZTRJL3TESLJYKDQ4W75XANCNFSM4KED4TOQ>
.
|
You can use I prefer to talk in terms of general features rather than implementations. If I understand you right you don't care about "changes"? You have lots of legitimate changes and all those changes will make seeing FAILED values? BTW... if |
re regex: That will match those extensions. If you want to easily negate the regex just use |
re changes: if you want to index the files but don't care about changes (or want them emailed at a different time) you could use different databases for them and/or use the filter. Can also use the null hash if you don't even care about bitrot checks given they change so often. |
What I'm getting at with these suggestions is that I'm not sure it's a good idea to ignore changes wholesale. It is better to have different checks for different usecases which I think is mostly available given the features available. Printing to stderr feels wrong because it's not an app failure. grep prints failed file opens to stderr because it's an error about the workflow. Not with the data. A FAILED hash check is not a workflow issue. That said I do have app errors printing to stdout which I should fix. |
From default options of scorch I suspect that original intent of the program was to detect file corruption on immutable files and perhaps let you know which files were modified recently. Do you see scorch supporting a scenario where files are modified on daily basis (ie.: .doc or .odt documents) and just focus on file corruption where it can be detected? |
I'm not sure I understand. If the file is modified regularly then the whole "silent" part of "silent corruption detection" is no longer a thing because it will be found when used/updated. scorch is for files that aren't consumed or changed regularly (similar to SnapRAID). I can make it so it doesn't report changed files but I'm not sure I understand why you would be adding them in the first place. Clearly if they change regularly then they aren't sitting around and at risk of bitrot in the same way an archived file or media generally is. Can you explain to me what you're trying to accomplish? What is the workflow? Do you want it to index but ignore files that have changed? Would you want it to "update" them but not print it? What would the tool do? How would you run it? |
Sorry! It is difficult to communicate effectively over such deep subject in just few words. When I found your project I was given the impression that the scorch is generic tool to detect file corruption on best effort basis and README.md was written in such detail and precise manner I thought it would be a good candidate. I was not realizing you meant scorch for immutable files checksum tool and this came up when I started testing it. My case is the following:
Allowing this would allow to create a very universal file corruption detection tool and I believe more people could use it. You have found a good niche for that kind of tool because (IMHO):
To me the only problem now is that if I want to have it done in automatic way I need to ignore output of "update" phase and I cannot distinct program errors from data:
Option to silence "CHANGED" files would be a godsend in my case. What do you think? |
This is what I don't quite understand. I don't know what you mean by "generic tool to detect file corruption". It is. However, data corruption on files that change regularly is not possible without detailed understanding of the format itself. Hashes are for general detection. If the file changes a single bit the hash will change. So if you have "live" files that are changed regularly keeping hashes around has no utility. They will never be useful. Only after they stop being changed does storing the hash make sense. I can change the verbosity settings to make "changed" a different level but I'm still unsure why you index files that knowingly, regularly change. Is it just in case they don't change in the future? Wouldn't that situation be better served with an mtime timeout? Only process files which haven't been modified for a certain period of time? I have the |
I'll change the verbosity settings. update should have been silent by default anyway. But what I'm really trying to understand is why you index files in which indexing isn't useful. Is it just because you have data stored scattershot across the paths and don't have the ability to define some metric to know which to index and which not to index? That is a fine reason but it'd help me a lot to address if I the reasons could be articulated. Right now I feel like I'm guessing. |
regarding hashdeep The problem with those tools is they don't fit the problem space of the typical data hoarder. There needs to be a way to indicate a file has changed vs have data silently corrupted so I don't get false positives when I rip my BluRay and replace the DVD rip of some movie and want to use the same filename. You need a workflow that enables ease of automation and general use. When you use a tool like mergerfs and you want the ability to index your data and easily see which files you are missing if a drive dies. There are a number of other features and usecases scorch provides. |
Hit the wrong button... |
Can you try the version from the |
bump |
Hi there @trapexit. Thanks for the great software. I've been trying to use it in a similar use case to the OP of this thread. I have lots of legitimate changes on the filesystem that I'm running it, but also want to be able to detect bitrot on files that are mixed in that don't change. I'm scripting the use of Scorch and deciding what to do based on the exit code I receive back from Scorch. Is there anyway to distinguish between files that have legitimately changed on the file system (that I don't care about), vs bitrot that I do care about? Both seem to result in a exit code of 4. Would it be possible to add an additional exit code that is just for file changes without the metadata changing (eg bitrot)? This would be extremely useful for me. My current workflow is (feel free to let me know if I'm using Scorch wrong):
I have tried running If you didn't like the idea of a separate error code for file corruption, rather than file changes and corruption, would it be possible to add new command |
For anyone playing along at home Scorch is very well written software and easy to modify. I was able to add the functionality I described above (the additional exit code) extremely easily. If you'd like the add the functionality yourself you can with the following diff (patch):
You'll now get an exit code of 64 or above (68 if you have no other errors) when you have a file whose contents have changed (ie different hash value), but the metadata has remained the same. @trapexit it would be great if you could incorporate this change or something similar, based on exit codes into the mainline of Scorch - that way I don't have to keep the diff and others could easily benefit (assuming someone else finds it useful). Thanks again for the great software!! |
Please don't hijack threads / issues. As for your request: it makes more sense to keep in line with the behavior of the software and have a change flag. digest mismatch is file corruption. The interpretation of that digest based on other data determines if it's "changed". I'll look at it. |
Firstly sorry for the thread hijack. It did seem to me that my issue / use case was very similar (almost identical to the OPs). I'll open a new issue now, which can be found here Thanks for looking at it! Very much appreciated and the idea of a flag to ignore metadata changes (when you are expecting them for example) would definitely work for me. |
@azurefreecovid your use case is exact the same as mine but I think that the author of scorch has different vision for the program (or a reason for it to be this way). |
Because most files never change and it is useful to have automatic (no human interaction) way to detect that there is something wrong with your data on best effort basis (check what I can) on native Linux ext2/3/4 filesystems. Not everyone wants to use ZFS on Linux. It is also useful not to separate the data by mutable/immutable collections and not to have to run manual data checks myself (which I will fail to do regularly and reliably). |
? I just didn't consider the return code for changed vs failed. Obviously I focus on that distinction given the high level of configuration around this feature. There are two separate things. One is a return code. One is how verbose things are. Not wanting them conflated isn't some rebut of the request. Thread hijacking confusing the conversation and complicates tracking.
I know that. And that's why I wrote the "changes" logic. That's why I don't understand what you mean by "generic tool to detect file corruption." It is. The concern of what gets printed is entirely separate. And I made changes 6 months ago and no one ever commented on it. |
Well from the discussion I also learned that you purposed STDERR for "program errors" - not data corruption errors so I think the idea from issue title may not be how scorch is written (different usage) and we possibly can't do anything. Hm.. so @trapexit - If we would like to have this issue gone - what would you like us to do? Is there anything we can do? |
I don't understand what you're talking about. I made the changes to the branch as you requested.
You could test the changes I made for you back in January and tell me if it was sufficient. I've made a bunch of changes since then so I'll need to port the changes over. |
@trapexit Sorry! I missed the message and assumed the changes are not fitting Scorch's purpose and just gave up. I better get to testing then :-) |
Completely understand. I can see you are very committed to Scorch and helping meet users requirements, thanks again! |
Hello!
I have found scorch while researching data integrity tools available for Linux and one of the users mentioned your script.
The script works beautifully and works as expected and thank you for creating it - great job!
Would it be possible to adjust it so it would only print information when there is actual data integrity problem ("FAILED")?
When used with crontab this would allow to setup cron job and cron by default sends e-mail to the user when there is any output on STDOUT (ie. root).
Right now the script also shows information about "CHANGED" hashes and defeats the purpose without additional scripting.
What do you think?
The text was updated successfully, but these errors were encountered: