Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for optional inclusion of filesize in hash files #65

Open
a-raccoon opened this issue Oct 30, 2019 · 11 comments
Open

Support for optional inclusion of filesize in hash files #65

a-raccoon opened this issue Oct 30, 2019 · 11 comments

Comments

@a-raccoon
Copy link

I wish to request support for HashCheck to support the presence and absence of the filesize (in bytes) stored within hash files (eg, .sha512). Filesize would be the second of the three columns, between the filehash and the filename. It would be contextually identifiable as it would be a plain integer value and not a fully qualifying path, and it would be followed by a fully qualified path. This should solve compatibility issues.

I need this in order to use these hash files in other software to be able to quickly identify a file disqualifies as a match because the filesize is a mismatch. And to help locate a moved or renamed file because the file has a matching filesize.

Format as proposed: hash filesize filepath

FFEDCBA9876543210123456789ABCDEF 1234567890 *path\filename.ext

@LanceUMatthews
Copy link

LanceUMatthews commented Oct 30, 2019

I, too, have thought that including the file size in a signature/checksum file would provide an early exit when the length doesn't match, but haven't been able to find a hash utility or file format that captures that metadata. I know that Wireshark signature files, for example, include the file lengths...

Wireshark-win64-3.0.6.exe: 59276232 bytes
SHA256(Wireshark-win64-3.0.6.exe)=6cd2b1474d5a031b85fca00538d45487144b36d8e1db1d565fd35d251ac261d0
RIPEMD160(Wireshark-win64-3.0.6.exe)=765f452efdc88e511291980398bd60e83bd06b3f
SHA1(Wireshark-win64-3.0.6.exe)=0ec54cf0d67ad5fd6583dd8d39e7d5fc68cfbae7

...and the format of the hash lines is consistent with the output of running openssl <algorithm> <path>, but I don't know if that signature file is, as a whole, any kind of standard format that's meant to be machine-readable and, if so, by what application. (The presence of documentation text further down suggests not.)

As for this proposal, are you asking that HashCheck produce hash files that includes file length information, or just that it be smart enough to read and verify such information, if present? The former would break compatibility with other checksum verification utilities.

@USMA56795
Copy link

I've seen at least one system (CallidusCloud, now SAP Commissions) which wanted filesize in its hashfile checks, so it's not unheard of.

For compatibility purposes, adding filesize should probably be done via a flag when calling HashCheck.

@a-raccoon
Copy link
Author

a-raccoon commented Nov 4, 2019

LanceUMatthews: Allow me to preface by observing that every hash checking software mentions the simple caveat that there is NO standardized agreed upon format. And I hope that there never will be.

That said, yes, I want HashCheck to both Write and Read hash files that do contain the filesize parameter. The user can choose this as an option. Heck, allow the user to specify their own user-defined format(s) to read and write if you want to get super advanced. $sha512hex $filesizebytes $filepathrelative

The file formats are easy to convert with a simple script, if a future librarian needs to change things to make them compatible with the software being used in 2045. But conversion can only happen IF the filesize exists in the file. You can't convert something that is utterly missing.

Hashcheck is the only software that recognizes a *.sha512 file extension. And all of these extensions, including .sfv and .md5, are arbitrary whimsical formats. Let's not worry about what other people are doing, and become a leader.

@a-raccoon
Copy link
Author

On a side note, I'm fleshing out design of arbitrary metadata files to make my work as an archivist easier, adopting json, xml, yaml, quoted and unquoted csv, tsv, ssv as all valid means of reading and writing, importing and exporting metadata. This will allow me to draw from multiple sources like hash files, ffprobe output, webpage scraped data, etc, into flat / structured files for manipulation and tracking changes to files and folders, comparing duplicates, revisions, etc.

Any simple file format that somebody uses or invents can easily be scripted around and imported / exported.

@LanceUMatthews
Copy link

I raised the issue of compatibility because having HashCheck change a known format and give it the same extension, as you've specified, would certainly make those files unreadable to all the programs out there that already handle that format/extension. Perhaps some people only use HashCheck in isolation, in which case it could use whatever format it wants, but one of the useful things about it is that, like any competent hashing utility, it reads and writes hash files in a format typical of and compatible with other hashing utilities. This isn't about software written in 2045; it's about compatibility with software that exists right now.

Also, I disagree that hash files are "arbitrary" formats. Yes, they may have been arbitrary when someone invented them, but they're well-established now. How often do you encounter *.<algorithm> or <ALGORITHM>SUMS files that can't be read by some hashing utility (for reasons other than unexpected line endings, character encoding, etc.)? I'm pretty fanatical about verifying downloads, backups, etc. and I don't think I've ever had that happen, and that's because everyone seems to do pretty well conforming to a simple, de facto standard. If someone provided a file for general consumption and named it after a ubiquitous format even though they had extended it in incompatible ways then I'd sooner call that misleading than leading, and being told "Oh, just run this script to transform it into the format you expected it to be" wouldn't improve my opinion, either.

If you're going to use a different, incompatible format then my point is that it should have a different name (extension), too. (Although, at that point you might be better off creating something entirely new without the limitations of an <ALGORITHM>SUMS-derived format.) Perhaps something like .hashcheck or .hc-<algorithm> or .<algorithm>+length. Then there'll be no surprises when other hashing software can't read that file, and it will give the user a fighting chance at finding (the) software that does.


By the way, I notice that HashCheck as well as GNU sha512sum both ignore lines that start with #. I don't know for how many hash utilities that is true, but perhaps an alternative would be to use comments to store the file lengths, like this...

# Length=1234567890 Path=path\filename.ext
FFEDCBA9876543210123456789ABCDEF *path\filename.ext

This would have the following benefits for compatibility:

  • Applications that support comments and file length verification would try to parse the length from any comments and fall back on the usual hash verification behavior if the length matches or is not found.
  • Applications that support comments but not file length verification would ignore the comments and maintain the usual hash verification behavior.
  • Applications that don't support comments would fail to parse the comment line. As long as hash file parsing continues when a malformed line is detected (which is my experience with the GNU hash utilities), the usual hash verification behavior should, in theory, be maintained since the hash lines are identical to those of a hash file with no length information.

@ThermoMan
Copy link

Using comment fields for data storage is a dark path.

I second (or third by now) the suggestion that was made earlier that the file size be included by a flag, but I go one step further. The new flag should not just be for the size, it should be for the format of the file. Using a structured format of any kind (json, xml, etc...) with named values would future proof the file.

Instead of PacketSenderPortable.sha256 with

0665dc3bf7848952a6bc98f701b4e3c3172bc49ddebbe14748708a1ea95e8df9 *PacketSenderPortable\bearer\qgenericbearer.dll
753a746d1d99ff037554ec466618c2bbadf47145259fdc387323cf8ae8d01942 *PacketSenderPortable\bearer\qnativewifibearer.dll
a6350bf3f10fbc97b85ddf5bc7e4d4bf9a566f001101ba0d9a653fbcb5aec344 *PacketSenderPortable\D3Dcompiler_47.dll
8177f97513213526df2cf6184d8ff986c675afb514d4e68a404010521b880643 *PacketSenderPortable\gpl-2.0.txt

We'd have PacketSenderPortable.checksum with

{
  "file 1":{
    "path":"PacketSenderPortable\\bearer\\qgenericbearer.dll",
    "size":89600,
    "sha256":"0665dc3bf7848952a6bc98f701b4e3c3172bc49ddebbe14748708a1ea95e8df9 "
  },
  "file 2":{
    "path":"PacketSenderPortable\\bearer\\qnativewifibearer.dll",
    "size":82432,
    "sha256":"753a746d1d99ff037554ec466618c2bbadf47145259fdc387323cf8ae8d01942"
  },
  "file 3":{
    "path":"PacketSenderPortable\\D3Dcompiler_47.dll",
    "size":3733504,
    "sha256":"a6350bf3f10fbc97b85ddf5bc7e4d4bf9a566f001101ba0d9a653fbcb5aec344"
  },
  "file 4":{
    "path":"PacketSenderPortable\\gpl-2.0.txt",
    "size":18092,
    "sha256":"8177f97513213526df2cf6184d8ff986c675afb514d4e68a404010521b880643"
  }
}

Then adding any new fields will break nothing. And what they are will be obvious without having to reference a document to figure out what order the fields are in and which computation methods were or were not included.

Just my 2¢

@a-raccoon
Copy link
Author

a-raccoon commented Nov 5, 2019

I'll throw out there that the STANDARD format for hash files is just a single line containing nothing else except the hex value hash of the file that was downloaded. This is true for md5, sha1, sha256 that you might find accompanying a file on a download page, with the filename of the hash file matching that of the download file.

ALL OTHER variations to this standard are deviations, including the format Hashcheck is using right now. Software can only reasonably expect the hash value, only one hash value, and no file size, and no filename or path.

One of the drawbacks of using structured files is their complexity for simple script parsing. The one-line-per-item format is a lot easier to manipulate in scripts and spreadsheets. I'm not specifically asking for the elimination of SSV, just as I believe against the sanctification of the current hash filepath format.

Allow the user to know best and pursue their needs.

I agree that using comment lines to convey data is a dark path.

We are talking about 2045, not just 1995 or 2005 or 2015. People are excellent at adopting and adapting to new "standards" when they enter circulation. If somebody is using a 20 year old hash software that can't read HashCheck generated user-define files, or refuses to, then the user can decide to upgrade their 20 year old software or write an interpreter / conversion script for themselves, or email the author of the competing software.

@LanceUMatthews
Copy link

Embedding extra information in comments is certainly not ideal, but given the lack of extensibility in the pervasive <hash> *<path> format I think adding that information to comments in a backwards-compatible way would be preferable to producing similar-but-still-different output that existing readers couldn't read at all.

As I said and @ThermoMan describes, an entirely new format that allows that flexibility would be better still than either of those options. At that point I don't think a flag is necessary but just another drop-down type to select in the Save As dialog; there may need to be options, though, to specify which computed hash(es) HashCheck will store when saving such a file (if not all of them) and which to verify when reading such a file (if not all of them when multiple are present).

XML or JSON are what I had in mind, too, although to lighten the parsing requirements it could be CSV...

"Path","Method","Value"
"path\filename.ext","Length","1234567890"
"path\filename.ext","MD5","FFEDCBA9876543210123456789ABCDEF"

...(which would have some redundant Path column values) or key-value/.ini-style...

[path\filename.ext]
Length=1234567890
MD5=FFEDCBA9876543210123456789ABCDEF

Another attribute to work in there whatever the containing format might be the specifier for binary vs. text mode hashes, either per-file (e.g. ReadMode=Binary) or per-hash (e.g. MD5(Binary)=FFEDCBA9876543210123456789ABCDEF).

@a-raccoon
Copy link
Author

@LanceUMatthews what do you consider to be a text-mode hash? Exclude all \r (0x0D) and \n (0x0A) characters from the data stream being hashed? Or just exclude \r characters? I don't know this to be a hashing mechanism used in the wild.

CSV TSV and SSV, quoted and unquoted, each have equal merits. Should be a setting the user can define. SSV, space separated values, is what HashCheck currently uses. It can probably interpret TSV but I haven't tested.

@LanceUMatthews
Copy link

My understanding is text mode ignores any newline characters encountered during hashing. Or maybe it normalizes them to a particular scheme? I've never had a need for it so I'm not sure. I could...possibly...see it being used for code or configuration files that are shared across platforms. Someone must have had a use for it at some point such that it's part of the other hash file format so it would make sense to somehow support it.

Maybe such an option should make it clear what it's doing, with ReadMode (or Transforms?) having possible values like Binary (the default), IgnoreNewlines, ForceLf, ForceCrLf, etc. A reader could implement whatever modes it wants (likely just Binary) and throw an error on those it doesn't. I could also see custom format-specific modes that allow for ignoring a file's embedded metadata during hashing (e.g. being able to verify an audio file after editing its tags) or even, say, treating a CSV file as if it had a particular quoting scheme...

I quoted the values in my sample CSV because otherwise there'd be no way to determine (without some extra logic or finding a delimiter that's an illegal character on all filesystems) where the path ends (with filename.ext,Length being a strange-but-valid file name) and the next value begins.

@a-raccoon
Copy link
Author

a-raccoon commented Nov 6, 2019

As far as delimiter goes: As long as only one parameter, the filename and path, could potentially contain the delimiter, whether space or comma, it is fine for this to be unquoted as long as that single parameter is the last one on the line. A lot of protocols use this sort of system of last-parameter delimiter anticipation. filehash filesize filepath contains spaces

I can imagine text-mode hashes being used in FTP file sharing where .txt .html .cpp etc files are converted between \r\n and \n depending on whether the ftpd is on windows or linux. Probably not as bug of a deal today as people prefer exact file copy transfer modes, and software code tends to be maintained via git / svn versioning systems anymore, not hash.

I was recently informed there's a video / audio frame hashing capability in the ffmpeg software, which only analyzes the frame data of the known media containers; codec agnostic / ignorant. ffmpeg -i MOVIE.mov -f framemd5 MOVIE.framemd5 This could very reasonably fit into HashCheck's future plans. ffmpeg is integrated into lots of software.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants