Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixity field structure #406

Closed
pwinckles opened this issue Nov 3, 2019 · 11 comments
Closed

Fixity field structure #406

pwinckles opened this issue Nov 3, 2019 · 11 comments
Assignees
Milestone

Comments

@pwinckles
Copy link

Currently, the fixity field within an inventory looks as follows:

  "fixity": {
    "md5": {
      "184f84e28cbe75e050e9c25ea7f2e939": [ "v1/content/foo/bar.xml" ],
      "2673a7b11a70bc7ff960ad8127b4adeb": [ "v2/content/foo/bar.xml" ],
      "c289c8ccd4bab6e385f5afdd89b5bda2": [ "v1/content/image.tiff" ],
      "d41d8cd98f00b204e9800998ecf8427e": [ "v1/content/empty.txt" ]
    },
    "sha1": {
      "66709b068a2faead97113559db78ccd44712cbf2": [ "v1/content/foo/bar.xml" ],
      "a6357c99ecc5752931e133227581e914968f3b9c": [ "v2/content/foo/bar.xml" ],
      "b9c7ccc6154974288132b63c15db8d2750716b49": [ "v1/content/image.tiff" ],
      "da39a3ee5e6b4b0d3255bfef95601890afd80709": [ "v1/content/empty.txt" ]
    }
  }

This structure is quite unwieldy to work with. Something like the following would be much more useful:

  "fixity": {
    "v1/content/foo/bar.xml": {
        "md5": "184f84e28cbe75e050e9c25ea7f2e939",
        "sha1": "66709b068a2faead97113559db78ccd44712cbf2"
    },
    "v2/content/foo/bar.xml": {
        "md5": "2673a7b11a70bc7ff960ad8127b4adeb",
        "sha1": "a6357c99ecc5752931e133227581e914968f3b9c"
    },
    [....]
  }

As it stands, that's the map I have to create any time I need to access fixity values anyway. To me, this seems like an unneeded hassle just so multiple files can be associated to the same digest value. Sorry if this has been discussed before. I couldn't find an old issue.

@ahankinson ahankinson added this to the 1.0 milestone Nov 5, 2019
@ahankinson ahankinson assigned ahankinson and zimeon and unassigned ahankinson Nov 5, 2019
@zimeon
Copy link
Contributor

zimeon commented Nov 5, 2019

I don't recall a discussion of this issue for the fixity block. However, the structure of fixity has been chosen to mirror the structure of state where the direction of the map was discussed in #275 and #277 (and I'm sure there should be another issue but I can't find it)

@pwinckles
Copy link
Author

pwinckles commented Nov 5, 2019

How is it intended to be used? The structure makes sense for manifest. Is fixity intended to be used in the same way? The spec is not clear on this and only references "legacy content migrations." Do these legacy content migrations use information in fixity instead of manifest? If so, I can see how the current structure would be useful. If not, and the fixity section is purely informational, then the structure makes it very difficult to associate digest information to files.

@ahankinson
Copy link
Contributor

ahankinson commented Nov 5, 2019

Two issues with using filepaths as keys that were identified in the issues that @zimeon linked:

  1. Non-ASCII filepaths may (or may not) translate well between JSON and the filesystem.
  2. Files with different paths but the same hash (i.e., files with the same content but different paths) will have duplicate data. The use of the hash as the key was designed to reduce the amount of duplicate data, especially for long hashes.

I think we didn't know enough about the corner cases in filesystem character encoding and JSON encoding in order to be sure that (1) was really a problem, but given our significant but non-exhaustive research into this issue, it was hard to come up with a reason why we shouldn't worry about it, particularly if the pathnames contain, for example, Windows-1251 encoding and we're not sure if they've been properly mapped to Unicode in any processor.

@ahankinson
Copy link
Contributor

Since the fixity block is used to check fixity values, and not as content addressing, why do you need to associate the pathname first? It seems to me that to check fixity, you would first look at what you have in hand, and then see if that matches:

  1. Scan file in directory
  2. Check if the hash matches a known value (i.e., a known key exists in the fixity hash table)
  • 2a. If yes, check that the filepath is in the array
    • 2a (i). If yes, pass
    • 2a (ii). If no, fail
  • 2b. If no, fail
  1. Continue

@ahankinson
Copy link
Contributor

The intention of the fixity block is to provide a means of storing md5 and sha1 values that users may have and wish to keep, but in a way that does not put those algorithms on the critical path for content addressibility given the known problems with using those algorithms. That is what is meant by "legacy content migrations".

@pwinckles
Copy link
Author

pwinckles commented Nov 5, 2019

The method you described for checking fixity doesn't work because if the fixity section is used there is no requirement for every file to have an entry in it. So, if a digest is not in the fixity section you don't know whether or not the file is invalid or just not present.

In my implementation, I am not using the fixity block at all for fixity checking, and try to interact with it as little as possible. I only touch it in the following instances:

  1. If fixity algorithms are specified, I will compute the digests of files that are added and generate the appropriate fixity block.
  2. On describe operations I have to flip the map to associate all of a file's digests to it.
  3. [New] When rewriting an inventory when converting a "mutable HEAD" version into a normal version I have to flip the map to identify the paths that need to be rewritten.

I'm in the middle of processing the issues @zimeon linked to...

@pwinckles
Copy link
Author

As far as I can tell, the only reliable way to use the fixity block to verify fixity is the following:

  1. Invert the fixity map so it's structured as I originally described.
  2. Iterate over each entry in the manifest:
    1. If a contentPath has an entry in the inverted fixity map, compute its digest using the specified algorithm.
    2. Otherwise, ignore

If file paths cannot be trusted, scanning a directory and comparing the digests of the files found to the fixity section wouldn't work as the path representation could be different, as the linked issues suggest, and the only consistent (hopefully) representation is in the manifest.

That does make me wonder though, if a contentPath can have non-ascii characters and might be represented differently on different filesystems, how can it be used to reliably reference the path to a file across filesystems?

@julianmorley
Copy link
Contributor

I'd like to revisit fixity, too. I've had no trouble in using the block as written in spec to do fixity checks (yes, the I had to flip the hash and expand the filepaths, but that's a common enough operation that I just wrote a method to do it), but I'm wondering if something that gets rid of the file paths completely would make more sense, e.g.:

{
  "md5": {
  "cffe55838a878a29da82a0e10b2909b7e46b6f7167ed7f815782465573e98f27": "fccd3f96d461f495a3bef31dc1d28f01",
  "f512eb0a032f562225e848ce88449895f3ec19f3d4836a80df80c77c74557bab": "d2c79c8519af858fac2993c2373b5203"
  },
  "sha1": {
  "f512eb0a032f562225e848ce88449895f3ec19f3d4836a80df80c77c74557bab": "aa9e59cde167454f1f8b1f0eeeb0795e2d2f8c6f"
  }
}

@julianmorley
Copy link
Contributor

Also, a note on implementation. Because the fixity block is not guaranteed or required to contain a digest for every file listed in the manifest, the best way to verify fixity is to start with the contents in the fixity block. I do:

  • Get fixity block.
  • Do what I call an invert-expand-prepend, which flips the hashes with the filenames and expands them to the full local system path.
  • Then it's just a simple matter of generating checksums for the files in this list and comparing them against the stored value.

@ahankinson
Copy link
Contributor

if a contentPath can have non-ascii characters and might be represented differently on different filesystems, how can it be used to reliably reference the path to a file across filesystems?

The 'backup plan' is to check the hash, which shouldn't change across filesystems. If you can find the file hash, then you know the path on that system. The hash is the identity of the file; the filepath is a sort of convenience for human readability.

@neilsjefferies
Copy link
Member

Editors have discussed this and decided to stay with current layout. Some other issues raised have been spun out separately.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants