PoC(tap): Snapshot merkle tree #1113

mnm678 · 2020-08-18T21:48:13Z

This pr adds the ability to use a snapshot merkle tree instead of the snapshot.json metadata. More details about the snapshot merkle design are available in the Notary v2 design proposal: https://docs.google.com/document/d/1w8PFELVxt4p1aMk5oJv0RbDyd5J4OyvwguNSNZ1sJNw/edit# or the TAP.

This implementation includes creation and verification of snapshot merkle metadata, as well as an example auditor implementation.

joshuagl · 2020-09-02T15:33:56Z

@mnm678 please could you mark this PR as a draft until it's ready for review?

mnm678 · 2020-09-02T20:36:12Z

@joshuagl I just fixed a typo and rebased, so this should be ready for review now.

joshuagl · 2020-09-08T14:56:40Z

The linked document appears to be an old version, I found this working copy much more informative. Particularly this section:

Metadata scalability with Merkle Tree

To optimize the snapshot metadata file size for large registries, registries can use a snapshot Merkle tree to conceptually store version information about all images in a single snapshot file. Of course, this would have scalability problems, but the idea is to not distribute that file to clients but instead provide the same protection in a more scalable manner. First, the client retrieves only a timestamp file, which changes according to some period p (such as every day or week). Second, the snapshot file is itself kept as a Merkle tree, with the timestamp as the root. A new snapshot Merkle tree is generated every time a new timestamp is generated. To prove that there has not been a reversion of the snapshot Merkle tree when downloading an image, the client downloads the prior snapshot Merkle trees and checks that the version numbers did not decrease at any point. To make this scalable as the number of timestamps increases, the client will only download version information signed by the current timestamp file. Thus, rotating this key enables the registry to discard old snapshot Merkle tree data.

This provides the following security, scalability, and privacy properties:
Users get rollback protection for the entire registry because the snapshot metadata in the Merkle tree lists all tags and images.
Users do not have to store the previous snapshot metadata to get rollback protection because they can securely check the previous snapshot Merkle tree. This gives some benefit to new users who would not have a previous snapshot metadata in the original design.
Access control can be handled at the targets metadata level. Each user can access targets metadata only if they know the public key and if they are able to access the metadata as per the access control used to keep the registry metadata private.

joshuagl

Really cool work here @mnm678! Apologies for the delayed review, it took me quite a bit of time to understand the design (and I'm still not 100% clear that I do). IIRC the purpose of this PR was to PoC the concept in before working on a TAP? A TAP would be really helpful for fully reviewing this submission.

On the above understanding, I've avoided reviewing the code in too much detail. Though I have made various comments and recommendations when reading through. The major code structure issue I'd like to see addressed if this were intended for a merge is to remove the amount of branching/special-casing for handling the Merkle tree related logic. The code will be much easier to follow, and therefore maintain, with the Merkle logic existing mostly in single-purpose methods (i.e. a generate_snapshot_merkle_metadata and generate_snapshot_metadata vs a combined generate_snapshot_metadata which takes a boolean indicating whether to generate Merkle metadata).

From a PoC/TAP authoring/review perspective I'd like to see the following questions addressed:

Do we have a feeling for how the proposed mechanisms perform? It would be good to create something like @lukpueh's scripts that test the performance of delegating to hashed bins to get a feel for the performance of the Merkle tree builder on large repositories like PyPI or Dockerhub
Do you expect the tree building algorithm to be part of the specification? Can clients/other implementations be compatible without that knowledge?
For large repositories this will generate a lot of files, what's the garbage collection story for Merkle tree files? Should we consider formalising guidelines around storing them to make cleanup easier? For example, it might make sense to use a directory per tree with the directory named for the root? We could look at other systems with similar tree structures for inspiration (i.e. Git Objects)
I'd appreciate some time in a TAP being explaining why the Merkle files are different to all of the other metadata files we create (they don't use the signable format and are not signed), though perhaps that's obvious to most of the TAP editors?

joshuagl · 2020-09-08T15:09:49Z

tuf/client/updater.py

+      metadata_role:
+        The name of the metadata role. This should not include a file extension.
+    <Exceptions>
+      tuf.exceptions.RepositoryError:


Note: I'm not sure this is the right exception, but which exceptions to use where is somewhat of an open question...

Do you have one you think would fit better?

No. Sorry, I wasn't very clear. I meant this more as an FYI for later and had intended to link to #1131

joshuagl · 2020-09-08T15:11:45Z

tuf/client/updater.py

+    merkle_root = self.metadata['current']['timestamp']['merkle_root']
+
+    # Download Merkle path
+    self._update_metadata(metadata_role + '-snapshot', 1000, snapshot_merkle=True)


Should the length be a constant? How did we come up with 1000?

I moved this constant to tuf.settings to avoid the magic number here, but the choice of 1000 could probably use more discussion.

joshuagl · 2020-09-09T14:28:03Z

tuf/repository_lib.py

+  # this path to the client for verification
+  return root, leaves
+
+def write_merkle_paths(root, leaves, storage_backend, merkle_directory):


Suggested change

def write_merkle_paths(root, leaves, storage_backend, merkle_directory):

def _write_merkle_paths(root, leaves, storage_backend, merkle_directory):

Internal methods should add an underscore prefix to the method name.

joshuagl · 2020-09-09T14:29:19Z

tuf/client/updater.py

+      # If merkle root is set, do not update snapshot metadata. Instead,
+      # download the relevant merkle path when downloading a target.


Suggested change

# If merkle root is set, do not update snapshot metadata. Instead,

# download the relevant merkle path when downloading a target.

# If merkle root is set, do not update snapshot metadata. Instead,

# we will download the relevant merkle path later when downloading

# a target.

joshuagl · 2020-09-09T14:38:57Z

tuf/client/updater.py

-    updated_metadata_object = metadata_signable['signed']
+    if snapshot_merkle:
+      # Snaphot merkle files are not signed
+      updated_metadata_object=metadata_signable


Suggested change

updated_metadata_object=metadata_signable

updated_metadata_object = metadata_signable

updated_metadata might be a signable, or not. Having the same variable be different types is often not a good sign. See my comment above about making a separate method for updating merkle files.

joshuagl · 2020-09-11T10:59:09Z

tuf/repository_lib.py

+def _print_merkle_tree(node, level):
+  """
+  Recursive function used by print_merkle_tree
+  """
+  print('--'* level + node.hash())
+  if not node.isLeaf():
+    _print_merkle_tree(node.left(), level + 1)
+    _print_merkle_tree(node.right(), level + 1)
+  else:
+    print('--' * (level+1) + node.name())
+
+
+
+def print_merkle_tree(root):
+  """
+  Helper function to print merkle tree contents for demos and verification
+  of the Merkle tree contents
+  """
+  print('')
+  _print_merkle_tree(root, 0)
+
+
+
+


Do we want/need to merge these demo/debug helpers?

These might be useful for saving the full tree for auditing purposes. I need to think more about the exact process for merkle tree auditing, but I'll include something in the TAP and we can decide then if these are needed.

tuf/repository_lib.py

joshuagl · 2020-09-11T11:17:43Z

tuf/formats.py

+  leaf_contents = SCHEMA.OneOf([VERSIONINFO_SCHEMA,
+                              METADATA_FILEINFO_SCHEMA]),


It's not clear to me why the leaf_contents can be one of two types? Aren't VERSIONINFO_SCHEMA objects already conformant to METADATA_FILEINFO_SCHEMA, because in the latter the hashes and length fields are optional?

This is the same definition used in the FILEINFODICT_SCHEMA used by snapshot metadata. I can remove the VERSIONINFO_SCHEMA from both places if it is redundant.

Maybe @lukpueh knows why they are both listed for snapshot?

joshuagl · 2020-09-11T12:43:39Z

tuf/client/updater.py

+    merkle_root = self.metadata['current']['timestamp']['merkle_root']
+
+    # Download Merkle path
+    self._update_metadata(metadata_role + '-snapshot', 1000, snapshot_merkle=True)


As above, rather than overload _update_metadata and special-case that function for snapshot Merkle files, it might be a bit cleaner to have a separate method for updating snapshot Merkle files?

joshuagl · 2020-09-11T12:47:15Z

tuf/repository_lib.py

+  def right(self):
+    return self._right
+
+  def isLeaf(self):


That's not a very pythonic function name. is_leaf or just a boolean variable leaf. This should probably be a property of Node?

mnm678 · 2020-09-11T16:05:24Z

Thanks @joshuagl!

IIRC the purpose of this PR was to PoC the concept in before working on a TAP? A TAP would be really helpful for fully reviewing this submission.

Yes, the goal is to turn this feature into a TAP before finalizing anything. I made some initial design decisions in this pr, but we can revisit them as needed.

Do we have a feeling for how the proposed mechanisms perform? It would be good to create something like @lukpueh's scripts that [test the performance of delegating to hashed bins](https://gist.github.com/lukpueh/724bd1d7b477f201a9f199b037d85747) to get a feel for the performance of the Merkle tree builder on large repositories like PyPI or Dockerhub

I did some experiments locally and the performance seemed pretty good, but I agree that formalizing these in test cases is a good next step. This feature is only really needed for large repositories, so we have to make sure it scales well.

Do you expect the tree building algorithm to be part of the specification? Can clients/other implementations be compatible without that knowledge?

We can discuss this in the TAP, but IMO we should include the tree building algorithm in the POUF. Implementations will need to know the algorithm in order to ensure that the hashes match, but it's not fundamental for achieving the security properties.

For large repositories this will generate a lot of files, what's the garbage collection story for Merkle tree files? Should we consider formalising guidelines around storing them to make cleanup easier? For example, it might make sense to use a directory per tree with the directory named for the root? We could look at other systems with similar tree structures for inspiration (i.e. [Git Objects](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects))

The merkle tree files do not need to be accessed once new files are generated, so it makes sense to store/delete these together. I don't think garbage collection is currently addressed in the specification, but we can include a description of this either there or in the TAP.

I'd appreciate some time in a TAP being explaining why the Merkle files are different to all of the other metadata files we create (they don't use the signable format and are not signed), though perhaps that's obvious to most of the TAP editors?

Based on my initial understanding, they do not have to be signed because all the data in the Merkle files is validated using the Merkle root (which is signed by timestamp). I will certainly include a more detailed explanation of this in the TAP, and we can discuss whether a signature is actually needed.

joshuagl · 2020-09-14T08:56:59Z

Do you expect the tree building algorithm to be part of the specification? Can clients/other implementations be compatible without that knowledge?
We can discuss this in the TAP, but IMO we should include the tree building algorithm in the POUF. Implementations will need to know the algorithm in order to ensure that the hashes match, but it's not fundamental for achieving the security properties.

That sounds reasonable to me.

For large repositories this will generate a lot of files, what's the garbage collection story for Merkle tree files? Should we consider formalising guidelines around storing them to make cleanup easier? For example, it might make sense to use a directory per tree with the directory named for the root? We could look at other systems with similar tree structures for inspiration (i.e. [Git Objects](https://git-scm.com/book/en/v2/Git-Internals-Git-Objects))
The merkle tree files do not need to be accessed once new files are generated, so it makes sense to store/delete these together. I don't think garbage collection is currently addressed in the specification, but we can include a description of this either there or in the TAP.

Yeah. It's not addressed in the specification at present, and unlikely to be added there as we move away from encoding notions of files in the specification (theupdateframework/specification#103). It would be good to address this in the TAP for the purposes of the reference implementation and adding something to the proposed secondary literature (theupdateframework/specification#91).

I'd appreciate some time in a TAP being explaining why the Merkle files are different to all of the other metadata files we create (they don't use the signable format and are not signed), though perhaps that's obvious to most of the TAP editors?
Based on my initial understanding, they do not have to be signed because all the data in the Merkle files is validated using the Merkle root (which is signed by timestamp). I will certainly include a more detailed explanation of this in the TAP, and we can discuss whether a signature is actually needed.

Of course, timestamp is signed and the nodes in the Merkle tree are protected by cryptographic hashes that are verified (contents match hash) as the files are downloaded. I'd certainly appreciate more explanation of this in the TAP, thank you.

mnm678 · 2020-09-17T19:38:31Z

I added an initial TAP for this feature at theupdateframework/taps#125.

joshuagl · 2020-11-26T16:26:36Z

I've converted this PoC to a draft PR, hope you don't mind

Generate a snapshot merkle tree when writing snapshot metadata and use the root hash in timestamp. This is a work in progress commit. Signed-off-by: marinamoore <mnm678@gmail.com>