Skip to content
This repository has been archived by the owner on Sep 27, 2022. It is now read-only.

On detecting deleted files across versions #3

Open
marcolarosa opened this issue Jan 13, 2021 · 2 comments
Open

On detecting deleted files across versions #3

marcolarosa opened this issue Jan 13, 2021 · 2 comments

Comments

@marcolarosa
Copy link
Contributor

Conversation moved from OCFL/spec#525

In OCFL/spec#522 I talked about a way we might use S3 as a backend. One point I made is that versioning would require pulling down from S3 the entire object which would be terrible in the case of very large objects (TB sized objects but even a few GB would make updates slow).

In his reply @pwinckles stated that the most recent inventory is likely the only thing needed.

This ticket is about thinking through how to detect that a file has been deleted in the next version without needing the whole object (which I can't see is possible - hence why I'm asking for help!).

Consider the following:

  v1                                     v2
  |- File A - hash X                     |- File A - hash X

No change; do not create new version.
  v1                                     v2
  |- File A - hash X                     |- File A - hash X
                                         |- File B - hash Y

New file; create new version referencing File A -> v1 and File B -> v2
  v1                                     v2
  |- File A - hash X                     |- File A - hash Z
  |- File B - hash Y                     |- File B - hash Y
    
File changed (File A); create new version referencing File B -> v1, File A -> v2

Our library works by digesting the path (walking the object tree and producing pairs of files + hashes) and then comparing the new tree to the existing tree. In all of the cases above we get the expected behaviour. However, if we didn't have the whole dataset available in the new version then changing a file would result in all of the other data being removed from the next version.

  v1                                     v2
  |- File A - hash X                     |- File A - hash Z
  |- File B - hash Y                     
    
File changed (File A); create new version referencing File A -> v2 but File B 
ends up removed from the new version.

So, if I've thought this through correctly, comparing against the latest inventory rather than a full digest means we will pick up file changes and file additions but we won't be able to remove a file from one version to the next as we would always need to include everything that is referenced in the latest version.

Is there another way to detect file deletions across versions without needing all of the object data up to that point?

@marcolarosa
Copy link
Contributor Author

From discussion with @ptsefton.

In the current mode the library just takes the new version and adds it to an existing ocfl object. In the existing use cases this is fine as a source (e.g. omeka or some filesystem or something else) pushes data to an OCFL repo. That is, the current state is available somewhere else and we manipulate it there before updating the ocfl object.

However, in the paradisec world we're looking at using the OCFL repo as the primary source of truth. So, in that case, the library needs to know how to perform operations when updating an object so that we don't need to rehydrate a new version from the latest version each time.

For example rather than getting a copy of the current state and then manipulating the data, what we want is to just perform the minimal operations on the new version and have the library handle the change sensibly.

  • add a new file to an object - add to new version and merge with existing object changing only inventory files
  • change an existing file - add a new version of the file and merge with existing object rewriting relevant inventories
  • move a file - operate only on the inventories to document the move
  • rename a file - operate only on the inventories to document the rename
  • delete a file - operate only on the inventories

These operations would still create new versions of the object. It's just that the operations would be handled in a more sophisticated way so that we wouldn't need all of the current state (data) in order to work out the diff and decide to version or not.

@marcolarosa
Copy link
Contributor Author

@ptsefton I've looked through the uts ocfl library commits and branches and can't see anything like what you mentioned re: identifying operations like delete. Can you please link here what you were telling me about ?

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant