You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
{{ message }}
This repository has been archived by the owner on Sep 27, 2022. It is now read-only.
In OCFL/spec#522 I talked about a way we might use S3 as a backend. One point I made is that versioning would require pulling down from S3 the entire object which would be terrible in the case of very large objects (TB sized objects but even a few GB would make updates slow).
In his reply@pwinckles stated that the most recent inventory is likely the only thing needed.
This ticket is about thinking through how to detect that a file has been deleted in the next version without needing the whole object (which I can't see is possible - hence why I'm asking for help!).
Consider the following:
v1 v2
|- File A - hash X |- File A - hash X
No change; do not create new version.
v1 v2
|- File A - hash X |- File A - hash X
|- File B - hash Y
New file; create new version referencing File A -> v1 and File B -> v2
v1 v2
|- File A - hash X |- File A - hash Z
|- File B - hash Y |- File B - hash Y
File changed (File A); create new version referencing File B -> v1, File A -> v2
Our library works by digesting the path (walking the object tree and producing pairs of files + hashes) and then comparing the new tree to the existing tree. In all of the cases above we get the expected behaviour. However, if we didn't have the whole dataset available in the new version then changing a file would result in all of the other data being removed from the next version.
v1 v2
|- File A - hash X |- File A - hash Z
|- File B - hash Y
File changed (File A); create new version referencing File A -> v2 but File B
ends up removed from the new version.
So, if I've thought this through correctly, comparing against the latest inventory rather than a full digest means we will pick up file changes and file additions but we won't be able to remove a file from one version to the next as we would always need to include everything that is referenced in the latest version.
Is there another way to detect file deletions across versions without needing all of the object data up to that point?
The text was updated successfully, but these errors were encountered:
In the current mode the library just takes the new version and adds it to an existing ocfl object. In the existing use cases this is fine as a source (e.g. omeka or some filesystem or something else) pushes data to an OCFL repo. That is, the current state is available somewhere else and we manipulate it there before updating the ocfl object.
However, in the paradisec world we're looking at using the OCFL repo as the primary source of truth. So, in that case, the library needs to know how to perform operations when updating an object so that we don't need to rehydrate a new version from the latest version each time.
For example rather than getting a copy of the current state and then manipulating the data, what we want is to just perform the minimal operations on the new version and have the library handle the change sensibly.
add a new file to an object - add to new version and merge with existing object changing only inventory files
change an existing file - add a new version of the file and merge with existing object rewriting relevant inventories
move a file - operate only on the inventories to document the move
rename a file - operate only on the inventories to document the rename
delete a file - operate only on the inventories
These operations would still create new versions of the object. It's just that the operations would be handled in a more sophisticated way so that we wouldn't need all of the current state (data) in order to work out the diff and decide to version or not.
@ptsefton I've looked through the uts ocfl library commits and branches and can't see anything like what you mentioned re: identifying operations like delete. Can you please link here what you were telling me about ?
Conversation moved from OCFL/spec#525
In OCFL/spec#522 I talked about a way we might use S3 as a backend. One point I made is that versioning would require pulling down from S3 the entire object which would be terrible in the case of very large objects (TB sized objects but even a few GB would make updates slow).
In his reply @pwinckles stated that the most recent inventory is likely the only thing needed.
This ticket is about thinking through how to detect that a file has been deleted in the next version without needing the whole object (which I can't see is possible - hence why I'm asking for help!).
Consider the following:
Our library works by digesting the path (walking the object tree and producing pairs of files + hashes) and then comparing the new tree to the existing tree. In all of the cases above we get the expected behaviour. However, if we didn't have the whole dataset available in the new version then changing a file would result in all of the other data being removed from the next version.
So, if I've thought this through correctly, comparing against the latest inventory rather than a full digest means we will pick up file changes and file additions but we won't be able to remove a file from one version to the next as we would always need to include everything that is referenced in the latest version.
Is there another way to detect file deletions across versions without needing all of the object data up to that point?
The text was updated successfully, but these errors were encountered: