Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Should we say anything about consistency or otherwise of version metadata between version inventories? #421

Closed
zimeon opened this issue Feb 14, 2020 · 22 comments · Fixed by #425
Assignees
Milestone

Comments

@zimeon
Copy link
Contributor

zimeon commented Feb 14, 2020

Imagine an object with v1 and v2. In each of v1/inventory.json and v2/inventory.json there is a block "versions": { "v1": { "created": ..., "message": ..., "state": ...., "user": ... } }.

I feel that it is implied that the state must give a consistent set of files for v1 in each case (though the values may be different if different digest algorithms are used for different versions). However, do the values of created, message and user need to (MUST) be consistent? Or is changing them an allowed but discouraged (SHOULD) way to fix metadata in a system that might have immutable versions? There is currently no comment in https://ocfl.io/draft/spec/#version-inventory

@neilsjefferies
Copy link
Member

SHOULD works for me. But definitely generate warning.

@zimeon
Copy link
Contributor Author

zimeon commented Feb 18, 2020

Consensus from editors' call was that we should go with SHOULD

@rosy1280 rosy1280 added this to the 1.0 milestone Feb 24, 2020
@zimeon
Copy link
Contributor Author

zimeon commented Feb 24, 2020

In a comment on the PR #425, @ahankinson wrote:

What legitimate reasons are there for updating the created, message, and user values of previous versions?

My take on this (and hence motivation for only SHOULD be consistent) is that we should start from the idea that we want to support version immutability -- once a version, including its inventory, has been written it might not be possible or desirable to change it. With this starting point I can imagine wanting to correct created, message, or user values in the inventory of a new version that were somehow incorrect in a previous version. Perhaps a message was missed off, perhaps the created datetime was wrong in some systematic way, perhaps the user had not been linked to an address, ...

@ahankinson
Copy link
Contributor

ahankinson commented Feb 24, 2020

I'm really sorry -- I must not have understood the discussion on the editors call on this, so I apologize if I'm re-hashing this...

I think of the versions section of the inventory as being as immutable as the version directories themselves. Since the inventory file (inventory.json) itself should not change in previous versions (b/c it's part of the content, and the content is immutable) that means that we can potentially have the situation where two or more different blocks are given for the same version -- one in the first version, and changed ones in subsequent versions, which breaks immutability. (We would also have to allow for a change of the inventory.json.sha512 sidecar files to match the new hash in all previous versions as well).

@awoods
Copy link
Member

awoods commented Feb 24, 2020

I think the scenario is the following, using the "minimal OCFL Object" example:

  • The inventory.json in the v1 directory is as shown in the example
  • However, in the hypothetical v2 directory, the inventory.json looks like:
{
 ...
  "versions": {
    "v1": {
      "created": "2018-10-02T12:00:00Z",
      "message": "One file",
      "state": {
        "7545b8...f67": [ "file.txt" ]
      },
      "user": {
        "address": "bob@example.org",
        "name": "Bob"
      }
    },
    "v2": {
...some updates...
    }
  }
}

The inventory.json file in the v1 directory remains unchanged; however, the content in the "v1" block of the inventory.json file in the v2 directory has a different "address" and "name".

At a minimum, I agree with the sentiment of this issue that we SHOULD not allow such changes... potentially, MUST not allow them.

@zimeon
Copy link
Contributor Author

zimeon commented Feb 25, 2020

Note that the state part of the versions block cannot be immutable in the case that the v2 inventory uses a different digestAlgorithm than the v1 inventory -- a real use case to migrate to a new digest algorithm (and actually a rather cool case where one might not change the content of an object at all, just write a new version with inventory.json).

Also, note that this does not create any issues around inventory.json.sha512 sidecar files -- they remain fixed for all previous versions because the inventories themselves are unchanged. We are only talking about the description of v1 in the v2 inventory and similar. (As in @awoods description above)

@neilsjefferies
Copy link
Member

I think you mean "does not create any issues"! My previous comment stands - we have examples where it is useful even if it is to be discouragesdin general.

@zimeon
Copy link
Contributor Author

zimeon commented Feb 25, 2020

Oh yes, I have added the "does not" into my #421 (comment)

@ahankinson
Copy link
Contributor

ahankinson commented Feb 25, 2020

So does that mean that for the v2 version we have to assume that the creator, date, and message can change?

If so, how do we tell which one is 'correct' in terms of data provenance? If v1 claims that "Person A" created it, but then someone later decided that they wanted to expunge "Person A" from the record, and so wrote v2 to put their own name in as the creator of v1 -- which person has ultimate responsibility for v1? The original person, or the person that later versions claim?

(Sorry about the mixup with sha512 files -- thank you for correcting me)

@julianmorley
Copy link
Contributor

Throwing my hat into the ring on too little sleep ... but my take is that the entire contents of the version block are immutable, and subsequent higher-version inventory files should not have metadata in those blocks that differs from the metadata in the older inventory files.

My reasoning is provenance: we have no way of knowing, or tracking, why the address value for version 1 in v1/inventory.json differs from that in version 1 in v2/inventory.json. If the address was wrongly entered in the original v1, note it in a message block of a subsequent version; don't mess with the history.

@rosy1280
Copy link
Contributor

rosy1280 commented Feb 26, 2020

i think its a SHOULD, i see valid reasons for changing it in later inventories.

however! if folks do this, can we say they SHOULD (or must if we're being adamant) indicate in the message that the change has occurred and why? maybe an unrealistic ask but someone who knew what they were doing would want to do that.

@julianmorley
Copy link
Contributor

Assuming that we do allow this (note: I'm still a 'no'), what are allowable changes?
I'm guessing we're OK with different message, user name and user address values, but not OK with created and definitely not OK with State changes?

@ahankinson
Copy link
Contributor

With this starting point I can imagine wanting to correct created, message, or user values in the inventory of a new version that were somehow incorrect in a previous version. Perhaps a message was missed off, perhaps the created datetime was wrong in some systematic way, perhaps the user had not been linked to an address, ...

The base assumption we have to work with is that the files belong to the implementer, and that if they want to rewrite history, there's not a whole lot the spec can do about it. I think we all recognize that what happens between observation points (aka "object in motion") is out of our control. So if an institution wanted to change the message, datetime, user., they are always technically capable of doing this, and nothing in the spec can prevent this. They can even rewrite it going back to previous 'immutable' history states, and a validator would be none the wiser.

Given that "your files are yours to do what you like with them" is a fundamental freedom of the implementer and one which we should not presume to infringe, I don't really see a need to be liberal in what we accept within the confines of the spec. The implementer can always rewrite history; I see it as our duty to point out places and reasons why they shouldn't, and to use spec language to reinforce that point.

Changing the metadata for the same version from one version to another seems like an area where we can be quite strict about our expectations about the nature of versions for one point to another. I think we have to assume that technological problems (a client produces bad JSON, or wrong datetimes) should be caught by a validator at point of ingest, and that such problems should be fundamentally corrected in all versions throughout the object's history. If a message was missed off, that's actually a useful historical artefact. Otherwise, the temptation would be to supply a message, post hoc, that may or may not match the message that the original creator intended.

@awoods
Copy link
Member

awoods commented Feb 26, 2020

@ahankinson : agreed.
I support MUST language for ensuring that vX blocks must be consistent across inventory.json files.

@neilsjefferies
Copy link
Member

People change name and address so expecting them to be immutable AND be useful information doesn't really work.

@neilsjefferies
Copy link
Member

...personally I think name and address should be replaced by an OCFL object reference to a person object which can be versioned properly in its own right.

@rosy1280
Copy link
Contributor

yes i did thumbs down the above.

@zimeon
Copy link
Contributor Author

zimeon commented Feb 27, 2020

I agree that going full OCFL-flavor linked-data per #421 (comment) would be a step too far

@neilsjefferies
Copy link
Member

neilsjefferies commented Feb 28, 2020

@rosy1280 @zimeon Sorry, should have smiley'ed that comment! But allowing the address to be an object reference or something like an ORCID might be sensible. Then name/address/message can be immutable IMHO. I do think @zimeon's example of replacing the digests is valid thing we need to support in some reasonably elegant way though.

@ahankinson
Copy link
Contributor

ahankinson commented Feb 28, 2020

Would this point instead to a need to capture digestAlgorithm on state, rather than on the top-level?

@zimeon
Copy link
Contributor Author

zimeon commented Feb 28, 2020

I don't think so. The key idea is that the digestAlgorithm applies to the manifest which is linked to state in each and every version. Also, we do want to update the digest algorithm used for state in old versions to be updated as well if we migrate an object to a new and better digest algorithm.

@ahankinson
Copy link
Contributor

ahankinson commented Mar 3, 2020

I can definitely see the use case for wanting to change to newer digest algorithms in subsequent inventories. I don't really think we should allow changes to the 'provenance' mechanisms, though (user, message, created).

I think the difference is that the former can always be re-computed -- an old manifest can always get back the sha-512 of the file if we've switched to blake2b for example, and we would already track that change so it's just a matter of saying f(hashalg, file) = hash and you should be able to recover a backwards-compatible inventory from the current version. On the other hand it wouldn't really be possible to trace WHY an institution made a change from Person A in Version 1 to Person B.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants