Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add quirks section to embargo redesign doc #1802

Merged
merged 4 commits into from
Feb 14, 2024
Merged

Add quirks section to embargo redesign doc #1802

merged 4 commits into from
Feb 14, 2024

Conversation

waxlamp
Copy link
Member

@waxlamp waxlamp commented Jan 2, 2024

This PR extends the embargo redesign with some odd edge/corner cases that complicate the semantics of embargo. We need to have some discussion in order to resolve the questions of how the feature should behave in this situations.

@waxlamp waxlamp self-assigned this Jan 2, 2024
Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some initial thoughts -- I think we might want just to formalize some rules/assumptions over embargoed content and dandisets and points of interaction with the user where UNEMBARGOing might happen (explicit request for known asset(s) for a given set of assets, or during upload of a new blob/zarr)


**Scenario.** Roughly the opposite of the last scenario: someone creates an *embargoed* Dandiset 3, uploading asset B to it. They then create an *open* Dandiset 4, also uploading B to that.

**Quirk.** Dandiset 4 now contains a hidden asset, which contradicts the openness of Dandiset 4 and seems to compromise the secrecy of Dandiset 3.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after re-reading original design above I noted that although it describes technical aspect of the solution on S3 level, it does not cover at all annotation of embargoed "stuff" within archive DB, and might be worth clarifying on what exactly (dandiset? asset? blob? zarr? all of that) and how (metadata? extra DB column? both) is annotated to be either open or embargoed? I guess intention is to just keep it as is now (EmbargoedAssetBlob, EmbargoedZarrArchive + metadata annotation of embargoed dandiset), but feels like a missing detail here which instructs implementation logic and potentially these use-cases / quirks.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I sort of agree with you, but I see it a little bit differently: I don't want to formalize what it means for "assets to be embargoed", because "embargo" is a very specific application-level concept that models NIH-authorized embargo for a very specific situation. In that sense, how we are keeping such embargoed Dandisets' contents hidden from public view is an implementation detail, and I think it would actually be good to use a different term than "embargoed" to describe the state of such assets.

Where we agree (I believe) is the need to formalize what "embargo" means (in my opinion, it is simply the state of a Dandiset which includes its contents and existence being hidden from unauthorized users).

As for implementation: yes, I think we are planning to keep the current models in place, only changing them if (1) the changes improve the code and (2) the requirements/behavior of the system are not changed. (This is essentially what it means to be an "implementation detail".)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ref: #1787 a somewhat related since details of implementation (either to be a part of this design doc or not) and e.g. interaction with metadata remain not formalized in a design doc, or is it?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is simply the state of a Dandiset which includes its contents and existence being hidden from unauthorized user

i would strike "and existence". it does not matter from an embargo perspective whether dandiset 2024 exists or not. simply knowing that no content about it is exposed is sufficient.

embargo/restricted - only certain authorized individuals/entities have access to that resource (whatever the resource may be).

perhaps a few additional things to note:

  • asset ids and blobs are decoupled
  • blob ids and blob checksums are decoupled and currently fall under deduplication principles rather than embargo principles.

thus from an implementation perspective:

  • a user should be notified if a blob being uploaded to an embargoed dandiset already exists in public view (could easily be possible for a host of reasons). i believe the CLI doesn't allow 0 length files. i'm not sure the API has any restriction.
  • one needs to consider as @yarikoptic wrote as to what happens when these two worlds collide. same blob uploaded to two different embargoed dandisets (also completely possible). current deduplication rules would not allow a different id.

both embargo (being discussed here) and restricted (to be implemented soon) are resource specific concepts. as such i would really bring rebac into the picture. the resources in consideration are dandisets, dandiset metadata, assets, and blobs. whether we allow access control on all of these elements is a policy decision, but implementation wise we should consider them as separate resources. dandisets can contain mixtures of access assets for whatever reason and several users have asked for asset level granularity for access for different use cases. a lot of the considerations in quirks would be covered under rebac, which turns things into policy decisions for access.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you expand the acronym "rebac"? (I have seen it stand for different things and I want to disambiguate what you are referring to.)

dandisets can contain mixtures of access assets for whatever reason and several users have asked for asset level granularity for access for different use cases.

I was really hoping to avoid this type of generality, especially without hard requirements on how this would work. My preference is to work from well-defined concepts such as "NIH embargo" or "restricted access" (the latter is not super well defined but it is at least a domain-level concept) in order to deliver those chunks of features to users.

I understand that users have asked for the fully general idea of mixing access levels of assets, but if we don't understand the actual software requirements here (inlcuding some of the use cases those users are trying to implement), this is a perfect recipe for half-baked implementations plus subsequent rework that we have been desperately trying to avoid.

I am happy to discuss this direction further, but in the current design I would like to pin down how embargo should work in the face of these "quirks" (and some shadow of future capabilities we want to develop). I think we are well on the road to accomplishing this, but I wanted to draw this boundary so that we don't get bogged down in trying to understand three things at once (i.e., embargo, restricted, general asset access).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i'm fine with not getting bogged down, but if the entire thing will require a complete overhaul again when we implement restricted/general asset access, i would rather have that be discussed in this context.

This is a fair position to take. FWIW, my intuition is that with the ability to hide/open assets already existing, we should be able to build out embargo, and "restricted assets", and the more general mixed-access Dandiset concepts.

For the latter two (which don't exist yet) we still have room to understand requirements and connect them to the existing capabilities of the system--I simply want to make sure we develop that full understanding before anything else (which includes discussions now about those requirements, etc.).

Copy link
Member

@yarikoptic yarikoptic Jan 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW -- isn't "EMBARGOED" is just a "RESTRICTED" with a dandiset-level rule (conditions) for when "RESTRICTED" could be automatically be converted to "OPEN"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

isn't "EMBARGOED" is just a "RESTRICTED" with a dandiset-level rule

yup, just policies are different as to how access is granted. embargoed currently restricted to owners. for restricted, we will need both owners and viewers who have access. hence the consideration of rebac (resource based) + rbac (role based).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think model could be simplified if EMBARGOED would just become RESTRICTED with the policy for when it to become OPEN, and otherwise to have the same permission models as RESTRICTED, i.e. to have owners and viewers, where the viewers have no authority to change it, e.g. to make it OPEN through the . I think it might be one of the possible ways collaborators on EMBARGOED dataset could leak it: through that Explicit "make content of a dandiset" entry point I was talking about. Sure thing they would still be able to make it OPEN through leaking via uploading them as OPEN to another OPEN dataset (unless we forbid such an action as @waxlamp excercised in this remediation, but I think it would be potentially causing other issues (pretty much a "race condition" on some possibly trivial content as e.g. a json file with just [] or alike)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really do understand the pull to "simplify" the model so that embargo (which we really need to fix), "restricted", and a more general "mixed access" all appear to become simple extensions thereof, but doing so will actually make it harder to deliver the functionality as envisioned, unless we write down requirements for each and then validate that the implementation plan will support all requirements.

I do feel like we're on the verge of figuring out the broad semantics of "restriction" that will support everything, but I am really feeling the exposure to the risk of getting it wrong (and thus triggering another round of discussions and then rework to fix a broken attempted implementation). I'm going to schedule a meeting so we can discuss this synchronously and try to eliminate those risks.

Finally, this brings up a question about when to remove the `embargoed` tag from the object in the S3 bucket: we would insert logic to detect whether the embargoed asset is included in any other embargoed Dandisets, and only trigger the tag removal if there are none. This brings up one major question:

- **Do these situations nullify the effect/need to keep data hidden in the bucket at all?**
The answer to this is probably “no”, but some discussion may help to illuminate answers to the other questions and quirks above.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

those are great user stories! Thank you @waxlamp . I think we should just generalize them into some concise "design decision" which would inform on what is to happen in any of them as a result/remediation. Echoing the first one I would say it could be

  • Anyone with access to EMBARGOED content can make it OPEN

    • I think this would be the crucial statement which would offload responsibility to the owner (or someone with access) to embargoed data. May be a suboptimal example, but it is similar to "leaking the data" -- as soon as someone leaks private data, it becomes known publicly. The same here -- if someone with access to EMBARGOED data decides to upload it publicly, they can.
    • We indeed would want interface(s) to make such transitions from EMBARGOED to OPEN to be explicit. I see two "entry points"
      • Explicit "make content of a dandiset (or prefix within it? e.g. a participant)" OPEN -- that goes to "Object Tag Removal Workflow".
      • Make content OPEN via upload.
    • If we do not assume that checksums are private data -- we would need to verify that the person indeed has access to EMBARGOED content to make it OPEN upon upload. We would need a (missing ATM) workflow here which instead of minting a new blob would just make EMBARGOED to become OPEN upon verification of upload digest.
  • EMBARGOED dandiset can point to OPEN content

  • EMBARGOED content can be made OPEN by any authorized user

  • Only a dandiset without assets pointing to EMBARGOED content can be announced to be OPEN

    • so, explicit action: UNEMBARGOING of dandiset consists of unembargoing any still embargoed content first
    • UNEMBARGOING some content might make one or more dandisets available to transition to OPEN without unembargoing more conent
  • Upload to EMBARGOED dandiset makes new to the archive content EMBARGOED

  • Upload to OPEN dandiset makes content (new or already known to archive) OPEN

    • that is where we might want to indeed add some explicit option to allow for unembargoing of embargoed content, since otherwise people might get a little pissed after someone from their group would "leak" the embargoed data.

But in general it boils down to be able to inform user on the effects of potential unembargo -- would we be able to say how many dandisets would be affected? how many would become OPEN (as in content)?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We indeed would want interface(s) to make such transitions from EMBARGOED to OPEN to be explicit. I see two "entry points"

  • Explicit "make content of a dandiset (or prefix within it? e.g. a participant)" OPEN -- that goes to "Object Tag Removal Workflow".
  • Make content OPEN via upload.

I think the second option is the right one. It is simpler, and it leaves the idea of "embargoed Dandiset" whole (indeed, the other way to make assets in an embargoed Dandiset OPEN is, quoting from your response, so, explicit action: UNEMBARGOING of dandiset consists of unembargoing any still embargoed content first, which also considers "embargo" to be a state on a whole Dandiset).

Everything else you wrote, I believe I understand and agree with. I will rework the PR text to reflect what you've said here and we can do a review round to make sure we agree on how it all should work.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I might be misunderstanding you saying "the right one" here. I do not think of these two options as mutually exclusive alternatives - rather they seems to be the two ways we should allow for making content OPEN. Are you suggesting it ('make content OPEN via upload') to be the only way to make content OPEN?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you suggesting it ('make content OPEN via upload') to be the only way to make content OPEN?

This is indeed what I was suggesting. My reasons for this:

  • simplifies the mental model for the user--in particular, the user has less to worry about in terms of managing what is open and what is hidden
  • keeps the concept of embargo simpler: a Dandiset is either embargoed or not, and upon performing the "unembargo" action, all contained assets become open

Much of this comes from trying to control complexity, even if in the future we are going to move to that hairier model where any kind of assets can be present in any Dandiset (and the simplicity of a binary embargo state becomes fuzzier). So I may need to compromise here, but the cost is much greater complexity in what we're attempting to build (I just need to be open about this tradeoff).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think that would work since

  • it would pretty much bring back the original motivation for redesign - problem of Inefficiency , with all issues just offloaded to the uploader instead of our automations which now need to move data between buckets
  • as a result, would make "UNEMBARGO"ing a very time consuming and unpleasant experience thus demotivating users from making their data OPEN.

I think I could argue also that the first reason you stated would not really be fulfilled, and the 2nd one remains pertinent to the proposed first "entry point" to unembaroing content - done directly on the archive:

  • Explicit "make content of a dandiset (or prefix within it? e.g. a participant)" OPEN -- that goes to "Object Tag Removal Workflow".

So may be we are not fully understanding each other here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes I think we must be miscommunicating a bit. Among other things, I'm not sure how what I proposed causes any performance problems. What I'm suggesting is something like this:

  • uploading an asset to an open dandiset would cause it to become unhidden in the bucket
  • clicking an "unembargo" button would cause all of an embargoed dandiset's (hidden) assets to become unhidden (and the status of the dandiset becomes open)

Both of these "entry points" would work with the redesigned embargo concept (i.e., no performance issues). Allowing users to unhide assets one by one would also be compatible with this way of doing things, now that I think about it (it is equivalent to uploading that asset somewhere else), but I guess I don't see why any user would do so--but perhaps there is a reason (maybe some of the example use cases Satra posted in a different thread).

In any case, maybe we can discuss this face to face.

@waxlamp
Copy link
Member Author

waxlamp commented Jan 24, 2024

@satra I've added some text to reflect the concepts we decided on in our meeting. Please take a look and approve if you like what you see.

Copy link
Member

@satra satra left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

left some comments, but after reading everything, you can ignore most of them. lgtm.

@waxlamp waxlamp merged commit cfa54d1 into master Feb 14, 2024
7 of 8 checks passed
@waxlamp waxlamp deleted the embargo-quirks branch February 14, 2024 01:07
@dandibot
Copy link
Member

🚀 PR was released in v0.3.76 🚀

@dandibot dandibot added the released This issue/pull request has been released. label Feb 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
released This issue/pull request has been released.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants