Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add embargo re-design doc #1772

Merged
merged 1 commit into from
Dec 11, 2023
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
118 changes: 118 additions & 0 deletions doc/design/embargo-redesign.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,118 @@
# Embargo Redesign

Author: Jacob Nesbitt

The current embargo infrastructure setup is both inefficient and prone to error, as we store embargoed data in one bucket, and regular data in another. This means that to unembargo data, it needs to be copied from one bucket to another, which not only costs money, but time as well. Plus, when it comes to the intersection of Zarrs and embargo (”Zarrbargo”), this approach is a non-starter. Therefore, a new approach is required.

## Problems with the Existing Approach

### Inefficiency

Embargoed data is currently uploaded to a bucket separate from the main sponsored bucket. Unembargoing involves copying data from that bucket into the sponsored bucket. Performing this copy by managing individual copy-object commands proved to show major performance issues due to the need for monitoring errors and from the sheer count of objects that need to be copied.

### Error-proneness

As Dandisets grow in size, comprising of more and more assets, the probability rises that the unembargo process—consisting of individually copying objects from bucket to bucket—will fail. While such failures are recoverable, that in turn requires further engineering efforts to make the process self-healing. This brings complexity and thus continued risk of failures for unembargo.

### Zarrbargo non-starterness

The drawbacks mentioned in the previous two sections are all compounded by Zarr archives, which have proven so far to bring data scales one or two orders of magnitude larger than non-Zarr data. The upload time for the largest Zarr we have processed was on the order of one month; copying such a Zarr over to the sponsored bucket would incur another period of time along the same order of magnitude.

Zarr archives, by nature, are made of many small files; Zarr archives of large size may expand to encompass 100000 files or more, raising the probability of a failure during unembargo.

## In-Place Object Tagging

With a simple bucket policy to deny public access to any object with an `embargoed` tag, the public access of an object can be restricted by simply adding the `embargoed` tag to that object.

```json
{
"Effect": "Deny",
"Principal": "*",
"Action": "s3:*",
"Resource": "arn:aws:s3:::dandiarchive/*",
"Condition": {
"StringEquals": {
"s3:ExistingObjectTag/embargoed": "true"
},
"StringNotEquals": {
"aws:PrincipalAccount": "769362853226"
}
}
}
```

With this bucket policy enacted, by default, all data in the bucket remains public. However, if any objects contains the `embargoed=true` tag, it will be restricted from public access. If authorized users (dandiset owners, etc.) wish to obtain access, a pre-signed URL can be obtained from the API.

## Change to Upload Procedure

This new approach requires a change to the upload procedure, so that uploads to embargoed dandisets are tagged with the `embargoed` S3 tag. This can be achieved by adding the `embargoed=true` tag as part of the pre-signed put object URL that is issued to the user when uploading an embargoed file, such that if the ******client****** doesn’t include that tag, the upload will fail.

A diagram of the upload procedure is shown below

```mermaid
sequenceDiagram
autonumber
participant S3
participant Client as dandi-cli
participant Server

Client ->> Server: Request pre-signed S3 upload URL for embargoed dandiset
Server ->> Client: Pre-signed URL with embargoed tag included
Client ->> S3: Upload file with embargoed tag
Client ->> Server: Finalize embargoed file upload
Server ->> S3: Server verifies access to embargoed file and mints new asset blob
rect rgb(235, 64, 52)
Client -->> S3: Unauthorized access is denied
end
Client->>Server: Request pre-signed URL for embargoed file access
Server->>Client: If user is permitted, a pre-signed URL is returned
rect rgb(179, 209, 95)
Client->>S3: Embargoed file is successfully accessed
end
```

## Change to Un-Embargo Procedure

Once the time comes to *********un-embargo********* those files, all that is required is to remove the `embargoed` tag from all of the objects. This can be achieved by an [S3 Batch Operations Job](https://docs.aws.amazon.com/AmazonS3/latest/userguide/batch-ops-create-job.html), in which the list of files is specified (all files belonging to the dandiset), and the desired action is specified (delete/replace tags).

The benefit of this approach is that once the files are uploaded, no further movement is required to change the embargo state, eliminating the storage, egress, and time costs associated with unembargoing from a second bucket. Using S3 Batch Operations to perform the untagging also means we can rely on AWS’s own error reporting mechanisms, while retrying any failed operations requires only minimal engineering effort within the Archive codebase.

### Object Tag Removal Workflow

1. [Create Job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3control/client/create_job.html) from celery task, storing the resulting Job ID in the dandiset model
2. Use a recurring celery task cron job to check any dandisets with a status of “unembargoing” and a not null “job ID” field, to see if they’ve finished using [describe_job](https://boto3.amazonaws.com/v1/documentation/api/latest/reference/services/s3control/client/describe_job.html)
3. Once an s3 batch job is found that’s completed, the manifest is downloaded from S3, to ensure that there were no failures
4. If there are no failures, the Job ID is set to null in the DB model, and the embargo status, metadata, etc. is updated to reflect that the dandiset is now `OPEN`.
5. Otherwise, an exception is raised and attended to by the developers.

A diagram of the un-embargo procedure (pertaining to just the objects) is shown below

```mermaid
sequenceDiagram
autonumber
participant Client
participant Server
participant Worker
participant S3

Client ->> Server: Un-embargo dandiset
Server ->> Worker: Dispatch un-embargo task
Worker ->> S3: List of all dandiset objects are aggregated into a manifest
Worker ->> S3: S3 Batch Operation job created
S3 ->> Worker: Job ID is returned
Worker ->> Server: Job ID is stored in the database
S3 ->> S3: Tags on all objects in the supplied manifest are removed
Note over Worker,S3: After some time, a cron job is run <br> which checks the status of the S3 job
Worker ->> Server: Job ID is retrieved
Worker ->> S3: Job status retrieved, worker observes that <br> the job has finished and was successful
Worker ->> Server: Job ID is cleared, dandiset embargo status is set to OPEN

rect rgb(179, 209, 95)
Client ->> S3: Data is now publicly accessible
end
```

## Experimental Results

- Deleting tags of 10,000 objects took ~18 seconds
- Deleting tags of 100,000 objects took ~2 minutes