Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design doc - Publish Dandisets that contain Zarr archives #1833

Closed
wants to merge 27 commits into from
Closed
Changes from all commits
Commits
Show all changes
27 commits
Select commit Hold shift + click to select a range
e71ee11
Update gitignore
kabilar Jan 26, 2024
c0c9d66
Add requirements doc
kabilar Jan 26, 2024
dd40a75
Update requirements doc
kabilar Jan 26, 2024
2987b81
Update requirements doc
kabilar Jan 26, 2024
aa684fa
Update current implementation
kabilar Jan 29, 2024
fc15230
Update requirements
kabilar Jan 29, 2024
b7d88ad
Fix text
kabilar Jan 29, 2024
95bb260
Add solutions section
kabilar Jan 31, 2024
9acf245
Update title
kabilar Jan 31, 2024
02b5dbc
Add technical specifications
kabilar Jan 31, 2024
acb1367
Add use case
kabilar Jan 31, 2024
7dcae93
Add use case 3
kabilar Jan 31, 2024
cfe20a2
Add TODO
kabilar Jan 31, 2024
881513e
Remove link
kabilar Jan 31, 2024
a138173
Update doc/design/zarr-publish-1.md
kabilar Feb 13, 2024
6b84ed0
Update doc/design/zarr-publish-1.md
kabilar Feb 13, 2024
2faa5b8
Revert "Update gitignore"
kabilar Feb 15, 2024
b062af1
Revert "Revert "Update gitignore""
kabilar Feb 15, 2024
22d014c
Merge branch 'zarr-doc' of https://github.com/kabilar/linc-archive in…
kabilar Feb 15, 2024
31af71d
Revert "Update gitignore"
kabilar Feb 15, 2024
57eaa62
Update doc/design/zarr-publish-1.md
kabilar Feb 15, 2024
41d9449
Reorder steps
kabilar Feb 16, 2024
f2be2dd
Merge branch 'zarr-doc' of https://github.com/kabilar/linc-archive in…
kabilar Feb 16, 2024
50163a2
Update potential solutions section
kabilar Feb 22, 2024
75977f6
Update requirements
kabilar Feb 22, 2024
585756c
Add details to requirement 2
kabilar Feb 22, 2024
58ad723
Update introduction section
kabilar Feb 22, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 62 additions & 0 deletions doc/design/zarr-publish-1.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,62 @@
# Publishing Dandisets that contain Zarr archives

This document describes the current implementation of publishing Dandisets with Zarr archives, desired use cases, and the associated requirements.
Note that once the requirements for `Use case 1` are implemented, then `Use cases 2-3` will be capable.

## Current implementation

When a non-Zarr asset blob is updated, a new copy of that file is uploaded to the S3 bucket. Zarr archives are too large so multiple copies should not be created. A Zarr archive is uploaded once and it is updated in place. This design means that the Zarr archive is immutable once the Dandiset is published, so that the published Dandiset is immutable. Currently, a Dandiset cannot be published if it contains a Zarr asset. For more details, see the [zarr-support-3 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/zarr-support-3.md).

## Use case 1

Publish a Dandiset containing a Zarr archive(s), and subsequently update the Zarr archive(s).

The publishing procedure would follow the description found in the [publish-1 design doc](https://github.com/dandi/dandi-archive/blob/master/doc/design/publish-1.md). A modified publishing procedure that includes Zarr archive(s) is summarized below.

1. User uploads a new Dandiset which includes a Zarr archive(s).
1. User publishes the Dandiset and thereby creates a new immutable version of the Dandiset.
1. User uploads an updated Zarr archive(s) to the `Draft` version of the Dandiset.
1. User repeats steps 2 and 3.

## Use case 2

Upload a Zarr archive to an embargoed Dandiset.

## Use case 3

Reuse a Zarr archive in more than one Dandiset.

Allow for a Zarr archive that is uploaded as part of an original Dandiset to be packaged in a new Dandiset without duplicating the Zarr archive. The new Dandiset could be created by potentially different authors and could contain additional raw and/or analyzed data. This feature has been previously implemented for other asset types with [add_asset_to_dandiset.py](https://gist.github.com/satra/29404d965226e4c99fb48e7502953503#file-add_asset_to_dandiset-py). Further details of this feature request have been previously documented in #1792.

## Requirements (Target date: April 30, 2024)

1. Publish Dandisets that contain Zarr archives.
2. If the same Zarr archive is uploaded to multiple Dandisets, then the Zarr archive should not be re-uploaded. This requirement would mirror the behavior of non-Zarr asset blobs.

## Implementation details

1. Design a lightweight object to version Zarr archives.
1. For a candidate implementation see https://github.com/dandi/zarr-manifests/.
2. Minimize storage costs in the design.

## Potential solutions

1. Earthmover's [Arraylake](https://earthmover.io/blog/arraylake-beta-launch)
1. Notes
1. Edits of the Zarr archive must happen through the Arraylake Python API, and thus the `dandi-cli` should be updated.
2. Questions
1. Egress costs?
2. Formal testing of Python API and infrastructure to ensure data integrity?

2. Create manifest file with paths and version IDs for each chunk for a specific version of the Zarr archive.
1. Candidate implementation - https://github.com/dandi/zarr-manifests/
2. Steps
1. Initiate S3 bucket versioning
3. Questions
1. Store the manifest file in a database instead of S3 for improved performance?
4. Constraints
1. If the Zarr archive must be re-chunked then the user would need to upload the entire Zarr archive.
2. Garbage collection would need to be updated.

3. Implement a Django backend for Zarr
1. Stores data in a Postgres database that references the Zarr chunks in S3.