Zarr access concerns - esp. for embargoed or from browser #1745

magland · 2023-11-10T17:09:25Z

I have some concerns about access to DANDI Zarr assets from the browser or for embargoed dandisets. I think it's likely that this could be solved by creating a kerchunk index for each Zarr asset, but I'm not sure about it. If that's the case I'd like to suggest that it be a high priority to integrate kerchunk in the dandi upload process.

Edit: Rather than kerchunk I proposed a different solution, see later comments.

I'll explain based on my current understanding (which may be limited) of remote access to Zarr directories.

@alejoe91 showed me this nice example on reading from a Zarr archive from a public AIND bucket

import zarr
remote_zarr_location = "s3://aind-open-data/ecephys_625749_2022-08-03_15-15-06_nwb_2023-05-16_16-34-55/ecephys_625749_2022-08-03_15-15-06_nwb/ecephys_625749_2022-08-03_15-15-06_experiment1_recording1.nwb.zarr/"
zarr_root = zarr.open(remote_zarr_location, storage_options=dict(anon=True))
print(zarr_root.attrs.keys())
for k in zarr_root.keys(): print(k)
# dict_keys(['.specloc', 'namespace', 'neurodata_type', 'nwb_version', 'object_id'])
# acquisition
# analysis
# ...

And it also works well when reading this DANDI Zarr example prepared by @CodyCBakerPhD

import zarr
remote_zarr_location = 's3://dandi-api-staging-dandisets/zarr/fe45a10f-3aa4-4549-84b0-8389955beb0c/'
zarr_root = zarr.open(remote_zarr_location, storage_options=dict(anon=True))
print(zarr_root.attrs.keys())
for k in zarr_root.keys(): print(k)
# dict_keys(['.specloc', 'namespace', 'neurodata_type', 'nwb_version', 'object_id'])
# acquisition
# analysis
# ...

However, if I try to give zarr the http URL it can only see the top-level attributes in the zarr tree

import zarr
remote_zarr_location = 'https://dandi-api-staging-dandisets.s3.amazonaws.com/zarr/fe45a10f-3aa4-4549-84b0-8389955beb0c/'
zarr_root = zarr.open(remote_zarr_location)
print(zarr_root.attrs.keys())
for k in zarr_root.keys(): print(k)
# dict_keys(['.specloc', 'namespace', 'neurodata_type', 'nwb_version', 'object_id'])
# NO KEYS WERE FOUND

(side note: If I use the DANDI API Url it doesn't work at all: https://api-staging.dandiarchive.org/api/assets/a617e96e-72cd-4bb8-ab20-b3d6bdc8ecd1/download/)

This highlights the fact that you cannot use normal fetch http requests to read the tree structure of a Zarr directory in an S3 bucket - because there's no way to get a directory listing (unless the admin enables something that is highly not recommended). Instead you need to use the S3 API... which requires AWS credentials (unless the bucket is public).

So this creates two problems

People won't be able to traverse embargoed Zarr assets at all (even from Python) because the bucket is not private and you won't want to give out AWS credentials (idk, maybe you have a plan for this?)
It makes it very difficult to read a Zarr asset (even if public) from a browser (eg neurosift)... because you need to use the S3 api rather than simple fetch http requests. (It's possible I haven't explored enough on this, but it's my impression thus far)

As I mentioned, a possible solution is to use kerchunk and create a json index for every Zarr asset on DANDI. IDK if this will satisfy all the requirements, but it would be great to start trying soon at this early stage.

Another related concern is that use of advanced compression in Zarr assets might make it impossible to read directly from a browser.

See also flatironinstitute/neurosift#70

magland · 2023-11-10T17:43:33Z

So I misunderstood something about kerchunk. It seems it doesn't create an index of a zarr. Rather it creates a zarr-like index of a hdf5 file.

So I don't think kerchunk is the solution... but something simpler... an index file of the recursive subdirectory structure of the zarr directory.

magland · 2023-11-10T18:08:34Z

I think I found the solution. It's consolidate_metadata() / open_consolidated() which creates/uses a .zmetadata file to store all the zarr meta data in one place, dramatically reducing the number of remote reads and making it possible to traverse a zarr even in a private bucket, or from the browser.

https://zarr.readthedocs.io/en/stable/api/convenience.html#zarr.convenience.consolidate_metadata

I think if we could generate this .zmetadata on all the zarr's this would solve the issues.

satra · 2023-11-10T19:48:00Z

we do recommend generating consolidated metadata for the microscopy data. however, even without consolidated metadata one should be able to traverse a zarr tree as long as .zattrs and .zgroup exist. we should ensure that zarr is validated as a container when uploaded.

magland · 2023-11-10T20:17:32Z

we do recommend generating consolidated metadata for the microscopy data. however, even without consolidated metadata one should be able to traverse a zarr tree as long as .zattrs and .zgroup exist. we should ensure that zarr is validated as a container when uploaded.

I don't believe you can traverse the tree just from .zattrs and .zgroup because they don't contain the information about the subgroups or subarrays. For example, here's a .zgroup/.zattrs that doesn't contain that information:

https://dandi-api-staging-dandisets.s3.amazonaws.com/zarr/fe45a10f-3aa4-4549-84b0-8389955beb0c/.zgroup
https://dandi-api-staging-dandisets.s3.amazonaws.com/zarr/fe45a10f-3aa4-4549-84b0-8389955beb0c/.zattrs

satra · 2023-11-10T21:00:09Z

@magland - sorry i misread the original post. you were able to read things with s3. indeed for http we used to have our own api endpoint. i can't seem to find this any more, so checking with @AlmightyYakob.

satra · 2023-11-10T21:02:14Z

also @magland, we currently don't support embargoed zarr files. there is some refactoring that's going to happen with embargo. post that we may enable zarrbargo as we call it.

magland · 2023-11-10T21:11:39Z

Makes sense @satra. I just want to put in my request to have consolidated .zmetadata in the root folder of the zarr archives so that they will be able to work with neurosift and other browser-based tools.

CodyCBakerPhD · 2023-11-10T21:18:30Z

At least for NWB Zarr files, I've requested we just always call that automatically on file creation (hdmf-dev/hdmf-zarr#139). I can add an inspector check for it as well if you'd like

magland · 2023-11-10T21:24:17Z

https://dandi-api-staging-dandisets.s3.amazonaws.com/zarr/fe45a10f-3aa4-4549-84b0-8389955beb0c/.zattrs

That would be great!

I noticed that's it's not in the example you provided a while ago:
Not found: https://dandi-api-staging-dandisets.s3.amazonaws.com/zarr/fe45a10f-3aa4-4549-84b0-8389955beb0c/.zmetadata

CodyCBakerPhD mentioned this issue Nov 10, 2023

[Feature]: Always consolidate store structure hdmf-dev/hdmf-zarr#139

Closed

3 tasks

yarikoptic added the zarr Issues with Zarr hosting/processing/etc. label Jan 31, 2024

kabilar mentioned this issue Jul 20, 2024

Allow for Zarr uploads to embargoed Dandisets #1982

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zarr access concerns - esp. for embargoed or from browser #1745

Zarr access concerns - esp. for embargoed or from browser #1745

magland commented Nov 10, 2023 •

edited

Loading

magland commented Nov 10, 2023 •

edited

Loading

magland commented Nov 10, 2023

satra commented Nov 10, 2023

magland commented Nov 10, 2023

satra commented Nov 10, 2023

satra commented Nov 10, 2023

magland commented Nov 10, 2023

CodyCBakerPhD commented Nov 10, 2023

magland commented Nov 10, 2023

Zarr access concerns - esp. for embargoed or from browser #1745

Zarr access concerns - esp. for embargoed or from browser #1745

Comments

magland commented Nov 10, 2023 • edited Loading

magland commented Nov 10, 2023 • edited Loading

magland commented Nov 10, 2023

satra commented Nov 10, 2023

magland commented Nov 10, 2023

satra commented Nov 10, 2023

satra commented Nov 10, 2023

magland commented Nov 10, 2023

CodyCBakerPhD commented Nov 10, 2023

magland commented Nov 10, 2023

magland commented Nov 10, 2023 •

edited

Loading

magland commented Nov 10, 2023 •

edited

Loading