Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr access concerns - esp. for embargoed or from browser #1745

Open
magland opened this issue Nov 10, 2023 · 9 comments
Open

Zarr access concerns - esp. for embargoed or from browser #1745

magland opened this issue Nov 10, 2023 · 9 comments
Labels
zarr Issues with Zarr hosting/processing/etc.

Comments

@magland
Copy link
Contributor

magland commented Nov 10, 2023

I have some concerns about access to DANDI Zarr assets from the browser or for embargoed dandisets. I think it's likely that this could be solved by creating a kerchunk index for each Zarr asset, but I'm not sure about it. If that's the case I'd like to suggest that it be a high priority to integrate kerchunk in the dandi upload process.

Edit: Rather than kerchunk I proposed a different solution, see later comments.

I'll explain based on my current understanding (which may be limited) of remote access to Zarr directories.

@alejoe91 showed me this nice example on reading from a Zarr archive from a public AIND bucket

import zarr
remote_zarr_location = "s3://aind-open-data/ecephys_625749_2022-08-03_15-15-06_nwb_2023-05-16_16-34-55/ecephys_625749_2022-08-03_15-15-06_nwb/ecephys_625749_2022-08-03_15-15-06_experiment1_recording1.nwb.zarr/"
zarr_root = zarr.open(remote_zarr_location, storage_options=dict(anon=True))
print(zarr_root.attrs.keys())
for k in zarr_root.keys(): print(k)
# dict_keys(['.specloc', 'namespace', 'neurodata_type', 'nwb_version', 'object_id'])
# acquisition
# analysis
# ...

And it also works well when reading this DANDI Zarr example prepared by @CodyCBakerPhD

import zarr
remote_zarr_location = 's3://dandi-api-staging-dandisets/zarr/fe45a10f-3aa4-4549-84b0-8389955beb0c/'
zarr_root = zarr.open(remote_zarr_location, storage_options=dict(anon=True))
print(zarr_root.attrs.keys())
for k in zarr_root.keys(): print(k)
# dict_keys(['.specloc', 'namespace', 'neurodata_type', 'nwb_version', 'object_id'])
# acquisition
# analysis
# ...

However, if I try to give zarr the http URL it can only see the top-level attributes in the zarr tree

import zarr
remote_zarr_location = 'https://dandi-api-staging-dandisets.s3.amazonaws.com/zarr/fe45a10f-3aa4-4549-84b0-8389955beb0c/'
zarr_root = zarr.open(remote_zarr_location)
print(zarr_root.attrs.keys())
for k in zarr_root.keys(): print(k)
# dict_keys(['.specloc', 'namespace', 'neurodata_type', 'nwb_version', 'object_id'])
# NO KEYS WERE FOUND

(side note: If I use the DANDI API Url it doesn't work at all: https://api-staging.dandiarchive.org/api/assets/a617e96e-72cd-4bb8-ab20-b3d6bdc8ecd1/download/)

This highlights the fact that you cannot use normal fetch http requests to read the tree structure of a Zarr directory in an S3 bucket - because there's no way to get a directory listing (unless the admin enables something that is highly not recommended). Instead you need to use the S3 API... which requires AWS credentials (unless the bucket is public).

So this creates two problems

  • People won't be able to traverse embargoed Zarr assets at all (even from Python) because the bucket is not private and you won't want to give out AWS credentials (idk, maybe you have a plan for this?)
  • It makes it very difficult to read a Zarr asset (even if public) from a browser (eg neurosift)... because you need to use the S3 api rather than simple fetch http requests. (It's possible I haven't explored enough on this, but it's my impression thus far)

As I mentioned, a possible solution is to use kerchunk and create a json index for every Zarr asset on DANDI. IDK if this will satisfy all the requirements, but it would be great to start trying soon at this early stage.

Another related concern is that use of advanced compression in Zarr assets might make it impossible to read directly from a browser.

See also flatironinstitute/neurosift#70

@magland
Copy link
Contributor Author

magland commented Nov 10, 2023

So I misunderstood something about kerchunk. It seems it doesn't create an index of a zarr. Rather it creates a zarr-like index of a hdf5 file.

So I don't think kerchunk is the solution... but something simpler... an index file of the recursive subdirectory structure of the zarr directory.

@magland
Copy link
Contributor Author

magland commented Nov 10, 2023

I think I found the solution. It's consolidate_metadata() / open_consolidated() which creates/uses a .zmetadata file to store all the zarr meta data in one place, dramatically reducing the number of remote reads and making it possible to traverse a zarr even in a private bucket, or from the browser.

https://zarr.readthedocs.io/en/stable/api/convenience.html#zarr.convenience.consolidate_metadata

I think if we could generate this .zmetadata on all the zarr's this would solve the issues.

@satra
Copy link
Member

satra commented Nov 10, 2023

we do recommend generating consolidated metadata for the microscopy data. however, even without consolidated metadata one should be able to traverse a zarr tree as long as .zattrs and .zgroup exist. we should ensure that zarr is validated as a container when uploaded.

@magland
Copy link
Contributor Author

magland commented Nov 10, 2023

we do recommend generating consolidated metadata for the microscopy data. however, even without consolidated metadata one should be able to traverse a zarr tree as long as .zattrs and .zgroup exist. we should ensure that zarr is validated as a container when uploaded.

I don't believe you can traverse the tree just from .zattrs and .zgroup because they don't contain the information about the subgroups or subarrays. For example, here's a .zgroup/.zattrs that doesn't contain that information:

https://dandi-api-staging-dandisets.s3.amazonaws.com/zarr/fe45a10f-3aa4-4549-84b0-8389955beb0c/.zgroup
https://dandi-api-staging-dandisets.s3.amazonaws.com/zarr/fe45a10f-3aa4-4549-84b0-8389955beb0c/.zattrs

@satra
Copy link
Member

satra commented Nov 10, 2023

@magland - sorry i misread the original post. you were able to read things with s3. indeed for http we used to have our own api endpoint. i can't seem to find this any more, so checking with @AlmightyYakob.

@satra
Copy link
Member

satra commented Nov 10, 2023

also @magland, we currently don't support embargoed zarr files. there is some refactoring that's going to happen with embargo. post that we may enable zarrbargo as we call it.

@magland
Copy link
Contributor Author

magland commented Nov 10, 2023

Makes sense @satra. I just want to put in my request to have consolidated .zmetadata in the root folder of the zarr archives so that they will be able to work with neurosift and other browser-based tools.

@CodyCBakerPhD
Copy link

At least for NWB Zarr files, I've requested we just always call that automatically on file creation (hdmf-dev/hdmf-zarr#139). I can add an inspector check for it as well if you'd like

@magland
Copy link
Contributor Author

magland commented Nov 10, 2023

@yarikoptic yarikoptic added the zarr Issues with Zarr hosting/processing/etc. label Jan 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
zarr Issues with Zarr hosting/processing/etc.
Projects
None yet
Development

No branches or pull requests

4 participants