Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide zarrs/ top level folder with support for experimental zarr manifests #43

Closed
yarikoptic opened this issue Jan 31, 2024 · 11 comments · Fixed by #52
Closed

Provide zarrs/ top level folder with support for experimental zarr manifests #43

yarikoptic opened this issue Jan 31, 2024 · 11 comments · Fixed by #52
Assignees
Labels
enhancement New feature or request therefor hier:zarrs Relating to serving the /zarrs/ hierarchy

Comments

@yarikoptic
Copy link
Member

https://github.com/dandi/zarr-manifests has my very crude helper scripts and 2 zarr manifests samples (if desired - can share more -- they are on typhon being generated). ATM we do not have multiple versions per zarr, only 1, but that should be ok for a proof of concept. The point is to show/test that we can efficiently access those zarrs

  • top path to all the urls for those manifests (I wonder now if it should become part of the manifest header??) is https://dandiarchive.s3.amazonaws.com/zarr/{zarr_id}
  • we will redirect to URLs with versionIds in them as provided by manifest.
@jwodder jwodder added the enhancement New feature or request therefor label Jan 31, 2024
@jwodder
Copy link
Member

jwodder commented Jan 31, 2024

@yarikoptic To be clear, you want dandidav to gain a /zarrs/{zarr_id} endpoint that serves the same view as /dandisets/{id}/version/{spec}/{zarr_path}, but with entries looked up via the manifests in that repository rather than via S3, and with individual entries redirecting to S3 URLs with versionIds, correct?

For the record, I don't like that the manifests declare the entry fields via a fields key, and I suspect the current schema will be hard to work with in Rust. I'd rather see the "entries" fields structured like this:

{
    "entries": {
        ".zattrs": {
             "type": "entry",  // or "file" or "blob" or whatever
             "versionId": "...",
             "lastModified": "...",
             "size": 123456789,
             "etag": "..."
        },
        "0": {
            "type": "folder",
            "children": {
                ".zarray": {
                    ...
                }
             }
        }
    }
}

@yarikoptic
Copy link
Member Author

eventually -- yes, but later, whenever archive is ready and produces manifests on S3. For now -- it is more of a "demo feature" so we could demonstrate versioning etc. Re path -- since we already have thousands of zarrs, listing /zarrs/ would be slow. That is why all those intermediate levels. I think we should just retain it for now reflecting exact same hierarchy as /zarrs/{uuid:0:3}/{uuid:3:3}/{uuid}/{checksum}.json.

re alternative schema of the manifest -- I am afraid it would result in significant growth of size and load/parsing time of such a manifest. I guess we could have compressed them e.g. with efficient xz to facilitate speedy transfer but then it would also be just slower to parse. That is why I made it into such a "compressed" form to facilitate transfer and parsing/load. IMHO it should be easy if not trivial (in Rust or Python) to cast records into any other alternative records/schema at run/load time, but thus avoiding more expensive parsing etc. FWIW, if would make it easier, we could "hardcode" the fields - I was just experimenting in trying to make it flexible.

@jwodder
Copy link
Member

jwodder commented Feb 1, 2024

@yarikoptic

  • Since unauthenticated requests to the GitHub API are limited to 60 requests/hour, dandidav will likely have to be provided with a GitHub access token on startup so that we get a larger rate limit. (Caching won't help if zarr-manifests-v2-sorted ever grows to the point that it contains more than 60 things that a user would want to see in an hour.)

  • Do you foresee zarr-manifests-v2-sorted or any of its subdirectories having more than a thousand entries? This affects which GitHub API endpoint I can use to list directory entries.

  • Do you think it's worth caching any of the GitHub API responses and/or the manifest files? On the one hand, the client would be saved from having to repeatedly request & parse a 30 MB file. On the other hand, that 30 MB file would live in the server's memory instead.

@yarikoptic
Copy link
Member Author

I shared on github just for initial convenience... and yes -- we will get many more after collecting them is done. At the end I expect us to share it on S3.

To completely avoid github and its limits, re-sharing it now under https://datasets.datalad.org/?dir=/dandi/zarr-manifests so you can just "browse" https://datasets.datalad.org/dandi/zarr-manifests/zarr-manifests-v2-sorted/ . I will also now sort more (around 2k ready) of manifests and push them shortly there too.
Note that it would be the same server as for experimental http://dandi.centerforopenneuroscience.org/ so transfer speeds there should be "really good".

  • Do you think it's worth caching any of the GitHub API responses and/or the manifest files? On the one hand, the client would be saved from having to repeatedly request & parse a 30 MB file. On the other hand, that 30 MB file would live in the server's memory instead.

I guess it is for us to decide but I do think that it would be useful to have some cache, e.g. lru_cache(10) or so so we do have caching for the most recent requests -- hopefully it would not get thrushed much whenever we get over 10 parallel visitors. We might want to look into some dynamic tuneups and implement caching based on timeout etc.

@yarikoptic
Copy link
Member Author

FWIW -- I have pushed a few thousands now to https://datasets.datalad.org/?dir=/dandi/zarr-manifests/zarr-manifests-v2-sorted

@jwodder
Copy link
Member

jwodder commented Feb 2, 2024

@yarikoptic

  • To be clear: You want paths of the form /zarrs/{uuid:0:3}/{uuid:3:3}/{uuid}/{checksum}.json on dandidav to be collections, with paths like /zarrs/{uuid:0:3}/{uuid:3:3}/{uuid}/{checksum}.json/0/0/.zarray pointing to entries under them? Having a .json extension in a directory filename is a bit odd.

  • I don't suppose there's a way to get listings for directories like /zarrs/{uuid:0:3}/{uuid:3:3}/{uuid}/ as JSON instead of HTML?

@yarikoptic
Copy link
Member Author

  • let's strip .json: /zarrs/{uuid:0:3}/{uuid:3:3}/{uuid}/{checksum}/0/0/.zarray . I was not even planing to expose original .json but I do not see why not -- people could get manifest then (from datasets.datalad.org) and compare to the underlying structure (files served from S3) -- would be great
  • interesting idea... looking into it...

@yarikoptic
Copy link
Member Author

done:

❯ curl --silent https://datasets.datalad.org/dandi/zarr-manifests/ | jq .
{
  "path": "/",
  "files": [
    ".gitattributes",
    "README.md",
    "check_zarr_paths.sh",
    "convert_schema_1to2.py",
    "get_them_all.sh",
    "list_bucket_prefix_versionids.py",
    "sort-v2.sh",
    "myapp.wsgi"
  ],
  "directories": [
    ".git",
    "zarr-manifests-v2-sorted"
  ]
}

@yarikoptic
Copy link
Member Author

FWIW, here is the wsgi script producing that json https://github.com/dandi/zarr-manifests/blob/master/myapp.wsgi happen you want to change format or see security concern to address etc.

@jwodder
Copy link
Member

jwodder commented Feb 6, 2024

@yarikoptic Trying to access one of your Zarr manifests at, e.g., https://datasets.datalad.org/dandi/zarr-manifests/zarr-manifests-v2-sorted/741/632/741632e4-18ac-437e-945c-a318c3d46483/c74b10dca1bbcce66db179c1ca8f76da-72142--112269290369.json, currently returns 404. I believe the problem is that the WSGI script doesn't distinguish between paths to directories and paths to manifest files.

Also, for the record, https://datasets.datalad.org/?dir=/dandi/zarr-manifests currently shows "ERROR: Could not find metadata for current dataset."

@jwodder
Copy link
Member

jwodder commented Feb 6, 2024

@yarikoptic I think dandi/zarr-manifests#1 should fix the 404 issue.

yarikoptic added a commit that referenced this issue Feb 6, 2024
Serve Zarrs view via manifests at `/zarrs/`
yarikoptic added a commit to dandi/zarr-manifests that referenced this issue Feb 11, 2024
    [Tue Feb 06 13:57:09.972838 2024] [wsgi:error] [pid 3001752:tid 140299577837248] [client 71.198.184.218:56874] Traceback (most recent call last):, referer: dandi/dandidav#43
    [Tue Feb 06 13:57:09.973156 2024] [wsgi:error] [pid 3001752:tid 140299577837248] [client 71.198.184.218:56874]   File "/srv/datasets.datalad.org/www/dandi/zarr-manifests/myapp.wsgi", line 35, in <lambda>, referer: dandi/dandidav#43
    [Tue Feb 06 13:57:09.973186 2024] [wsgi:error] [pid 3001752:tid 140299577837248] [client 71.198.184.218:56874]     return iter(lambda: fp.read(65535), b""), referer: dandi/dandidav#43
    [Tue Feb 06 13:57:09.973224 2024] [wsgi:error] [pid 3001752:tid 140299577837248] [client 71.198.184.218:56874]                         ^^^^^^^^^^^^^^, referer: dandi/dandidav#43
    [Tue Feb 06 13:57:09.973295 2024] [wsgi:error] [pid 3001752:tid 140299577837248] [client 71.198.184.218:56874] ValueError: read of closed file, referer: dandi/dandidav#43
@jwodder jwodder added the hier:zarrs Relating to serving the /zarrs/ hierarchy label Feb 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request therefor hier:zarrs Relating to serving the /zarrs/ hierarchy
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants