Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add endpoint for querying a folder or asset path in a Dandiset #1837

Open
jwodder opened this issue Jan 26, 2024 · 14 comments
Open

Add endpoint for querying a folder or asset path in a Dandiset #1837

jwodder opened this issue Jan 26, 2024 · 14 comments
Labels
enhancement New feature or request

Comments

@jwodder
Copy link
Member

jwodder commented Jan 26, 2024

Implementing https://github.com/dandi/dandi-webdav and some other things would be much easier if it were possible to make a single API request that provided a path and received in return either the asset at that path or an indication that the path is a folder, possibly along with the folder's children. Note that the current /dandisets/000108/versions/draft/assets/paths/ endpoint is insufficient for this, as it does not return asset properties or metadata, and even if it did, setting path_prefix to an asset path returns an empty list, so one cannot make a single request that returns either an asset or the contents of a folder at a path.

Possible endpoint: GET /dandiset/{dandiset_id}/versions/{version_id}/assets/atpath/?path={path}[&metadata=<0|1>]

  • If path is omitted from the request, return a 4xx error.

  • If path is not an asset path and there are no assets under {path}/, return 404.

  • If path points to an asset, the following is returned:

    {
        "type": "asset",
        "asset": {
             // Asset properties (ID, size, created, etc.) as would be returned while paginating over `/dandisets/{dandiset_id}/versions/{version_id}/assets/`
             // If `metadata=1` was included in the query, "asset" also contains a "metadata" field containing the asset metadata
        }
    }
  • If there are assets under {path}/ (i.e., if path is a folder), the following or similar is returned:

    {
        "type": "folder",
        "folder": {
             "aggregate_files": INT,
             "aggregate_size": INT
        }
    }
  • Ideally, I'd like it if including a children=1 query parameter would, if path turns out to be a folder, cause the response to include a list of all direct children of path, both assets (with properties, and including metadata if metadata=1) and paths of subfolders. However, this would need to be paginated, and having a response that (a) is sometimes paginated and (b) only paginates a subset of the data returned seems awkward.

A parallel endpoint could also be implemented for Zarr entries.

@satra
Copy link
Member

satra commented Jan 26, 2024

@jwodder - in the current paths api the absence of asset i think indicates it's a folder. doesn't that not satisfy your requirements except for the metadata return parameter?

current return of `paths` api at the top level of 000108
{
  "count": 7,
  "next": null,
  "previous": null,
  "results": [
    {
      "path": "dataset_description.json",
      "version": 500,
      "aggregate_files": 1,
      "aggregate_size": 71,
      "asset": {
        "asset_id": "2847011b-f9fb-4933-a6b5-1641e0c1886b",
        "url": "https://dandiarchive.s3.amazonaws.com/blobs/c07/71a/c0771a4f-3483-47e7-821e-b28ac8df46a5"
      }
    },
    {
      "path": "samples.tsv",
      "version": 500,
      "aggregate_files": 1,
      "aggregate_size": 572,
      "asset": {
        "asset_id": "a8a5fed9-02bf-4d54-b457-12427a4d3e70",
        "url": "https://dandiarchive.s3.amazonaws.com/blobs/369/c1b/369c1b33-7311-402a-a236-373b5af3a800"
      }
    },
    {
      "path": "sub-mEhm",
      "version": 500,
      "aggregate_files": 168,
      "aggregate_size": 3102758281057,
      "asset": null
    },
    {
      "path": "sub-MITU01",
      "version": 500,
      "aggregate_files": 6375,
      "aggregate_size": 290203968395869,
      "asset": null
    },
    {
      "path": "sub-MITU01h3",
      "version": 500,
      "aggregate_files": 475,
      "aggregate_size": 52424119215488,
      "asset": null
    },
    {
      "path": "sub-SChmi53",
      "version": 500,
      "aggregate_files": 574,
      "aggregate_size": 12364425736517,
      "asset": null
    },
    {
      "path": "sub-U01hm15x",
      "version": 500,
      "aggregate_files": 490,
      "aggregate_size": 16635711419780,
      "asset": null
    }
  ]
}

@jwodder
Copy link
Member Author

jwodder commented Jan 26, 2024

@satra

  • That doesn't return asset properties.
  • If I want to check whether, say, foo/bar is an asset (and get its properties) or a folder (and get its children), there is no single paths request that gives me what I want:
    • If I set path_prefix to foo/bar:
      • If foo/bar is an asset, I get nothing — BAD.
      • If foo/bar is a folder, I get all of its children — GOOD.
    • If I set path_prefix to foo:
      • If foo/bar is an asset, I get its asset ID, along with a bunch of other irrelevant stuff that I may have to paginate through — SUBOPTIMAL.
      • If foo/bar is a folder, I get confirmation that it's a folder, but I don't get its children, and there's still a bunch of other irrelevant stuff to filter out — BAD.

@satra
Copy link
Member

satra commented Jan 26, 2024

If I set path_prefix to foo/bar:
If foo/bar is an asset, I get nothing — BAD.

this could be fixed i think. the current api assumes one is using it for navigation and hence this situation never comes up. on that front can you describe your use case in webdav that would be different from the tree navigator in the web app?

If I set path_prefix to foo:
If foo/bar is a folder, I get confirmation that it's a folder, but I don't get its children, and there's still a bunch of other irrelevant stuff to filter out — BAD.

i'm not sure why i would return children of anything that is a folder, unless we parameterize that with a depth field. again something we could add to the api. however, it would be good to know what the webdav use case for it is. an intent behind that api was to provide a limited view so as to not overwhelm the server and to reflect what the ui looks like.

i think the specific use case of webdav that's different from the online file browser would be good to know. if not, it would be good to examine whether the filebrowser could be improved.

@jwodder
Copy link
Member Author

jwodder commented Jan 26, 2024

@satra In order to make dandi-webdav more efficient (cf. dandi/dandi-webdav#5), it needs to be able to take a path and do the following operations on it:

  • For a GET request or a PROPFIND request with Depth: 1, dandi-webdav needs to be able to efficiently fetch whatever's at the path as follows:
    • If the path points to a blob asset, the blob's S3 download URL (for GET) or the asset's properties & metadata (for PROPFIND) must be retrieved.
    • If the path points to an asset folder, all assets (including their properties & metadata) and names of subfolders in the folder must be retrieved.
    • If the path points to a Zarr asset, then the Zarr's properties & metadata and all top-level entries (including their S3 properties) and subfolders must be retrieved.
    • If the path is a Zarr asset path concatenated with a Zarr entry path, then the Zarr asset's S3 URL must be retrieved, followed by a query to S3 (because the Archive currently doesn't have an equivalent of …/assets/paths/ for Zarr entries) to retrieve whatever's at the entry path:
      • If the entry path points to an entry, its S3 download URL (for GET) or S3 properties (for PROPFIND) must be retrieved.
      • If the entry path points to a folder, all entries (including their S3 properties) and names of subfolders in the folder must be retrieved.
  • For a PROPFIND request with Depth: 0, same as above, except that folder entries do not need to be retrieved.

@jwodder jwodder added the enhancement New feature or request label Jan 28, 2024
@satra
Copy link
Member

satra commented Jan 30, 2024

let's consolidate this with the web file browser. @mvandenburgh - do you have thoughts on the issues raised here.

If the path points to a Zarr asset, then the Zarr's properties and all top-level entries (including their S3 properties) and subfolders must be retrieved.

not sure why this is a requirement for webdavfs. i would suggest any zarr reading part be left to a zarr library and a zarr object is treated the same way as any other binary file.

@jwodder
Copy link
Member Author

jwodder commented Jan 30, 2024

@satra It's a requirement because that's what @yarikoptic told me to do. You can't treat a Zarr as a binary file, as it isn't a file in the first place.

@waxlamp
Copy link
Member

waxlamp commented Jan 31, 2024

let's consolidate this with the web file browser

I'd prefer we avoid this design approach. The web file browser and this webdav server have fundamentally different requirements and therefore should not be served by one consolidated API. And I'd rather not add to the public API for a single client's needs (unless and until that API shows promise for general usefulness).

What I'd like to try is creating some private API tailored to the dandidav's specific needs (recognizing that dandidav, like the web client and the CLI, is an official interface to the DANDI datastore and therefore has special privileges). Later on, we can factor out anything that is suitable for the public API.

I realize this may sound like an odd approach, but it has advantages:

  • it can immediately deliver the customized performance value that dandidav is seeking
  • it means we don't need to agonize over the design of this new API--it can instead be whatever John needs
  • it leaves open the future possibility of creating a public version (if needed)

@yarikoptic
Copy link
Member

what is "private API" @waxlamp in the context of DANDI?

@yarikoptic
Copy link
Member

If the path points to a Zarr asset, then the Zarr's properties and all top-level entries (including their S3 properties) and subfolders must be retrieved.

not sure why this is a requirement for webdavfs. i would suggest any zarr reading part be left to a zarr library and a zarr object is treated the same way as any other binary file.

@satra It's a requirement because that's what @yarikoptic told me to do. You can't treat a Zarr as a binary file, as it isn't a file in the first place.

the goal here for having zarr as a browsable folder is multifold

  • navigate entire dandiarchive "space" as if it was just a filesystem to browse - to review files, to FUSE mount it and access any file or "formatted folder" like zarr is
  • to provide access to zarrs as if they are just hosted on the web (thus not just a "binary thing" for zarr).

Just try it at https://dandi.centerforopenneuroscience.org/dandisets/000108/draft/sub-MITU01h3/ses-20211130h19m15s39/micr/ and around there ;-)

@yarikoptic
Copy link
Member

yarikoptic commented Feb 1, 2024

I think we first need to acknowledge that current API related to paths is quite odd in

  1. (already known and thus acknowledged) swapped meaning of path vs path_prefix: Fix naming of path and path_prefix parameter names #1562
  2. (already known and thus acknowledged. edit: actually fixed at API level) /paths should 404 if there are not assets with path_prefix/ #546
  3. its behavior of returning a record with count 0 and empty results when you hit an asset:
{
  "count": 0,
  "next": null,
  "previous": null,
  "results": []
}

it is odd because it is an implicit knowledge to give meaning to such a result. It at least does return 404 when pointing to non-existing path (phewph).
4. implicit type of a record - no clear indication on what result is (folder? asset? and further type of asset - file? zarr folder?). Implicit logic: "if there is an "asset" -- it is a 'file'; if no "asset" - it is a 'folder'":

    {
      "path": "sub-RAT123",
      "version": 419,
      "aggregate_files": 1,
      "aggregate_size": 18792,
      "asset": null
    },
    {
      "path": "test1234",
      "version": 419,
      "aggregate_files": 1,
      "aggregate_size": 9,
      "asset": {
        "asset_id": "28155d24-9342-48bc-a3a7-b69191e86e13",
        "url": "https://dandiarchive.s3.amazonaws.com/blobs/031/7cf/0317cf5a-4047-4e19-aae1-4f7b7434d2d7"
      }
    },

From this point of acknowledging that it sucks quite a bit, it sounds like we really need to strive to make it better, and then benefit from an improvement.

What I see possible to be done to make it better without breaking anything:

  1. add to result records date_modified so could be used by file browser
  2. add to result records some kind of result_type and make it as useful as we can: e.g. Folder, AssetBlob, AssetZarr - to remove ambiguity of 4. above

And our file browser could immediately benefit and expose Date Modified column and allow sorting etc. Nobody complained yet but I bet it would be UX improvement.

And now we can think about possible changes which we can expose "conditionally" based on some option to the request without breaking API (we already did that for metadata of /assets/ end point and only benefited IMHO from that).

E.g. we could add include_hit (or alike) which would default to False and thus retaining the current behavior; but for True it would

  • add a record for the exact path (well -- path_prefix here due to oddity 1.), so instead of that empty record with 0 we would get count: 1 with a record
    {
      "path": null,
      "version": 419,
      "aggregate_files": 1,
      "aggregate_size": 9,
      "date_modified": "2023-03-16T11:50:08.973069-04:00",
      "result_type": "AssetBlob",
      "asset": {
        "asset_id": "28155d24-9342-48bc-a3a7-b69191e86e13",
        "url": "https://dandiarchive.s3.amazonaws.com/blobs/031/7cf/0317cf5a-4047-4e19-aae1-4f7b7434d2d7"
      }

and if path points to the folder -- it would be that folder's record as when listed from its parent, but with result_type "Folder". It will be pretty much the "." you see in ls -la and thus could even be included in the listing of the folder we have now. Again -- our file browser could benefit from that and present that information to the user.

@jwodder do you think this would suffice for our needs in webdav implementation?

@jwodder
Copy link
Member Author

jwodder commented Feb 1, 2024

@yarikoptic This would improve things, but WebDAV would still have to make further requests to get asset properties & metadata (both for an asset at a path and for assets under a folder). Note that I already described what information WebDAV needs per path in #1837 (comment).

@yarikoptic
Copy link
Member

I see:

the asset's properties & metadata (for PROPFIND) must be retrieved

could you @jwodder list explicitly specific metadata fields and properties which are of interest here? I guess adding metadata flag to return actual metadata record as we have done for /assets would be overkill here.

@jwodder
Copy link
Member Author

jwodder commented Feb 1, 2024

@yarikoptic

@yarikoptic
Copy link
Member

Let's "side discuss" at #1851 where it might be easier "per line"

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants