Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new API design doc #447

Merged
merged 16 commits into from
Apr 25, 2022
Merged

Add new API design doc #447

merged 16 commits into from
Apr 25, 2022

Conversation

jwodder
Copy link
Member

@jwodder jwodder commented Mar 3, 2021

prior discussion in #249

@jwodder jwodder added the internal Changes only affect the internal API label Mar 3, 2021
@jwodder jwodder mentioned this pull request Mar 3, 2021
@codecov
Copy link

codecov bot commented Mar 3, 2021

Codecov Report

Merging #447 (2b32d56) into master (eae1568) will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #447   +/-   ##
=======================================
  Coverage   83.38%   83.38%           
=======================================
  Files          59       59           
  Lines        5272     5272           
=======================================
  Hits         4396     4396           
  Misses        876      876           
Flag Coverage Δ
unittests 83.38% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.


Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update eae1568...2b32d56. Read the comment docs.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some initial comments/notes. One design decision to make is either Remote{Dandiset,Asset} constructor would need to communicate to the server right away or not (and have properties/etc populated only whenever queried for the first time).

* `created: datetime`
* `modified: datetime`
* `version: Optional[Version]`
* `get_all_assets() -> Iterator[Asset]`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
* `get_all_assets() -> Iterator[Asset]`
* `get_assets() -> Iterator[Asset]`

so it easier to use with auto-completion to discover all methods which could be used to get assets. I think it would be aligning better with the notion of dict having .keys() and not .all_keys() etc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I prefer the get_all_* nomenclature because otherwise get_asset() and get_assets() are just a typo apart.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but we would never have get_asset() (and if we get -- it would require some argument), so I do not see a typo be a silent problem.

* `get_metadata() -> DandiMeta`
* `Asset`:
* `path: str`
* `sha256: str` — Should this only be on `RemoteAsset`?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in principle we can make it a lazy property for a local Asset.

The same for size/created/modified below - the point is either a LocalAsset would require to have a physical file to exist for the path, which I guess it would, so we can indeed provide such interface

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noted.

* These classes (except `Version`?) will all contain a private `_client` attribute referring back to the associated `DandiAPIClient` instance.
* Changes to `DandiAPIClient` methods:
* Add a `get_dandiset(dandiset_id: Union[str, RemoteDandiset], version_id=None) -> RemoteDandiset` method
* **To discuss:** Should the default `version` be "draft" or the most recent version? Either way, there should be another method that explicitly gets the other option.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since draft might disappear (e.g. we "archive" a dandiset, preventing any changes and thus removing "draft"), I think we should just stay consistent with what API's /dandisets/{dandiset__pk}/ returns in most_recent_version -- might be a released one or a draft IIRC. So the default here could stay None. But I wonder if we should complicate with allowing for special handling of "latest_release"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated.

* Changes to `DandiAPIClient` methods:
* Add a `get_dandiset(dandiset_id: Union[str, RemoteDandiset], version_id=None) -> RemoteDandiset` method
* **To discuss:** Should the default `version` be "draft" or the most recent version? Either way, there should be another method that explicitly gets the other option.
* **To discuss:** Should this take a `lazy=False` parameter that, if true, constructs a `RemoteDandiset` containing only an identifier, without making any requests at the time of creation? This would allow calling a `RemoteDandiset`'s methods without having to make a request for the Dandiset's data beforehand. If & when any non-identifiers attributes of a lazy `RemoteDandiset` are requested, a request is made at that point and the results cached.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if RemoteDandiset constructor does not demand talking to the server, shouldn't it be "lazy" to start with and have properties lazily populated? (or may be only if no version_id is given)

* **To discuss:** Should the default `version` be "draft" or the most recent version? Either way, there should be another method that explicitly gets the other option.
* **To discuss:** Should this take a `lazy=False` parameter that, if true, constructs a `RemoteDandiset` containing only an identifier, without making any requests at the time of creation? This would allow calling a `RemoteDandiset`'s methods without having to make a request for the Dandiset's data beforehand. If & when any non-identifiers attributes of a lazy `RemoteDandiset` are requested, a request is made at that point and the results cached.
* Add a `get_all_dandisets() -> Iterator[RemoteDandiset]` method
* **To discuss:** What should these objects' `Version`s be?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think boils down to the same - are RemoteDandisets lazy at the constructor - do they require communication to the server or not...

doc/design/python-api-1.md Outdated Show resolved Hide resolved
doc/design/python-api-1.md Outdated Show resolved Hide resolved
* `modified: datetime`
* `version: Optional[Version]`
* `get_all_assets() -> Iterator[Asset]`
* `get_assets_under_path(path: str) -> Iterator[Asset]` — Returns all assets whose paths are either under the given folder or equal the given filepath
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

UX sugarings we should consider add on top

@assets -> get_assets()
@metadata -> get_metadata()
__iter__() -> .get_assets()

? they might be tricky with subclassing though since IIRC we cannot just bind the property to get _assets in super class, since then subclasses .assets would us it instead of overloaded... May be properties should be lazy (cached value) invocations over the methods? That would give them a better purpose

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not clear on what exactly you're proposing here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smth like

@property
def assets(self):
     if self._assets is None:
            self._assets = self.get_assets()
     return self._assets

def __iter__(self):
     for asset in self.assets:
          yield asset

but we can think about all the sugaring later

@jwodder
Copy link
Member Author

jwodder commented Mar 5, 2021

@yarikoptic

One design decision to make is either Remote{Dandiset,Asset} constructor would need to communicate to the server right away or not (and have properties/etc populated only whenever queried for the first time).

The API resource instances would be constructed from data returned from an API call made by whatever method constructs the resources. If there are any fields left out of the API response that require another call to fetch, those can be populated lazily/on-demand rather than when the object is constructed.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

some more comments

doc/design/python-api-1.md Show resolved Hide resolved
* `get_versions() -> Iterator[Version]`
* `for_version(version_id: Union[str, Version]) -> RemoteDandiset`
* When `version_id == "draft"`, this is the same as `for_draft_version()`
* `for_draft_version() -> DraftDandiset`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

see above on making it some get*_version

* `delete() -> None`
* `get_users() -> Iterator[User]`
* `set_users(users: List[User]) -> None`
* Should this method also accept strings giving the usernames directly?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would not bother interfacing users calls for now at all at this point

doc/design/python-api-1.md Outdated Show resolved Hide resolved
doc/design/python-api-1.md Outdated Show resolved Hide resolved
doc/design/python-api-1.md Outdated Show resolved Hide resolved
doc/design/python-api-1.md Outdated Show resolved Hide resolved
* `asset_count: int`
* `size: int`

* Attributes & methods of the `RemoteAsset` class (a subclass of `Asset`):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

but overall -- is it just a helper to expose common fields from an asset metadata?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's for exposing the "meta-metadata" associated with an asset that's returned when paginating through asset lists and was previously returned when querying an individual assets. Are we planning on getting rid of those fields from the API?

* `size: int`
* `created: datetime`
* `modified: datetime`
* `dandiset_id: str`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wouldn't include it here now that we have assets more "globally defined" AFAIK

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Until the asset API endpoints are modified to not require the dandiset & version IDs as path components, the RemoteAsset class needs this information.

* `created: datetime`
* `modified: datetime`
* `dandiset_id: str`
* `version_id: str`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

like dandiset version? also would not bother... may be if we are to establish relationship to a specific dandiset, could be via .dandiset: RemoteDandiset

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, Dandiset version. Same response as above.

@jwodder jwodder force-pushed the design-doc branch 2 times, most recently from 72c4673 to b2decde Compare March 11, 2021 13:46
@jwodder
Copy link
Member Author

jwodder commented Mar 12, 2021

@yarikoptic I had an idea about the return values from the blocking upload/download functions. Since some transfers may not produce assets due to errors or the existing setting, I think it may be a good idea for the functions to return structures with the following fields:

  • UploadResult:
    • local_asset: LocalAsset — the asset that was uploaded; will have to be something else if we want to support an upload_file(filepath, ...) method
    • status: TransferStatus — an enum with SUCCESS, ERROR, and SKIPPED values
    • message: str — for ERROR, this is the error message; for SKIPPED, this is a reason why the asset was skipped; for SUCCESS, it doesn't matter
    • remote_asset: Optional[RemoteAsset] — The asset created by the upload; only set for SUCCESS
  • DownloadResult:
    • remote_asset: RemoteAsset — the asset that was downloaded
    • status: TransferStatus — an enum with SUCCESS, ERROR, and SKIPPED values
    • message: str — for ERROR, this is the error message; for SKIPPED, this is a reason why the asset was skipped; for SUCCESS, it doesn't matter
    • (Some representation of the resulting downloaded file? Returning LocalAsset would be awkward for the cases where we're not downloading to a local Dandiset.)

What do you think?

@yarikoptic
Copy link
Member

@yarikoptic I had an idea about the return values from the blocking upload/download functions.

if they are blocking, so more typical Python user-facing interfaces, why not just to raise an exception? e.g. IncompleteResultsError which would have attributes such as .results containing the results which were successfully obtained and may be .failed for e.g. Assets which failed to upload or download. That would make it easier to write high level Python code, without requiring to analyze all those records/status'es just to e.g. download something and not miss an errored out operation.

@jwodder
Copy link
Member Author

jwodder commented Mar 12, 2021

@yarikoptic Assuming you're suggesting that the exception only be raised after all threads have ended, that could work, too, but even when there are no errors, I feel there should be some way to indicate which assets were skipped and which weren't. It doesn't have to be done by returning a list of Result objects; a single Result object with list attributes would work as well.

How about this:

  • On success, upload functions return a structure with an assets: List[RemoteAsset] attribute for successful uploads plus a skipped: List[Tuple[LocalAsset, str]] attribute pairing skipped local attributes with the reasons why they were skipped.
  • If an error occurs in an upload function, an exception is raised after all threads have completed with the same assets and skipped attributes as above plus an errored: List[Tuple[LocalAsset, str]] attribute pairing failed assets with error messages.
  • Download functions return the same, except LocalAsset is replaced with RemoteAsset, and RemoteAsset is replaced with ... something I still haven't decided on.

@jwodder jwodder mentioned this pull request Apr 13, 2021
@yarikoptic yarikoptic mentioned this pull request Apr 16, 2021
@jwodder
Copy link
Member Author

jwodder commented Jun 2, 2021

@yarikoptic Do we want the classes to be "lazy" (not making any API requests until actual data is requested) or not (and thus fetching data upon instance creation)?

@yarikoptic
Copy link
Member

re lazy: I feel that we might benefit from "lazy" in particular for large dandisets, but for "remote" dandisets we should at least verify that they do exist, so I guess basic metadata record should be fetched upon class instantiation for those.

Copy link
Member

@yarikoptic yarikoptic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some comments, primarily on freshly added **Notes** and there might be some elderly comments whenever I started the review awhile back but never "Finished"

* `modified: datetime`
* `name: str`
* `asset_count: int`
* `size: int`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not sure if such metadata (computed stats and even name) should be part of the Version: even though API returns it now, I think eventually it would migrate to dandiset metadata, which has assetsSummary prepared for that. That would leave Version with information pertinent to just describe version, i.e. its version identifier and when created (well - modified makes sense ATM only for the draft, and otherwise I guess should be identical to created)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

related: fresh (this comment was done awhile back but not "posted") #568

* `size: int`
* `created: datetime`
* `modified: datetime`
* **Problem:** The API reports `created` and `modified` dates when paginating through a version's assets, but the data is absent when getting an asset's metadata directly from the asset-specific endpoint (Note that, although there are two timestamps for modification times in asset metadata, neither of them match the `modified` time in the pagination results.)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah -- those created and modified AFAIK are for the specific row in dandiset_versions_assets -- so specific to that dandiset version asset, whenever metadata record modified, should be the one on when metadata record itself was modified (should be before those created modified above). If our Asset here is not dandiset/version specific (I think it is not), then I guess we could have only the dateModified from metadata. For our needs (upload/download refresh mode) I think it is the one which is needed. So let's just drop created (even if for just now).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

created removed.

* Attributes of the `User` class:
* `username: str`
* `first_name: str`
* `last_name: str`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

may be don't bother separating and just keep it in "sync" with API here which just has name.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The separate fields used to exist in a previous version of the API; updated.

* `first_name: str`
* `last_name: str`
* `admin: bool`
* **Note:** We should ask the dandi-api developers to make the `GET/PUT /dandisets/{dandiset__pk}/users/` endpoints return full user structures, not just ones with only `username` fields, so that instances of this class can be properly populated
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

filed dandi/dandi-archive#319 . For now we can proceed without exposing .admin and then add it whenever API provides it

* `skipped: List[Tuple[LocalAsset, str]]` — list of skipped assets, paired with the reasons why they were skipped
* For blocking upload methods that upload multiple assets, if an error occurs while uploading, an exception is raised after all threads have completed; this exception has the same `assets` and `skipped` attributes as above plus an `errored: List[Tuple[LocalAsset, str]]` attribute pairing errored assets with error messages

* **To discuss:** Should methods that accept `DandisetMeta` and `(Bare|Remote)AssetMeta` instances also accept raw `dict`s? If so, should any validation be done on these `dict`s?
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for starters we can just add a dedicated @classmethod from_dict(cls, d: dict, validate:bool=True) -> cls (possible?) and require such conversion explicitly. If we keep running into needing that too often -- we would rethink

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pydantic already provides parse_obj() (validating) and construct() (non-validating) methods for that.

@@ -0,0 +1,189 @@
Designs for an improved Python API
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note, in #712 I renamed doc/ to docs/ so if we are to merge this one, might want to rename

@yarikoptic
Copy link
Member

how much this design doc reflects current implementation? I just wonder what should we do about this PR -- tune up&merge, merge, or just close?

@jwodder
Copy link
Member Author

jwodder commented Nov 9, 2021

@yarikoptic I'd say somewhere between a third and a half of the doc has been implemented. I would prefer breaking all the tasks in this doc up into multiple issues (possibly organized in a GitHub project). I don't know why you'd ever merge this PR, as once it's implemented the docstrings should be the canonical location for the features' documentation. If you really want a nice place to put a write up of the desired API, I think the repo wiki is a better place.

@yarikoptic
Copy link
Member

I don't know myself why we haven't ever merged it , so will do that now -- shouldn't hurt.

@yarikoptic yarikoptic merged commit 185bbfc into master Apr 25, 2022
@yarikoptic yarikoptic deleted the design-doc branch April 25, 2022 16:41
@github-actions
Copy link

🚀 PR was released in 0.39.3 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
internal Changes only affect the internal API released
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants