-
Notifications
You must be signed in to change notification settings - Fork 28
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add new API design doc #447
Conversation
Codecov Report
@@ Coverage Diff @@
## master #447 +/- ##
=======================================
Coverage 83.38% 83.38%
=======================================
Files 59 59
Lines 5272 5272
=======================================
Hits 4396 4396
Misses 876 876
Flags with carried forward coverage won't be shown. Click here to find out more. Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some initial comments/notes. One design decision to make is either Remote{Dandiset,Asset}
constructor would need to communicate to the server right away or not (and have properties/etc populated only whenever queried for the first time).
doc/design/python-api-1.md
Outdated
* `created: datetime` | ||
* `modified: datetime` | ||
* `version: Optional[Version]` | ||
* `get_all_assets() -> Iterator[Asset]` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
* `get_all_assets() -> Iterator[Asset]` | |
* `get_assets() -> Iterator[Asset]` |
so it easier to use with auto-completion to discover all methods which could be used to get assets. I think it would be aligning better with the notion of dict
having .keys()
and not .all_keys()
etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I prefer the get_all_*
nomenclature because otherwise get_asset()
and get_assets()
are just a typo apart.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but we would never have get_asset()
(and if we get -- it would require some argument), so I do not see a typo be a silent problem.
doc/design/python-api-1.md
Outdated
* `get_metadata() -> DandiMeta` | ||
* `Asset`: | ||
* `path: str` | ||
* `sha256: str` — Should this only be on `RemoteAsset`? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in principle we can make it a lazy property for a local Asset.
The same for size/created/modified below - the point is either a LocalAsset would require to have a physical file to exist for the path, which I guess it would, so we can indeed provide such interface
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noted.
doc/design/python-api-1.md
Outdated
* These classes (except `Version`?) will all contain a private `_client` attribute referring back to the associated `DandiAPIClient` instance. | ||
* Changes to `DandiAPIClient` methods: | ||
* Add a `get_dandiset(dandiset_id: Union[str, RemoteDandiset], version_id=None) -> RemoteDandiset` method | ||
* **To discuss:** Should the default `version` be "draft" or the most recent version? Either way, there should be another method that explicitly gets the other option. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since draft
might disappear (e.g. we "archive" a dandiset, preventing any changes and thus removing "draft"), I think we should just stay consistent with what API's /dandisets/{dandiset__pk}/
returns in most_recent_version
-- might be a released one or a draft IIRC. So the default here could stay None
. But I wonder if we should complicate with allowing for special handling of "latest_release"
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated.
* Changes to `DandiAPIClient` methods: | ||
* Add a `get_dandiset(dandiset_id: Union[str, RemoteDandiset], version_id=None) -> RemoteDandiset` method | ||
* **To discuss:** Should the default `version` be "draft" or the most recent version? Either way, there should be another method that explicitly gets the other option. | ||
* **To discuss:** Should this take a `lazy=False` parameter that, if true, constructs a `RemoteDandiset` containing only an identifier, without making any requests at the time of creation? This would allow calling a `RemoteDandiset`'s methods without having to make a request for the Dandiset's data beforehand. If & when any non-identifiers attributes of a lazy `RemoteDandiset` are requested, a request is made at that point and the results cached. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if RemoteDandiset
constructor does not demand talking to the server, shouldn't it be "lazy" to start with and have properties lazily populated? (or may be only if no version_id is given)
doc/design/python-api-1.md
Outdated
* **To discuss:** Should the default `version` be "draft" or the most recent version? Either way, there should be another method that explicitly gets the other option. | ||
* **To discuss:** Should this take a `lazy=False` parameter that, if true, constructs a `RemoteDandiset` containing only an identifier, without making any requests at the time of creation? This would allow calling a `RemoteDandiset`'s methods without having to make a request for the Dandiset's data beforehand. If & when any non-identifiers attributes of a lazy `RemoteDandiset` are requested, a request is made at that point and the results cached. | ||
* Add a `get_all_dandisets() -> Iterator[RemoteDandiset]` method | ||
* **To discuss:** What should these objects' `Version`s be? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think boils down to the same - are RemoteDandisets
lazy at the constructor - do they require communication to the server or not...
* `modified: datetime` | ||
* `version: Optional[Version]` | ||
* `get_all_assets() -> Iterator[Asset]` | ||
* `get_assets_under_path(path: str) -> Iterator[Asset]` — Returns all assets whose paths are either under the given folder or equal the given filepath |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
UX sugarings we should consider add on top
@assets -> get_assets()
@metadata -> get_metadata()
__iter__() -> .get_assets()
? they might be tricky with subclassing though since IIRC we cannot just bind the property to get _assets in super class, since then subclasses .assets would us it instead of overloaded... May be properties should be lazy (cached value) invocations over the methods? That would give them a better purpose
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not clear on what exactly you're proposing here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smth like
@property
def assets(self):
if self._assets is None:
self._assets = self.get_assets()
return self._assets
def __iter__(self):
for asset in self.assets:
yield asset
but we can think about all the sugaring later
The API resource instances would be constructed from data returned from an API call made by whatever method constructs the resources. If there are any fields left out of the API response that require another call to fetch, those can be populated lazily/on-demand rather than when the object is constructed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some more comments
* `get_versions() -> Iterator[Version]` | ||
* `for_version(version_id: Union[str, Version]) -> RemoteDandiset` | ||
* When `version_id == "draft"`, this is the same as `for_draft_version()` | ||
* `for_draft_version() -> DraftDandiset` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
see above on making it some get*_version
* `delete() -> None` | ||
* `get_users() -> Iterator[User]` | ||
* `set_users(users: List[User]) -> None` | ||
* Should this method also accept strings giving the usernames directly? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would not bother interfacing users calls for now at all at this point
* `asset_count: int` | ||
* `size: int` | ||
|
||
* Attributes & methods of the `RemoteAsset` class (a subclass of `Asset`): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
but overall -- is it just a helper to expose common fields from an asset metadata
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's for exposing the "meta-metadata" associated with an asset that's returned when paginating through asset lists and was previously returned when querying an individual assets. Are we planning on getting rid of those fields from the API?
* `size: int` | ||
* `created: datetime` | ||
* `modified: datetime` | ||
* `dandiset_id: str` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wouldn't include it here now that we have assets more "globally defined" AFAIK
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Until the asset API endpoints are modified to not require the dandiset & version IDs as path components, the RemoteAsset
class needs this information.
* `created: datetime` | ||
* `modified: datetime` | ||
* `dandiset_id: str` | ||
* `version_id: str` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
like dandiset version? also would not bother... may be if we are to establish relationship to a specific dandiset, could be via .dandiset: RemoteDandiset
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, Dandiset version. Same response as above.
72c4673
to
b2decde
Compare
@yarikoptic I had an idea about the return values from the blocking upload/download functions. Since some transfers may not produce assets due to errors or the
What do you think? |
if they are blocking, so more typical Python user-facing interfaces, why not just to raise an exception? e.g. |
@yarikoptic Assuming you're suggesting that the exception only be raised after all threads have ended, that could work, too, but even when there are no errors, I feel there should be some way to indicate which assets were skipped and which weren't. It doesn't have to be done by returning a list of Result objects; a single Result object with list attributes would work as well. How about this:
|
@yarikoptic Do we want the classes to be "lazy" (not making any API requests until actual data is requested) or not (and thus fetching data upon instance creation)? |
re lazy: I feel that we might benefit from "lazy" in particular for large dandisets, but for "remote" dandisets we should at least verify that they do exist, so I guess basic metadata record should be fetched upon class instantiation for those. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments, primarily on freshly added **Notes**
and there might be some elderly comments whenever I started the review awhile back but never "Finished"
* `modified: datetime` | ||
* `name: str` | ||
* `asset_count: int` | ||
* `size: int` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not sure if such metadata (computed stats and even name
) should be part of the Version
: even though API returns it now, I think eventually it would migrate to dandiset metadata, which has assetsSummary
prepared for that. That would leave Version
with information pertinent to just describe version, i.e. its version identifier and when created (well - modified
makes sense ATM only for the draft, and otherwise I guess should be identical to created
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
related: fresh (this comment was done awhile back but not "posted") #568
* `size: int` | ||
* `created: datetime` | ||
* `modified: datetime` | ||
* **Problem:** The API reports `created` and `modified` dates when paginating through a version's assets, but the data is absent when getting an asset's metadata directly from the asset-specific endpoint (Note that, although there are two timestamps for modification times in asset metadata, neither of them match the `modified` time in the pagination results.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah -- those created
and modified
AFAIK are for the specific row in dandiset_versions_assets -- so specific to that dandiset version asset, whenever metadata record modified, should be the one on when metadata record itself was modified (should be before those created
modified
above). If our Asset
here is not dandiset/version specific (I think it is not), then I guess we could have only the dateModified from metadata. For our needs (upload/download refresh
mode) I think it is the one which is needed. So let's just drop created
(even if for just now).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
created
removed.
doc/design/python-api-1.md
Outdated
* Attributes of the `User` class: | ||
* `username: str` | ||
* `first_name: str` | ||
* `last_name: str` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
may be don't bother separating and just keep it in "sync" with API here which just has name
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The separate fields used to exist in a previous version of the API; updated.
* `first_name: str` | ||
* `last_name: str` | ||
* `admin: bool` | ||
* **Note:** We should ask the dandi-api developers to make the `GET/PUT /dandisets/{dandiset__pk}/users/` endpoints return full user structures, not just ones with only `username` fields, so that instances of this class can be properly populated |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
filed dandi/dandi-archive#319 . For now we can proceed without exposing .admin
and then add it whenever API provides it
doc/design/python-api-1.md
Outdated
* `skipped: List[Tuple[LocalAsset, str]]` — list of skipped assets, paired with the reasons why they were skipped | ||
* For blocking upload methods that upload multiple assets, if an error occurs while uploading, an exception is raised after all threads have completed; this exception has the same `assets` and `skipped` attributes as above plus an `errored: List[Tuple[LocalAsset, str]]` attribute pairing errored assets with error messages | ||
|
||
* **To discuss:** Should methods that accept `DandisetMeta` and `(Bare|Remote)AssetMeta` instances also accept raw `dict`s? If so, should any validation be done on these `dict`s? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for starters we can just add a dedicated @classmethod
from_dict(cls, d: dict, validate:bool=True) -> cls (possible?)
and require such conversion explicitly. If we keep running into needing that too often -- we would rethink
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pydantic already provides parse_obj()
(validating) and construct()
(non-validating) methods for that.
@@ -0,0 +1,189 @@ | |||
Designs for an improved Python API |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
note, in #712 I renamed doc/
to docs/
so if we are to merge this one, might want to rename
how much this design doc reflects current implementation? I just wonder what should we do about this PR -- tune up&merge, merge, or just close? |
@yarikoptic I'd say somewhere between a third and a half of the doc has been implemented. I would prefer breaking all the tasks in this doc up into multiple issues (possibly organized in a GitHub project). I don't know why you'd ever merge this PR, as once it's implemented the docstrings should be the canonical location for the features' documentation. If you really want a nice place to put a write up of the desired API, I think the repo wiki is a better place. |
I don't know myself why we haven't ever merged it , so will do that now -- shouldn't hurt. |
🚀 PR was released in |
prior discussion in #249