Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dandiset.get_metadata not working for some dandisets #1205

Open
bendichter opened this issue Feb 6, 2023 · 9 comments
Open

dandiset.get_metadata not working for some dandisets #1205

bendichter opened this issue Feb 6, 2023 · 9 comments
Assignees

Comments

@bendichter
Copy link
Member

bendichter commented Feb 6, 2023

from dandi.dandiapi import DandiAPIClient

client = DandiAPIClient()
dandisets = list(client.get_dandisets())

# this works
dandisets[-3].get_metadata()

# this does not work
for dandiset in dandisets:
    metadata = dandiset.get_metadata()
---------------------------------------------------------------------------
ValidationError                           Traceback (most recent call last)
Cell In[13], line 12
     10 # this does not work
     11 for dandiset in tqdm(dandisets):
---> 12     metadata = dandiset.get_metadata()

File /opt/conda/lib/python3.10/site-packages/dandi/dandiapi.py:925, in RemoteDandiset.get_metadata(self)
    920 def get_metadata(self) -> models.Dandiset:
    921     """
    922     Fetch the metadata for this version of the Dandiset as a
    923     `dandischema.models.Dandiset` instance
    924     """
--> 925     return models.Dandiset.parse_obj(self.get_raw_metadata())

File /opt/conda/lib/python3.10/site-packages/pydantic/main.py:527, in pydantic.main.BaseModel.parse_obj()

File /opt/conda/lib/python3.10/site-packages/pydantic/main.py:342, in pydantic.main.BaseModel.__init__()

ValidationError: 1 validation error for Dandiset
license
  ensure this value has at least 1 items (type=value_error.list.min_items; limit_value=1)
@bendichter bendichter changed the title client.get_metadata not working for all dandisets dandiset.get_metadata not working for all dandisets Feb 6, 2023
@yarikoptic
Copy link
Member

ah - right, now I remembered it, thank you @bendichter and I searched its previous instantiation in #1181 which is a twin of #1189 ;) So the issue here is pretty much on what is our "guarantee" here while returning metadata -- is it valid according to the schema or not. We have closed two prior instances without discussing, but may be let's discuss here. What would you Ben prefer it to behave like? May be if we somehow inform user that if they are interested in possibly invalid metadata, they could use get_raw_metadata?

@satra
Copy link
Member

satra commented Feb 7, 2023

for the specific issue, we can add a keyword arg to get_metadata, such as validate=False, which could internally do: Dandiset.unvalidated(**dandiset.get_raw_metadata()) if there is a ValidationError

of course a user can do that too.

the bigger question is around what should it return. in this particular case license became a required property, now set on creation, but this draft dandiset doesn't have one. i think we should first clean, upgrade, publish dandisets before trying to deal with this particular edge case or others. there are many cases where metadata are invalid since data submitters don't always fix them.

@bendichter
Copy link
Member Author

@satra that would make the query code more convoluted than it already is because we'd have to handle both cases. Also, if I understand correctly it would prevent users from editing metadata using the DandiAPIClient. My preferred solution would be:

  1. add a is_valid_draft flag or something similar on the dandiset and asset objects (I suppose this is similar to try/except on the user side, but I try to avoid that pattern)
  2. add some stink to the dandiset landing page to encourage users to fix metadata. Note that this is draft validation, not publication validation, so it is not covered by our existing validation messages that only cover publication validation.
  3. identify and try to fix invalid metadata for old dandisets. I can reach out to dandiset owners and try to get them to fix it. In this case, I'm pretty confident they'd be happy with CC-BY, it's just a matter of getting them to change it.

@satra
Copy link
Member

satra commented Feb 7, 2023

would prevent users from editing metadata using the DandiAPIClient

i don't see why this would be the case. it's still a Dandiset model object and can be locally validated/updated and pushed.

for 1, there is already a field that reflects validity of a dandiset: <dandiset>.version.status
3 is not just license, but other errors as well. all of these are documented in both the DLP (in the validation section), but also in info api endpoint.

@jwodder - is the dandiset version info api endpoint exposed somewhere in the library, so one could list all the validation errors locally ?

i just checked 2, and no published dandiset is invalid under the current schema.

from pydantic import ValidationError
for idx, dandiset in enumerate(dandisets):
    if dandiset.version.identifier == 'draft':
        continue
    try:
        metadata = dandiset.get_metadata()
    except ValidationError as e:
        print(idx, e)

@jwodder
Copy link
Member

jwodder commented Feb 7, 2023

@satra

is the dandiset version info api endpoint exposed somewhere in the library, so one could list all the validation errors locally ?

That endpoint is exposed via Dandiset.get_version(), but the structure returned does not store the validation errors.

@bendichter
Copy link
Member Author

bendichter commented Feb 7, 2023

@satra I'd like to homogenize the dandisets to facilitate reasonably terse and simple query code. Ideally users would not have to worry about cases where metadata does not match our schema. Your code would not allow the search to include dandisets that fail validation. It is true that many of these are empty or test datasets of little value, but there are also a substantial number that are high-value contributions including from the Allen Institute and others. I'd prefer to repair these if possible- at least the large ones. Also, if I understand your code correctly you are skipping all non-published dandisets. That's fine as a user option but I would not expect everyone to want to do that.

I went through all the validation errors for existing dandisets. Many are empty dandisets. Most validation errors are a missing license. Some are contributor validation error. I would really prefer if these contributor validation errors were caught in the metadata editor form. Let's use these to try to patch the UI to prevent users from creating invalid metadata.

18, 24, 30, 31, 32, 33, 38, 42, 46, 47, 63, 71, 72, 106, 112, 113, 114, 116, 120, 124, 131, 132 (and probably more, I started skipping them): missing a license but is also empty. 32 does not even have an owner. I'm not sure how that happened. It's not too hard to skip these with something like:

if dandiset.version.size == 0:
    continue;

invalid metadata
21, 22: important Allen datasets that are missing licenses. Owned by Wayne Wakeman who is no longer with Allen, Josh Siegle is listed as a contact author but not a dandiset owner.
28: dandiset by Matthias Hennig missing contributors and license. I have sent a request to update.
update: Matthias requested that we delete this dandiset
36: Allen dandiset missing a license. Owned by Jerome. I sent a request for him to fix it.
50: Allen dandiset missing a license. Several owners.
51, 52: Owned by Lee Kamentsky. Missing a license and contributors. Described as a testing datasets.
58: Owned by Lee Kamentsky. Missing contributor.
60: Missing license.
66: Missing a license. @satra you are an owner of this one. Can you add a license?
68: Testing dataset by Thomas Braun missing a license.
105: Owned by Lee Kamentsky. Validation error appears to be improper URL. See relevant issue here.
107: missing contributors and non-supported license
144: missing a license.
149: contributor validation error. See relevant issue here
235, 236, 237, 238, 297: contributor validation error. Looks like a simple name formatting issue.
338: validation error due to unusual Disorder metadata entry. Relevant issue here
362: title is too long. See relevant issue here

@bendichter bendichter changed the title dandiset.get_metadata not working for all dandisets dandiset.get_metadata not working for some dandisets Feb 7, 2023
@satra
Copy link
Member

satra commented Feb 7, 2023

my code was simply to show that published dandisets don't have validation errors, nothing beyond that.

regarding the issues. it's going to be two fold:

  1. defining and implementing the stale dandiset policy should remove many of those but this will require both policy creation and implementation. feel free to start a design doc in this archive on this.
  2. for dandisets that seem important to get the authors/team to fix things. we can offer to help fix some of the metadata issues for now.

for the general issue, we will not necessarily have valid draft dandisets, for whatever reason. hence i don't think any code can assume it to be valid at that stage. we can keep improving all kinds of interfaces, but we have not imposed a zero tolerance model for all aspects of validity. we have only done so for triggering publication.

also these are the metadata issues, once we trigger layout validation, several of these dandisets will be invalid as well. we need to add that to the validation services. (@yarikoptic - this was done outside of dandi schema and also related to web-based validation, rather than local validation).

@bendichter
Copy link
Member Author

my code was simply to show that published dandisets don't have validation errors, nothing beyond that.

ah ok I see.

for the general issue, we will not necessarily have valid draft dandisets, for whatever reason. hence i don't think any code can assume it to be valid at that stage. we can keep improving all kinds of interfaces, but we have not imposed a zero tolerance model for all aspects of validity. we have only done so for triggering publication.

It makes sense to me why we would allow for improper metadata for old datasets that were created before certain rules were in place. It seems to me the best approach there is to work with these groups to update the metadata, which ideally would be a simple task. For new dandisets, I don't see why we would allow the creation of metadata that does not follow our schema. Why aren't we properly validating this data in the metadata editor form? In my opinion, every way a user is able to use the web UI to create improper metadata is a bug in the metadata editor form validation. I also think this is true for the API, but I feel a bit less strongly about that. In the current state, there is not enough enforcement on the metadata to facilitate simple structured queries, particularly if you are requiring metadata to fit the schema perfectly to even build the metadata object I am meant to be querying. I can run queries on the raw_metadata hierarchical dictionary, which is what I have been doing, but this requires logic all over the place for the random non-compliant metadata dictionaries.

@yarikoptic
Copy link
Member

NB I think the discussion is great but have potential to derail. Hence for

(@yarikoptic - this was done outside of dandi schema and also related to web-based validation, rather than local validation).

filed dandi/dandi-schema#157 so we better continue on that aspect there.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants