Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BF+RF validation #1209

Merged
merged 11 commits into from
Feb 9, 2023
Merged

BF+RF validation #1209

merged 11 commits into from
Feb 9, 2023

Conversation

yarikoptic
Copy link
Member

Might still be incomplete since I left in place the odd dichotomy of results in validation depending on either schema_version is provided or not, as emphasized in the tests at https://github.com/dandi/dandi-cli/blob/HEAD/dandi/tests/test_files.py#L313 . @jwodder @TheChymera do you remember a reason why we did it this way?

meanwhile I made LocalAsset to always validate against schema even if a schema_version is not provided and that might be going against some logic which is pointed to above. But I think we should simplify/harmonize it here one way or another: we might want to always validate against everything possible and then results just filtered, or would need to pass some filtering option (e.g. listing validation id prefixes thus to enable only some) to select which validators to use. So if no objections/concerns would be stated, I would proceed to remove dichotomy in NWBAsset validation and always include super's validation (thus validation against schema).

Then I fixed up validation of zarrs in that their metadata records should be populated with zarr specific dummy checksum (previously tests would not even get to that case I believe).

…if no schema version provided

Otherwise (if no schema provided) -- we were completely skipping
validation, and that is IMHO is not right.

Such change already breakes some tests, e.g.

    dandi/tests/test_files.py::test_validate_deep_zarr
    >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>> PDB set_trace >>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>>
    > /home/yoh/proj/dandi/dandi-cli-master/dandi/files/bases.py(183)get_validation_errors()
    -> if devel_debug:
    (Pdb) l
    178  	        try:
    179  	            asset = self.get_metadata(digest=DUMMY_DIGEST)
    180  	            BareAsset(**asset.dict())
    181  	        except ValidationError as e:
    182  	            import pdb; pdb.set_trace()
    183  ->	            if devel_debug:
    184  	                raise
    185  	            # TODO: how do we get **all** errors from validation - there must be a way
    186  	            return [
    187  	                ValidationResult(
    188  	                    origin=ValidationOrigin(
    (Pdb) p e
    ValidationError(model='BareAsset', errors=[{'loc': ('digest',), 'msg': 'A zarr asset must have a zarr checksum.', 'type': 'value_error'}])
we have logic quite duplicated in a number of places and similar
exception handling was already RFed in DandisetMetadataFile but was
left "old" in the LocalAsset .
…per usage, fixed type annotation

Also getting proper `msg` field, not `message` from the dict although
allowing for both since why not.
We can easily have too deep and some other schema based errors at the same
time.
… Path among types

Note: for some reason our type testing was not triggerring error but pycharm
did highlight this issue for me.  @jwodder might know more.
@yarikoptic yarikoptic added minor Increment the minor version when merged cmd-validate labels Feb 8, 2023
@codecov
Copy link

codecov bot commented Feb 8, 2023

Codecov Report

Base: 89.19% // Head: 89.24% // Increases project coverage by +0.04% 🎉

Coverage data is based on head (9890f00) compared to base (4ccbe12).
Patch coverage: 59.45% of modified lines in pull request are covered.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #1209      +/-   ##
==========================================
+ Coverage   89.19%   89.24%   +0.04%     
==========================================
  Files          76       76              
  Lines        9487     9492       +5     
==========================================
+ Hits         8462     8471       +9     
+ Misses       1025     1021       -4     
Flag Coverage Δ
unittests 89.24% <59.45%> (+0.04%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
dandi/files/bases.py 79.07% <52.00%> (+0.35%) ⬆️
dandi/files/zarr.py 87.95% <66.66%> (-0.22%) ⬇️
dandi/misctypes.py 62.96% <100.00%> (+0.34%) ⬆️
dandi/pynwb_utils.py 82.68% <100.00%> (ø)
dandi/metadata.py 87.79% <0.00%> (+0.23%) ⬆️
dandi/dandiapi.py 87.67% <0.00%> (+0.44%) ⬆️

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

☔ View full report at Codecov.
📢 Do you have feedback about the report comment? Let us know in this issue.

@@ -743,12 +739,15 @@ def _get_nwb_inspector_version():


def _pydantic_errors_to_validation_results(
errors: Any[list[dict], Exception],
errors: list[dict | Exception] | ValidationError,
file_path: str,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since the only use of file_path in this function converts it to a Path, it would be better to make this argument a Path and remove the str() call when passing it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indeed! thanks, done in b6f8180

@@ -488,6 +476,12 @@ def get_validation_errors(
schema_version: Optional[str] = None,
devel_debug: bool = False,
) -> list[ValidationResult]:
"""Validate NWB asset

If `schema_version` was provided, we only validate basic metadata,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
If `schema_version` was provided, we only validate basic metadata,
If ``schema_version`` was provided, we only validate basic metadata,

One backtick for things that can be linked to (like classes and functions), two backticks for other code.

@yarikoptic
Copy link
Member Author

Thanks @jwodder for the review. Before I embark on further RF journey -- do you remember/have an idea why we had that different behavior depending on schema_version I mentioned/changed here for the LocalAsset?

@@ -180,6 +182,9 @@ def get_metadata(
) -> BareAsset:
metadata = get_default_metadata(self.filepath, digest=digest)
metadata.encodingFormat = ZARR_MIME_TYPE
# TODO: .size obtained via get_default_metadata would be the one
# from os.stat and thus not reflective of actual size of the zarr
# folder.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, for Zarrs, the size is extracted from the digest:

dandi-cli/dandi/metadata.py

Lines 993 to 998 in 4920831

if digest is not None and digest.algorithm is models.DigestType.dandi_zarr_checksum:
m = re.fullmatch(
r"(?P<hash>[0-9a-f]{32})-(?P<files>[0-9]+)--(?P<size>[0-9]+)", digest.value
)
if m:
metadata["contentSize"] = int(m["size"])

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, missed that, thanks! I will remove the TODO but might RF to move this zarr specific code into zarr specific place ;)

@jwodder
Copy link
Member

jwodder commented Feb 8, 2023

@yarikoptic I think the goal of schema_version was to use the old, "flat" metadata if no schema_version was set, and to use the dandischema classes if it was set. It was added during the transition to the dandischema classes.

@yarikoptic
Copy link
Member Author

Thanks! that aligns with my observation on behavior of metadata extraction on dandiset to use _check_required_fields... I think I will then continue RFing in the follow up PR to make it always work with dandischema classes, and will merge this one meanwhile unless you would recommend to grow this single PR @jwodder .

Copy link
Contributor

@TheChymera TheChymera left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looked through the diff, nothing I can see wrong with it. However, the schema specification dichotomy wasn't my design decision, and I'm unsure why it's done this way, so with respect to that me not seeing anything wrong might be of limited value.

One thing which I do see, is a bit of extra steps for string/Path object conversion. We have a bunch of gotchas like that where some things take strings and some others Path objects. Probably not worth dealing with in this PR, just putting it out there. Maybe paths should always be strings, as they always are in GNU utils, and the objects should be created only if we ever need the advanced features like .relative_to() which can't be easily done with os.path.

@jwodder
Copy link
Member

jwodder commented Feb 9, 2023

@TheChymera

Maybe paths should always be strings

I disagree. Paths should be Paths, in order to make it clear that they're paths and in order to take advantage of the simpler & more powerful pathlib API.

@yarikoptic
Copy link
Member Author

I tend to agree with @jwodder that Path is more versatile and easier to use construct, and indeed we can then benefit from type annotation. I wish even if there was convention to encode "RelativePath" and ensuring that proper conversion is done where needed etc.

@yarikoptic
Copy link
Member Author

with above, let's proceed and I hope to push more work here soon.

@yarikoptic yarikoptic merged commit 662f8ab into master Feb 9, 2023
@yarikoptic yarikoptic deleted the bfrf-validation branch February 9, 2023 18:39
@github-actions
Copy link

🚀 PR was released in 0.49.0 🚀

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cmd-validate minor Increment the minor version when merged released
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants