Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

validate: RF to collect/output more informative structured records #943

Closed
yarikoptic opened this issue Mar 21, 2022 · 3 comments · Fixed by #1104
Closed

validate: RF to collect/output more informative structured records #943

yarikoptic opened this issue Mar 21, 2022 · 3 comments · Fixed by #1104
Assignees

Comments

@yarikoptic
Copy link
Member

yarikoptic commented Mar 21, 2022

Currently validate is just collecting a list of string error representations. As @CodyCBakerPhD mentions in https://github.com/dandi/dandi-cli/pull/941/files#r830671373 , nwbinspector provides more informative result records with contextual information about the error etc.

Similarly for BIDS validation being added in #896 error messages are also serialized/simplified, although stock bids-validator has quite an extensive registry of warnings/errors: google spreadsheet

Looking forward we should establish a ValidationResult data type/class/record which would be expressive to cover validation results from various validators. I foresee it to also reflect the "source" of the validation result, so overall should have (more to be filled out later on; yet to look at what nwbinspector provides):

  • origin: what produced this validation result
    • name: str , {"pynwb.load", "pynwb.validate", "nwbinspector", "dandischema", "dandi-bids", "ome-zarr"}
    • version: str -- version of the tool producing the result, e.g. "0.3.1"
  • severity: {'ERROR', 'WARNING', 'HINT'} -- for now just reflecting
  • id: str (or id: list?) -- identifier of that particular type of result. For BIDS could be its Key or numeric ID. Providing a unique string ID like BIDS does quite nice since gives immediate summary in its name
  • scope: {'file', 'dandiset'} -- either it is
  • path: str - path to file or folder/dataset (e.g. BIDS of dandiset) which triggered the error
  • asset_paths: Optional[list[str]] - defaults to [.path] if not defined. Otherwise - could point to an actual file (or collection of files) which are affected by a warning/error in the .path. E.g. {"path": "task-broken_bold.json", "asset_paths": ["sub-01/func/sub-01_task-broken_bold.json", "sub-02/func/sub-02_task-broken_bold.json"]}.
  • dataset_path: Optional[str] -- path to the dataset based on validation of which message is based (e.g. could be a nested BIDS dataset, or not present for pynwb validations since they do not care)
  • message: str - it would be the "Reason" of BIDS or a string interpolation of nwbinspector record?

Use

  • dandi validate should gain --severity=HINT,WARNING,ERROR switch to provide minimal level for what to include in the report
  • dandi upload should use validate with severity='ERROR' thus only allowing for uploads without errors but possibly which would trigger WARNINGs and any other lower level problems

edits:

@effigies
Copy link

effigies commented Apr 1, 2022

  • @effigies also expressed some utopian idea -- if schema-based errors could point to specific BIDS schema file which instructed that particular structure, that would be useful; but to me it sounds like tricky to implement to make checks aware of original location of definitions in schema

@yarikoptic To be clear, my utopian idea was that we could tell where errors in the BIDS dataset arose. For example:

task-rest_bold.json: {"EchoTime": 30}
sub-01/func/sub-01_task-rest_bold.json: {"RepetitionTime": 2000}

When I look at the logical metadata now, I see {"EchoTime": 30, "RepetitionTime": 2000}, but if we tagged each entry like as {"EchoTime": (30, "task-rest_bold.json"), "RepetitionTime": (2000, "sub-01/func/sub-01_task-rest_bold.json")}, we could produce errors like:

IMPLAUSIBLE_ECHO_TIME: 30 seconds is an unlikely echo time (defined in file: task-rest_bold.json)
IMPLAUSIBLE_REPETITION_TIME: 2000 seconds is an unlikely repetition time (defined in file: sub-01/func/sub-01_task-rest_bold.json)

@yarikoptic
Copy link
Member Author

yarikoptic commented Apr 29, 2022

@effigies thanks for bringing it up. Please correct me or just elaborate:

  • in most of the cases it would correspond to the file (data or sidecar) which is "validated", and only in some cases it might correspond to some other file where value was actually defined and picked up through some available mechanism (e.g. "inheritance", "sidecar", another "part" of a data file which all need to be considered/validated together)

  • nwbinspector record already has smth like that: https://github.com/NeurodataWithoutBorders/nwbinspector/blob/dev/nwbinspector/register_checks.py#L64 has file_path (the nwb file in question/being validated) and location (relative path to the inspected object, which might be external to that .nwb file)

  • altogether I feel that we might benefit from a notion of an asset [1] as a collection of files, and indeed track where data or metadata comes from for a more precise location pointing. For now added asset_path such as

  • asset_paths: Optional[list[str]] - defaults to [.path] if not defined. Otherwise - could point to an actual file (or collection of files) which are affected by a warning/error in the .path. E.g. {"path": "task-broken_bold.json", "asset_paths": ["sub-01/func/sub-01_task-broken_bold.json", "sub-02/func/sub-02_task-broken_bold.json"]} .

WDYT @effigies ?

[1] may be some other name to not overlap since our asset was envisioned to be such a collection but ended up being at large 1-to-1 mapping from file or a directory (.zarr/) into an asset

edit 1: added dataset_path as well to disambiguate a possible "reference" for the validation. But may be it should somehow be "groupped" into origin part of the record?

@yarikoptic
Copy link
Member Author

Some incomplete thinking about `path`s (absolute/relative/etc)
  • recalled a possible convention for ./, / and "pure relative" paths in BIDS datasets
  • absolute paths (e.g. datalad result records are returned with abs paths)
    • pros: would make it non-ambiguous
    • cons: any permanent serialization would be fragile (broken if dandiset is moved etc)
  • could be relative to the top of the dandiset
    • it is ok as long as we do not allow for nesting of dandisets, but I guess we might arrive to those later
      ...

yarikoptic added a commit that referenced this issue Aug 26, 2022
* origin/master:
  Removed deprecated parameter
  Depending on bidsschematools
  Removed debug option
  Using upstream schema reference root specification
  Dropped reference log
  Let's see how this goes
  Add test
  get_content_url(): If a HEAD fails, return the failing URL
  Update CHANGELOG.md [skip ci]
  DOC: minor - boost copyright years in the docs
yarikoptic added a commit that referenced this issue Oct 28, 2022
Structured validation results
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants