validate: RF to collect/output more informative structured records #943

yarikoptic · 2022-03-21T19:45:58Z

Currently validate is just collecting a list of string error representations. As @CodyCBakerPhD mentions in https://github.com/dandi/dandi-cli/pull/941/files#r830671373 , nwbinspector provides more informative result records with contextual information about the error etc.

Similarly for BIDS validation being added in #896 error messages are also serialized/simplified, although stock bids-validator has quite an extensive registry of warnings/errors: google spreadsheet

Looking forward we should establish a ValidationResult data type/class/record which would be expressive to cover validation results from various validators. I foresee it to also reflect the "source" of the validation result, so overall should have (more to be filled out later on; yet to look at what nwbinspector provides):

origin: what produced this validation result
- name: str , {"pynwb.load", "pynwb.validate", "nwbinspector", "dandischema", "dandi-bids", "ome-zarr"}
- version: str -- version of the tool producing the result, e.g. "0.3.1"
severity: {'ERROR', 'WARNING', 'HINT'} -- for now just reflecting
id: str (or id: list?) -- identifier of that particular type of result. For BIDS could be its Key or numeric ID. Providing a unique string ID like BIDS does quite nice since gives immediate summary in its name
scope: {'file', 'dandiset'} -- either it is
path: str - path to file or folder/dataset (e.g. BIDS of dandiset) which triggered the error
asset_paths: Optional[list[str]] - defaults to [.path] if not defined. Otherwise - could point to an actual file (or collection of files) which are affected by a warning/error in the .path. E.g. {"path": "task-broken_bold.json", "asset_paths": ["sub-01/func/sub-01_task-broken_bold.json", "sub-02/func/sub-02_task-broken_bold.json"]}.
dataset_path: Optional[str] -- path to the dataset based on validation of which message is based (e.g. could be a nested BIDS dataset, or not present for pynwb validations since they do not care)
message: str - it would be the "Reason" of BIDS or a string interpolation of nwbinspector record?

Use

dandi validate should gain --severity=HINT,WARNING,ERROR switch to provide minimal level for what to include in the report
dandi upload should use validate with severity='ERROR' thus only allowing for uploads without errors but possibly which would trigger WARNINGs and any other lower level problems

edits:

here is what @nellh has for data structures to encapsulate in schema-based bids-validator: https://github.com/bids-standard/bids-validator/pull/1413/files#diff-3e29891088bc511ec972c5b36dcb579d59db3552fa69b3c882e065931d790664R6
@effigies also expressed some utopian idea -- if schema-based errors could point to specific BIDS schema file which instructed that particular structure, that would be useful; but to me it sounds like tricky to implement to make checks aware of original location of definitions in schema
nwbinspector's @dataclass describing InspectorMessage: https://github.com/NeurodataWithoutBorders/nwbinspector/blob/dev/nwbinspector/register_checks.py#L38

The text was updated successfully, but these errors were encountered:

effigies · 2022-04-01T13:36:59Z

@effigies also expressed some utopian idea -- if schema-based errors could point to specific BIDS schema file which instructed that particular structure, that would be useful; but to me it sounds like tricky to implement to make checks aware of original location of definitions in schema

@yarikoptic To be clear, my utopian idea was that we could tell where errors in the BIDS dataset arose. For example:

task-rest_bold.json: {"EchoTime": 30}
sub-01/func/sub-01_task-rest_bold.json: {"RepetitionTime": 2000}

When I look at the logical metadata now, I see {"EchoTime": 30, "RepetitionTime": 2000}, but if we tagged each entry like as {"EchoTime": (30, "task-rest_bold.json"), "RepetitionTime": (2000, "sub-01/func/sub-01_task-rest_bold.json")}, we could produce errors like:

IMPLAUSIBLE_ECHO_TIME: 30 seconds is an unlikely echo time (defined in file: task-rest_bold.json)
IMPLAUSIBLE_REPETITION_TIME: 2000 seconds is an unlikely repetition time (defined in file: sub-01/func/sub-01_task-rest_bold.json)

yarikoptic · 2022-04-29T19:43:57Z

@effigies thanks for bringing it up. Please correct me or just elaborate:

in most of the cases it would correspond to the file (data or sidecar) which is "validated", and only in some cases it might correspond to some other file where value was actually defined and picked up through some available mechanism (e.g. "inheritance", "sidecar", another "part" of a data file which all need to be considered/validated together)
nwbinspector record already has smth like that: https://github.com/NeurodataWithoutBorders/nwbinspector/blob/dev/nwbinspector/register_checks.py#L64 has file_path (the nwb file in question/being validated) and location (relative path to the inspected object, which might be external to that .nwb file)
altogether I feel that we might benefit from a notion of an asset [1] as a collection of files, and indeed track where data or metadata comes from for a more precise location pointing. For now added asset_path such as
asset_paths: Optional[list[str]] - defaults to [.path] if not defined. Otherwise - could point to an actual file (or collection of files) which are affected by a warning/error in the .path. E.g. {"path": "task-broken_bold.json", "asset_paths": ["sub-01/func/sub-01_task-broken_bold.json", "sub-02/func/sub-02_task-broken_bold.json"]} .

WDYT @effigies ?

[1] may be some other name to not overlap since our asset was envisioned to be such a collection but ended up being at large 1-to-1 mapping from file or a directory (.zarr/) into an asset

edit 1: added dataset_path as well to disambiguate a possible "reference" for the validation. But may be it should somehow be "groupped" into origin part of the record?

yarikoptic · 2022-04-29T20:02:03Z

Some incomplete thinking about `path`s (absolute/relative/etc)

recalled a possible convention for ./, / and "pure relative" paths in BIDS datasets
absolute paths (e.g. datalad result records are returned with abs paths)
- pros: would make it non-ambiguous
- cons: any permanent serialization would be fragile (broken if dandiset is moved etc)
could be relative to the top of the dandiset
- it is ok as long as we do not allow for nesting of dandisets, but I guess we might arrive to those later
  ...

* origin/master: Removed deprecated parameter Depending on bidsschematools Removed debug option Using upstream schema reference root specification Dropped reference log Let's see how this goes Add test get_content_url(): If a HEAD fails, return the failing URL Update CHANGELOG.md [skip ci] DOC: minor - boost copyright years in the docs

Structured validation results

yarikoptic added the cmd-validate label Mar 21, 2022

This was referenced Mar 21, 2022

Integrate NWBInspector with DANDI validation #941

Merged

[Discussion]: Add Error Codes? NeurodataWithoutBorders/nwbinspector#34

Closed

improve validation process dandi/dandi-schema#127

Open

yarikoptic mentioned this issue Mar 28, 2022

Adding BIDS support #896

Merged

10 tasks

yarikoptic mentioned this issue Jun 23, 2022

Initial validate support, with --strict option ome/ome-zarr-py#142

Closed

yarikoptic assigned jwodder Jul 13, 2022

yarikoptic mentioned this issue Jul 22, 2022

RF "files.py" validation/metadata-loading to support BIDS #1044

Closed

jwodder mentioned this issue Aug 18, 2022

Structured validation results #1104

Merged

TheChymera added a commit that referenced this issue Sep 9, 2022

Merge branch 'gh-943' of github.com:dandi/dandi-cli into gh-943

df7f978

TheChymera added a commit that referenced this issue Sep 21, 2022

Merge branch 'gh-943' of github.com:dandi/dandi-cli into gh-943

a748e56

TheChymera added a commit that referenced this issue Oct 5, 2022

Merge branch 'gh-943' of github.com:dandi/dandi-cli into gh-943

e874fcb

yarikoptic closed this as completed in #1104 Oct 28, 2022

yarikoptic added a commit that referenced this issue Oct 28, 2022

Merge pull request #1104 from dandi/gh-943

48a222f

Structured validation results

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

validate: RF to collect/output more informative structured records #943

validate: RF to collect/output more informative structured records #943

yarikoptic commented Mar 21, 2022 •

edited

Loading

effigies commented Apr 1, 2022

yarikoptic commented Apr 29, 2022 •

edited

Loading

yarikoptic commented Apr 29, 2022

validate: RF to collect/output more informative structured records #943

validate: RF to collect/output more informative structured records #943

Comments

yarikoptic commented Mar 21, 2022 • edited Loading

Use

effigies commented Apr 1, 2022

yarikoptic commented Apr 29, 2022 • edited Loading

yarikoptic commented Apr 29, 2022

yarikoptic commented Mar 21, 2022 •

edited

Loading

yarikoptic commented Apr 29, 2022 •

edited

Loading