[ENH] Reorganize derivatives #488

effigies · 2020-05-30T17:40:45Z

Following comments from @robertoostenveld (#265 (comment)), and a short thread involving @Lestropie and @Remi-Gau (#265 (comment)), I have tried to reorganize the first two sections of derivatives into a more logical grouping.

I have:

Moved "Storage of derived datasets" and "Non-compliant datasets" into the "Common Principles" section of the main spec, under "Source vs. raw vs. derived data", and changed the latter to "Non-compliant derivatives". I added an example of a derivatives dataset using code/ and sourcedata/ to encapsulate its own provenance, as a partial fulfillment of @Remi-Gau's request in [ENH] BEP 003: Common Derivatives #265 (comment).
Moved "Derived dataset and pipeline description" as a subsection under dataset_description.json. This was motivated partially because it is too detail-oriented for the derivatives introduction, but not really a "common data type". Given the move of (1) into the main spec, this felt less out-of-place.
Moved "Common file-level metadata fields" under "Common data types" and changed it to "Common data types and metadata".
Separated the "Spatial reference" discussion from "Common file-level metadata fields", because it's starting to get into a discussion of the space entity. Giving it some extra space should reduce the confusion and recognizes that it's not quite comparable to the other fields. The rewritten text is intended to address @robertoostenveld's concerns in [ENH] BEP 003: Common Derivatives #265 (comment).

Let me know what y'all think. I really appreciate you taking some weekend time to look over this.

Remi-Gau · 2020-05-30T17:59:30Z

src/02-common-principles.md

+
+    Example of a derivative dataset including the raw dataset as source:
+
+    ```Plain


This example looks good to me. I am sure generations of new users will appreciate. :-)

Remi-Gau

AFAICT LGTM

effigies · 2020-05-31T20:58:21Z

@Lestropie @robertoostenveld I would appreciate your review before merging this.

robertoostenveld

Overall I think that it is a good improvement! I have reviewed the changes to individual files, but don’t have the overview yet. For that it will help to merge this and render as HTML on RTD.

robertoostenveld · 2020-06-01T14:45:24Z

src/03-modality-agnostic-files.md

+```
+
+If a derived dataset is stored as a subfolder of the raw dataset, then the `Name` field
+of the first `GeneratedBy` object MUST be a substring of the derived dataset folder name.


in the 02-common-principles you use "fmriprep-v1.4.1" as the directory name, whereas here in the JSON the name is capitalized as "FMRIPREP". I would expect the requirement for a substring to be that the case matches.

robertoostenveld · 2020-06-01T14:50:21Z

src/05-derivatives/02-common-data-types.md

+
+## Common file level metadata fields
+
+Each derivative file SHOULD be described by a JSON file provided as a sidecar or


"Each derivative file SHOULD be described by a JSON file"; this creates recursion since JSON files should then also be described by a JSON file. I propose to write "Each derivative data file ..."

robertoostenveld · 2020-06-01T14:58:40Z

src/05-derivatives/02-common-data-types.md

+| **Key name** | **Description**                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                            |
+| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Description  | RECOMMENDED. Free-form natural language description of the nature of the file.                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+| Sources      | OPTIONAL. A list of paths relative to dataset root pointing to the file(s) that were directly used in the creation of this derivative. For example, if a derivative A is used in the creation of another derivative B, which is in turn used to generate C in a chain of A->B->C, C should only list B in `Sources`, and B should only list A in `Sources`. However, in case both X and Y are directly used in the creation of Z, then Z should list X and Y in `Sources`, regardless of whether X was used to generate Y. |


A list of paths ... pointing to the file(s). It is unclear to me what is meant by "pointing". Do the paths include the file name? Then it would not be pointing but rather something like this:

"A list of files with the path specified relative to the dataset root; these files were directly used in the creation of this derivative data file."

If pointing is meant to as "giving a direction, but not putting your finger directly on it" then I would read this as

"A list of directories specified relative to the dataset root; these directories contain the files that were directly used in the creation of this derivative data file."

robertoostenveld · 2020-06-01T15:01:25Z

src/05-derivatives/02-common-data-types.md

+| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
+| Description  | RECOMMENDED. Free-form natural language description of the nature of the file.                                                                                                                                                                                                                                                                                                                                                                                                                                             |
+| Sources      | OPTIONAL. A list of paths relative to dataset root pointing to the file(s) that were directly used in the creation of this derivative. For example, if a derivative A is used in the creation of another derivative B, which is in turn used to generate C in a chain of A->B->C, C should only list B in `Sources`, and B should only list A in `Sources`. However, in case both X and Y are directly used in the creation of Z, then Z should list X and Y in `Sources`, regardless of whether X was used to generate Y. |
+| RawSources   | OPTIONAL. A list of paths relative to dataset root pointing to the BIDS-Raw file(s) that were used in the creation of this derivative. When the derivative filename does not define a `space` keyword, the first entry of `RawSources` MUST be defined and it will define the `scanner` coordinate system that applies.                                                                                                                                                                                                    |


"... the first entry of RawSources MUST be defined ..." I suggest changing this "defined" to "specified".

Part of my confusion remains, due to the OPTIONAL which then later turns into MUST. Here is an attempt to rewrite:

REQUIRED or OPTIONAL: If the derived file does not contain space, it is required and the first file in the list defines the coordinate system; if the derived file does contain space, it is optional.

Another concern is that it is unclear to me how RawSources would be used with a derived EEG data file (which are also covered by this common specification), or a derived behav-only data file (also part of the BIDS specification). The raw source EEG data file that it would point to does not define a scanner coordinate system. I believe that this description can only be applied to derived data files (and corresponding source files) that define some spatial coordinate system. As such I do see how this can apply to MEG (e.g. following Maxfilter).

The validator should not flag derived EEG or behavioural datasets without RawSources and without space keyword as invalid.

Talking with @oesteban I think this use of RawSources to define a reference has been made obsolete by SpatialReference. I will remove this language, and adjust the language in the example.

robertoostenveld · 2020-06-01T15:18:25Z

src/05-derivatives/02-common-data-types.md

+### Examples
+
+Preprocessed `bold` NIfTI file in the original coordinate space of the
+original run. Please mind that in this case `RawSources` key is REQUIRED


I propose to clarify this as "Please mind that in this case, since the space keyword is not part of the file name, the RawSources key is REQUIRED (and also..."

effigies · 2020-06-01T17:45:58Z

@robertoostenveld It's rendered in both the build_docs_artifact and the docs/readthedocs.org:bids-specification checks.

Since you're conceptually okay with the reorg, I will address the easy things here, merge, and make comments on the main PR to bring up questions that need further discussion (if any).

…ifying wording

Lestropie · 2020-06-02T01:23:25Z

src/02-common-principles.md

+1.  As a standalone dataset independent of the source (raw or derived) BIDS
+    dataset.
+    This way of specifying derivatives is particularly useful when the source
+    dataset is provided with read-only access, for publishing derivatives as
+    independent bodies of work, or for describing derivatives that were created
+    from more than one source dataset.
+    The `sourcedata/` subdirectory MAY be used to include the source dataset(s)
+    that were used to generate the derivatives.
+    Likewise, any code used to generate the derivatives from the source data
+    MAY be included in the `code/` subdirectory.
+
+    Example of a derivative dataset including the raw dataset as source:


Given the differential definitions of "source" vs. "raw" datasets in "common principles", it might be better to avoid use of the word "source" in the context of "data from which the derivatives were generated". That or "source" in reference to pre-BIDS-raw data would need to be changed.

There's also some potential ambiguity here in terms of hierarchical directory structure. It's permissible for derivative data to be stored in a sub-directory of the raw dataset, but it's also possible for the raw data to be stored in a sub-directory of the derivative dataset, and none of the directory names are prescribed; so it's a bit of a free-for-all, with the only restriction being storage of raw and derived data at the same hierarchical level. This would mean that for automatic parsing, unambiguously determining what's raw and what's derived would require running the entire validator on each candidate directory. Also, if the raw dataset is provided read-only, writing a read-write version of it along with the derivative data could be seen to be counteracting the intent of that read-only provision. If the raw data are to be in some way distributed alongside the derivative data, doing so according to the suggestion provided in "common principles":

my_dataset/ sourcedata/ ... rawdata/ dataset_description.json participants.tsv sub-01/ sub-02/ ... derivatives/ ...

may be more faithful to such.

Actually looking at it again, this is in (edit: the old version of) "Common principles":

These data MUST be kept in separate sourcedata and derivatives folders

Permitting BIDS-raw data to be stored as a sub-directory within a BIDS-derivatives directory could arguably be seen as contradictory to this requirement, since the derivatives directory is no longer "separate" to the (duplicate of the) raw data but a superset thereof.

So personally I'd probably err on not permitting raw data as sub-directories of derived data.

(Sorry I missed the timeline on this; I'll try to get through it nevertheless and you can use or not use as you wish)

Potentially reinforcing this: consider #492. When reading data from the raw dataset, a parser will traverse up the filesystem hierarchy looking for JSON files that should be read in order. Unless the parser explicitly detects the root of the BIDS-raw directory as being such, and therefore ceasing traversal of the filesystem at that point, it will continue up and (potentially, depending on the design of the parser) include any JSON file at the root of the derivatives dataset in agglomerating sidecar information for a file in the raw dataset.

Lestropie · 2020-06-02T01:31:01Z

src/02-common-principles.md

+Case 2.
+In both cases, every derivatives dataset is considered a BIDS dataset and must
+include a `dataset_description.json` file at the root level (see
+[Dataset description][dataset-description].


Suggested change

[Dataset description][dataset-description].

[Dataset description][dataset-description]).

effigies added 3 commits May 30, 2020 13:25

Reorganize derivatives

ad6b0bf

STY: Clean up links, formatting, outdated text

88ad581

Fix link

3100608

Remi-Gau reviewed May 30, 2020

View reviewed changes

Remi-Gau approved these changes May 30, 2020

View reviewed changes

Update TOC

83105f5

effigies mentioned this pull request May 31, 2020

[ENH] BEP 003: Common Derivatives #265

Merged

5 tasks

robertoostenveld reviewed Jun 1, 2020

View reviewed changes

effigies added 2 commits June 1, 2020 14:11

Make dataset description example consistent with folder name example

9b4a185

Remove RawSources overlap with SpatialReference, some additional clar…

1abaea9

…ifying wording

effigies merged commit ded6875 into bids-standard:common-derivatives Jun 1, 2020

effigies deleted the rf/reorganize_derivatives branch June 1, 2020 18:12

Lestropie reviewed Jun 2, 2020

View reviewed changes

Lestropie mentioned this pull request Jun 2, 2020

Inheritance principle: unsetting of key-value pairs #492

Closed

nicholst changed the title ~~Reorganize derivatives~~ [ENH] Reorganize derivatives Jun 13, 2020

nicholst mentioned this pull request Jun 13, 2020

REL: v1.4.0 #496

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ENH] Reorganize derivatives #488

[ENH] Reorganize derivatives #488

effigies commented May 30, 2020

Remi-Gau May 30, 2020

Remi-Gau left a comment

effigies commented May 31, 2020

robertoostenveld left a comment

robertoostenveld Jun 1, 2020

robertoostenveld Jun 1, 2020

robertoostenveld Jun 1, 2020

robertoostenveld Jun 1, 2020

robertoostenveld Jun 1, 2020

effigies Jun 1, 2020

robertoostenveld Jun 1, 2020

effigies commented Jun 1, 2020

Lestropie Jun 2, 2020 •

edited

Loading

Lestropie Jun 2, 2020

Lestropie Jun 2, 2020


		Example of a derivative dataset including the raw dataset as source:

		```Plain


		## Common file level metadata fields

		Each derivative file SHOULD be described by a JSON file provided as a sidecar or

	[Dataset description][dataset-description].
	[Dataset description][dataset-description]).

[ENH] Reorganize derivatives #488

[ENH] Reorganize derivatives #488

Conversation

effigies commented May 30, 2020

Choose a reason for hiding this comment

Remi-Gau left a comment

Choose a reason for hiding this comment

effigies commented May 31, 2020

robertoostenveld left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

effigies commented Jun 1, 2020

Lestropie Jun 2, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Lestropie Jun 2, 2020 •

edited

Loading