Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Reorganize derivatives #488

Merged

Conversation

effigies
Copy link
Collaborator

Following comments from @robertoostenveld (#265 (comment)), and a short thread involving @Lestropie and @Remi-Gau (#265 (comment)), I have tried to reorganize the first two sections of derivatives into a more logical grouping.

I have:

  1. Moved "Storage of derived datasets" and "Non-compliant datasets" into the "Common Principles" section of the main spec, under "Source vs. raw vs. derived data", and changed the latter to "Non-compliant derivatives". I added an example of a derivatives dataset using code/ and sourcedata/ to encapsulate its own provenance, as a partial fulfillment of @Remi-Gau's request in [ENH] BEP 003: Common Derivatives #265 (comment).
  2. Moved "Derived dataset and pipeline description" as a subsection under dataset_description.json. This was motivated partially because it is too detail-oriented for the derivatives introduction, but not really a "common data type". Given the move of (1) into the main spec, this felt less out-of-place.
  3. Moved "Common file-level metadata fields" under "Common data types" and changed it to "Common data types and metadata".
  4. Separated the "Spatial reference" discussion from "Common file-level metadata fields", because it's starting to get into a discussion of the space entity. Giving it some extra space should reduce the confusion and recognizes that it's not quite comparable to the other fields. The rewritten text is intended to address @robertoostenveld's concerns in [ENH] BEP 003: Common Derivatives #265 (comment).

Let me know what y'all think. I really appreciate you taking some weekend time to look over this.


Example of a derivative dataset including the raw dataset as source:

```Plain
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example looks good to me. I am sure generations of new users will appreciate. :-)

Copy link
Collaborator

@Remi-Gau Remi-Gau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAICT LGTM

@effigies
Copy link
Collaborator Author

@Lestropie @robertoostenveld I would appreciate your review before merging this.

@effigies effigies mentioned this pull request May 31, 2020
5 tasks
Copy link
Collaborator

@robertoostenveld robertoostenveld left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall I think that it is a good improvement! I have reviewed the changes to individual files, but don’t have the overview yet. For that it will help to merge this and render as HTML on RTD.

```

If a derived dataset is stored as a subfolder of the raw dataset, then the `Name` field
of the first `GeneratedBy` object MUST be a substring of the derived dataset folder name.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in the 02-common-principles you use "fmriprep-v1.4.1" as the directory name, whereas here in the JSON the name is capitalized as "FMRIPREP". I would expect the requirement for a substring to be that the case matches.


## Common file level metadata fields

Each derivative file SHOULD be described by a JSON file provided as a sidecar or
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Each derivative file SHOULD be described by a JSON file"; this creates recursion since JSON files should then also be described by a JSON file. I propose to write "Each derivative data file ..."

| **Key name** | **Description** |
| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Description | RECOMMENDED. Free-form natural language description of the nature of the file. |
| Sources | OPTIONAL. A list of paths relative to dataset root pointing to the file(s) that were directly used in the creation of this derivative. For example, if a derivative A is used in the creation of another derivative B, which is in turn used to generate C in a chain of A->B->C, C should only list B in `Sources`, and B should only list A in `Sources`. However, in case both X and Y are directly used in the creation of Z, then Z should list X and Y in `Sources`, regardless of whether X was used to generate Y. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A list of paths ... pointing to the file(s). It is unclear to me what is meant by "pointing". Do the paths include the file name? Then it would not be pointing but rather something like this:

"A list of files with the path specified relative to the dataset root; these files were directly used in the creation of this derivative data file."

If pointing is meant to as "giving a direction, but not putting your finger directly on it" then I would read this as

"A list of directories specified relative to the dataset root; these directories contain the files that were directly used in the creation of this derivative data file."

| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| Description | RECOMMENDED. Free-form natural language description of the nature of the file. |
| Sources | OPTIONAL. A list of paths relative to dataset root pointing to the file(s) that were directly used in the creation of this derivative. For example, if a derivative A is used in the creation of another derivative B, which is in turn used to generate C in a chain of A->B->C, C should only list B in `Sources`, and B should only list A in `Sources`. However, in case both X and Y are directly used in the creation of Z, then Z should list X and Y in `Sources`, regardless of whether X was used to generate Y. |
| RawSources | OPTIONAL. A list of paths relative to dataset root pointing to the BIDS-Raw file(s) that were used in the creation of this derivative. When the derivative filename does not define a `space` keyword, the first entry of `RawSources` MUST be defined and it will define the `scanner` coordinate system that applies. |
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"... the first entry of RawSources MUST be defined ..." I suggest changing this "defined" to "specified".

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Part of my confusion remains, due to the OPTIONAL which then later turns into MUST. Here is an attempt to rewrite:

REQUIRED or OPTIONAL: If the derived file does not contain space, it is required and the first file in the list defines the coordinate system; if the derived file does contain space, it is optional.

Another concern is that it is unclear to me how RawSources would be used with a derived EEG data file (which are also covered by this common specification), or a derived behav-only data file (also part of the BIDS specification). The raw source EEG data file that it would point to does not define a scanner coordinate system. I believe that this description can only be applied to derived data files (and corresponding source files) that define some spatial coordinate system. As such I do see how this can apply to MEG (e.g. following Maxfilter).

The validator should not flag derived EEG or behavioural datasets without RawSources and without space keyword as invalid.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Talking with @oesteban I think this use of RawSources to define a reference has been made obsolete by SpatialReference. I will remove this language, and adjust the language in the example.

### Examples

Preprocessed `bold` NIfTI file in the original coordinate space of the
original run. Please mind that in this case `RawSources` key is REQUIRED
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I propose to clarify this as "Please mind that in this case, since the space keyword is not part of the file name, the RawSources key is REQUIRED (and also..."

@effigies
Copy link
Collaborator Author

effigies commented Jun 1, 2020

@robertoostenveld It's rendered in both the build_docs_artifact and the docs/readthedocs.org:bids-specification checks.

Since you're conceptually okay with the reorg, I will address the easy things here, merge, and make comments on the main PR to bring up questions that need further discussion (if any).

@effigies effigies merged commit ded6875 into bids-standard:common-derivatives Jun 1, 2020
@effigies effigies deleted the rf/reorganize_derivatives branch June 1, 2020 18:12
Comment on lines +228 to +239
1. As a standalone dataset independent of the source (raw or derived) BIDS
dataset.
This way of specifying derivatives is particularly useful when the source
dataset is provided with read-only access, for publishing derivatives as
independent bodies of work, or for describing derivatives that were created
from more than one source dataset.
The `sourcedata/` subdirectory MAY be used to include the source dataset(s)
that were used to generate the derivatives.
Likewise, any code used to generate the derivatives from the source data
MAY be included in the `code/` subdirectory.

Example of a derivative dataset including the raw dataset as source:
Copy link
Collaborator

@Lestropie Lestropie Jun 2, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the differential definitions of "source" vs. "raw" datasets in "common principles", it might be better to avoid use of the word "source" in the context of "data from which the derivatives were generated". That or "source" in reference to pre-BIDS-raw data would need to be changed.

There's also some potential ambiguity here in terms of hierarchical directory structure. It's permissible for derivative data to be stored in a sub-directory of the raw dataset, but it's also possible for the raw data to be stored in a sub-directory of the derivative dataset, and none of the directory names are prescribed; so it's a bit of a free-for-all, with the only restriction being storage of raw and derived data at the same hierarchical level. This would mean that for automatic parsing, unambiguously determining what's raw and what's derived would require running the entire validator on each candidate directory. Also, if the raw dataset is provided read-only, writing a read-write version of it along with the derivative data could be seen to be counteracting the intent of that read-only provision. If the raw data are to be in some way distributed alongside the derivative data, doing so according to the suggestion provided in "common principles":

my_dataset/
  sourcedata/
    ...
  rawdata/
    dataset_description.json
    participants.tsv
    sub-01/
    sub-02/
    ...
  derivatives/
    ...

may be more faithful to such.

Actually looking at it again, this is in (edit: the old version of) "Common principles":

These data MUST be kept in separate sourcedata and derivatives folders

Permitting BIDS-raw data to be stored as a sub-directory within a BIDS-derivatives directory could arguably be seen as contradictory to this requirement, since the derivatives directory is no longer "separate" to the (duplicate of the) raw data but a superset thereof.

So personally I'd probably err on not permitting raw data as sub-directories of derived data.

(Sorry I missed the timeline on this; I'll try to get through it nevertheless and you can use or not use as you wish)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potentially reinforcing this: consider #492. When reading data from the raw dataset, a parser will traverse up the filesystem hierarchy looking for JSON files that should be read in order. Unless the parser explicitly detects the root of the BIDS-raw directory as being such, and therefore ceasing traversal of the filesystem at that point, it will continue up and (potentially, depending on the design of the parser) include any JSON file at the root of the derivatives dataset in agglomerating sidecar information for a file in the raw dataset.

Case 2.
In both cases, every derivatives dataset is considered a BIDS dataset and must
include a `dataset_description.json` file at the root level (see
[Dataset description][dataset-description].
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
[Dataset description][dataset-description].
[Dataset description][dataset-description]).

@nicholst nicholst changed the title Reorganize derivatives [ENH] Reorganize derivatives Jun 13, 2020
@nicholst nicholst mentioned this pull request Jun 13, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants