-
Notifications
You must be signed in to change notification settings - Fork 159
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[ENH] Reorganize derivatives #488
[ENH] Reorganize derivatives #488
Conversation
|
||
Example of a derivative dataset including the raw dataset as source: | ||
|
||
```Plain |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This example looks good to me. I am sure generations of new users will appreciate. :-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICT LGTM
@Lestropie @robertoostenveld I would appreciate your review before merging this. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall I think that it is a good improvement! I have reviewed the changes to individual files, but don’t have the overview yet. For that it will help to merge this and render as HTML on RTD.
``` | ||
|
||
If a derived dataset is stored as a subfolder of the raw dataset, then the `Name` field | ||
of the first `GeneratedBy` object MUST be a substring of the derived dataset folder name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
in the 02-common-principles you use "fmriprep-v1.4.1" as the directory name, whereas here in the JSON the name is capitalized as "FMRIPREP". I would expect the requirement for a substring to be that the case matches.
|
||
## Common file level metadata fields | ||
|
||
Each derivative file SHOULD be described by a JSON file provided as a sidecar or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"Each derivative file SHOULD be described by a JSON file"; this creates recursion since JSON files should then also be described by a JSON file. I propose to write "Each derivative data file ..."
| **Key name** | **Description** | | ||
| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| Description | RECOMMENDED. Free-form natural language description of the nature of the file. | | ||
| Sources | OPTIONAL. A list of paths relative to dataset root pointing to the file(s) that were directly used in the creation of this derivative. For example, if a derivative A is used in the creation of another derivative B, which is in turn used to generate C in a chain of A->B->C, C should only list B in `Sources`, and B should only list A in `Sources`. However, in case both X and Y are directly used in the creation of Z, then Z should list X and Y in `Sources`, regardless of whether X was used to generate Y. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A list of paths ... pointing to the file(s). It is unclear to me what is meant by "pointing". Do the paths include the file name? Then it would not be pointing but rather something like this:
"A list of files with the path specified relative to the dataset root; these files were directly used in the creation of this derivative data file."
If pointing is meant to as "giving a direction, but not putting your finger directly on it" then I would read this as
"A list of directories specified relative to the dataset root; these directories contain the files that were directly used in the creation of this derivative data file."
| ------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | ||
| Description | RECOMMENDED. Free-form natural language description of the nature of the file. | | ||
| Sources | OPTIONAL. A list of paths relative to dataset root pointing to the file(s) that were directly used in the creation of this derivative. For example, if a derivative A is used in the creation of another derivative B, which is in turn used to generate C in a chain of A->B->C, C should only list B in `Sources`, and B should only list A in `Sources`. However, in case both X and Y are directly used in the creation of Z, then Z should list X and Y in `Sources`, regardless of whether X was used to generate Y. | | ||
| RawSources | OPTIONAL. A list of paths relative to dataset root pointing to the BIDS-Raw file(s) that were used in the creation of this derivative. When the derivative filename does not define a `space` keyword, the first entry of `RawSources` MUST be defined and it will define the `scanner` coordinate system that applies. | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"... the first entry of RawSources
MUST be defined ..." I suggest changing this "defined" to "specified".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Part of my confusion remains, due to the OPTIONAL
which then later turns into MUST
. Here is an attempt to rewrite:
REQUIRED or OPTIONAL: If the derived file does not contain
space
, it is required and the first file in the list defines the coordinate system; if the derived file does containspace
, it is optional.
Another concern is that it is unclear to me how RawSources would be used with a derived EEG data file (which are also covered by this common specification), or a derived behav-only data file (also part of the BIDS specification). The raw source EEG data file that it would point to does not define a scanner
coordinate system. I believe that this description can only be applied to derived data files (and corresponding source files) that define some spatial coordinate system. As such I do see how this can apply to MEG (e.g. following Maxfilter).
The validator should not flag derived EEG or behavioural datasets without RawSources
and without space
keyword as invalid.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking with @oesteban I think this use of RawSources
to define a reference has been made obsolete by SpatialReference
. I will remove this language, and adjust the language in the example.
### Examples | ||
|
||
Preprocessed `bold` NIfTI file in the original coordinate space of the | ||
original run. Please mind that in this case `RawSources` key is REQUIRED |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I propose to clarify this as "Please mind that in this case, since the space
keyword is not part of the file name, the RawSources
key is REQUIRED (and also..."
@robertoostenveld It's rendered in both the Since you're conceptually okay with the reorg, I will address the easy things here, merge, and make comments on the main PR to bring up questions that need further discussion (if any). |
1. As a standalone dataset independent of the source (raw or derived) BIDS | ||
dataset. | ||
This way of specifying derivatives is particularly useful when the source | ||
dataset is provided with read-only access, for publishing derivatives as | ||
independent bodies of work, or for describing derivatives that were created | ||
from more than one source dataset. | ||
The `sourcedata/` subdirectory MAY be used to include the source dataset(s) | ||
that were used to generate the derivatives. | ||
Likewise, any code used to generate the derivatives from the source data | ||
MAY be included in the `code/` subdirectory. | ||
|
||
Example of a derivative dataset including the raw dataset as source: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the differential definitions of "source" vs. "raw" datasets in "common principles", it might be better to avoid use of the word "source" in the context of "data from which the derivatives were generated". That or "source" in reference to pre-BIDS-raw data would need to be changed.
There's also some potential ambiguity here in terms of hierarchical directory structure. It's permissible for derivative data to be stored in a sub-directory of the raw dataset, but it's also possible for the raw data to be stored in a sub-directory of the derivative dataset, and none of the directory names are prescribed; so it's a bit of a free-for-all, with the only restriction being storage of raw and derived data at the same hierarchical level. This would mean that for automatic parsing, unambiguously determining what's raw and what's derived would require running the entire validator on each candidate directory. Also, if the raw dataset is provided read-only, writing a read-write version of it along with the derivative data could be seen to be counteracting the intent of that read-only provision. If the raw data are to be in some way distributed alongside the derivative data, doing so according to the suggestion provided in "common principles":
my_dataset/
sourcedata/
...
rawdata/
dataset_description.json
participants.tsv
sub-01/
sub-02/
...
derivatives/
...
may be more faithful to such.
Actually looking at it again, this is in (edit: the old version of) "Common principles":
These data MUST be kept in separate sourcedata and derivatives folders
Permitting BIDS-raw data to be stored as a sub-directory within a BIDS-derivatives directory could arguably be seen as contradictory to this requirement, since the derivatives directory is no longer "separate" to the (duplicate of the) raw data but a superset thereof.
So personally I'd probably err on not permitting raw data as sub-directories of derived data.
(Sorry I missed the timeline on this; I'll try to get through it nevertheless and you can use or not use as you wish)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Potentially reinforcing this: consider #492. When reading data from the raw dataset, a parser will traverse up the filesystem hierarchy looking for JSON files that should be read in order. Unless the parser explicitly detects the root of the BIDS-raw directory as being such, and therefore ceasing traversal of the filesystem at that point, it will continue up and (potentially, depending on the design of the parser) include any JSON file at the root of the derivatives dataset in agglomerating sidecar information for a file in the raw dataset.
Case 2. | ||
In both cases, every derivatives dataset is considered a BIDS dataset and must | ||
include a `dataset_description.json` file at the root level (see | ||
[Dataset description][dataset-description]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[Dataset description][dataset-description]. | |
[Dataset description][dataset-description]). |
Following comments from @robertoostenveld (#265 (comment)), and a short thread involving @Lestropie and @Remi-Gau (#265 (comment)), I have tried to reorganize the first two sections of derivatives into a more logical grouping.
I have:
code/
andsourcedata/
to encapsulate its own provenance, as a partial fulfillment of @Remi-Gau's request in [ENH] BEP 003: Common Derivatives #265 (comment).dataset_description.json
. This was motivated partially because it is too detail-oriented for the derivatives introduction, but not really a "common data type". Given the move of (1) into the main spec, this felt less out-of-place.space
entity. Giving it some extra space should reduce the confusion and recognizes that it's not quite comparable to the other fields. The rewritten text is intended to address @robertoostenveld's concerns in [ENH] BEP 003: Common Derivatives #265 (comment).Let me know what y'all think. I really appreciate you taking some weekend time to look over this.