Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Major update to pointing to files within, outside of, or remote to a current BIDS dataset #820

Closed
wants to merge 44 commits into from

Conversation

sappelhoff
Copy link
Member

@sappelhoff sappelhoff commented Jun 8, 2021

This PR implements a suggestion by @effigies to solve several long standing issues with how we refer to files within, outside of, and remote to a current BIDS dataset. (see: https://github.com/bids-standard/bids-specification/projects/6)

These fields are affected:

  • IntendedFor
  • SpatialReference
  • AssociatedEmptyRoom
  • B0FieldIdentifier
  • B0FieldSource
  • BasedOn
  • RawSources
  • Sources

Specifically, @effigies suggests (and I agree) to use a URI scheme like bids:<dataset-name>:/absolute/path/within/dataset to refer to data.

<dataset-name>s are declared as name-to-location mappings in a new DatasetLinks object in dataset_description.json.

The whole proposal is fleshed out in the changes of this PR in the "Common Principles" section.

I also tried to make all other adjustments (examples, metadata field descriptions, etc.) so that you can have an impression of what will be touched by these changes. I may have missed something, please help me find everything.

This is my interpretation of @effigies proposal, so please correct me where I went wrong, Chris. And all other reviewers: Please help us to make this a good revision to consistently refer to files in BIDS!

This is intended to close several issues:

closes #790
closes #471
closes #718
#757 (or at least, pave the way for it to be closed)

@sappelhoff sappelhoff added enhancement New feature or request opinions wanted Please read and offer your opinion on this matter consistency Spec is (potentially) inconsistent labels Jun 8, 2021
@sappelhoff sappelhoff requested a review from effigies June 8, 2021 12:01
@sappelhoff sappelhoff added this to the 1.6.1 milestone Jun 8, 2021
Copy link
Member

@VisLab VisLab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 388: Throughout the BIDS specification there are metadata...
not is metadata

Copy link
Member

@VisLab VisLab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Line 392 Suggest rewording of

"failed to consistently describe relative to what these pointers should be specified,
and the exact syntax to be used.

To:
"failed to consistently describe what these pointers should be relative to.

Line 394: suggest rewording of:

"Yet even with perfectly well and consistently defined "relative paths",

to:

"Yet even with consistently defined "relative paths",

Copy link
Member

@VisLab VisLab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Having gone through the entire proposal, I find it well-thought out. However, when I started reading, I became confused on line 407:

bids:<dataset-name>:/absolute/path/within/dataset

because I didn't understand what <dataset-name> could be if I wanted to specify a file within the dataset. It became clear later that this was the "local" bids::/.

Could we put a brief sentence here explaining that for local paths <dataset-name> will be empty and other cases will be explained in the examples?

Copy link
Member Author

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robertoostenveld I believe I have addressed all your open points (thanks for the review) - could you please verify and either raise new points or approve of the changes?

If you can't do it now, an estimated date would also be helpful :-)

Copy link
Collaborator

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are some fiddly comments. I will write up an additional post with my more general perspective.

and the BIDS URIs for the raw dataset (excluding `/derivatives`) MUST be resolved with respect to `/dataset_description.json`.

However if a `derivatives/` *folder* is nested under a raw dataset and only the raw dataset has a `dataset_description.json` file,
that `derivatives/` *folder* is not considered a BIDS dataset (see [Storage of Derivated Datasets](#storage-of-derived-datasets)),
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
that `derivatives/` *folder* is not considered a BIDS dataset (see [Storage of Derivated Datasets](#storage-of-derived-datasets)),
that `derivatives/` *folder* is not considered a BIDS dataset (see [Storage of Derived Datasets](#storage-of-derived-datasets)),

It may just as well be a (sub-)directory containing one or more arbitrary files.

In the case where a derivatives dataset is nested under a raw dataset and both have a `dataset_description.json` file,
the BIDS URIs within the nested derivatives dataset MUST be resolved with respect to `/derivatives/dataset_description.json`,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Generally we don't treat <bids>/derivatives/ itself as a derivative dataset but a container for derivative datasets.

Suggested change
the BIDS URIs within the nested derivatives dataset MUST be resolved with respect to `/derivatives/dataset_description.json`,
the BIDS URIs within the nested derivatives dataset MUST be resolved with respect to `/derivatives/<derivative>/dataset_description.json`,

Comment on lines +434 to +437
However if a `derivatives/` *folder* is nested under a raw dataset and only the raw dataset has a `dataset_description.json` file,
that `derivatives/` *folder* is not considered a BIDS dataset (see [Storage of Derivated Datasets](#storage-of-derived-datasets)),
and all BIDS URIs MUST be resolved with respect to the `/dataset_description.json` file.
The latter situation is then comparable to referring to data in a `sourcedata/` or `code/` folder.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The contents of sourcedata/, code/ and unspecified derivatives are not specified, so it does not make sense to me to provide an interpretation for BIDS URIs found within them. We can't stop people from using them, and a tool from doing its best to find an appropriate BIDS root to resolve relative to, but this feels out-of-scope.

However sometimes it may be convenient to refer to files that are outside of a given dataset but on the same host.
BIDS URIs allow for specifying such locations,
but such specifications are by definition not portable in that the BIDS URIs break when the host changes.
When sharing a BIDS dataset, the dataset curator MUST make sure that all BIDS URIs are portable.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this mean for the validator? Should the validator reject relative paths, have a relative-path rejecting mode that something like OpenNeuro can use, or something else?

@@ -34,7 +34,7 @@ sub-01/func/sub-01_task-rest_desc-preproc_bold.json

```JSON
{
"RawSources": ["sub-01/func/sub-01_task-rest_bold.nii.gz"]
"RawSources": ["bids::/sub-01/func/sub-01_task-rest_bold.nii.gz"]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RawSources are in a source dataset. We probably want something like:

Suggested change
"RawSources": ["bids::/sub-01/func/sub-01_task-rest_bold.nii.gz"]
"RawSources": ["bids:src:/sub-01/func/sub-01_task-rest_bold.nii.gz"]

Using forward-slash separated paths is
[DEPRECATED](/02-common-principles.html#definitions).
The value of this field MUST be shared only by the images meant to be used
as inputs for the estimation of a
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a unique string, not a URI. It's just an identifier so that you can do things like:

fieldmap = layout.get(B0FieldIdentifier="myidentifier", extension=".nii.gz")

MUST be a
[BIDS URI][/02-common-principles.html#bids-uri-pointing-to-files-within-and-outside-of-bids-datasets],
using forward-slash separated paths is
[DEPRECATED](/02-common-principles.html#definitions).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, identifier, not a URI.

applies_to = layout.get(B0FieldSource="myidentifier", extension=".nii.gz")

@effigies
Copy link
Collaborator

Apologies for not getting back to this sooner, as I see significant discussion has happened. I would like to push back on a particular concept here, but I recognize that this may have been settled to others' satisfaction. If that's the case, please use a thumbs-down (:-1:) on my post to indicate that you don't want to reconsider this. Please use a thumbs-up (:+1:) to indicate that you agree with my proposal, and a confused (😕) to indicate that you would like to re-discuss. I will leave it to @sappelhoff to decide whether this can be summarily dismissed, accepted, or re-litigated.


In the context of BIDS URIs, a dataset with a given <dataset-name> does not have to be a BIDS dataset, or a neuroimaging dataset. It may just as well be a (sub-)directory containing one or more arbitrary files.

I would prefer bids:<dataset-name> to always refer to a BIDS dataset, but the path within the dataset could be in a non-standard location. For example, supposing I have non-standard derivatives, I could say bids::/derivatives/special_deriv/path/to/file, rather than bids:special_deriv:/path/to/file.

In practice, I don't see an obvious way to enforce this distinction for subdirectories short of requiring that the target be validated, but I think we should not encourage this.

We present a few use cases, where I would recommend alternative practices:

Refer to a file outside of a dataset but on the same host

[...]
To link for example to an MRI file that is stored in a
local directory that is NOT nested in the raw BIDS directory:

bids:canonical:/single_subj_T1.nii

{
    "DatasetLinks": {
        "canonical": "file://../../common/matlab/spm12/canonical"
    }
}

I would instead encourage users to use a regular URI such as https://github.com/spm/spm12/raw/r7771/canonical/single_subj_T1.nii. Alternately, to ensure access, one could copy matlab/spm12/canonical/single_subj_T1.nii into sourcedata/ and refer to it as bids::/sourcedata/spm12/canonical/single_subj_T1.nii

Refer to a remote file

[...]
bids:my-location:/Conte69_Atlas/Conte69.R.midthickness.32k_fs_LR.surf.gii

{
    "DatasetLinks": {
        "my-location": "https://github.com/mgxd/brainplot/raw/master/brainplot"
    }
}

I would treat this the same, actually. I would either use a remote URI (https://github.com/mgxd/brainplot/raw/master/brainplot/Conte69_Atlas/Conte69.R.midthickness.32k_fs_LR.surf.gii) or encourage people to copy it into sourcedata/.


To put a fine point on it, I see BIDS URIs as solving two problems:

  1. We have a variety of inconsistent ways of referencing files within and between BIDS datasets.
  2. Related datasets (typically a source dataset <-> derivatives relation) might be organized in multiple ways, as described in Source vs. raw vs. derived data.

Dataset links seemed like an elegant solution as it allowed us to unify both within and between, while simultaneously making wholesale reorganization tractable via editing a single dictionary at the root of the dataset.

Using a URI schema allows us to further unify with external resources because now we can say "Any reference to a file is a URI", which is easily validated. The indirection is less clearly needed for unrelated datasets or external resources, and so I would rather not introduce it. Additionally, to me bids: says "Here is a file in a BIDS dataset", not "This is a reference coming from a BIDS dataset". So if we are going to support non-BIDS datasets with this mechanism, I would recommend a different schema than bids:.

As a final, minor point, I also worry that diluting the meaning of bids:<dataset> to potentially mean non-BIDS datasets will complicate adding in authorities that could respond by returning specific files, such as bids://openneuro.org/ds000001:/sub-01/anat/sub-01_T1w.nii.gz. I don't know that it would actually be a problem, however.

Copy link
Member Author

@sappelhoff sappelhoff left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks a lot for the thorough look, Chris. I'll deal with this once I am back from vacation. All others who want to help: Please read the post and "vote" as requested :-)

@satra
Copy link
Collaborator

satra commented Aug 23, 2021

I would prefer bids:<dataset-name> to always refer to a BIDS dataset,

i think the proposal really comes down to this statement from @effigies above. but, the two examples right after point to non-bids-dataset references (spm, conte69). i concur with the above sentiment that bids: only allows for a BIDS dataset. perhaps one should not overload this to point to other things, and simply use persistent identifiers for those things to prevent link rot. yes, individuals can always copy those resources into a bids dataset and use the canonical form, but then i would want the provenance (checksum, source repo/location, etc.,.) to be tracked.

i thought the intent of this PR was to refer to files in bids datasets rather than general resources, which then gets into provenance, parameters of operations, etc.,. if the latter is to be considered, how would this relate to space related links and potential extensions being proposed in bep 14 and the work in templateflow? perhaps a way to consider is to focus on bids datasets to start with and see which use cases are being left out?

@VisLab
Copy link
Member

VisLab commented Aug 24, 2021

Having read through all of the comments and thinking about it, I basically agree with comments of @effigies. Is it correct that this proposal is:

  1. bids:<dataset-name> always refers to a BIDS dataset.
  2. Anything that doesn't start with bids: has to be directly downloadable from a standard URI.

I do have a question though..... what is a BIDS dataset? I realized that I am not completely clear on what the minimum requirements are for a directory to be considered "BIDS". Is there a strictly enforced rule about what sub-directories must/are allowed to be under the top level directory? What is the minimum requirement for files in the root?

@effigies
Copy link
Collaborator

To commenters/reviewers on this PR, I have opened #918 as an alternative that addresses my concerns. In the significant scale-back of scope, it's possible that I took out too much clarifying text that was added previously. I would appreciate your review over there.

Pinging especially @nicholst @robertoostenveld @satra @VisLab, who previously reviewed.

@nicholst
Copy link
Collaborator

nicholst commented Nov 9, 2021

This a quite extensive PR... is it possible to get rendered? (Sorry if I should know the URL magic!)

@effigies
Copy link
Collaborator

effigies commented Nov 9, 2021

@nicholst I would look at #918, which is intended to replace this one and is much more compact. It's rendered here: https://bids-specification--918.org.readthedocs.build/en/918/

For reference, this one is rendered here: https://bids-specification--820.org.readthedocs.build/en/820/

You can find these links as the readthedocs.org check (bottom line of this screenshot):

all_checks_passed

@nicholst
Copy link
Collaborator

Thanks @effigies! For the directions to the rendered version, and to #918. Indeed, the only issues I caught on this PR (text that desribes the evolution of treatment of URIs, which doesn't fit in an 'archival' version of the standard) have been addressed in #918.

@sappelhoff
Copy link
Member Author

closing this, as #918 has completely taken over and all discussion seems resolved.

@sappelhoff sappelhoff closed this Feb 1, 2022
@sappelhoff sappelhoff deleted the enh/point_to_data branch February 1, 2022 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consistency Spec is (potentially) inconsistent enhancement New feature or request opinions wanted Please read and offer your opinion on this matter schema Issues related to the YAML schema representation of the specification. Patch version release.
Projects
None yet
8 participants