Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specification of relative or absolute path in IntendedFor and SpatialReference is not consistent #471

Closed
robertoostenveld opened this issue May 14, 2020 · 16 comments · Fixed by #918
Labels
consistency Spec is (potentially) inconsistent

Comments

@robertoostenveld
Copy link
Collaborator

While reading the common derivatives draft as the rendered version of #265, I noticed that for SpatialReference it uses an absolute path:

    "SpatialReference": "/sub-01/anat/sub-01_desc-combined_T1w.nii.gz"

whereas in the existing MRI examples there is

   "IntendedFor": "func/sub-01_task-motor_bold.nii.gz"

and in the EEG examples there is again

  "IntendedFor":"/sub-01/ses-01/anat/sub-01_T1w.nii",

This made me realize that the specification is not clear w.r.t. references to other files. E.g. in the SpatialReference case (which is in the derivative folder): would that point to a file in the derivative itself, or could it also point to a file at a higher level (e.g. the raw level)? In this case I know it points within the derivative dataset since it contains the desc keyword (which is not defined for raw). But had it been /sub-01/anat/sub-01_T1w.nii.gz, then I would have said that it must have been pointing to a file in the raw dataset. The specification of a relative path, and the specification of what specifically it is relative to, might avoid confusion.

@robertoostenveld
Copy link
Collaborator Author

Reading a bit further, I see another SpatialReference that points towards a .gii file in the derivatives dataset, but also a RawSources that (with precisely the same path structure) points to a file in the higher level raw dataset.

@effigies
Copy link
Collaborator

Thanks for bringing this up, @robertoostenveld. As another data point, stim_file in Task events is a relative path (not starting with /), but relative to the /stimuli directory.

I can see why RawSources prefers to its paths to relative to the raw dataset instead of the current dataset, because the datasets may not have constant locations relative to one another. If my dataset_description.json says:

{
    ...
    "SourceDatasets": [
        {
            "DOI": "10.18112/openneuro.ds000114.v1.0.1"
        }
    }
}

Then I have enough information to find the files. Any more specific path I might give will be very fragile if I organize my derivatives anywhere besides in the derivatives/ subdirectory of the sources.

I would suggest that we try to normalize on "path, not beginning with /, relative to the root of the pertinent dataset". In the case of SpatialReference and Sources, relative paths are resolved to the root of the current dataset. In the case of RawSources, relative paths are resolved to the root of the source dataset.

@yarikoptic
Copy link
Collaborator

yarikoptic commented Sep 28, 2020

uff, just ran into it -- it is indeed quite messy ATM in how paths are "prescribed" and there is no consistency across or even within (IntendedFor of mri vs eeg/meg) fields! I think we should look into harmonizing it across fields etc.

I would suggest that we try to normalize on "path, not beginning with /, relative to the root of the pertinent dataset".

would conflict with current use of IntendedFor -- should not start with / and is relative to the subject directory.

What about, if starts with

  • / - top of the dataset (already used in the IntendedFor of eeg)
  • ./ - directory of the sidecar file with that field
  • otherwise - subject directory

and follows POSIX convention (/ -- separator, .. -- up -- could come handy to point to the files of different datatype within the same session, without explicitly stating the session)?

FWIW, as of now (v1.4.0-373-g2f0f61b)

05-derivatives/02-common-data-types.md:    "SpatialReference": "sub-01/anat/sub-01_desc-combined_T1w.nii.gz"

so it doesn't follow any of the conventions ;-)

PS I wonder if we should come up with a new label such as inconsistency or harmonization to allocate to such issues?

@sappelhoff
Copy link
Member

PS I wonder if we should come up with a new label such as inconsistency or harmonization to allocate to such issues?

👍

@effigies effigies added the consistency Spec is (potentially) inconsistent label Sep 29, 2020
@sappelhoff
Copy link
Member

What about, if starts with
/ - top of the dataset (already used in the IntendedFor of eeg)
./ - directory of the sidecar file with that field
otherwise - subject directory

I don't get what you mean by

otherwise - subject directory

other than that, sounds good.

But isn't Robert's original point also about referring (for example) from a derivatives dataset to a source (raw) dataset - even when that source dataset is not even on disk? Say I have an isolated derivatives-only dataset in which I place some annotations.tsv that is IntendedFor a raw file in the source dataset on OpenNeuro.

What to use for that? Something like SourceDatasets/<index>... would be intuitive IMHO, with index being the index into the array of defined SourceDatasets from the derivatives dataset_description.json

@robertoostenveld
Copy link
Collaborator Author

robertoostenveld commented Jan 25, 2021

One problem is referring within one dataset, where I would for now say that the raw bids dataset is one, and a derived dataset (that might be placed inside the raw dataset) would be another.

Another challenge is how to refer to files in other datasets, where it gets confusing as to what defines "another" dataset is. For example in

├── rawbids
│   ├── sourcedata
│   └── derivative1
├── derivative2
└── derivative3

I am not sure whether people would consider derivative1 a different dataset from rawbids, but derivative2 and 3 (*) would certainly be different ones. And once they are different datasets, you can move them separately (on the same filesystem, or between servers) and put them together in different ways (or using datalad with subtrees or subvolumes).

*) this is where I would put them, so that I can protect the rawbids directory by setting it to read-only. This also helps with my/our archiving and sharing policy. I would personally also put sourcedata next to rawbids, and not inside it.

@effigies
Copy link
Collaborator

effigies commented Apr 7, 2021

Somewhere I have a half-written comment that I intended to post months ago, but I think I've hit the point where it seems more efficient to start over than to find that tab.


There are three cases where we want to refer to other files:

  1. Files within the current dataset
  2. Files in another dataset
  3. Files addressable by some URI

URIs are trivially permissible, and take the form scheme:[//authority]path[?query][#fragment]. We could unify these use cases by defining a BIDS-dataset URI in the following form:

[bids:<ds-name>:]/absolute/path/within/dataset

I would require all paths to be relative to the root of some dataset, except when backwards compatibility requires us to permit that use case.

Elaboration of proposed URI follows:

Within-dataset references

To refer to a file within the dataset, I would have two options:

  • /sub-01/fmap/sub-01_dir-PA_epi.nii.gz
  • bids:local:/sub-01/fmap/sub-01_dir-PA_epi.nii.gz

This would match @yarikoptic's notion that starting with a / refers to dataset root. It probably makes sense to declare a reserved name like local to refer to the current dataset.

Between-dataset references

  • bids:deriv1:/sub-01/anat/sub-01_desc-preproc_T1w.nii.gz

Now I need some way to find deriv1. I would suggest a field in dataset_description.json:

{
  ...
  "DatasetLinks": {
    "deriv1": "file://derivatives/derivative1", # Within dataset
    "deriv2": "file://../derivative2", # Local to a collection of datasets intended to be bundled together
    "deriv3": "file:///path/to/derivative3", # Absolute on host
    "deriv3": "https://mydatasethost.com/look/at/this/dataset.git" # Remote
  }
}

Now there's no guarantee that a link persists; if I'm a piece of software creating a derivative of a dataset that doesn't declare a DOI that I can reference, the best I can do is refer to it by the absolute path on the computer I'm running on. But a dataset curator can update dataset_description.json with improved links and it only has to be done in one place.


Does this feel like something like the right direction?

@sappelhoff
Copy link
Member

Does this feel like something like the right direction?

absolutely, thanks a lot for taking the time and writing this up!

Why would you reserve local? Just for convenience? Why not use the BIDS dataset's name from dataset_description.json?

It probably makes sense to declare a reserved name like local to refer to the current data

+1

Now I need some way to find deriv1. I would suggest a field in dataset_description.json

+1

cc @adam2392

@robertoostenveld
Copy link
Collaborator Author

I like the direction, but think that this might confuses the "scheme" with the "authority". BIDS is just a format description, not an actual implementation of a storage system. Hence the ds-name in

[bids:<ds-name>:]/absolute/path/within/dataset

is not unique, since BIDS does not specify a unique namespace. E.g., we could both have a dataset with the same identifier ("study1") on our respective local computers.

If it would be openneuro:, then it could be made unique by the accession number. But whether openneuro should be the scheme or the authority, I am not sure.

I think that

doi:10.34973/37n0-yc51/sub-A2002/anat/sub-A2002_T1w.nii

would be uniquely and persistently resolvable, as would

https://data.donders.ru.nl/collections/di/dccn/DSC_3011020.09_236/sub-A2002/anat/sub-A2002_T1w.nii

where the doi has been replaced by the corresponding (but not persistent) http web address. (Note that you cannot download from there directly, so not an ideal example).

Idem, we could conceive something like

openneuro:ds003548/sub-01/anat/sub-01_T1w.nii.gz

that would resolve to

https://github.com/OpenNeuroDatasets/ds003548/sub-01/anat/sub-01_T1w.nii.gz

or some corresponding datalad or S3 address specification.

@yarikoptic
Copy link
Collaborator

Minor note: / is allowed in doi suffix, so in doi:10.34973/37n0-yc51/sub-A2002/anat/sub-A2002_T1w.nii you can't really tell where doi stops and path begins.

@effigies
Copy link
Collaborator

effigies commented Apr 13, 2021

What I'm proposing is:

  • scheme = bids
  • path = [<ds-name>:]<ds-relative-path>

If we want to add an authority, then it would look something like bids://openneuro.org/ds000001:/sub-01/anat/sub-01_T1w.nii.gz, which would break down to {"scheme": "bids", "authority": "openneuro.org", "path": "/ds000001:/sub-01/anat/sub-01_T1w.nii.gz"}, and path would further break down into {"ds-name": "/ds000001", "ds-relative-path": "/sub-01/anat/sub-01_T1w.nii.gz"}

I'm not sure I would support adding authorities, though, because links between datasets are often quite dense, i.e., if I'm referencing an external dataset in one place, I'm probably referencing it in another. The canonical example in my mind is a derivative where each derivative data file may point back to one or more source files. If for some reason I need to change the dataset location, the operation would be large and potentially error-prone. Having a small registry that is updated seems to fit the use case better, IMO.

BIDS does not specify a unique namespace. E.g., we could both have a dataset with the same identifier ("study1") on our respective local computers.

The namespaces would be defined within the dataset via the DatasetLinks object (or whatever we call it).

I think that

doi:10.34973/37n0-yc51/sub-A2002/anat/sub-A2002_T1w.nii

would be uniquely and persistently resolvable

I do not believe DOIs allow extending the path as you describe, at least in general. DOI is a doi:<path> -> URI registry.

@yarikoptic
Copy link
Collaborator

doi:10.34973/37n0-yc51/sub-A2002/anat/sub-A2002_T1w.nii

would be uniquely and persistently resolvable
I do not believe DOIs allow extending the path as you describe, at least in general. DOI is a doi:<path> -> URI registry.

could be made into ad-hoc doi:<DOI>//<PATH>, i.e. using // as a separator but even then no guarantee that some doi would not use (I think it is not disallowed) //.
The only IMHO robust way would be to (consistently) decouple reference to the dataset and path within dataset, i.e. have something like

{
  "dataset": "doi:10.34973/37n0-yc51",
  "path": "sub-A2002/anat/sub-A2002_T1w.nii"
}

@robertoostenveld
Copy link
Collaborator Author

DOIs indeed do not specify how one would extend the path if a DOI were to refer to a filesystem. Also, DOIs don't have to refer to a filesystem but could refer to a PDF and then you might want to extent it with page numbers, or line numbers instead. Assuming the DOI points to a filesystem, we still have the issue that @yarikoptic mentions, which is that in my example the / could be part of the authority, but also of the path.

DNS hostnames also do not allow the specification of a an identifier on the server. Nevertheless, https://github.com/bids-standard/bids-specification/blob/master/README.md still works, because it breaks down into {scheme: <protocol>, authority: <DNS-resolvable IP address>, path: <location on the server>}.

So my observation is more that URIs in general consist of a triplet. Your proposal consists of a pair, like file://<path>. But in the path you are then using <ds-name> which I think requires some authority to resolve, like DNS does for hostnames.

@effigies
Copy link
Collaborator

@yarikoptic

The only IMHO robust way would be to (consistently) decouple reference to the dataset and path within dataset, i.e. have something like

{
  "dataset": "doi:10.34973/37n0-yc51",
  "path": "sub-A2002/anat/sub-A2002_T1w.nii"
}

I agree that decoupling is the best approach, and I would implement what you have as bids:mydataset:/sub-A2002/anat/sub-A2002_T1w.nii, along with a link in dataset_description.json that defines where mydataset can be found:

"DatasetLinks": {
  ...
  "mydataset": "doi:10.34973/37n0-yc51",
  ...
}

@robertoostenveld

So my observation is more that URIs in general consist of a triplet. Your proposal consists of a pair, like file://<path>. But in the path you are then using <ds-name> which I think requires some authority to resolve, like DNS does for hostnames.

The authority is dataset_description.json local to the dataset that contains the reference.


Apologies if the above points were already clear. I may not be understanding your objections.

@sappelhoff
Copy link
Member

More and more issues are raised connected to this problem, and I expect this to become worse in the future - so the earlier we solve it (properly), the better.

I think Chris' proposal is in a good state to:

  1. make a PR
  2. send out the PR to the bids mailing list, and potentially neurostars to gather feedback
  3. iterate and converge on a solution

@effigies do you have time to prepare an initial PR in the next days? Alternatively, I can give it a go and you can review whether I accurately represented your proposal. Let me know.

@effigies
Copy link
Collaborator

@sappelhoff Please feel free to take a pass at it. My attention is pretty divided at the moment.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
consistency Spec is (potentially) inconsistent
Projects
None yet
4 participants