Specification of relative or absolute path in IntendedFor and SpatialReference is not consistent #471

robertoostenveld · 2020-05-14T19:23:33Z

While reading the common derivatives draft as the rendered version of #265, I noticed that for SpatialReference it uses an absolute path:

    "SpatialReference": "/sub-01/anat/sub-01_desc-combined_T1w.nii.gz"

whereas in the existing MRI examples there is

   "IntendedFor": "func/sub-01_task-motor_bold.nii.gz"

and in the EEG examples there is again

  "IntendedFor":"/sub-01/ses-01/anat/sub-01_T1w.nii",

This made me realize that the specification is not clear w.r.t. references to other files. E.g. in the SpatialReference case (which is in the derivative folder): would that point to a file in the derivative itself, or could it also point to a file at a higher level (e.g. the raw level)? In this case I know it points within the derivative dataset since it contains the desc keyword (which is not defined for raw). But had it been /sub-01/anat/sub-01_T1w.nii.gz, then I would have said that it must have been pointing to a file in the raw dataset. The specification of a relative path, and the specification of what specifically it is relative to, might avoid confusion.

The text was updated successfully, but these errors were encountered:

robertoostenveld · 2020-05-14T19:25:49Z

Reading a bit further, I see another SpatialReference that points towards a .gii file in the derivatives dataset, but also a RawSources that (with precisely the same path structure) points to a file in the higher level raw dataset.

effigies · 2020-05-20T20:59:47Z

Thanks for bringing this up, @robertoostenveld. As another data point, stim_file in Task events is a relative path (not starting with /), but relative to the /stimuli directory.

I can see why RawSources prefers to its paths to relative to the raw dataset instead of the current dataset, because the datasets may not have constant locations relative to one another. If my dataset_description.json says:

{
    ...
    "SourceDatasets": [
        {
            "DOI": "10.18112/openneuro.ds000114.v1.0.1"
        }
    }
}

Then I have enough information to find the files. Any more specific path I might give will be very fragile if I organize my derivatives anywhere besides in the derivatives/ subdirectory of the sources.

I would suggest that we try to normalize on "path, not beginning with /, relative to the root of the pertinent dataset". In the case of SpatialReference and Sources, relative paths are resolved to the root of the current dataset. In the case of RawSources, relative paths are resolved to the root of the source dataset.

yarikoptic · 2020-09-28T20:18:32Z

uff, just ran into it -- it is indeed quite messy ATM in how paths are "prescribed" and there is no consistency across or even within (IntendedFor of mri vs eeg/meg) fields! I think we should look into harmonizing it across fields etc.

I would suggest that we try to normalize on "path, not beginning with /, relative to the root of the pertinent dataset".

would conflict with current use of IntendedFor -- should not start with / and is relative to the subject directory.

What about, if starts with

/ - top of the dataset (already used in the IntendedFor of eeg)
./ - directory of the sidecar file with that field
otherwise - subject directory

and follows POSIX convention (/ -- separator, .. -- up -- could come handy to point to the files of different datatype within the same session, without explicitly stating the session)?

FWIW, as of now (v1.4.0-373-g2f0f61b)

05-derivatives/02-common-data-types.md:    "SpatialReference": "sub-01/anat/sub-01_desc-combined_T1w.nii.gz"

so it doesn't follow any of the conventions ;-)

PS I wonder if we should come up with a new label such as inconsistency or harmonization to allocate to such issues?

sappelhoff · 2020-09-29T07:27:30Z

PS I wonder if we should come up with a new label such as inconsistency or harmonization to allocate to such issues?

👍

sappelhoff · 2021-01-25T14:23:23Z

What about, if starts with
/ - top of the dataset (already used in the IntendedFor of eeg)
./ - directory of the sidecar file with that field
otherwise - subject directory

I don't get what you mean by

otherwise - subject directory

other than that, sounds good.

But isn't Robert's original point also about referring (for example) from a derivatives dataset to a source (raw) dataset - even when that source dataset is not even on disk? Say I have an isolated derivatives-only dataset in which I place some annotations.tsv that is IntendedFor a raw file in the source dataset on OpenNeuro.

What to use for that? Something like SourceDatasets/<index>... would be intuitive IMHO, with index being the index into the array of defined SourceDatasets from the derivatives dataset_description.json

robertoostenveld · 2021-01-25T15:09:08Z

One problem is referring within one dataset, where I would for now say that the raw bids dataset is one, and a derived dataset (that might be placed inside the raw dataset) would be another.

Another challenge is how to refer to files in other datasets, where it gets confusing as to what defines "another" dataset is. For example in

├── rawbids
│   ├── sourcedata
│   └── derivative1
├── derivative2
└── derivative3

I am not sure whether people would consider derivative1 a different dataset from rawbids, but derivative2 and 3 (*) would certainly be different ones. And once they are different datasets, you can move them separately (on the same filesystem, or between servers) and put them together in different ways (or using datalad with subtrees or subvolumes).

*) this is where I would put them, so that I can protect the rawbids directory by setting it to read-only. This also helps with my/our archiving and sharing policy. I would personally also put sourcedata next to rawbids, and not inside it.

effigies · 2021-04-07T20:36:55Z

Somewhere I have a half-written comment that I intended to post months ago, but I think I've hit the point where it seems more efficient to start over than to find that tab.

There are three cases where we want to refer to other files:

Files within the current dataset
Files in another dataset
Files addressable by some URI

URIs are trivially permissible, and take the form scheme:[//authority]path[?query][#fragment]. We could unify these use cases by defining a BIDS-dataset URI in the following form:

[bids:<ds-name>:]/absolute/path/within/dataset

I would require all paths to be relative to the root of some dataset, except when backwards compatibility requires us to permit that use case.

Elaboration of proposed URI follows:

Within-dataset references

To refer to a file within the dataset, I would have two options:

/sub-01/fmap/sub-01_dir-PA_epi.nii.gz
bids:local:/sub-01/fmap/sub-01_dir-PA_epi.nii.gz

This would match @yarikoptic's notion that starting with a / refers to dataset root. It probably makes sense to declare a reserved name like local to refer to the current dataset.

Between-dataset references

bids:deriv1:/sub-01/anat/sub-01_desc-preproc_T1w.nii.gz

Now I need some way to find deriv1. I would suggest a field in dataset_description.json:

{
  ...
  "DatasetLinks": {
    "deriv1": "file://derivatives/derivative1", # Within dataset
    "deriv2": "file://../derivative2", # Local to a collection of datasets intended to be bundled together
    "deriv3": "file:///path/to/derivative3", # Absolute on host
    "deriv3": "https://mydatasethost.com/look/at/this/dataset.git" # Remote
  }
}

Now there's no guarantee that a link persists; if I'm a piece of software creating a derivative of a dataset that doesn't declare a DOI that I can reference, the best I can do is refer to it by the absolute path on the computer I'm running on. But a dataset curator can update dataset_description.json with improved links and it only has to be done in one place.

Does this feel like something like the right direction?

sappelhoff · 2021-04-08T07:30:59Z

Does this feel like something like the right direction?

absolutely, thanks a lot for taking the time and writing this up!

Why would you reserve local? Just for convenience? Why not use the BIDS dataset's name from dataset_description.json?

It probably makes sense to declare a reserved name like local to refer to the current data

+1

Now I need some way to find deriv1. I would suggest a field in dataset_description.json

+1

cc @adam2392

robertoostenveld · 2021-04-08T11:14:10Z

I like the direction, but think that this might confuses the "scheme" with the "authority". BIDS is just a format description, not an actual implementation of a storage system. Hence the ds-name in

[bids:<ds-name>:]/absolute/path/within/dataset

is not unique, since BIDS does not specify a unique namespace. E.g., we could both have a dataset with the same identifier ("study1") on our respective local computers.

If it would be openneuro:, then it could be made unique by the accession number. But whether openneuro should be the scheme or the authority, I am not sure.

I think that

doi:10.34973/37n0-yc51/sub-A2002/anat/sub-A2002_T1w.nii

would be uniquely and persistently resolvable, as would

https://data.donders.ru.nl/collections/di/dccn/DSC_3011020.09_236/sub-A2002/anat/sub-A2002_T1w.nii

where the doi has been replaced by the corresponding (but not persistent) http web address. (Note that you cannot download from there directly, so not an ideal example).

Idem, we could conceive something like

openneuro:ds003548/sub-01/anat/sub-01_T1w.nii.gz

that would resolve to

https://github.com/OpenNeuroDatasets/ds003548/sub-01/anat/sub-01_T1w.nii.gz

or some corresponding datalad or S3 address specification.

yarikoptic · 2021-04-08T11:24:16Z

Minor note: / is allowed in doi suffix, so in doi:10.34973/37n0-yc51/sub-A2002/anat/sub-A2002_T1w.nii you can't really tell where doi stops and path begins.

effigies · 2021-04-13T14:28:49Z

What I'm proposing is:

scheme = bids
path = [<ds-name>:]<ds-relative-path>

If we want to add an authority, then it would look something like bids://openneuro.org/ds000001:/sub-01/anat/sub-01_T1w.nii.gz, which would break down to {"scheme": "bids", "authority": "openneuro.org", "path": "/ds000001:/sub-01/anat/sub-01_T1w.nii.gz"}, and path would further break down into {"ds-name": "/ds000001", "ds-relative-path": "/sub-01/anat/sub-01_T1w.nii.gz"}

I'm not sure I would support adding authorities, though, because links between datasets are often quite dense, i.e., if I'm referencing an external dataset in one place, I'm probably referencing it in another. The canonical example in my mind is a derivative where each derivative data file may point back to one or more source files. If for some reason I need to change the dataset location, the operation would be large and potentially error-prone. Having a small registry that is updated seems to fit the use case better, IMO.

BIDS does not specify a unique namespace. E.g., we could both have a dataset with the same identifier ("study1") on our respective local computers.

The namespaces would be defined within the dataset via the DatasetLinks object (or whatever we call it).

I think that
doi:10.34973/37n0-yc51/sub-A2002/anat/sub-A2002_T1w.nii
would be uniquely and persistently resolvable

I do not believe DOIs allow extending the path as you describe, at least in general. DOI is a doi:<path> -> URI registry.

yarikoptic · 2021-04-13T14:50:11Z

doi:10.34973/37n0-yc51/sub-A2002/anat/sub-A2002_T1w.nii
would be uniquely and persistently resolvable
I do not believe DOIs allow extending the path as you describe, at least in general. DOI is a doi:<path> -> URI registry.

could be made into ad-hoc doi:<DOI>//<PATH>, i.e. using // as a separator but even then no guarantee that some doi would not use (I think it is not disallowed) //.
The only IMHO robust way would be to (consistently) decouple reference to the dataset and path within dataset, i.e. have something like

{
  "dataset": "doi:10.34973/37n0-yc51",
  "path": "sub-A2002/anat/sub-A2002_T1w.nii"
}

robertoostenveld · 2021-04-13T14:55:26Z

DOIs indeed do not specify how one would extend the path if a DOI were to refer to a filesystem. Also, DOIs don't have to refer to a filesystem but could refer to a PDF and then you might want to extent it with page numbers, or line numbers instead. Assuming the DOI points to a filesystem, we still have the issue that @yarikoptic mentions, which is that in my example the / could be part of the authority, but also of the path.

DNS hostnames also do not allow the specification of a an identifier on the server. Nevertheless, https://github.com/bids-standard/bids-specification/blob/master/README.md still works, because it breaks down into {scheme: <protocol>, authority: <DNS-resolvable IP address>, path: <location on the server>}.

So my observation is more that URIs in general consist of a triplet. Your proposal consists of a pair, like file://<path>. But in the path you are then using <ds-name> which I think requires some authority to resolve, like DNS does for hostnames.

effigies · 2021-04-13T16:37:22Z

@yarikoptic

The only IMHO robust way would be to (consistently) decouple reference to the dataset and path within dataset, i.e. have something like
{
  "dataset": "doi:10.34973/37n0-yc51",
  "path": "sub-A2002/anat/sub-A2002_T1w.nii"
}

I agree that decoupling is the best approach, and I would implement what you have as bids:mydataset:/sub-A2002/anat/sub-A2002_T1w.nii, along with a link in dataset_description.json that defines where mydataset can be found:

"DatasetLinks": {
  ...
  "mydataset": "doi:10.34973/37n0-yc51",
  ...
}

@robertoostenveld

So my observation is more that URIs in general consist of a triplet. Your proposal consists of a pair, like file://<path>. But in the path you are then using <ds-name> which I think requires some authority to resolve, like DNS does for hostnames.

The authority is dataset_description.json local to the dataset that contains the reference.

Apologies if the above points were already clear. I may not be understanding your objections.

sappelhoff · 2021-05-21T08:41:12Z

More and more issues are raised connected to this problem, and I expect this to become worse in the future - so the earlier we solve it (properly), the better.

I think Chris' proposal is in a good state to:

make a PR
send out the PR to the bids mailing list, and potentially neurostars to gather feedback
iterate and converge on a solution

@effigies do you have time to prepare an initial PR in the next days? Alternatively, I can give it a go and you can review whether I accurately represented your proposal. Let me know.

effigies · 2021-05-21T20:08:46Z

@sappelhoff Please feel free to take a pass at it. My attention is pretty divided at the moment.

robertoostenveld mentioned this issue May 14, 2020

[ENH] BEP 003: Common Derivatives #265

Merged

5 tasks

effigies added the consistency Spec is (potentially) inconsistent label Sep 29, 2020

This was referenced Jan 28, 2021

WIP: Overhaul CoordinateSystem details for MRI, MEG, EEG, iEEG #700

Closed

IntendedFor fields needs to be harmonized #718

Closed

sappelhoff mentioned this issue Apr 13, 2021

[WIP; ENH] Storage of iEEG electrodes in subject-specific FreeSurfer space #757

Open

sappelhoff mentioned this issue May 2, 2021

Specification of AssociatedEmptyRoom #790

Closed

sappelhoff mentioned this issue Jun 8, 2021

[ENH] Major update to pointing to files within, outside of, or remote to a current BIDS dataset #820

Closed

sappelhoff mentioned this issue Nov 8, 2021

[ENH] Add BIDS URIs and deprecate relative paths, RawSources and (possibly unused) BasedOn #918

Merged

sappelhoff mentioned this issue Apr 25, 2022

Incoherent IntendedFor definitions #1076

Closed

yarikoptic mentioned this issue Apr 29, 2022

validate: RF to collect/output more informative structured records dandi/dandi-cli#943

Closed

sappelhoff closed this as completed in #918 Jul 23, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Specification of relative or absolute path in IntendedFor and SpatialReference is not consistent #471

Specification of relative or absolute path in IntendedFor and SpatialReference is not consistent #471

robertoostenveld commented May 14, 2020

robertoostenveld commented May 14, 2020

effigies commented May 20, 2020

yarikoptic commented Sep 28, 2020 •

edited

Loading

sappelhoff commented Sep 29, 2020

sappelhoff commented Jan 25, 2021

robertoostenveld commented Jan 25, 2021 •

edited

Loading

effigies commented Apr 7, 2021

sappelhoff commented Apr 8, 2021

robertoostenveld commented Apr 8, 2021

yarikoptic commented Apr 8, 2021

effigies commented Apr 13, 2021 •

edited

Loading

yarikoptic commented Apr 13, 2021

robertoostenveld commented Apr 13, 2021

effigies commented Apr 13, 2021

sappelhoff commented May 21, 2021

effigies commented May 21, 2021

Specification of relative or absolute path in IntendedFor and SpatialReference is not consistent #471

Specification of relative or absolute path in IntendedFor and SpatialReference is not consistent #471

Comments

robertoostenveld commented May 14, 2020

robertoostenveld commented May 14, 2020

effigies commented May 20, 2020

yarikoptic commented Sep 28, 2020 • edited Loading

sappelhoff commented Sep 29, 2020

sappelhoff commented Jan 25, 2021

robertoostenveld commented Jan 25, 2021 • edited Loading

effigies commented Apr 7, 2021

Within-dataset references

Between-dataset references

sappelhoff commented Apr 8, 2021

robertoostenveld commented Apr 8, 2021

yarikoptic commented Apr 8, 2021

effigies commented Apr 13, 2021 • edited Loading

yarikoptic commented Apr 13, 2021

robertoostenveld commented Apr 13, 2021

effigies commented Apr 13, 2021

sappelhoff commented May 21, 2021

effigies commented May 21, 2021

yarikoptic commented Sep 28, 2020 •

edited

Loading

robertoostenveld commented Jan 25, 2021 •

edited

Loading

effigies commented Apr 13, 2021 •

edited

Loading