Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] Clarify that data files must be uniquely identified by entities/suffix #1508

Merged
merged 3 commits into from
Jun 15, 2023

Conversation

sappelhoff
Copy link
Member

related to #1487

src/common-principles.md Outdated Show resolved Hide resolved
Copy link
Collaborator

@DimitriPapadopoulos DimitriPapadopoulos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to think out loud of corner cases where duplicates could be viewed as useful: A project might rely on two pieces of software, one reading TIFF files only, the other one reading JPEG files only. It might seem useful to keep both formats of the same image in the dataset. The purpose would not only be to "cache" both formats to avoid re-generating, but also "archiving" both formats just in case the conversion software is lost.

But then, just use different names, on the plus side you'd know which image is the original the other image has been converted from:

  • sub-01_ses-01_sample-A_original.tif
  • sub-01_ses-01_sample-A_converted.jpg

I really cannot find a useful case, on the contrary duplicate file names just bring confusion.

@effigies
Copy link
Collaborator

effigies commented Jun 2, 2023

I feel like this doesn't quite hit the mark for me, and I agree with @DimitriPapadopoulos that the focus on identical data seems too narrow. Maybe it's worth introducing terms:

  1. Data files
  2. Format-specified metadata files (shared entities, shared suffix, different extension)
  3. BIDS sidecar metadata files (shared suffix, subset of entities, different extension)

Data files MUST be uniquely identified by entities+suffix. Only metadata files (format-specified or sidecar) files may share entities and suffix with a data file.

We can then go on to say that this implies that data files MUST NOT have the same entities and suffix, and using shared entities and suffix to indicate the same data encoded in multiple formats is not supported.

I don't know if it's worth bringing in the additional concept of:

  1. Associated data files (subset of entities, different suffix)

The same rules apply to associated data as data, and the photo is a good example here.

@Remi-Gau
Copy link
Collaborator

Remi-Gau commented Jun 2, 2023

Maybe it's worth introducing terms:

1. Data files

2. Format-specified metadata files (shared entities, shared suffix, different extension)

3. BIDS sidecar metadata files (shared suffix, subset of entities, different extension)

were you thinking of adding those in the definitions?

@effigies
Copy link
Collaborator

effigies commented Jun 2, 2023

I think we need some way of distinguishing these concepts to explain what overlaps are permitted and what aren't, but I'm not committed to these terms or necessarily making them "terms" in the sense of getting schema entries (though maybe). If we can write one or two paragraphs that make it clear, then cool.

But I guess first: Do you all agree with this breakdown?

@Remi-Gau
Copy link
Collaborator

Remi-Gau commented Jun 2, 2023

I agree with this breakdown.

I suspect examples will help.

I would suggest having the definitions in the spec first but give ourselves before we move them into the schema (in another pr)

@sappelhoff
Copy link
Member Author

But I guess first: Do you all agree with this breakdown?

I do, except for this special case of a "Data file" that I can think of: The EEGLAB file format has .set files with an optional .fdt file. With the optional .fdt file, the .set file is a "Format-specified metadata file", whereas without the .fdt file, the .set file is actually the "Data file".

But that's just something to keep in mind if we were to make lists of extensions and classify them under Chris' three categories ... which I don't see any need for. (same argument also goes for tsv files that are sometimes data files, sometimes metdata files, actually)

I don't know if it's worth bringing in the additional concept of:

Associated data files (subset of entities, different suffix)

I think that'd make sense. These files are not really data files in their own right (one wouldn't have a BIDS dataset with just such files), so it'd be good to distinguish them.

@effigies
Copy link
Collaborator

effigies commented Jun 6, 2023

How about this:

Uniqueness of data files

Data files MUST be uniquely identified by BIDS path components (entities, datatype, suffix).
If multiple extensions are permissible (for example, .nii and .nii.gz), there MUST only be one such file with the same entities, datatype and suffix.
This limitation does not apply to metadata files, such as JSON sidecar files or format-specific metadata files.

Note that duplicating files to make the same data available in multiple formats is not permitted.
For example, if the files sub-01_ses-01_sample-A_photo.jpg and sub-01_ses-01_sample-A_photo.tif
contain a representation of the same data, then the dataset MUST NOT contain both images.
If the files contain different images, other entities MUST be used to distinguish the two.

@sappelhoff
Copy link
Member Author

sounds great to me ✅ feel free to push a commit --- else I'll push a commit with you as a co-author tomorrow @effigies

@codecov
Copy link

codecov bot commented Jun 6, 2023

Codecov Report

Patch and project coverage have no change.

Comparison is base (22d7ab3) 87.80% compared to head (a677bfd) 87.80%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #1508   +/-   ##
=======================================
  Coverage   87.80%   87.80%           
=======================================
  Files          14       14           
  Lines        1287     1287           
=======================================
  Hits         1130     1130           
  Misses        157      157           

☔ View full report in Codecov by Sentry.
📢 Do you have feedback about the report comment? Let us know in this issue.

Copy link
Collaborator

@effigies effigies left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of my own text...

Copy link
Collaborator

@Remi-Gau Remi-Gau left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I approve of @effigies approving his own text

@effigies effigies changed the title [MISC] clarify duplications with different extensions are not permitted [ENH] Clarify that data files must be uniquely identified by entities/suffix Jun 7, 2023
@effigies effigies merged commit 6af9537 into bids-standard:master Jun 15, 2023
@sappelhoff sappelhoff deleted the dupes branch June 15, 2023 19:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants