Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FIX] Clarify case collision intolerance as a file naming principle #858

Merged
merged 4 commits into from
Sep 28, 2021

Conversation

yarikoptic
Copy link
Collaborator

@yarikoptic yarikoptic commented Aug 20, 2021

Closes bids-standard/legacy-validator#857

An alternative could be to establish a new section, like "Case sensitivity" in the same document, which would allow to generalize the rule to cover the entire file path, and then just refer to it from <label>. But was not sure if it would not just get lost among already a good number of sections there, and where to place it. Sounds like a foundational aspect to me and should be close on top but "what is not as important" there? ;)

  • Whenever this PR is merged in some form or shape, an issue should be filed in bids-validator to have validation for this implemented.

edit 1: is there a term to describe such desired behavior? it is neither "case sensitive" nor "case insensitive" . "case collision intolerant"?

@sappelhoff
Copy link
Member

An alternative could be to establish a new section, like "Case sensitivity" in the same document, which would allow to generalize the rule to cover the entire file path, and then just refer to it from .

That sounds better.

But was not sure if it would not just get lost among already a good number of sections there, and where to place it. Sounds like a foundational aspect to me and should be close on top but "what is not as important" there? ;)

Few people read the spec linearly, most people will read this potential new section after hitting an error in the validator (once we add support there), so I think a "great placing" of the section is less important than having it as a section (because it allows us, as you say, to more easily generalize the rule)

edit 1: is there a term to describe such desired behavior? it is neither "case sensitive" nor "case insensitive" . "case collision intolerant"?

I like case collision intolerant

@effigies
Copy link
Collaborator

I think we're asking for trouble by making this a MUST NOT-level change. I would suggest softening to SHOULD NOT and having the validator warn.

Users can promote some options to promote warnings to errors (for example, fMRIPrep considers missing T1w images an error and OpenNeuro considers missing author lists an error), and some flags appear as check boxes on the web interface. I would support doing that for this to raise awareness of the potential problem, without going so far as to automatically invalidate legacy datasets.

@yarikoptic
Copy link
Collaborator Author

re @effigies

I think we're asking for trouble by making this a MUST NOT-level change. I would suggest softening to SHOULD NOT and having the validator warn.

warnings often get ignored.

I checked existing public openneuro datasets -- seems to be no hits. Those which are yet to be released, better have it fixed first (and likelihood of encountering such cases is small)
(git)smaug:/mnt/datasets/datalad/crawl/openneuro[master]
$> for ds in ds*; do files=$(/usr/bin/find $ds/*); nf=$(echo $files | wc -l); fl=$(echo $files | tr 'A-Z' 'a-z' | sort | uniq -c); nfl=$(echo $fl | wc -l); echo -n "$ds $nf $nfl "; [ "$nf" == "$nfl" ] && echo "ok" || echo "PROBLEM";  done

resulted in none with PROBLEM

A sample run on HBN one mentioned

(git)smaug:/mnt/datasets/datalad/crawl/openneuro[master]
$> for ds in ../../crawl-misc/HBN_BIDS ; do files=$(/usr/bin/find $ds/*); nf=$(echo $files | wc -l); fl=$(echo $files | tr 'A-Z' 'a-z' | sort | uniq -c); nfl=$(echo $fl | wc -l); echo -n "$ds $nf $nfl "; [ "$nf" == "$nfl" ] && echo "ok" || { echo "PROBLEM"; echo $fl | grep -v '^ *1 '; }; done
../../crawl-misc/HBN_BIDS 81321 81308 PROBLEM
      2 ../../crawl-misc/hbn_bids/sub-ndarvn280jtn
      2 ../../crawl-misc/hbn_bids/sub-ndarvn280jtn/ses-hbnsitecbic
      2 ../../crawl-misc/hbn_bids/sub-ndarvn280jtn/ses-hbnsitecbic/anat
      2 ../../crawl-misc/hbn_bids/sub-ndarvn280jtn/ses-hbnsitecbic/anat/sub-ndarvn280jtn_ses-hbnsitecbic_acq-vnavnorm_t1w.json
      2 ../../crawl-misc/hbn_bids/sub-ndarvn280jtn/ses-hbnsitecbic/anat/sub-ndarvn280jtn_ses-hbnsitecbic_acq-vnavnorm_t1w.nii.gz
      2 ../../crawl-misc/hbn_bids/sub-ndarvn280jtn/ses-hbnsitecbic/fmap
      2 ../../crawl-misc/hbn_bids/sub-ndarvn280jtn/ses-hbnsitecbic/fmap/sub-ndarvn280jtn_ses-hbnsitecbic_acq-fmri_dir-ap_epi.json
      2 ../../crawl-misc/hbn_bids/sub-ndarvn280jtn/ses-hbnsitecbic/fmap/sub-ndarvn280jtn_ses-hbnsitecbic_acq-fmri_dir-ap_epi.nii.gz
      2 ../../crawl-misc/hbn_bids/sub-ndarvn280jtn/ses-hbnsitecbic/fmap/sub-ndarvn280jtn_ses-hbnsitecbic_acq-fmri_dir-pa_epi.json
      2 ../../crawl-misc/hbn_bids/sub-ndarvn280jtn/ses-hbnsitecbic/fmap/sub-ndarvn280jtn_ses-hbnsitecbic_acq-fmri_dir-pa_epi.nii.gz
      2 ../../crawl-misc/hbn_bids/sub-ndarvn280jtn/ses-hbnsitecbic/func
      2 ../../crawl-misc/hbn_bids/sub-ndarvn280jtn/ses-hbnsitecbic/func/sub-ndarvn280jtn_ses-hbnsitecbic_task-rest_run-1_bold.json
      2 ../../crawl-misc/hbn_bids/sub-ndarvn280jtn/ses-hbnsitecbic/func/sub-ndarvn280jtn_ses-hbnsitecbic_task-rest_run-1_bold.nii.gz

Situation is IMHO of the same caliber of enforcement as "<label> - an alphanumeric value...", and it is not tool specific (as given examples on needing or not anatomical or some optional metadata) but rather might be effecting users of major OS/file systems before they even get to the tool to analyze it. Due to such collisions, and tools just issuing a warning if/whenever such collisions detected, users might proceed to analyzes without noticing that some files might have been overwritten with a copy from an alternative casing. That could effect reproducibility etc -- e.g. if someone originally ran analysis on Linux and then someone tries to reproduce using analysis script, they might not even mention that on their OS/system apparently there was some warning issued some time and results "differ". So I would vote for making it strong (MUST NOT) instead of just a SHOULD NOT.

@yarikoptic
Copy link
Collaborator Author

Thanks everyone for the feedback! So I am to

  • craft a separate session
  • will term it as "case collision intolerant"

@effigies
Copy link
Collaborator

As I said, OpenNeuro could promote the warning to an error and prevent this from arising in our data. The question is whether the standard should break backwards compatibility in this way.

@yarikoptic
Copy link
Collaborator Author

As I said, OpenNeuro could promote the warning to an error and prevent this from arising in our data. The question is whether the standard should break backwards compatibility in this way.

without this "fix" to the standard, such a BIDS dataset might not be compatible with major OSes. So, as to me, it is not "breaking backward compatibility" (as if a problematic dataset could be used "correctly" before) but rather a "bugfix" (as @sappelhoff noted in #857 (comment) noted) to the standard to avoid the issue of a BIDS-compliant dataset incompatibility with the OSes/filesystems.

@sappelhoff
Copy link
Member

without this "fix" to the standard, such a BIDS dataset might not be compatible with major OSes.

That was the argument that convinced me to consider this change a bugfix ... that and the fact that we - apparently - wouldn't break a lot of datasets if any. I would be interested in further opinions though.

@Remi-Gau
Copy link
Collaborator

I am usually pretty conservatives for these things but I would be tempted to bite the bullet on this one and go for a MUST NOT on this one even if it means breaking some datasets hidden somewhere.

@effigies
Copy link
Collaborator

I don't disagree with the arguments for breaking compatibility, but so far it seems like we have a small group approving it, mostly from a developer perspective.

For me to sign off, I'd at least want steering group input or a ping to the discussion list.

@Remi-Gau
Copy link
Collaborator

I don't disagree with the arguments for breaking compatibility, but so far it seems like we have a small group approving it, mostly from a developer perspective.

For me to sign off, I'd at least want steering group input or a ping to the discussion list.

Good point. That's more than fair.

@yarikoptic would you mind hailing the google group on this issue ?

@yarikoptic
Copy link
Collaborator Author

roger that, I will RF this PR into adding a short section, and then hail the list

@yarikoptic
Copy link
Collaborator Author

PR updated, RFC is posted: https://groups.google.com/g/bids-discussion/c/ixm2aFgAJy0

@sappelhoff
Copy link
Member

The RFC (so far) has yielded two comments that I interpret to be in favor of this (breaking) change. I think an input from @bids-standard/steering would be nice to have here.

@poldrack
Copy link

This change sounds very reasonable to me.

@effigies
Copy link
Collaborator

Thinking about this now, this is consistent with the previous decision that FA (fractional anisotropy) and fa (flip angle) cannot both exist as suffixes. Would it make sense to make a more general rule like:

File name components, including labels and suffixes, are case-sensitive,
but collisions MUST be avoided when case is ignored.
For example, a dataset cannot contain both `sub-S1` and `sub-s1`.
Additionally, because the suffix `eeg` is defined, then the suffix `EEG` may not be added to the standard.

It's mixing rules for dataset curators and BEP leads a little bit.

@yarikoptic
Copy link
Collaborator Author

Sounds great to me -- feel welcome @effigies to make it into a recommendation or just a follow up commit if placement should be different (section)

@guiomar
Copy link
Collaborator

guiomar commented Sep 13, 2021

It also sounds good to me. Understanding it as a bug fix and indeed preventing future errors. It's nice there isn't any known public dataset affected. And that the same applies for both labels and suffixes.

@effigies
Copy link
Collaborator

@yarikoptic Pushed a commit directly to your branch. Please have a look.

@effigies effigies changed the title BF: inform that no to <label> can be identical if casing is ignored FIX: Clarify case collision intolerance as a file naming principle Sep 13, 2021
@guiomar
Copy link
Collaborator

guiomar commented Sep 13, 2021

Thanks!!

@yarikoptic
Copy link
Collaborator Author

Thank you @effigies , looks good for me, but I guess I shouldn't approve my own PR ;)

@effigies
Copy link
Collaborator

Well, between the two of us, we can count as one.

Copy link
Collaborator

@guiomar guiomar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me, thank you!

@yarikoptic
Copy link
Collaborator Author

filed https://github.com/bids-standard/bids-validator/issues/1330 , and got surprised that current master is a whooping v1.6.0-203-g811e17e2 - so lots of changes/development since the last release, without any bugfix or minor release.

@robertoostenveld
Copy link
Collaborator

It reminds me of an earlier discussion on units, where we identified that Greek characters used in units (ohm) and prefixes (micro) are ambiguous. See https://bids-specification.readthedocs.io/en/stable/99-appendices/05-units.html. An important raison d'être for the specification is to disambiguate and to make choices where individual researchers themselves might go for opposite choices. In this case "a" and "A" are ambiguous under some conditions.

Hence I also approve of this "breaking change" as a bug fix to the specification.

@sappelhoff sappelhoff changed the title FIX: Clarify case collision intolerance as a file naming principle [FIX] Clarify case collision intolerance as a file naming principle Sep 14, 2021
@sappelhoff
Copy link
Member

Given the general agreement and the postings on the mailing list to make a wider audience, I feel like we can merge this PR in 3 days, thus observing our 5 days rule. It'll then be part of the next release unless anything changes in the meantime.

@sappelhoff
Copy link
Member

Thanks for identifying this issue and pushing for a solution @yarikoptic and all reviewers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Validate text encoding for JSON data
7 participants