[FIX] clarify which file to list in scans.tsv for file formats with multiple files #1178

sappelhoff · 2022-08-02T08:03:13Z

came up in:

[BUG] parsing scans.tsv is currently too restrictive for BrainVision files mne-tools/mne-bids#1033

I am adding a general recommendation that if file formats consist of several files, that file which would also be passed to I/O software should be listed in scans.tsv.

tsalo · 2022-08-15T13:26:57Z

src/03-modality-agnostic-files.md

+Recordings with multiple parts MUST be documented with one row per file.
+It is RECOMMENDED that if the multiple parts of a recording stem solely from the file format,
+that file is listed, which would also be passed to software functions for
+reading (for example `.vhdr` for BrainVision and `.set` for EEGLAB).


I think the wording needs to be adjusted. In the previous sentence, file formats with multiple files are treated the same as file collections, in which each file MUST be documented separately in the scans.tsv. However, in the next sentence, it is RECOMMENDED to not do that for these file formats specifically.

I think that you need to distinguish between file collections and multi-file file formats from the beginning, so that it's clear that multi-file format files shouldn't be listed separately in the scans.tsv file.

I actually wonder whether it's not better to be consistent and also document multi-file file formats with one row per file then. That is, BrainVision recordings would have 3 rows per recording in scans.tsv, and EEGLAB recordings that make use of their optional .fdt file get 2 rows per recording in scans.tsv.

It's less pretty than just listing the header file there (.vhdr for BrainVision, .set for EEGLAB), but not sure if it's worth introducing a special rule then 🤔 As far as I know we don't have other multi-file file formats (excluding those that are directories, .. here the whole directory gets listed)

What do you think @hoechenberger ?

My proposal would be to allow filename to be an array of strings (pity that it's scans.tsv and not scans.json!!), this would ensure we still have only one row per scan. Still – which file should we try to load? And what about split files?

-> Considering this, I'm leaning toward listing only the VHDR file for BV and SET file for EEGLAB.

But there shouldn't be a RECOMMENDED, it should be a MUST, and we shouldn't just propose, but enforce which file to list. I don't like all this "well yeah umm you could but you don't have do, do what you like" stuff in the standard, it makes software implementations such an unnecessary pain to develop!

Still – which file should we try to load?

That is less of a problem than it seems because all data formats and their extensions are documented in BIDS and it's easy to include a logic like:

"if VMRK found, this is BrainVision, so look for VHDR, if it doesn't exist, raise error"

And what about split files?

Here we actually recommend that people write ALL split files, see:

https://bids-specification.readthedocs.io/en/latest/99-appendices/14-glossary.html#split-entities

implemented in [FIX] split MEG files should be listed separately in scans.tsv #735

(after mne-bids discussion in scans.tsv for split files mne-tools/mne-bids#692)

So that further makes me think that listing all neural recordings in scans.tsv makes sense ... also for the two exceptions we have: EEGLAB (set+fdt) and BrainVision (vhdr, vmrk, eeg) files 🤔

But there shouldn't be a RECOMMENDED, it should be a MUST, and we shouldn't just propose, but enforce which file to list.

right, but this is the age-old discussion about backward compatibility and the balance we are treading ... I think things will get better in the long run with the move to the BIDS schema, which will eventually allow validation of BIDS datasets against particular BIDS schema versions (something we cannot currently do with bids-validator) --> and I think that may make us all more comfortable with breaking changes and including rules that are more strict.

(as you know I do in principle agree with you that we need more MUST and less SHOULD)

TL;DR: So I think I'll revamp this PR so that all BrainVision files and all EEGLAB files SHOULD be listed in scans.tsv

TL;DR: So I think I'll revamp this PR so that all BrainVision files and all EEGLAB files SHOULD be listed in scans.tsv

I really don't think this is a good idea. It effectively creates duplicates for those scans. If you really want to list all files, we should find a way to squeeze ONE recording into ONE row, and not spread it out across several rows just because several files belong to this recording.

It effectively creates duplicates for those scans.

not really duplicates (the extension is different) ... one could argue that it shows that several files belong to that recording. After all, we document that each file belonging to a recording gets an entry in the table.

Also, we are already doing this for split files, and for qMRI file collections

I think it'd be the consistent thing to do ... and I think it's kind of implicit in the present rules already: just not obviously documented (which I'd like to fix).

not really duplicates (the extension is different)

Yes but that is exactly my point :) the purpose of scans.tsv is, and I state the spec:

The purpose of this file is to describe timing and other properties of each imaging acquisition sequence (each run file) within one session. Each neural recording file SHOULD be described by exactly one row.

So the intention was, originally, to have precisely one row per acquisition run, from where you could extract all the relevant metadata for that specific run, such as recording time and yes, also the name of file used to store the data.

Later on we extended the standard, and I believe I contributed or at least +1'd this change as well:

Some recordings consist of multiple parts, that span several files, for example through echo-, part-, or split- entities. Such recordings MUST be documented with one row per file.

And I believe this was a mistake. As users are free to add additional columns to the file, we end up with duplicated metadata in cases of split, ... files.

If the focus and purpose of this file really has shifted that much since its initial inception, it should be renamed to files.tsv, or an entirely new file should be introduced -- instead of squeezing more and more new stuff into a thing that was never designed to store that kind of info.

I know there's basically no chance to make this happen, but I wanted to voice that I'm very unhappy with the direction some things are developing. 😓

.... but since things are a total mess already anyway ... yeah, it's probably more consistent to add those additional files than just omitting them.

I know there's basically no chance to make this happen, but I wanted to voice that I'm very unhappy with the direction some things are developing. sweat

knowing your opinions to some extent I don't think this statement does the current developments justice ... the present issue is more "cleaning up" / "documenting" rather than developing anyhow. But maybe we should discuss that in person some other time :-)

sappelhoff

@robertoostenveld how do you write entries to scans.tsv in data2bids in fieldtrip?

In this PR I want to clarify that each of the three files should be written (eeg, vhdr, vmrk), because that is consistent with what we do for FIF "split files" (write all of them), or entity-linked file collections (write all of them).

I think this is already implicit in the rules, but not obvious and I'd like to improve that with this PR.

codecov · 2022-08-25T13:47:16Z

Codecov Report

Merging #1178 (6298081) into master (7523fc3) will not change coverage.
The diff coverage is n/a.

@@           Coverage Diff           @@
##           master    #1178   +/-   ##
=======================================
  Coverage   88.33%   88.33%           
=======================================
  Files           6        6           
  Lines        1020     1020           
=======================================
  Hits          901      901           
  Misses        119      119

Help us with your feedback. Take ten seconds to tell us how you rate us. Have a feature suggestion? Share it here.

src/04-modality-specific-files/04-intracranial-electroencephalography.md

robertoostenveld · 2022-08-25T15:06:37Z

@robertoostenveld how do you write entries to scans.tsv in data2bids in fieldtrip?

I write one row per single "recording" (aka run), also if that "recording" happens to be represented over multiple files. For BrainVision I use point to the .vhdr file. For EEGLAB (which can comprise a .set plus a .fdt file) I point to the .set file. For CTF I point to the .ds directory, not to any of the files. Also for MEF3 iEEG data I would point to the directory, not to the files.

In fieldtrip we refer to such a conceptual recording construct as the "dataset" and we use https://github.com/fieldtrip/fieldtrip/blob/master/private/dataset2files.m to map between the dataset (the construct) and the actual files in the filesystem.

But I honestly don't think that all 3 files (for BrainVision), 2 files (for EEGLAB) or ~15 files (for CTF) or ~10-100 files (for MEF3) should be listed in the scans file. Just consider: if you were to use BIDS with old ANALYZE data, would you also specify both the .hdr and the .img file? Is there any processing software (i.e. machine readable) or human that would want to have the full list in scans.tsv? MNE Python also only expects the .vhdr filename and not the triplet, right?

[edit: only after typing this reply, I read the whole thread. I now realize that I repeated some things that were already mentioned earlier. Still, I am in favor of one row per run in the scans.tsv, and to use a string (not a list, or a tuple, or a object, or a dict) consistent with the strings that we now specify in the different analysis software packages. E.g. in BrainVision analyzer you click only on the vhdr file to open the recording, not on the triplet.]

sappelhoff · 2022-08-25T15:48:36Z

I write one row per single "recording" (aka run), also if that "recording" happens to be represented over multiple files

How do you proceed for split FIF files that use the split entity? Because unfortunately (?) we have at some point decided that all split files should we written to scans.tsv (see: #1178 (comment))

What you do in data2bids and what we do in mne-bids corresponds to the first commit in this PR (dba9dd8), which I have meanwhile changed based on the fact that:

the existing text reads like each file should be listed
that's what we do for split files

But perhaps me drawing on entity-linked file collections as another argument to write each file in scans.tsv was misguided, as these are in truth separate (but related) recordings -- unlike the EEGLAB, BrainVision, and directory based (mef3, ctf, ...) file formats (is that correct @tsalo? your initial comment triggered this whole train of thought).

I am happy to reconsider again, but it bothers me that for split files we write each file 😕

https://github.com/sccn/bids-matlab-tools does not seem to write (the optional) scans.tsv files at all 🤔 can you confirm that @arnodelorme?

tsalo · 2022-08-25T16:02:17Z

But perhaps me drawing on entity-linked file collections as another argument to write each file in scans.tsv was misguided, as these are in truth separate (but related) recordings -- unlike the EEGLAB, BrainVision, and directory based (mef3, ctf, ...) file formats (is that correct @tsalo? your initial comment triggered this whole train of thought).

Technically some entity-linked file collections come from the same recording (e.g., complex-valued data broken into magnitude and phase parts), while other collections can actually come from distinct scans (e.g., a lot of the file collections used to calculate quantitative MRI maps). However, I suppose it could be said that they all contain the same type of data (i.e., use the same extensions), but different actual data housed within the files. That seems like a meaningful difference from what you're talking about with multi-file formats.

robertoostenveld · 2022-08-26T08:20:23Z

How do you proceed for split FIF files that use the split entity? Because unfortunately (?) we have at some point decided that all split files should we written to scans.tsv

I have not encountered/considered (i.e., thought about) that yet in the context of FT data2bids. Let me read up on that...

robertoostenveld · 2022-08-26T08:29:38Z

Following @tsalo in #1178 (comment), the split FIF files contain the same type of data, at consecutive time windows that all are part of the same run (same acq params, same task, no deliberate subject movement in between). That is different from the eeg/vhdr/vmrk triplet which contain data/header/markers, all from the same experimental time window.

Imagine that we would represent BOLD data as a series of 3D nii files (like in SPM), rather than one 4D nii file. Would you in that case represent that in a single row of the scans.tsv, or in different rows? I feel inclined to accept those ending up on different rows (as they are explicitly split over time), just like split FIF files. The analysis software in that case needs to be smart enough to sort them and glue the data together in memory, but - albeit less practical - could also analyze one at a time (like 2x ft_preprocessing followed by ft_appenddata followed by the rest of the analysis).

robertoostenveld · 2022-08-26T08:32:57Z

In https://bids-specification.readthedocs.io/en/latest/99-appendices/14-glossary.html#split-entities it is written "Each of these files has an internal pointer to the next file.". I am no fif expert. Is that also the case if the split files are written by the neuromag software, e.g. during acquisition or after maxfilter? Or is that only for MNE-Python written files?

edit: FieldTrip does not make use of this internal pointer in the fif file, it expects the user to specify the series and glue them together, either by specifying cfg.dataset = {fif1, fif2, etc.} during ft_preprocessing, or by preprocessing each separately and then explicitly pasting them together with ft_appenddata.

…many rows as files

Update src/04-modality-specific-files/04-intracranial-electroencephalography.md

sappelhoff · 2022-08-26T08:43:58Z

"Each of these files has an internal pointer to the next file.". I am no fif expert. Is that also the case if the split files are written by the neuromag software, e.g. during acquisition or after maxfilter? Or is that only for MNE-Python written files?

I think that's how the files are written by the neuromag software. Is that correct @larsoner?

sappelhoff · 2022-08-26T08:45:00Z

Thanks for the discussion so far @hoechenberger @robertoostenveld @tsalo I have considered your feedback and made a new attempt to document the situation. I'd be happy for another round of reviews.

src/03-modality-agnostic-files.md

robertoostenveld · 2022-08-26T09:11:13Z

I looked at the revised text, it now looks good to me.

Co-authored-by: Robert Oostenveld <r.oostenveld@gmail.com>

hoechenberger · 2022-08-26T09:26:22Z

@robertoostenveld

"Each of these files has an internal pointer to the next file.". I am no fif expert. Is that also the case if the split files are written by the neuromag software, e.g. during acquisition or after maxfilter?

To my knowledge, yes, this is the case. It sometimes leads to problems when users manually rename the files after retrieving them from the acquisition device / computer, as the pointer(s) still contain(s) the old filenames.

robertoostenveld · 2022-08-26T09:39:37Z

@hoechenberger would you be able to tell me which fif tag it is, and possibly point to the mne-matlab code that deals with it (if any)? As far as I am aware, the mne-matlab reader does not deal with a tag like this and requires the user (and the FieldTrip code) to deal with the files one by one.

edit: I searched myself, from open.py I guess it is FIFF_REF_FILE_NAME, which is not used when reading fif files with the MATLAB implementation https://github.com/mne-tools/mne-matlab/search?q=ref_file_name, nor with the CPP implementation https://github.com/mne-tools/mne-cpp.

hoechenberger · 2022-08-26T09:55:09Z

@larsoner may be able to help; he's much more competent than I on that front :)

hoechenberger · 2022-08-26T10:09:15Z

@robertoostenveld

edit: I searched myself, from open.py I guess it is FIFF_REF_FILE_NAME, which is not used when reading fif files with the MATLAB implementation https://github.com/mne-tools/mne-matlab/search?q=ref_file_name, nor with the CPP implementation https://github.com/mne-tools/mne-cpp.

I believe it's actually FIFFV_ROLE_NEXT_FILE, which is being defined here for MNE-MATLAB
https://github.com/mne-tools/mne-matlab/blob/8fb7d04fbf1f2c15e40d14582aab6719486178f7/matlab/fiff_define_constants.m#L277-L278

but never used.

But it's used in MNE-CPP when writing split files:
https://github.com/mne-tools/mne-cpp/blob/fe015fdd225de7c080a1fad4749debcc54eb97f7/applications/mne_scan/plugins/writetofile/writetofile.cpp#L443

robertoostenveld · 2022-08-26T10:59:45Z

Thanks. I now see that teh files can be stitched together using FIFF_REF_FILE_NAME which together with FIFF_REF_ROLE (which can have the value FIFFV_ROLE_NEXT_FILE and FIFFV_ROLE_PREV_FILE) allows to construct the linked-list.
I already saw that it is used in writing with CPP, but it is not used in reading. So reading the data with mne-cpp requires the user looping over the filenames, just as with mne-matlab. That speaks in favor of having all split fif files represented in the scans.tsv, as now further clarified in this PR. Any software that does automatically read the sequence (like mne-python) will then have to take care in not reading the same split fif file multiple times.

sappelhoff · 2022-08-26T13:40:07Z

@hoechenberger @robertoostenveld if you are both happy with the present changes, please don't forget to approve the PR for formality reasons :)

robertoostenveld

I am fine with the suggested changes

effigies · 2022-08-26T16:06:01Z

Do we have a validator pr to go with the new requirements?

sappelhoff · 2022-08-26T23:05:07Z

Do we have a validator pr to go with the new requirements?

not yet, but we could add a warning for example when a BrainVision file is detected and anything BUT .vhdr is listed in scans.tsv (same for EEGLAB). 🤔

As for the file collections and split fif files, the checks are appropriately in place, if I remember correctly.

hoechenberger · 2022-08-27T05:24:10Z

Thanks @sappelhoff!!

sappelhoff added this to the 1.8.0 milestone Aug 2, 2022

sappelhoff mentioned this pull request Aug 2, 2022

[BUG] parsing scans.tsv is currently too restrictive for BrainVision files mne-tools/mne-bids#1033

Closed

sappelhoff requested review from Remi-Gau and tsalo August 15, 2022 09:52

tsalo reviewed Aug 15, 2022

View reviewed changes

sappelhoff requested review from erdalkaraca and ezemikulan as code owners August 25, 2022 13:43

sappelhoff force-pushed the clarify/scans/formats branch from 8d84248 to 454fc7b Compare August 25, 2022 13:44

sappelhoff requested a review from robertoostenveld August 25, 2022 13:44

sappelhoff commented Aug 25, 2022

View reviewed changes

src/04-modality-specific-files/04-intracranial-electroencephalography.md Outdated Show resolved Hide resolved

sappelhoff added 4 commits August 26, 2022 10:39

[FIX] clarify which file to list in scans

16facb6

revert on initial idea: multi-file formats should be described by as …

7a1ae14

…many rows as files

update brainvision link

a10bcd8

Update src/04-modality-specific-files/04-intracranial-electroencephalography.md

BV->vhdr, EEGLAB->set, CTF->ds, FIF->all files

57914fa

sappelhoff force-pushed the clarify/scans/formats branch from 67bf58b to 57914fa Compare August 26, 2022 08:40

fix link

c372665

sappelhoff commented Aug 26, 2022

View reviewed changes

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved

robertoostenveld reviewed Aug 26, 2022

View reviewed changes

src/03-modality-agnostic-files.md Outdated Show resolved Hide resolved

fix typos

6298081

Co-authored-by: Robert Oostenveld <r.oostenveld@gmail.com>

robertoostenveld requested review from robertoostenveld and hoechenberger August 26, 2022 15:01

robertoostenveld approved these changes Aug 26, 2022

View reviewed changes

hoechenberger approved these changes Aug 26, 2022

View reviewed changes

hoechenberger merged commit 594ef95 into bids-standard:master Aug 27, 2022

bids-maintenance added a commit that referenced this pull request Aug 27, 2022

[DOC] Auto-generate changelog entry for PR #1178

b3afa2a

sappelhoff deleted the clarify/scans/formats branch March 16, 2023 09:29

sappelhoff mentioned this pull request Mar 16, 2023

Clarification regarding scans.tsv filename and MEG CTF output directory. #517

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FIX] clarify which file to list in scans.tsv for file formats with multiple files #1178

[FIX] clarify which file to list in scans.tsv for file formats with multiple files #1178

sappelhoff commented Aug 2, 2022

tsalo Aug 15, 2022

sappelhoff Aug 24, 2022

hoechenberger Aug 24, 2022 •

edited

Loading

sappelhoff Aug 25, 2022

hoechenberger Aug 25, 2022 •

edited

Loading

sappelhoff Aug 25, 2022

hoechenberger Aug 25, 2022 •

edited

Loading

hoechenberger Aug 25, 2022 •

edited

Loading

sappelhoff Aug 25, 2022

sappelhoff left a comment

codecov bot commented Aug 25, 2022 •

edited

Loading

robertoostenveld commented Aug 25, 2022 •

edited

Loading

sappelhoff commented Aug 25, 2022

tsalo commented Aug 25, 2022

robertoostenveld commented Aug 26, 2022

robertoostenveld commented Aug 26, 2022 •

edited

Loading

robertoostenveld commented Aug 26, 2022 •

edited

Loading

sappelhoff commented Aug 26, 2022

sappelhoff commented Aug 26, 2022

robertoostenveld commented Aug 26, 2022

hoechenberger commented Aug 26, 2022 •

edited

Loading

robertoostenveld commented Aug 26, 2022 •

edited

Loading

hoechenberger commented Aug 26, 2022

hoechenberger commented Aug 26, 2022

robertoostenveld commented Aug 26, 2022

sappelhoff commented Aug 26, 2022

robertoostenveld left a comment

effigies commented Aug 26, 2022

sappelhoff commented Aug 26, 2022

hoechenberger commented Aug 27, 2022

[FIX] clarify which file to list in scans.tsv for file formats with multiple files #1178

[FIX] clarify which file to list in scans.tsv for file formats with multiple files #1178

Conversation

sappelhoff commented Aug 2, 2022

tsalo Aug 15, 2022

Choose a reason for hiding this comment

sappelhoff Aug 24, 2022

Choose a reason for hiding this comment

hoechenberger Aug 24, 2022 • edited Loading

Choose a reason for hiding this comment

sappelhoff Aug 25, 2022

Choose a reason for hiding this comment

hoechenberger Aug 25, 2022 • edited Loading

Choose a reason for hiding this comment

sappelhoff Aug 25, 2022

Choose a reason for hiding this comment

hoechenberger Aug 25, 2022 • edited Loading

Choose a reason for hiding this comment

hoechenberger Aug 25, 2022 • edited Loading

Choose a reason for hiding this comment

sappelhoff Aug 25, 2022

Choose a reason for hiding this comment

sappelhoff left a comment

Choose a reason for hiding this comment

codecov bot commented Aug 25, 2022 • edited Loading

Codecov Report

robertoostenveld commented Aug 25, 2022 • edited Loading

sappelhoff commented Aug 25, 2022

tsalo commented Aug 25, 2022

robertoostenveld commented Aug 26, 2022

robertoostenveld commented Aug 26, 2022 • edited Loading

robertoostenveld commented Aug 26, 2022 • edited Loading

sappelhoff commented Aug 26, 2022

sappelhoff commented Aug 26, 2022

robertoostenveld commented Aug 26, 2022

hoechenberger commented Aug 26, 2022 • edited Loading

robertoostenveld commented Aug 26, 2022 • edited Loading

hoechenberger commented Aug 26, 2022

hoechenberger commented Aug 26, 2022

robertoostenveld commented Aug 26, 2022

sappelhoff commented Aug 26, 2022

robertoostenveld left a comment

Choose a reason for hiding this comment

effigies commented Aug 26, 2022

sappelhoff commented Aug 26, 2022

hoechenberger commented Aug 27, 2022

hoechenberger Aug 24, 2022 •

edited

Loading

hoechenberger Aug 25, 2022 •

edited

Loading

hoechenberger Aug 25, 2022 •

edited

Loading

hoechenberger Aug 25, 2022 •

edited

Loading

codecov bot commented Aug 25, 2022 •

edited

Loading

robertoostenveld commented Aug 25, 2022 •

edited

Loading

robertoostenveld commented Aug 26, 2022 •

edited

Loading

robertoostenveld commented Aug 26, 2022 •

edited

Loading

hoechenberger commented Aug 26, 2022 •

edited

Loading

robertoostenveld commented Aug 26, 2022 •

edited

Loading