Skip to content

Commit

Permalink
Update 3. Data Organisation.md
Browse files Browse the repository at this point in the history
  • Loading branch information
marbarrantescepas authored May 16, 2024
1 parent dd8ea3f commit 5dca127
Showing 1 changed file with 22 additions and 21 deletions.
43 changes: 22 additions & 21 deletions tabs/3. Data Organisation.md
Original file line number Diff line number Diff line change
Expand Up @@ -66,68 +66,69 @@ Before creating your own DMP, we recommend consulting with your institution to s
It's recommended to follow the FAIR principles in data for improving **F**indability, **A**ccessibility, **I**nteroperability, and **R**eusability. If you plan it from the beginning, it is easier to make data [FAIR](https://the-turing-way.netlify.app/reproducible-research/rdm/rdm-fair). Making data FAIR is not the same as making it open.

{: .important }
> Keep in mind that data should be as open as possible and as closed as necessary.
> **Keep in mind that data should be as open as possible and as closed as necessary.**
Accessible means that there is a procedure in place to access the data that could benefit sharing data or methods within your own group or department and re-use existing pipelines without having to put much effort into finding data.



## Data Storage and Organisation

Storing your data correctly is important to prevent data loss, which happens more often than we would like. To avoid data loss, it’s recommended to pick a suitable storage system and back up your data frequently. Your institution will usually provide information on how to store your data. Consult which are the different storage systems your institution is using and how you can back up data correctly. You can either consider using cloud storage if your data protection allows it or you can consider encrypting your data before storage.

At the same time, organising data in a meaningful and FAIR way may be challenging, especially at the beginning of new projects where we are not fully sure of all the intermediate files. You should use a clear folder structure to ensure you can find your files.
In the following paragraphs, we divided our recommendations depending on which kind of neuroscience data you are mainly working with brain imaging (MRI, PET, EEG, MEG), cognitive, behavioural, cellular, histological, or molecular data.

In the following paragraphs, we divided our recommendations depending on which kind of **neuroscience data** you are mainly working with brain imaging (MRI, PET, EEG, MEG), cognitive, behavioural, cellular, histological, or molecular data.

### Brain Imaging Data Standards

Brain Imaging Data Standards (BIDS) is a community-driven consensus on how to organise and share data obtained in neuroimaging experiments (everybody can become part of the community). Lack of consensus led to time wasted on rearranging data or rewriting scripts in a certain way. More information about the guidelines and specifications can be found at https://bids.neuroimaging.io/ or https://bids-specification.readthedocs.io/en/stable/index.html
Brain Imaging Data Standards (BIDS) is a community-driven consensus on how to organise and share data obtained in neuroimaging experiments (everybody can become part of the [community](https://bids.neuroimaging.io/get_involved.html)). Lack of consensus led to time wasted on rearranging data or rewriting scripts in a certain way. More information about the guidelines and [specifications](https://bids-specification.readthedocs.io/en/stable/index.html) can be found at [BIDS webpage] (https://bids.neuroimaging.io/).

#### BIDS project folder structure
#### **BIDS project folder structure**

Briefly, each project has a main folder containing a sourcedata, rawdata and derivatives folder, where different types of data will be stored. Sourcedata is meant to be for data before any kind of conversion, reconstruction and/or harmonisation, rawdata is expected to contain the data converted to NIFTI and JSON format, and finally, derivatives should contain all the files derived from your analysis.
Briefly, each project has a main folder containing a *sourcedata*, *rawdata* and *derivatives* folder, where different types of data will be stored. *Sourcedata* is meant to be for data before any kind of conversion, reconstruction and/or harmonisation, *rawdata* is expected to contain the data converted to NIFTI and JSON format, and finally, *derivatives* should contain all the files derived from your analysis.

Coming soon
{: .label .label-yellow }

Inside sourcedata and rawdata, we should have a folder for each subject of the study named sub-SUBID, where SUBID is the code or identifier of that particular participant, and a tsv file containing the information of our dataset. Inside the subject folder, it is expected to have a subfolder per session in case of longitudinal data. We would recommend adding a session folder even in cross-sectional studies. You never know if it will be longitudinal later on!
Inside the session (if it exists) or the subject folder, we should have a subfolder for each modality i.e. anat, func, dwi, etc., containing anatomical, functional, diffusion or other type of data respectively. See more information about the specifications of different modalities.
Inside *sourcedata* and *rawdata*, we should have a folder for each subject of the study named **sub-SUBID**, where SUBID is the code or identifier of that particular participant, and a **tsv file** containing the information of our dataset. Inside the subject folder, it is expected to have a subfolder per session in case of longitudinal data. We would recommend adding a session folder even in cross-sectional studies. You never know if it will be longitudinal later on!
Inside the session (if it exists) or the subject folder, we should have a subfolder for each modality i.e. anat, func, dwi, etc., containing anatomical, functional, diffusion or other type of data respectively. See more information about the [specifications of different modalities](https://bids-specification.readthedocs.io/en/stable/modality-agnostic-files.html).

#### BIDS file naming structure
#### **BIDS file naming structure**

Finally, the files MUST be named in a certain way to be machine-readable. There are three main types of data (or extensions): .json files containing metadata, .tsv files containing tables of metadata and raw data images (with .jpg or .nii.gz). All files follow a similar structure that includes using keys, the corresponding value to that key, a suffix and, finally, the extension. Keys are always paired with values, some of them are mandatory to have, for instance, the subject name (i.e. key would be sub- and the value is the corresponding SUBJID), however, there are also some of them recommended but not mandatory. Suffixes are mandatory to have and they indicate the kind of data. For a given suffix, some entities are required. More information about the specifications for each kind of data can be found here.
Finally, the files MUST be named in a certain way to be machine-readable. There are three main types of data (or extensions): **.json** files containing metadata, **.tsv** files containing tables of metadata and raw data images (with **.jpg** or **.nii.gz**). All files follow a similar structure that includes using **keys**, the corresponding **value** to that key, a **suffix** and, finally, the **extension**. Keys are always paired with values, some of them are mandatory to have, for instance, the subject name (i.e. key would be sub- and the value is the corresponding SUBJID), however, there are also some of them recommended but not mandatory. Suffixes are mandatory to have and they indicate the kind of data. For a given suffix, some entities are required. More information about the specifications for each kind of data can be found [here](https://bids-specification.readthedocs.io/en/stable/modality-agnostic-files.html).

Coming soon
{: .label .label-yellow }

If at this point you are lost and don’t know where to start, we recommend you check the BIDS starter kit and consult experienced people on the field or the BIDS community.
{: .warning }
> If at this point you are lost and don’t know where to start, we recommend you check the BIDS starter kit and consult experienced people on the field or the BIDS community.
#### **BIDS converter and validation tools**

#### BIDS converter and validation tools
There are some tools that organise DICOM data directly to BIDS format. From our personal experience, we recommend **BIDScoin** and **dcm2bids**. However, there are more options that can be used for converting your source data into BIDS format. Find more about these tools at https://bids.neuroimaging.io/benefits.html#software-currently-supporting-bids.

There are some tools that organise DICOM data directly to BIDS format. From our personal experience, we recommend BIDScoin and dcm2bids. However, there are more options that can be used for converting your source data into BIDS format. Find more about these tools at https://bids.neuroimaging.io/benefits.html#software-currently-supporting-bids.
Furthermore, there is a validator available to check if your data is correctly organised, called BIDS validator. Although the purpose of this tool is useful, the validator can sometimes be discouraging because it is not up to date with the latest recommendations. If you struggling with the validator, please check it with someone more experienced or contact the BIDS community directly.
Furthermore, there is a validator available to check if your data is correctly organised, called [BIDS validator](https://github.com/bids-standard/bids-validator). Although the purpose of this tool is useful, the validator can sometimes be discouraging because it is not up to date with the latest recommendations. If you struggling with the validator, please check it with someone more experienced or contact the BIDS community directly.

#### BIDS derivatives
#### **BIDS derivatives**
Coming soon
{: .label .label-yellow }

#### BIDS citations
#### **BIDS citations**
Be nice and don’t forget to cite in your study the BIDS citations if you are using them!

### Clinical data management
For clinical, demographic, and behavioural data, accurate and meticulous data management is essential to create a high-quality database for statistical analysis. Procedures to ensure high-quality standards include database designing, data entry, data annotation, data validation, discrepancy management, and database locking. A review article highlights the processes and recommended tools for clinical data management (Krishnankutty et al., 2012). Common software for Electronic Data Capture (EDC) include: REDcap, Castor, Greenlight Guru Clinical, Medidata Rave, Clinion.

Spreadsheets and documents are widely used for various purposes including collecting, storing, manipulating, analyzing, and documenting research data. However, it's important to exercise caution as improper use of them can lead to significant errors in workflows. Our recommendation is to follow the Turing Way.
Spreadsheets and documents are widely used for various purposes including collecting, storing, manipulating, analyzing, and documenting research data. However, it's important to exercise caution as improper use of them can lead to significant errors in workflows. Our recommendation is to follow the [Turing Way](https://the-turing-way.netlify.app/reproducible-research/rdm/rdm-storage).

### BIDS in Microscopic Data
### **BIDS in Microscopic Data**
Coming soon
{: .label .label-yellow }

### Big Genetic Datasets
### **Big Genetic Datasets**
Coming soon
{: .label .label-yellow }

### Work in progress
### **Work in progress**
For some data types, there is no current standard yet. Efforts are ongoing in various fields to develop standardized approaches for data storage and sharing. While there's still progress to be made, we recommend reaching out to colleagues engaged in similar work or data. Collaborating and sharing experiences can provide valuable insights into effective data storage practices.

## Sharing your own data
Expand Down

0 comments on commit 5dca127

Please sign in to comment.