Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[ENH] BEP031 - New columns to participants.tsv file #816

Merged
merged 13 commits into from
Oct 13, 2021
Merged
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
29 changes: 21 additions & 8 deletions src/03-modality-agnostic-files.md
Original file line number Diff line number Diff line change
Expand Up @@ -162,16 +162,20 @@ participants.json
```

The purpose of this RECOMMENDED file is to describe properties of participants
such as age, sex, handedness.
such as age, sex, handedness, species, strain, strain_rrid, diagnosis.
If this file exists, it MUST contain the column `participant_id`,
which MUST consist of `sub-<label>` values identifying one row for each participant,
followed by a list of optional columns describing participants.
Each participant MUST be described by one and only one row.

Commonly used *optional* columns in `participant.tsv` files are `age`, `sex`,
and `handedness`. We RECOMMEND to make use of these columns, and
in case that you do use them, we RECOMMEND to use the following values
for them:
When different from `homo sapiens`, `participants.tsv` SHOULD include a `species`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shouldn't this be MUST, given that all of BIDS assumes humans at this point.

Suggested change
When different from `homo sapiens`, `participants.tsv` SHOULD include a `species`
When different from `homo sapiens`, `participants.tsv` MUST include a `species`

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how we would validate a MUST, here. Also, there are rodent datasets that may not have this column in this form at this point, so we would be breaking backwards compatibility if we could validate. What about:

The RECOMMENDED `species` column MUST be a binomial species name from the
[NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi).
For backwards compatibility, if `species` is absent, the participant is assumed to be
`homo sapiens`.

Also, REQUIRE-ing a species name from NCBI Taxonomy feels like it's going to be difficult to validate, as we will need to either query the database or maintain a list of accepted names, updating the validator as new use cases arise... Is there a validation plan?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @effigies for the suggestion, I think assuming homo sapiens if the column is omitted is a strong incentive without breaking backward compatibility. I would be in favor of that.

However, I had not thought about the validation, querying the database seems like the best option to not have to maintain an up-to-date list in the validator but it may be difficult to implement. Are there similar requirements elsewhere in the spec? Would the alternative of “SHOULD” or “strongly RECOMMENDED” be advisable?

Also, thinking about it, I think I should add examples other than homo sapiens like mus musculus and rattus norvegicus in the description.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@effigies - regarding validation, i will raise the same issue here as the other post. i'm not sure we actually validate values for example for sex or anything that could have levels or enumerations.

I don't know how we would validate a MUST

while one could at least detect presence, i agree that keeping with the current perspective of the participants.tsv being a recommended file, we can keep things recommended instead of required.

species does get a little complicated, especially for animals, as you start going into species + genotype notions. here is our generic participant at a timepoint model in dandi: https://github.com/dandi/dandischema/blob/master/dandischema/models.py#L642 (technically all of those properties could come into play, with some being more important for animal studies).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@effigies, I modified the species description as you suggested and added examples in 8b41137.
For validation purposes, I kept both the column and the taxonomy as RECOMMENDED and not REQUIRED.
Let me know what you think.

column, and the value MUST be the string of the binomial species name from
[NCBI Taxonomy](https://www.ncbi.nlm.nih.gov/Taxonomy/Browser/wwwtax.cgi).

Commonly used *optional* columns in `participants.tsv` files are `age`, `sex`,
`handedness`, `strain`, `strain_rrid` and `diagnosis`. We RECOMMEND to make use
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could group be added to this list as its used in the example below?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question, I'm not sure we should because group is also used below to illustrate how to describe an additional column that is not part of the optional ones in the participants.json example. From my understanding, group may have different meanings depending on the study and it would be hard to define it with a general example

of these columns, and in case that you do use them, we RECOMMEND to use the
following values for them:

- `age`: numeric value in years (float or integer value)

Expand All @@ -197,6 +201,15 @@ for them:
- for "ambidextrous", use one of these values: `ambidextrous`, `a`, `A`,
`AMBIDEXTROUS`, `Ambidextrous`

- `strain`: string value indicating the strain of the species
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples for each of these would be useful.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to clarify, do you mean example directly in the description, like above for handedness with ambidextrous, a, A, etc, or an example of particpants.tsv for an animal.

I did not change the example of participants.tsv below which is an example for human. I thought having complete examples for both human and animal would maybe be too much.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Examples added in b451671.


- `strain_rrid`: research resource identifier ([RRID](https://scicrunch.org/resources/Organisms/search))
of the strain of the species

- `diagnosis`: string value describing the diagnosis of the participant.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't know if this has to be a string value. in many datasets on openneuro diagnosis/dx is present and can be an enumerated type. also, this is one place, where one can have multiple designations depending on the study. we should allow for some notion of that, or simply remove diagnosis from this file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The aim of this PR is to add new columns to describe animal properties. In that context, I agree that diagnosis may be out of scope.

For context, I added the columns following this discussion #779 (comment), #779 (comment) and #779 (comment) because we also introduced pathology in samples.tsv.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Following your suggestion, I removed the diagnosis column in 7146144 as being out of scope for this PR (animal properties).

The diagnosis MAY instead be specified in [Sessions files](06-longitudinal-and-multi-site-studies.md#sessions-file)
in case it changes over time.

Throughout BIDS you can indicate missing values with `n/a` (for "not
available").

Expand All @@ -213,9 +226,9 @@ It is RECOMMENDED to accompany each `participants.tsv` file with a sidecar
`participants.json` file to describe the TSV column names and properties of their values (see also
the [section on tabular files](02-common-principles.md#tabular-files)).
Such sidecar files are needed to interpret the data, especially so when
optional columns are defined beyond `age`, `sex`, and `handedness`, such as
`group` in this example, or when a different age unit is needed
(for example, gestational weeks).
optional columns are defined beyond `age`, `sex`, `handedness`, `species`, `strain`,
`strain_rrid` and `diagnosis`, such as `group` in this example, or when a different
age unit is needed (for example, gestational weeks).
If no `units` is provided for age, it will be assumed to be in years relative
to date of birth.

Expand Down