Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Proposed Analysis: RNA Expression-based Prediction of Sex #84

Closed
cgreene opened this issue Aug 28, 2019 · 20 comments
Closed

Proposed Analysis: RNA Expression-based Prediction of Sex #84

cgreene opened this issue Aug 28, 2019 · 20 comments
Labels
in progress Someone is working on this issue, but feel free to propose an alternative approach! proposed analysis transcriptomic Related to or requires transcriptomic data

Comments

@cgreene
Copy link
Collaborator

cgreene commented Aug 28, 2019

Scientific goals

In #73 it's noted that the reported sex of the participants in the study didn't align with information from germline sequencing in 11 cases. It could be interesting to understand how accurately this can be called from gene expression data. For some studies, gene expression data is all that is available. If it can also be accurately called directly from gene expression data, these studies will be better positioned to evaluate their metadata.

Proposed methods

  • Construct an elastic-net logistic regression classifier using gene expression values as features and reported sex as the labels. I suggest elastic net because the signal is expected to be relatively linear, most genes are expected to play little role, but a few sets of genes are likely to be predictive and highly correlated, and it makes sense to spread weights across them to produce a robust predictor.
  • Evaluate using cross-validation with reported labels across the full set.
  • Evaluate using cross-validation with germline-based labels across the full set
  • Evaluation prediction accuracy using germline sequencing within each histology.

Required input data

For classifier construction:

  • The reported sex in the PBTA histologies file.
  • Gene expression estimates from kallisto, RSEM, or both.

For evaluation:

  • The germline-based sequencing calls.
  • Histologies, so that performance can be broken out by histology as well.

Proposed timeline

I am proposing this analysis but don't have time to do it, so I would leave this as an estimate for someone who decides to take this on.

Relevant literature

There are a number of reports of this being readily discoverable, even with unsupervised methods. These are two from our group where we noticed this, but there are also others from other groups:

@bill-amadio
Copy link
Contributor

I would like to give this a try. pbta-gene-expression-kallisto.rds looks like a good first source of gene expression estimates (1028 samples of 200,000+ transcript abundance values). pbta-histologies.tsv has an entry for each of these samples. Model construction will be with glmnet and glmnetUtils packages which allow tuning, through cross validation, of the regularization weight (lambda) and elastic net weight (alpha) simultaneously.

@jaclyn-taroni
Copy link
Member

Welcome @bill-amadio - thanks for getting this analysis started! We can use this issue to discuss strategy and address any questions around the analysis you have.

@jaclyn-taroni jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Aug 30, 2019
@cgreene
Copy link
Collaborator Author

cgreene commented Aug 31, 2019

@bill-amadio - very exciting! This plan sounds good - in the interests of compute time you may find that filtering to the transcripts that vary the most in an unsupervised way (something like the median absolute deviation) may make things much faster to run with little to no loss in accuracy. I'm looking forward to hearing how this goes!

@bill-amadio
Copy link
Contributor

Thanks @jaclyn-taroni and @cgreene - Quick question: are the calls from germline sequencing 100% accurate? If not, is there some estimate of the uncertainty in the process?

@cgreene
Copy link
Collaborator Author

cgreene commented Sep 1, 2019

I don't know that we could say they are 100% accurate. We don't have that level of information, though they should be more closely aligned than the reported gender. The way I'm guessing that is by looking at the powerpoint linked in this comment from @jharenza #73 (comment). In slide 4 it looks like there are relatively few samples in the regions between the two distributions, and these are expected to be due to sample contamination.

@bill-amadio
Copy link
Contributor

UnderstandKallistoRDS.pdf

Hi, Jackie and Casey,

I've attached a first result, a predictive model for reported_gender using the top 25% mean absolute deviation transcripts out of the kallisto V3 file. I split the data set 70/30 train/test, accuracy on the holdout was 96.7%. The elastic net alpha on this partition was 0.34. Other partitions have yielded alpha = 1. With alpha = 0.34, we have about 200 non-zero coefficients.

This doc still has a lot of my development diagnostics, but I wanted to get something to you as soon as possible. I used paths specific to my computer, and more wrangling code needs to be written. I had only one patient with more than one record in the histology file, so I dealt with the dupe manually. I think the final version should handle multiple dupes automatically.

I should be able to try the same code using the germline_sex-estimates next week.

Hope you both are well.

@jharenza
Copy link
Collaborator

jharenza commented Sep 7, 2019

I don't know that we could say they are 100% accurate. We don't have that level of information, though they should be more closely aligned than the reported gender. The way I'm guessing that is by looking at the powerpoint linked in this comment from @jharenza #73 (comment). In slide 4 it looks like there are relatively few samples in the regions between the two distributions, and these are expected to be due to sample contamination.

That's right, @bill-amadio - there was only one sample which landed between the two distributions and we had removed it from our dataset due to a previous QC using NGSCheckmate on its tumor/normal WGS data. We are presuming that germline sample was contaminated, hence why the fail on NGScheckmate and the mid-value for the sex prediction.

Great to have you working on the RNA side of this!

@jharenza
Copy link
Collaborator

jharenza commented Sep 7, 2019

UnderstandKallistoRDS.pdf

Hi, Jackie and Casey,

I've attached a first result, a predictive model for reported_gender using the top 25% mean absolute deviation transcripts out of the kallisto V3 file. I split the data set 70/30 train/test, accuracy on the holdout was 96.7%. The elastic net alpha on this partition was 0.34. Other partitions have yielded alpha = 1. With alpha = 0.34, we have about 200 non-zero coefficients.

This doc still has a lot of my development diagnostics, but I wanted to get something to you as soon as possible. I used paths specific to my computer, and more wrangling code needs to be written. I had only one patient with more than one record in the histology file, so I dealt with the dupe manually. I think the final version should handle multiple dupes automatically.

I should be able to try the same code using the germline_sex-estimates next week.

Hope you both are well.

Thanks for working on this! With regard to the duplicate sample, I noticed this and created an issue and this should be fixed in V4 of the data release, which we hope to have online by the end of next week.

It would be great if you wanted to start a pull request with your analysis code, link to this issue, and that way we can make direct comments within the PR :).

@bill-amadio
Copy link
Contributor

OK, @jharenza. Will do.

@bill-amadio
Copy link
Contributor

Apologies. This is my first multi-participant Git/GitHub project. I am not sure I have the symlinks and the pull request correct. I've reached out to a colleague here at Rider for some hand-holding. Will issue the pull request asap.

@jaclyn-taroni
Copy link
Member

@bill-amadio No worries at all -- please let us know if there's anything we can do to help!

@bill-amadio
Copy link
Contributor

@jaclyn-taroni thank you so much. I would prefer to come to you to setup and launch this first pull request. I can get to you Monday or Wednesday of next week. Is there a time on either of those days that works for you?

@jaclyn-taroni
Copy link
Member

Hi @bill-amadio - I will send you an email to coordinate.

@bill-amadio bill-amadio mentioned this issue Oct 18, 2019
2 tasks
@jaclyn-taroni jaclyn-taroni added the transcriptomic Related to or requires transcriptomic data label Oct 26, 2019
@jaclyn-taroni jaclyn-taroni mentioned this issue Nov 20, 2019
2 tasks
@jaclyn-taroni
Copy link
Member

At the moment, this module fails in CI intermittently. See #391. This may indicate an underlying issue with the reproducibility of the analysis but I'm not sure if it's because of how it is run in CI or a more general issue. It warrants further investigation

@sjspielman
Copy link
Member

sjspielman commented Jan 2, 2020

Probably not the cause of CI failure, but I do have difficulty running the analysis locally. For example, the required library e1071 is never actually loaded for 03-evaluate_model.R Running w/in CI or Docker environment has likely been circumventing this bug, but the code needs to be explicit about what libraries are loaded and used.
EDIT: This was identified running the script while getting angsty for the Docker env to spin up.

@jaclyn-taroni
Copy link
Member

The e1071 problem likely stems from the way caret is installed (ref)

The package “suggests” field includes 30 packages. caret loads packages as needed and assumes that they are installed. If a modeling package is missing, there is a prompt to install it.

Install caret using

install.packages("caret", dependencies = c("Depends", "Suggests"))

to ensure that all the needed packages are installed.

and loaded, as you point out @sjspielman. I've run into this issue before.

@bill-amadio
Copy link
Contributor

bill-amadio commented Jan 2, 2020 via email

@bill-amadio
Copy link
Contributor

bill-amadio commented Jan 2, 2020 via email

@jaclyn-taroni
Copy link
Member

Now that this module has a README as of #404, I believe this issue can be closed and if we decide to make changes, we can file a new Update an analysis issue. Thanks for all your hard work @bill-amadio, congrats!

@bill-amadio
Copy link
Contributor

@jaclyn-taroni It was my pleasure. Thank you for the opportunity and for all your help.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
in progress Someone is working on this issue, but feel free to propose an alternative approach! proposed analysis transcriptomic Related to or requires transcriptomic data
Projects
None yet
Development

No branches or pull requests

5 participants