Proposed Analysis: RNA Expression-based Prediction of Sex #84

cgreene · 2019-08-28T13:53:35Z

Scientific goals

In #73 it's noted that the reported sex of the participants in the study didn't align with information from germline sequencing in 11 cases. It could be interesting to understand how accurately this can be called from gene expression data. For some studies, gene expression data is all that is available. If it can also be accurately called directly from gene expression data, these studies will be better positioned to evaluate their metadata.

Proposed methods

Construct an elastic-net logistic regression classifier using gene expression values as features and reported sex as the labels. I suggest elastic net because the signal is expected to be relatively linear, most genes are expected to play little role, but a few sets of genes are likely to be predictive and highly correlated, and it makes sense to spread weights across them to produce a robust predictor.
Evaluate using cross-validation with reported labels across the full set.
Evaluate using cross-validation with germline-based labels across the full set
Evaluation prediction accuracy using germline sequencing within each histology.

Required input data

For classifier construction:

The reported sex in the PBTA histologies file.
Gene expression estimates from kallisto, RSEM, or both.

For evaluation:

The germline-based sequencing calls.
Histologies, so that performance can be broken out by histology as well.

Proposed timeline

I am proposing this analysis but don't have time to do it, so I would leave this as an estimate for someone who decides to take this on.

Relevant literature

There are a number of reports of this being readily discoverable, even with unsupervised methods. These are two from our group where we noticed this, but there are also others from other groups:

Unsupervised VAE -> feature aligned with sex: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5728678/
Multiple unsupervised methods -> feature aligned with sex: https://www.biorxiv.org/content/10.1101/573782v1

bill-amadio · 2019-08-30T21:25:28Z

I would like to give this a try. pbta-gene-expression-kallisto.rds looks like a good first source of gene expression estimates (1028 samples of 200,000+ transcript abundance values). pbta-histologies.tsv has an entry for each of these samples. Model construction will be with glmnet and glmnetUtils packages which allow tuning, through cross validation, of the regularization weight (lambda) and elastic net weight (alpha) simultaneously.

jaclyn-taroni · 2019-08-30T21:48:09Z

Welcome @bill-amadio - thanks for getting this analysis started! We can use this issue to discuss strategy and address any questions around the analysis you have.

cgreene · 2019-08-31T18:12:42Z

@bill-amadio - very exciting! This plan sounds good - in the interests of compute time you may find that filtering to the transcripts that vary the most in an unsupervised way (something like the median absolute deviation) may make things much faster to run with little to no loss in accuracy. I'm looking forward to hearing how this goes!

bill-amadio · 2019-09-01T14:41:55Z

Thanks @jaclyn-taroni and @cgreene - Quick question: are the calls from germline sequencing 100% accurate? If not, is there some estimate of the uncertainty in the process?

cgreene · 2019-09-01T14:49:20Z

I don't know that we could say they are 100% accurate. We don't have that level of information, though they should be more closely aligned than the reported gender. The way I'm guessing that is by looking at the powerpoint linked in this comment from @jharenza #73 (comment). In slide 4 it looks like there are relatively few samples in the regions between the two distributions, and these are expected to be due to sample contamination.

bill-amadio · 2019-09-07T17:49:54Z

UnderstandKallistoRDS.pdf

Hi, Jackie and Casey,

I've attached a first result, a predictive model for reported_gender using the top 25% mean absolute deviation transcripts out of the kallisto V3 file. I split the data set 70/30 train/test, accuracy on the holdout was 96.7%. The elastic net alpha on this partition was 0.34. Other partitions have yielded alpha = 1. With alpha = 0.34, we have about 200 non-zero coefficients.

This doc still has a lot of my development diagnostics, but I wanted to get something to you as soon as possible. I used paths specific to my computer, and more wrangling code needs to be written. I had only one patient with more than one record in the histology file, so I dealt with the dupe manually. I think the final version should handle multiple dupes automatically.

I should be able to try the same code using the germline_sex-estimates next week.

Hope you both are well.

jharenza · 2019-09-07T19:37:04Z

I don't know that we could say they are 100% accurate. We don't have that level of information, though they should be more closely aligned than the reported gender. The way I'm guessing that is by looking at the powerpoint linked in this comment from @jharenza #73 (comment). In slide 4 it looks like there are relatively few samples in the regions between the two distributions, and these are expected to be due to sample contamination.

That's right, @bill-amadio - there was only one sample which landed between the two distributions and we had removed it from our dataset due to a previous QC using NGSCheckmate on its tumor/normal WGS data. We are presuming that germline sample was contaminated, hence why the fail on NGScheckmate and the mid-value for the sex prediction.

Great to have you working on the RNA side of this!

jharenza · 2019-09-07T19:39:42Z

UnderstandKallistoRDS.pdf

Hi, Jackie and Casey,

I've attached a first result, a predictive model for reported_gender using the top 25% mean absolute deviation transcripts out of the kallisto V3 file. I split the data set 70/30 train/test, accuracy on the holdout was 96.7%. The elastic net alpha on this partition was 0.34. Other partitions have yielded alpha = 1. With alpha = 0.34, we have about 200 non-zero coefficients.

This doc still has a lot of my development diagnostics, but I wanted to get something to you as soon as possible. I used paths specific to my computer, and more wrangling code needs to be written. I had only one patient with more than one record in the histology file, so I dealt with the dupe manually. I think the final version should handle multiple dupes automatically.

I should be able to try the same code using the germline_sex-estimates next week.

Hope you both are well.

Thanks for working on this! With regard to the duplicate sample, I noticed this and created an issue and this should be fixed in V4 of the data release, which we hope to have online by the end of next week.

It would be great if you wanted to start a pull request with your analysis code, link to this issue, and that way we can make direct comments within the PR :).

bill-amadio · 2019-09-07T21:02:12Z

OK, @jharenza. Will do.

bill-amadio · 2019-09-11T19:09:32Z

Apologies. This is my first multi-participant Git/GitHub project. I am not sure I have the symlinks and the pull request correct. I've reached out to a colleague here at Rider for some hand-holding. Will issue the pull request asap.

jaclyn-taroni · 2019-09-11T19:12:16Z

@bill-amadio No worries at all -- please let us know if there's anything we can do to help!

bill-amadio · 2019-09-12T18:18:12Z

@jaclyn-taroni thank you so much. I would prefer to come to you to setup and launch this first pull request. I can get to you Monday or Wednesday of next week. Is there a time on either of those days that works for you?

jaclyn-taroni · 2019-09-12T18:32:09Z

Hi @bill-amadio - I will send you an email to coordinate.

jaclyn-taroni · 2020-01-02T14:43:27Z

At the moment, this module fails in CI intermittently. See #391. This may indicate an underlying issue with the reproducibility of the analysis but I'm not sure if it's because of how it is run in CI or a more general issue. It warrants further investigation

sjspielman · 2020-01-02T16:31:23Z

Probably not the cause of CI failure, but I do have difficulty running the analysis locally. For example, the required library e1071 is never actually loaded for 03-evaluate_model.R Running w/in CI or Docker environment has likely been circumventing this bug, but the code needs to be explicit about what libraries are loaded and used.
EDIT: This was identified running the script while getting angsty for the Docker env to spin up.

jaclyn-taroni · 2020-01-02T16:35:46Z

The e1071 problem likely stems from the way caret is installed (ref)

The package “suggests” field includes 30 packages. caret loads packages as needed and assumes that they are installed. If a modeling package is missing, there is a prompt to install it.

Install caret using
install.packages("caret", dependencies = c("Depends", "Suggests"))
to ensure that all the needed packages are installed.

and loaded, as you point out @sjspielman. I've run into this issue before.

bill-amadio · 2020-01-02T21:18:41Z

No chance the source of the error is in CI. I had this error occur for the first time this week while looking at how predictive accuracy depends on the number of training transcripts. At low levels of training transcripts, all predictions are Male (the positive class for this dataset is Female), and the call to caret::twoClassSummary() fails. Calls to caret::confusionMatrix() work fine. I've put all the twoClassSummary calls and saves of the resulting files inside try() functions. If a particular call fails, then there is no twoClassSummary file for that level of training transcripts in the results folder. This does not cause a problem further downstream with the results notebook. The twoClassSummary reports only three values: ROC, sensitivity and specificity. Sensitivity and specificity are also in the Confusion Matrix output. Can we consider dropping the twoClassSummary from the 03-evaluate_model script? Or substituting a manual calculation and presentation if a test before calling twoClassSummary reveals zero predictions for one of the classes?

…

On Thu, Jan 2, 2020 at 11:35 AM Jaclyn Taroni ***@***.***> wrote: The e1071 problem likely stems from the way caret is installed (ref <https://cran.r-project.org/web/packages/caret/vignettes/caret.html>) The package “suggests” field includes 30 packages. caret loads packages as needed and assumes that they are installed. If a modeling package is missing, there is a prompt to install it. Install caret using install.packages("caret", dependencies = c("Depends", "Suggests")) to ensure that all the needed packages are installed. and loaded, as you point out @sjspielman <https://github.com/sjspielman>. I've run into this issue before. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#84?email_source=notifications&email_token=ACYCOYYW2SRG7AHN6SYVFUTQ3YJWHA5CNFSM4IRHVV2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6YHRA#issuecomment-570262468>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACYCOYYBCSWC4EOVVQNYKTTQ3YJWHANCNFSM4IRHVV2A> .

bill-amadio · 2020-01-02T21:25:28Z

I got the e1071 message from caret on my local machine. I may have been doing something that did not make it into the current release. Current scripts contain library(caret), but not library(e1071).

…

On Thu, Jan 2, 2020 at 11:35 AM Jaclyn Taroni ***@***.***> wrote: The e1071 problem likely stems from the way caret is installed (ref <https://cran.r-project.org/web/packages/caret/vignettes/caret.html>) The package “suggests” field includes 30 packages. caret loads packages as needed and assumes that they are installed. If a modeling package is missing, there is a prompt to install it. Install caret using install.packages("caret", dependencies = c("Depends", "Suggests")) to ensure that all the needed packages are installed. and loaded, as you point out @sjspielman <https://github.com/sjspielman>. I've run into this issue before. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#84?email_source=notifications&email_token=ACYCOYYW2SRG7AHN6SYVFUTQ3YJWHA5CNFSM4IRHVV2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6YHRA#issuecomment-570262468>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACYCOYYBCSWC4EOVVQNYKTTQ3YJWHANCNFSM4IRHVV2A> .

jaclyn-taroni · 2020-02-02T21:46:17Z

Now that this module has a README as of #404, I believe this issue can be closed and if we decide to make changes, we can file a new Update an analysis issue. Thanks for all your hard work @bill-amadio, congrats!

bill-amadio · 2020-02-03T13:44:36Z

@jaclyn-taroni It was my pleasure. Thank you for the opportunity and for all your help.

cgreene added the proposed analysis label Aug 28, 2019

cgreene mentioned this issue Aug 28, 2019

Proposed Analysis: Sex prediction on PBTA cohort #73

Closed

jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Aug 30, 2019

jaclyn-taroni mentioned this issue Oct 9, 2019

Sex prediction from RNASeq #123

Merged

2 tasks

bill-amadio mentioned this issue Oct 18, 2019

01 clean split data #160

Merged

2 tasks

jaclyn-taroni added the transcriptomic Related to or requires transcriptomic data label Oct 26, 2019

jaclyn-taroni mentioned this issue Nov 20, 2019

03 evaluate model #276

Merged

2 tasks

This was referenced Dec 3, 2019

CI: Representative reported_gender in stranded subset files #308

Merged

01 02 03 script #290

Merged

bill-amadio mentioned this issue Jan 5, 2020

04 present results #404

Merged

jaclyn-taroni closed this as completed Feb 2, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Proposed Analysis: RNA Expression-based Prediction of Sex #84

Proposed Analysis: RNA Expression-based Prediction of Sex #84

cgreene commented Aug 28, 2019

bill-amadio commented Aug 30, 2019

jaclyn-taroni commented Aug 30, 2019

cgreene commented Aug 31, 2019

bill-amadio commented Sep 1, 2019

cgreene commented Sep 1, 2019

bill-amadio commented Sep 7, 2019

jharenza commented Sep 7, 2019

jharenza commented Sep 7, 2019

bill-amadio commented Sep 7, 2019

bill-amadio commented Sep 11, 2019

jaclyn-taroni commented Sep 11, 2019

bill-amadio commented Sep 12, 2019

jaclyn-taroni commented Sep 12, 2019

jaclyn-taroni commented Jan 2, 2020

sjspielman commented Jan 2, 2020 •

edited

Loading

jaclyn-taroni commented Jan 2, 2020

bill-amadio commented Jan 2, 2020 via email

bill-amadio commented Jan 2, 2020 via email

jaclyn-taroni commented Feb 2, 2020

bill-amadio commented Feb 3, 2020

Proposed Analysis: RNA Expression-based Prediction of Sex #84

Proposed Analysis: RNA Expression-based Prediction of Sex #84

Comments

cgreene commented Aug 28, 2019

Scientific goals

Proposed methods

Required input data

Proposed timeline

Relevant literature

bill-amadio commented Aug 30, 2019

jaclyn-taroni commented Aug 30, 2019

cgreene commented Aug 31, 2019

bill-amadio commented Sep 1, 2019

cgreene commented Sep 1, 2019

bill-amadio commented Sep 7, 2019

jharenza commented Sep 7, 2019

jharenza commented Sep 7, 2019

bill-amadio commented Sep 7, 2019

bill-amadio commented Sep 11, 2019

jaclyn-taroni commented Sep 11, 2019

bill-amadio commented Sep 12, 2019

jaclyn-taroni commented Sep 12, 2019

jaclyn-taroni commented Jan 2, 2020

sjspielman commented Jan 2, 2020 • edited Loading

jaclyn-taroni commented Jan 2, 2020

bill-amadio commented Jan 2, 2020 via email

bill-amadio commented Jan 2, 2020 via email

jaclyn-taroni commented Feb 2, 2020

bill-amadio commented Feb 3, 2020

sjspielman commented Jan 2, 2020 •

edited

Loading