-
Notifications
You must be signed in to change notification settings - Fork 67
Proposed Analysis: RNA Expression-based Prediction of Sex #84
Comments
I would like to give this a try. pbta-gene-expression-kallisto.rds looks like a good first source of gene expression estimates (1028 samples of 200,000+ transcript abundance values). pbta-histologies.tsv has an entry for each of these samples. Model construction will be with glmnet and glmnetUtils packages which allow tuning, through cross validation, of the regularization weight (lambda) and elastic net weight (alpha) simultaneously. |
Welcome @bill-amadio - thanks for getting this analysis started! We can use this issue to discuss strategy and address any questions around the analysis you have. |
@bill-amadio - very exciting! This plan sounds good - in the interests of compute time you may find that filtering to the transcripts that vary the most in an unsupervised way (something like the median absolute deviation) may make things much faster to run with little to no loss in accuracy. I'm looking forward to hearing how this goes! |
Thanks @jaclyn-taroni and @cgreene - Quick question: are the calls from germline sequencing 100% accurate? If not, is there some estimate of the uncertainty in the process? |
I don't know that we could say they are 100% accurate. We don't have that level of information, though they should be more closely aligned than the reported gender. The way I'm guessing that is by looking at the powerpoint linked in this comment from @jharenza #73 (comment). In slide 4 it looks like there are relatively few samples in the regions between the two distributions, and these are expected to be due to sample contamination. |
Hi, Jackie and Casey, I've attached a first result, a predictive model for reported_gender using the top 25% mean absolute deviation transcripts out of the kallisto V3 file. I split the data set 70/30 train/test, accuracy on the holdout was 96.7%. The elastic net alpha on this partition was 0.34. Other partitions have yielded alpha = 1. With alpha = 0.34, we have about 200 non-zero coefficients. This doc still has a lot of my development diagnostics, but I wanted to get something to you as soon as possible. I used paths specific to my computer, and more wrangling code needs to be written. I had only one patient with more than one record in the histology file, so I dealt with the dupe manually. I think the final version should handle multiple dupes automatically. I should be able to try the same code using the germline_sex-estimates next week. Hope you both are well. |
That's right, @bill-amadio - there was only one sample which landed between the two distributions and we had removed it from our dataset due to a previous QC using NGSCheckmate on its tumor/normal WGS data. We are presuming that germline sample was contaminated, hence why the fail on NGScheckmate and the mid-value for the sex prediction. Great to have you working on the RNA side of this! |
Thanks for working on this! With regard to the duplicate sample, I noticed this and created an issue and this should be fixed in V4 of the data release, which we hope to have online by the end of next week. It would be great if you wanted to start a pull request with your analysis code, link to this issue, and that way we can make direct comments within the PR :). |
OK, @jharenza. Will do. |
Apologies. This is my first multi-participant Git/GitHub project. I am not sure I have the symlinks and the pull request correct. I've reached out to a colleague here at Rider for some hand-holding. Will issue the pull request asap. |
@bill-amadio No worries at all -- please let us know if there's anything we can do to help! |
@jaclyn-taroni thank you so much. I would prefer to come to you to setup and launch this first pull request. I can get to you Monday or Wednesday of next week. Is there a time on either of those days that works for you? |
Hi @bill-amadio - I will send you an email to coordinate. |
At the moment, this module fails in CI intermittently. See #391. This may indicate an underlying issue with the reproducibility of the analysis but I'm not sure if it's because of how it is run in CI or a more general issue. It warrants further investigation |
Probably not the cause of CI failure, but I do have difficulty running the analysis locally. For example, the required library |
The
and loaded, as you point out @sjspielman. I've run into this issue before. |
No chance the source of the error is in CI.
I had this error occur for the first time this week while looking at how
predictive accuracy depends on the number of training transcripts. At low
levels of training transcripts, all predictions are Male (the positive
class for this dataset is Female), and the call to caret::twoClassSummary()
fails. Calls to caret::confusionMatrix() work fine.
I've put all the twoClassSummary calls and saves of the resulting files
inside try() functions. If a particular call fails, then there is no
twoClassSummary file for that level of training transcripts in the results
folder. This does not cause a problem further downstream with the results
notebook.
The twoClassSummary reports only three values: ROC, sensitivity and
specificity. Sensitivity and specificity are also in the Confusion Matrix
output. Can we consider dropping the twoClassSummary from the
03-evaluate_model script? Or substituting a manual calculation and
presentation if a test before calling twoClassSummary reveals zero
predictions for one of the classes?
…On Thu, Jan 2, 2020 at 11:35 AM Jaclyn Taroni ***@***.***> wrote:
The e1071 problem likely stems from the way caret is installed (ref
<https://cran.r-project.org/web/packages/caret/vignettes/caret.html>)
The package “suggests” field includes 30 packages. caret loads packages as
needed and assumes that they are installed. If a modeling package is
missing, there is a prompt to install it.
Install caret using
install.packages("caret", dependencies = c("Depends", "Suggests"))
to ensure that all the needed packages are installed.
and loaded, as you point out @sjspielman <https://github.com/sjspielman>.
I've run into this issue before.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#84?email_source=notifications&email_token=ACYCOYYW2SRG7AHN6SYVFUTQ3YJWHA5CNFSM4IRHVV2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6YHRA#issuecomment-570262468>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACYCOYYBCSWC4EOVVQNYKTTQ3YJWHANCNFSM4IRHVV2A>
.
|
I got the e1071 message from caret on my local machine. I may have been
doing something that did not make it into the current release. Current
scripts contain library(caret), but not library(e1071).
…On Thu, Jan 2, 2020 at 11:35 AM Jaclyn Taroni ***@***.***> wrote:
The e1071 problem likely stems from the way caret is installed (ref
<https://cran.r-project.org/web/packages/caret/vignettes/caret.html>)
The package “suggests” field includes 30 packages. caret loads packages as
needed and assumes that they are installed. If a modeling package is
missing, there is a prompt to install it.
Install caret using
install.packages("caret", dependencies = c("Depends", "Suggests"))
to ensure that all the needed packages are installed.
and loaded, as you point out @sjspielman <https://github.com/sjspielman>.
I've run into this issue before.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#84?email_source=notifications&email_token=ACYCOYYW2SRG7AHN6SYVFUTQ3YJWHA5CNFSM4IRHVV2KYY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOEH6YHRA#issuecomment-570262468>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ACYCOYYBCSWC4EOVVQNYKTTQ3YJWHANCNFSM4IRHVV2A>
.
|
Now that this module has a README as of #404, I believe this issue can be closed and if we decide to make changes, we can file a new |
@jaclyn-taroni It was my pleasure. Thank you for the opportunity and for all your help. |
Scientific goals
In #73 it's noted that the reported sex of the participants in the study didn't align with information from germline sequencing in 11 cases. It could be interesting to understand how accurately this can be called from gene expression data. For some studies, gene expression data is all that is available. If it can also be accurately called directly from gene expression data, these studies will be better positioned to evaluate their metadata.
Proposed methods
Required input data
For classifier construction:
For evaluation:
Proposed timeline
I am proposing this analysis but don't have time to do it, so I would leave this as an estimate for someone who decides to take this on.
Relevant literature
There are a number of reports of this being readily discoverable, even with unsupervised methods. These are two from our group where we noticed this, but there are also others from other groups:
The text was updated successfully, but these errors were encountered: