Skip to content
This repository has been archived by the owner on Jun 21, 2023. It is now read-only.

Proposed Analysis: Comparative RNA-Seq analysis #229

Open
GeoffLyle opened this issue Nov 5, 2019 · 16 comments
Open

Proposed Analysis: Comparative RNA-Seq analysis #229

GeoffLyle opened this issue Nov 5, 2019 · 16 comments
Labels
in progress Someone is working on this issue, but feel free to propose an alternative approach! proposed analysis transcriptomic Related to or requires transcriptomic data

Comments

@GeoffLyle
Copy link

Scientific goals

  • Create correlation matrices for polyA samples and ribodeplete (stranded) samples using gene expression data.
  • Generate gene outlier thresholds for ribodeplete samples.
  • List outlier genes for each ribodeplete sample.
  • Discover trends in outlier genes by tumor subtype.
  • Investigate possible targeted therapeutics based on outlier expression profile.

Proposed methods

  • The correlations will be calculated using pairwise Spearman correlation of the RNA-Seq gene expression profiles.
  • The gene expression profile data will be filtered using a developed method described in Vaske et al.
    Note: This process is the precursor to creating a TumorMap (https://tumormap.ucsc.edu/). While programmatic creation of a TumorMap is not currently open access, we can add instructions for users on how to manually generate a TumorMap on the website.
  • Outlier thresholds will be calculated according to the Tukey method.
  • Discovery of outlier genes and possible targeted therapeutics will be based on the Treehouse CARE method. https://github.com/UCSC-Treehouse/CARE
    • The CARE method will be used to find outliers using the entire population as a cohort as well as subsamples from the population to disentangle sample specific outliers from tumor subtype outliers.
  • Application of novel transfer learning method trained on CCLE response data to Open-PBTA samples will predict response to therapeutics.

Required input data

pbta-gene-expression-kallisto.polya.rds
pbta-gene-expression-kallisto.stranded.rds

Is it possible to get RSEM TPM values for this data?
That would make it easier for us to adapt existing algorithms to work with the Open-PBTA data.

Proposed timeline

8 weeks

Relevant literature

Method published in [Vaske et al. Jama Open Network. 2019.] (https://jamanetwork.com/journals/jamanetworkopen/article-abstract/2753519)

@jharenza
Copy link
Collaborator

jharenza commented Nov 5, 2019

hi @GeoffLyle! Thanks for proposing this analysis! We currently have RSEM FPKM values available in our data release: pbta-gene-expression-rsem-fpkm.polya.rds and pbta-gene-expression-rsem-fpkm.stranded.rds. Since FPKM and TPM from RSEM are highly correlated, will this do, or does your analysis require TPM? If the latter, I can file an issue to add the TPM data during our next sprint, which starts Nov. 13 and the data could be expected sometime around the end of next week.

@GeoffLyle
Copy link
Author

@jharenza I believe our analysis could work with the FPKM or Kallisto TPM values. However, our current workflow is set up tot use RSEM TPM values and our learning model was trained on TPM data. Since our team is much more comfortable working with RSEM TPM, it would be great if that data could be provided. By the end of next week works perfectly well for our timeline. Thanks!

@jharenza
Copy link
Collaborator

jharenza commented Nov 5, 2019

OK, will add to our sprint, thanks!

@GeoffLyle
Copy link
Author

Hi @jharenza, I talked with our bioinformatician and she informed me that we are interested in both Transcript Level and Gene Level RSEM TPM. Would it possible to generate both of these files during your sprint?

@jharenza
Copy link
Collaborator

jharenza commented Nov 7, 2019

@GeoffLyle will do both!

@jaclyn-taroni
Copy link
Member

Hi @GeoffLyle I'm looking forward to these analyses. Thank you for proposing them and thanks @jharenza for your team's help with the RSEM TPM.

I wanted to ask about how you were planning to split up the analyses for submission. Would each bullet point under scientific goals be it's own pull request or will some steps be grouped together?

I also wanted to note that we are planning to flesh out the molecular subtype information (#19) which may be of scientific interest here. If so, the column molecular_subtype in the pbta-histologies.tsv that is included in the data download would be the one that gets filled in with more information and is therefore the one you may want use during development.

@GeoffLyle
Copy link
Author

@jaclyn-taroni Our submission of analyses will follow the scientific goals bullet points. Our first analysis will definitely be Create correlation matrices for polyA samples and ribodeplete (stranded) samples using gene expression data. The Generate gene outlier thresholds for ribodeplete samples. and List outlier genes for each ribodeplete sample. may be two separate analyses or combined into one pull request depending on the speed and ease of engineering our algorithm to work with the data. The Discover trends in outlier genes by tumor subtype. and Investigate possible targeted therapeutics based on outlier expression profile. are broad goals we are still determining how to implement. Typically this analysis is done on a N-of-1 patient sample so I will have to explore the data to determine what trends there are and how best to present them. Also we have some modeling software our graduate students have been working on that we would like to adapt to the Open-PBTA data and submit as an analysis.

In short, the first 3 scientific goals will be submitted as 2 or 3 pull requests. The last two scientific goals may be submitted in 2+ pull requests.

Thanks for the heads up on the pbta-histologies.tsv update. We planned to look into differences by disease and histology. It would be interesting to see if there is a signal when grouping by molecular subtype.

@jaclyn-taroni jaclyn-taroni added the in progress Someone is working on this issue, but feel free to propose an alternative approach! label Nov 8, 2019
@jaclyn-taroni
Copy link
Member

Sounds good @GeoffLyle, thank you! If you have any questions that pop up as you're implementing the correlation matrices step, please let us know.

@jharenza
Copy link
Collaborator

jharenza commented Dec 2, 2019

Hi @GeoffLyle - hopefully you have been following, but the TPM data you requested became available with release V10 (#273). Excited to see your PRs come through!

@GeoffLyle
Copy link
Author

@jharenza Thank you! Saw the V10 release. I'll let the team know about the V11 release. Should be a good test to ensure everything still runs correctly.

@jharenza
Copy link
Collaborator

jharenza commented Dec 2, 2019

@GeoffLyle ahh spoke too soon and corrected it. @jaclyn-taroni is running V11 tests currently, then we will merge once everything works properly :).

@jaclyn-taroni
Copy link
Member

Hi @GeoffLyle and @e-t-k, I wanted to check in. Do you have an idea of when you expect to file a pull request for the outlier thresholds and outlier gene lists? Also, if you are working with the release-v12-20191217 data, I wanted to make you aware that there are stranded poly-A samples in the stranded dataset files, which are mostly comprised of ribodepleted samples (see #374) though this very well may have been apparent from your results. Those samples will come out in the v13 release, which is planned for next week (#373).

@e-t-k
Copy link
Contributor

e-t-k commented Jan 15, 2020 via email

@jaclyn-taroni jaclyn-taroni added the transcriptomic Related to or requires transcriptomic data label Jan 18, 2020
@jaclyn-taroni
Copy link
Member

Hi @e-t-k,

Thanks for your reply!

I've been working on some other priorities recently, so current estimate for this is early Feb, but I could look into making it sooner if that's behind why you're asking.

We are looking to wind down the analysis phase of the project and prepare the first draft of the OpenPBTA manuscript relatively soon.

If you would like this analysis to be included in the first version of the manuscript, I would recommend filing a pull request as soon as its possible. There is the option of including this analysis for revisions. (For the Deep Review paper that @cgreene previously organized, they continued to add things during revisions beyond what reviewers asked for.)

There also may be more lag time between when a pull request is initially filed and when it is reviewed as we move our focus to writing the manuscript.

Whatever you decide works for us, but I wanted to be transparent about what the next few months of the project will look like! We'd love to include this in the first version if the timing works out.

And thank you for the heads up regarding the stranded PolyA samples; your exclusion of them in the next release works just fine for us.

release-v13-20200116 went out yesterday (#444), so if you're up to date with the AlexsLemonade master branch you'll be able to snag that by rerunning the download bash script.

@GeoffLyle also noticed there are some cell line samples in the data set that we'll want to exclude as well, which we were expecting to just implement as a filtering step.

Ah yes, sounds good. If you're looking to limit your analysis to tumor samples, the pattern that we've used throughout the project is to filter the pbta-histologies.tsv file to rows where sample_type is Tumor and composition is Solid Tissue. (Here's an example in R:

.) If you further filter such that all rows are RNA-Seq in experimental_strategy, the Kids_First_Biospecimen_ID will contain all the assay identifiers you're looking for. You may have gotten to the bottom of this already, but the presence of cell lines and different identifiers tripped me up when I first started working with the OpenPBTA data ☺️

e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 5, 2020
utils.py - write_rds accidentally modified its input dataframe; now makes
a copy before changing it. oops.
e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 5, 2020
adds candidate final draft for step 2 - generate outlier thresholds and outliers.
also adds it to continuous integration
e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 5, 2020
TODO: check in a markdown viewer for typoes
e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 5, 2020
@e-t-k e-t-k mentioned this issue Feb 5, 2020
5 tasks
jaclyn-taroni pushed a commit that referenced this issue Feb 7, 2020
* [#229] comparative rnaseq - utils bugfix

utils.py - write_rds accidentally modified its input dataframe; now makes
a copy before changing it. oops.

* [#229] comparative rna-seq

adds candidate final draft for step 2 - generate outlier thresholds and outliers.
also adds it to continuous integration

* [#229] comparative-rnaseq-analysis: add README.md

TODO: check in a markdown viewer for typoes

* [#229] comparative-rnaseq-analysis: README cleanup

* [#513] comparative rnasq--address reviewer comments

Addresses @jashapiro's comments in #513 .
In particular:

Step 01: save normalized_expression to file which I forgot to do earlier.

Step 02:
- ingests normalized expression file from step 1 instead of original expression file
- parameters updated to reflect this
- normalize_expression() no longer needed

- file path handling all moved to main()

- Change output file format - each cell now contains a 4-character string in a fixed order which encodes the analysis results.
- update comments at top to reflect this

- Output file is now a .tsv.gz instead of .rds and is included in this pull.

- no longer saves thresholds file as it's not ingested anyhwere.
e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 19, 2020
e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 19, 2020
e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 20, 2020
Per jashapiro's comments, double-check that MEND QC files parsed into PASS or FAIL and
crash if they didn't.
Also call out any samples that didn't have MEND QC files (these will be treated as
fails and not included in the dataset).
e-t-k added a commit to UCSC-Treehouse/OpenPBTA-analysis that referenced this issue Feb 20, 2020
…ts tsv

Due to filtering columns on a set() in step 01 filter_samples, the column order of intermediate .feather files and therefore outlier results file changed between script reruns. To improve reproducibility, pin column order (ie samples order) to the order of the original .Rds expression matrix.

Includes new results file which has data identical to previous, but columns in correct order.
jaclyn-taroni pushed a commit that referenced this issue Feb 20, 2020
…s); convert scratch files to .feather (#544)

* update README (TODO: proofread markdown)

* implement pull 3: sample filters and general cleanup

- add tsv and feather IO functions to utils

- add sample filters to step 01: check MEND qc status, and use pbta-histologies to filter to only tumor samples
- For dropped genes, note if dropped by Expression filter or Variance filter
- all internal .rds files converted to .feather
- all path handling in step 01 moved to main()

* [#323] update CI file, dockerfile, results file

for comparative-rnaseq-analysis:
- update command lines in CI file; add dependency to dockerfile;
add new version of outlier results tsv.gz file with sample filters applied.

* [#229] tidy up README markdown.

* [#229] analyses/README update comparative-rna-seq

* [#229] check MEND QC files to confirm parsed ok

Per jashapiro's comments, double-check that MEND QC files parsed into PASS or FAIL and
crash if they didn't.
Also call out any samples that didn't have MEND QC files (these will be treated as
fails and not included in the dataset).

* [#229] maintain consistent column order in outlier results tsv

Due to filtering columns on a set() in step 01 filter_samples, the column order of intermediate .feather files and therefore outlier results file changed between script reruns. To improve reproducibility, pin column order (ie samples order) to the order of the original .Rds expression matrix.

Includes new results file which has data identical to previous, but columns in correct order.
@GeoffLyle
Copy link
Author

@jaclyn-taroni We want to add our cohort finding script to this analysis. As it uses the correlation matrix that is already created for finding the gene outlier status of samples, I believe it would fit into this analysis rather than being a separate proposed analysis.

The idea is to calculate the Spearman distance between a focus sample and all other samples. If the Spearman distance is above a certain threshold (typically above the 95th percentile of all Spearman distances), those samples are added to a cohort called "First-Degree Neighbors". A second cohort called "First and Second-Degree Neighbors" contains the first degree neighbors as well as those samples first degree neighbors. A third cohort is "Diseases of top 6 samples", which is a cohort containing all samples with matching diseases as the top 6 samples with the highest Spearman score.

This will involve adapting a currently used script as well as harmonizing the diseases found in the "disease_type_new" column from pbta_histologies.tsv. We believe this level of analysis will not cause any issues with the CI.

@jaclyn-taroni
Copy link
Member

Hi @GeoffLyle, adding what you've described to analyses/comparative-RNASeq-analysis/ makes sense to me!

We believe this level of analysis will not cause any issues with the CI.

To confirm, you believe that it will run with 8GB RAM and in a reasonable amount of time in CI (10 minutes or less without some kind of output), is that correct?

I wanted to point you to how to obtain the files we use for CI in case you'd like to do some further testing: https://github.com/AlexsLemonade/OpenPBTA-analysis#working-with-the-subset-files-used-in-ci-locally

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
in progress Someone is working on this issue, but feel free to propose an alternative approach! proposed analysis transcriptomic Related to or requires transcriptomic data
Projects
None yet
Development

No branches or pull requests

4 participants