Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RUVseq RNA Batch Correction #247

Merged
merged 139 commits into from
Jan 12, 2023
Merged

RUVseq RNA Batch Correction #247

merged 139 commits into from
Jan 12, 2023

Conversation

aadamk
Copy link

@aadamk aadamk commented Aug 24, 2022

Purpose/implementation Section

What scientific question is your analysis addressing?

This analysis addresses the issue of apparent batch effects by RNA-library in transcriptome data, correcting specifically for the factors that result in differences across polyA, ribo-deplete, and RNA exome libraries. This analysis should more effectively control for Type I and Type II errors when running differential gene expression analysis by DESeq2.

What was your approach?

This model assumes that differences in housekeeping genes from the HRT atlas v1.0 across DESeq2 comparison groups will specifically be a result of technical, rather than biological, variation, whether in a tumor-only or tumor-normal context. Therefore, RUVg uses HRT atlas genes as a negative control gene set for modeling factors of unwanted variation.

To select the model with the optimal sensitivity/specificity balance, this module requires that users specify a positive control gene set, simply a list of hugo gene names in an rds file containing genes that are expected to differ across comparison groups. A few use cases are specified in the run scripts, namely:

  1. TARGET Neuroblastoma: MYC-N amplified versus WT as specified in molecular_subtype from v10 (now deprecated in v11), using KIM_MYCN_AMPLIFICATION_TARGETS_[UP,DN] as the positive control gene sets from C2 canonical pathways from MSigDb.
  2. Cancer groups 'High-grade glioma/astrocytoma' and 'Diffuse midline glioma', using cancer stemness genes from https://doi.org/10.1186/s12915-022-01324-0, Additional file 4: Table S2.. Under Merged_differential_expression tab, genes whose adj p-value across K27M vs WT day5 < 0.05.
  3. 'High-grade glioma/astrocytoma' and 'Diffuse midline glioma' versus GTEx normals, specifying Reactome Cell Cycle pathway genes as positive control genes (non-redundant gene list from Reactome pathway IDs: R-HSA-69278, R-HSA-69620, and R-HSA-1640170). Note: it is recommended to drop the first two factors of variation (drop = 2 parameter in RUVg call) to ensure that biological variation across tumor versus normal is maintained (it is expected that biological variation, rather than RNA_library will be the main source of variation).

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Performs DESeq2 analysis using RUVg-based batch correction.

Key steps:

  1. 01-ruvseq-deseq.R will run tumor-only standard differential expression analysis (no RUVg correction, by molecular_subtype) from DESeq2, to compare to RUVg using k=1-5 factors of variation in a batch-corrected DESeq2 module.
  2. 02-ruvseq-deseq-tn.R will run tumor-normal standard differential expression analysis (no RUVg correction, by molecular_subtype) from DESeq2, to compare to RUVg using k=1-5 factors of variation in a batch-corrected DESeq2 module.
  3. 03-ruvseq-summarization.R will select the batch correction method with the optimal balance of sensitivity and specificity, returning normalized counts .RDS file and show the UMAP clustering of the groups of interest.

Is there anything that you want to discuss further?

A neuroblastoma tumor-normal use case may be valuable to test whether certain cancer genes/biomarkers of interest show the expected patterns across tumor versus normal (e.g. a cohort-level box plot).

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

  1. Plots: UMAP plots showing clustering of samples before and after batch correction under plots directory. P-value histograms showing the impact on the preponderance of significant calls before and after batch correction are also included.
  2. Output: normalized counts of the optimal model are provided in the output/[dataset]/normalized_counts folder
  3. Output: differential gene expression tabular data are provided for batch corrected and non batch-corrected analyses. Further, RUVseq output are provided for each k should the user desire to compare to optimal output.

What is your summary of the results?

  1. From Neuroblastoma comparison, non-batch corrected DGE analysis resulted in a significant nominal p-value for MYCN expression, but not a significant adjusted p-value. Batch correction using 3 factors of variation (the k chosen by this algorithm), resulted in an adjusted p-value of 0.01 for MYCN, which would be expected in amplified versus WT tumors.
  2. From the HGG/DMG comparison, 3, rather than 2, batches were present in the data (RNA exome, poly-A, ribo-deplete), and the UMAP plot suggests that those batch effects are effectively corrected for at k=2.
  3. From HGG/DMG versus normal, the UMAP plot suggests that tumor and normal remain separated even after batch effect correction, though brought in closer proximity (suggesting that biological variation is preserved, but technical variation may be in part corrected for).

Note: files in code/archive subdirectory can be ignored but are useful to re-run matched sample analysis

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

@aadamk
Copy link
Author

aadamk commented Nov 28, 2022

@aadamk it looks like the latest error is a file subset error - can you check this one out?

Unsure why this is giving an error. This previously ran successfully in the docker environment and is currently running successfully locally as well.

When checking the sample content of the rnaseq rsem expression file from my v11 data download, specifically the few bs_ids that were listed as not found in the CI error, I did indeed find those ids as columns in the counts file.
I removed the dplyr::select clause as certain dplyr functions were giving me issues in the past. I instead switched to base R style subsetting, by finding the indices of the columns matching the bs_ids of interest and subsetting the expression dataframe using brackets, but it still appears that certain bs_ids are not being found given the nature of the error downstream (the subsetted histology ids do not intersect perfectly with the ids in the counts dataframe after subsetting).

Unsure how to proceed - it seems like the data that are downloaded by CI do not match the data that downloaded locally to my machine.

@ewafula
Copy link

ewafula commented Nov 28, 2022

@aadamk it looks like the latest error is a file subset error - can you check this one out?

Unsure why this is giving an error. This previously ran successfully in the docker environment and is currently running successfully locally as well.

When checking the sample content of the rnaseq rsem expression file from my v11 data download, specifically the few bs_ids that were listed as not found in the CI error, I did indeed find those ids as columns in the counts file. I removed the dplyr::select clause as certain dplyr functions were giving me issues in the past. I instead switched to base R style subsetting, by finding the indices of the columns matching the bs_ids of interest and subsetting the expression dataframe using brackets, but it still appears that certain bs_ids are not being found given the nature of the error downstream (the subsetted histology ids do not intersect perfectly with the ids in the counts dataframe after subsetting).

Unsure how to proceed - it seems like the data that are downloaded by CI do not match the data that downloaded locally to my machine.

@aadamk, the test data used by the CI is a subset of the v11 expression matrices and can be downloaded using the following script:

We might need to create a specific subset for the module if the current CI testing matrix is not suitable. The module that creates this CI testing subset data sets is here:

Cc @jharenza

@ewafula
Copy link

ewafula commented Nov 28, 2022

@aadamk, here is the ticket for developing the corresponding CAVATICA app by the D3b bix-dev team. You can expound on specific details if any.

@aadamk
Copy link
Author

aadamk commented Nov 28, 2022

We might need to create a specific subset for the module if the current CI testing matrix is not suitable. The module that creates this CI testing subset data sets is here:

Thank you for the clarification @ewafula
Yes, I think that would be helpful to allow this module to pass CI. I can create a ticket for this.

@ewafula
Copy link

ewafula commented Nov 30, 2022

@aadamk,

@jharenza suggested the following two options to resolve the issue of the CI failing because of missing bs_ids in the CI testing counts subset matrix:

  1. Only run the tumor-normal script when with CI testing. It will require including an environmental variable OPENPBTA_TESTING in the run_ruvseq.sh bash script to conditionally run the tumor-normal batch correction script and not ignore the tumor-only batch correction script.
  2. Enrich the CI testing counts data matrix with a few bs_ids that will enable the batch correction module to run successfully without failing GA checks.

To me, it seems both options, including the tumor-normal, require enriching the CI testing counts matrix dataset with as follows:

  1. include a few bs_ids with contrasting molecular subtypes in the CI subset counts matrix for the target_nbl and hgg_dmg tumor-only batch correction commands in the bash script (run_ruvseq.sh)
  2. include corresponding cancer_group/gtex_subgroup bs_ids in the CI subset counts matrix for the HGG-DMG_TN tumor-normal batch correction command in the bash script (run_ruvseq.sh)

Here is an example of a list of bs_ids included in the CI testing dataset to enrich for TP53 and NF1 mutations determined by this script in create-subset-files module.

Since the batch correction analysis is upstream, the histologies.tsv utilized in the module needs to be replaced with histologies-base.tsv. It might require temporarily creating the subtype group (i.e., MYCN status via pathology_free_text_diagnosis) and changing to use broad_histoiogy or short_histology and not cancer group in the scripts

@aadamk, would you mind adding more detail, specifically for task 1 in ticket #460 with requirements on how to select bs_ids that would enrich the CI testing counts matrix for the batch correction module? It will be helpful when updating the create-subset-files module.

@jharenza
Copy link
Member

jharenza commented Jan 7, 2023

Hi @aadamk! This passed with the latest updates @devbyaccident did over in #295 and I have now merged #297. Just a few more requests before I approve:

  1. Can you bring the dev version of the dockerfile into your PR (there are some spacing changes which are changing the dockerfile here). Not a huge deal since the docker image will build anyway, but better to keep any dockerfile changes out of this since it is not needed.
  2. Can you update this module in the analysis README, as well as links/pertinent info in the older modules so all relevant batch correction modules are up to date? Eg rna-seq-protocol-ruvseq and rna-seq-protocol-dge have broken links, the latter says in progress. Related, do we need to keep those modules around or does your new module take care of everything from those previous modules?

So close! :)

@aadamk
Copy link
Author

aadamk commented Jan 10, 2023

Hi @aadamk! This passed with the latest updates @devbyaccident did over in #295 and I have now merged #297. Just a few more requests before I approve:

  1. Can you bring the dev version of the dockerfile into your PR (there are some spacing changes which are changing the dockerfile here). Not a huge deal since the docker image will build anyway, but better to keep any dockerfile changes out of this since it is not needed.
  2. Can you update this module in the analysis README, as well as links/pertinent info in the older modules so all relevant batch correction modules are up to date? Eg rna-seq-protocol-ruvseq and rna-seq-protocol-dge have broken links, the latter says in progress. Related, do we need to keep those modules around or does your new module take care of everything from those previous modules?

So close! :)

Hi @jharenza
These are now addressed
For 1, the dockerfile contents from dev have been copied into the branch dockerfile.
For 2, I have updated the readme accordingly. rna-seq-protocol-ruvseq and rna-seq-protocol-dge can be removed or archived as these were pilot analyses rather than generalized DGE utilities.

@jharenza
Copy link
Member

Hi @aadamk! This passed with the latest updates @devbyaccident did over in #295 and I have now merged #297. Just a few more requests before I approve:

  1. Can you bring the dev version of the dockerfile into your PR (there are some spacing changes which are changing the dockerfile here). Not a huge deal since the docker image will build anyway, but better to keep any dockerfile changes out of this since it is not needed.
  2. Can you update this module in the analysis README, as well as links/pertinent info in the older modules so all relevant batch correction modules are up to date? Eg rna-seq-protocol-ruvseq and rna-seq-protocol-dge have broken links, the latter says in progress. Related, do we need to keep those modules around or does your new module take care of everything from those previous modules?

So close! :)

Hi @jharenza These are now addressed For 1, the dockerfile contents from dev have been copied into the branch dockerfile. For 2, I have updated the readme accordingly. rna-seq-protocol-ruvseq and rna-seq-protocol-dge can be removed or archived as these were pilot analyses rather than generalized DGE utilities.

thanks @aadamk - made a ticket for removal of those

@aadamk
Copy link
Author

aadamk commented Jan 11, 2023

The corrected count matrices and DGE results for HGG-DMG and NBL genes of interest look as expected plus there are very few major changes with other PMTL genes. The results are suitable for DGE.

Hi @afarrel - thank you for this review. can you approve so that we can merge this in?

@jharenza
Copy link
Member

We can merge this next - we only need one approving review. Cc @ewafula

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants