RUVseq RNA Batch Correction #247

aadamk · 2022-08-24T18:57:58Z

Purpose/implementation Section

What scientific question is your analysis addressing?

This analysis addresses the issue of apparent batch effects by RNA-library in transcriptome data, correcting specifically for the factors that result in differences across polyA, ribo-deplete, and RNA exome libraries. This analysis should more effectively control for Type I and Type II errors when running differential gene expression analysis by DESeq2.

What was your approach?

This model assumes that differences in housekeeping genes from the HRT atlas v1.0 across DESeq2 comparison groups will specifically be a result of technical, rather than biological, variation, whether in a tumor-only or tumor-normal context. Therefore, RUVg uses HRT atlas genes as a negative control gene set for modeling factors of unwanted variation.

To select the model with the optimal sensitivity/specificity balance, this module requires that users specify a positive control gene set, simply a list of hugo gene names in an rds file containing genes that are expected to differ across comparison groups. A few use cases are specified in the run scripts, namely:

TARGET Neuroblastoma: MYC-N amplified versus WT as specified in molecular_subtype from v10 (now deprecated in v11), using KIM_MYCN_AMPLIFICATION_TARGETS_[UP,DN] as the positive control gene sets from C2 canonical pathways from MSigDb.
Cancer groups 'High-grade glioma/astrocytoma' and 'Diffuse midline glioma', using cancer stemness genes from https://doi.org/10.1186/s12915-022-01324-0, Additional file 4: Table S2.. Under Merged_differential_expression tab, genes whose adj p-value across K27M vs WT day5 < 0.05.
'High-grade glioma/astrocytoma' and 'Diffuse midline glioma' versus GTEx normals, specifying Reactome Cell Cycle pathway genes as positive control genes (non-redundant gene list from Reactome pathway IDs: R-HSA-69278, R-HSA-69620, and R-HSA-1640170). Note: it is recommended to drop the first two factors of variation (drop = 2 parameter in RUVg call) to ensure that biological variation across tumor versus normal is maintained (it is expected that biological variation, rather than RNA_library will be the main source of variation).

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Performs DESeq2 analysis using RUVg-based batch correction.

Key steps:

01-ruvseq-deseq.R will run tumor-only standard differential expression analysis (no RUVg correction, by molecular_subtype) from DESeq2, to compare to RUVg using k=1-5 factors of variation in a batch-corrected DESeq2 module.
02-ruvseq-deseq-tn.R will run tumor-normal standard differential expression analysis (no RUVg correction, by molecular_subtype) from DESeq2, to compare to RUVg using k=1-5 factors of variation in a batch-corrected DESeq2 module.
03-ruvseq-summarization.R will select the batch correction method with the optimal balance of sensitivity and specificity, returning normalized counts .RDS file and show the UMAP clustering of the groups of interest.

Is there anything that you want to discuss further?

A neuroblastoma tumor-normal use case may be valuable to test whether certain cancer genes/biomarkers of interest show the expected patterns across tumor versus normal (e.g. a cohort-level box plot).

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes

Results

What types of results are included (e.g., table, figure)?

Plots: UMAP plots showing clustering of samples before and after batch correction under plots directory. P-value histograms showing the impact on the preponderance of significant calls before and after batch correction are also included.
Output: normalized counts of the optimal model are provided in the output/[dataset]/normalized_counts folder
Output: differential gene expression tabular data are provided for batch corrected and non batch-corrected analyses. Further, RUVseq output are provided for each k should the user desire to compare to optimal output.

What is your summary of the results?

From Neuroblastoma comparison, non-batch corrected DGE analysis resulted in a significant nominal p-value for MYCN expression, but not a significant adjusted p-value. Batch correction using 3 factors of variation (the k chosen by this algorithm), resulted in an adjusted p-value of 0.01 for MYCN, which would be expected in amplified versus WT tumors.
From the HGG/DMG comparison, 3, rather than 2, batches were present in the data (RNA exome, poly-A, ribo-deplete), and the UMAP plot suggests that those batch effects are effectively corrected for at k=2.
From HGG/DMG versus normal, the UMAP plot suggests that tumor and normal remain separated even after batch effect correction, though brought in closer proximity (suggesting that biological variation is preserved, but technical variation may be in part corrected for).

Note: files in code/archive subdirectory can be ignored but are useful to re-run matched sample analysis

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

…s/OpenPedCan-analysis into rnaseq-batch-correct

aadamk · 2022-11-28T19:11:43Z

@aadamk it looks like the latest error is a file subset error - can you check this one out?

Unsure why this is giving an error. This previously ran successfully in the docker environment and is currently running successfully locally as well.

When checking the sample content of the rnaseq rsem expression file from my v11 data download, specifically the few bs_ids that were listed as not found in the CI error, I did indeed find those ids as columns in the counts file.
I removed the dplyr::select clause as certain dplyr functions were giving me issues in the past. I instead switched to base R style subsetting, by finding the indices of the columns matching the bs_ids of interest and subsetting the expression dataframe using brackets, but it still appears that certain bs_ids are not being found given the nature of the error downstream (the subsetted histology ids do not intersect perfectly with the ids in the counts dataframe after subsetting).

Unsure how to proceed - it seems like the data that are downloaded by CI do not match the data that downloaded locally to my machine.

ewafula · 2022-11-28T20:26:59Z

@aadamk it looks like the latest error is a file subset error - can you check this one out?

Unsure why this is giving an error. This previously ran successfully in the docker environment and is currently running successfully locally as well.

When checking the sample content of the rnaseq rsem expression file from my v11 data download, specifically the few bs_ids that were listed as not found in the CI error, I did indeed find those ids as columns in the counts file. I removed the dplyr::select clause as certain dplyr functions were giving me issues in the past. I instead switched to base R style subsetting, by finding the indices of the columns matching the bs_ids of interest and subsetting the expression dataframe using brackets, but it still appears that certain bs_ids are not being found given the nature of the error downstream (the subsetted histology ids do not intersect perfectly with the ids in the counts dataframe after subsetting).

Unsure how to proceed - it seems like the data that are downloaded by CI do not match the data that downloaded locally to my machine.

@aadamk, the test data used by the CI is a subset of the v11 expression matrices and can be downloaded using the following script:

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/dev/scripts/download-testing-files.sh

We might need to create a specific subset for the module if the current CI testing matrix is not suitable. The module that creates this CI testing subset data sets is here:

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/create-subset-files

Cc @jharenza

ewafula · 2022-11-28T20:39:27Z

@aadamk, here is the ticket for developing the corresponding CAVATICA app by the D3b bix-dev team. You can expound on specific details if any.

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/blob/dev/scripts/download-testing-files.sh

aadamk · 2022-11-28T22:21:44Z

We might need to create a specific subset for the module if the current CI testing matrix is not suitable. The module that creates this CI testing subset data sets is here:

https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/analyses/create-subset-files

Thank you for the clarification @ewafula
Yes, I think that would be helpful to allow this module to pass CI. I can create a ticket for this.

ewafula · 2022-11-30T21:28:30Z

@aadamk,

@jharenza suggested the following two options to resolve the issue of the CI failing because of missing bs_ids in the CI testing counts subset matrix:

Only run the tumor-normal script when with CI testing. It will require including an environmental variable OPENPBTA_TESTING in the run_ruvseq.sh bash script to conditionally run the tumor-normal batch correction script and not ignore the tumor-only batch correction script.
Enrich the CI testing counts data matrix with a few bs_ids that will enable the batch correction module to run successfully without failing GA checks.

To me, it seems both options, including the tumor-normal, require enriching the CI testing counts matrix dataset with as follows:

include a few bs_ids with contrasting molecular subtypes in the CI subset counts matrix for the target_nbl and hgg_dmg tumor-only batch correction commands in the bash script (run_ruvseq.sh)
include corresponding cancer_group/gtex_subgroup bs_ids in the CI subset counts matrix for the HGG-DMG_TN tumor-normal batch correction command in the bash script (run_ruvseq.sh)

Here is an example of a list of bs_ids included in the CI testing dataset to enrich for TP53 and NF1 mutations determined by this script in create-subset-files module.

Since the batch correction analysis is upstream, the histologies.tsv utilized in the module needs to be replaced with histologies-base.tsv. It might require temporarily creating the subtype group (i.e., MYCN status via pathology_free_text_diagnosis) and changing to use broad_histoiogy or short_histology and not cancer group in the scripts

@aadamk, would you mind adding more detail, specifically for task 1 in ticket #460 with requirements on how to select bs_ids that would enrich the CI testing counts matrix for the batch correction module? It will be helpful when updating the create-subset-files module.

jharenza · 2023-01-07T20:09:38Z

Hi @aadamk! This passed with the latest updates @devbyaccident did over in #295 and I have now merged #297. Just a few more requests before I approve:

Can you bring the dev version of the dockerfile into your PR (there are some spacing changes which are changing the dockerfile here). Not a huge deal since the docker image will build anyway, but better to keep any dockerfile changes out of this since it is not needed.
Can you update this module in the analysis README, as well as links/pertinent info in the older modules so all relevant batch correction modules are up to date? Eg rna-seq-protocol-ruvseq and rna-seq-protocol-dge have broken links, the latter says in progress. Related, do we need to keep those modules around or does your new module take care of everything from those previous modules?

So close! :)

aadamk · 2023-01-10T16:46:38Z

Hi @aadamk! This passed with the latest updates @devbyaccident did over in #295 and I have now merged #297. Just a few more requests before I approve:

Can you bring the dev version of the dockerfile into your PR (there are some spacing changes which are changing the dockerfile here). Not a huge deal since the docker image will build anyway, but better to keep any dockerfile changes out of this since it is not needed.

Can you update this module in the analysis README, as well as links/pertinent info in the older modules so all relevant batch correction modules are up to date? Eg rna-seq-protocol-ruvseq and rna-seq-protocol-dge have broken links, the latter says in progress. Related, do we need to keep those modules around or does your new module take care of everything from those previous modules?

So close! :)

Hi @jharenza
These are now addressed
For 1, the dockerfile contents from dev have been copied into the branch dockerfile.
For 2, I have updated the readme accordingly. rna-seq-protocol-ruvseq and rna-seq-protocol-dge can be removed or archived as these were pilot analyses rather than generalized DGE utilities.

jharenza · 2023-01-10T18:15:44Z

Hi @aadamk! This passed with the latest updates @devbyaccident did over in #295 and I have now merged #297. Just a few more requests before I approve:

Can you bring the dev version of the dockerfile into your PR (there are some spacing changes which are changing the dockerfile here). Not a huge deal since the docker image will build anyway, but better to keep any dockerfile changes out of this since it is not needed.

Can you update this module in the analysis README, as well as links/pertinent info in the older modules so all relevant batch correction modules are up to date? Eg rna-seq-protocol-ruvseq and rna-seq-protocol-dge have broken links, the latter says in progress. Related, do we need to keep those modules around or does your new module take care of everything from those previous modules?

So close! :)

Hi @jharenza These are now addressed For 1, the dockerfile contents from dev have been copied into the branch dockerfile. For 2, I have updated the readme accordingly. rna-seq-protocol-ruvseq and rna-seq-protocol-dge can be removed or archived as these were pilot analyses rather than generalized DGE utilities.

thanks @aadamk - made a ticket for removal of those

aadamk · 2023-01-11T13:34:38Z

The corrected count matrices and DGE results for HGG-DMG and NBL genes of interest look as expected plus there are very few major changes with other PMTL genes. The results are suitable for DGE.

Hi @afarrel - thank you for this review. can you approve so that we can merge this in?

jharenza · 2023-01-12T12:20:25Z

We can merge this next - we only need one approving review. Cc @ewafula

komalsrathi added 30 commits May 18, 2021 17:41

add rnaseq batch correction module

6140e6f

add more deets to the README

dbf0f4a

align documentation

7fcabd2

align documentation

5d1bc13

use Housekeeping Transcript Atlas

d9220b2

add uwot to Dockerfile

ce603a4

uncomment code

5084699

replace code

1781e53

Merge branch 'rnaseq-batch-correct' of github.com:PediatricOpenTarget…

938e875

…s/OpenPedCan-analysis into rnaseq-batch-correct

tmp

05427ee

update

7d0ab60

add README

5827946

rearrange output

63a27d7

fix README

fc21c8b

add library

ef2307f

update with diff expr hk genes

095d8b0

update README

30cccbf

add boxplots before/after UQ norm

44f1495

add nl

904b307

add combat

7f76fbd

add combat outputs

bbbc229

update README with ComBat details

316a517

add combat and combatseq

bbd849c

add ks test of uniformity

e558a68

add ks test

91ab068

update README

84ad924

use full transcriptome for p-value histogram

1ae1751

pvalue histogram of full transcriptome + controls

f2cbae6

fix design

773b1be

fix coef

080393d

update column subsetting

a53b68e

revert column selection to dplyr

4e629cc

aadamk and others added 5 commits December 20, 2022 11:15

minor fix

f64b3e4

Update RUVg_optimal_k3_normalized_counts.rds

6eb0f6b

Merge branch 'dev' into rnaseq-batch-correct

7434541

Merge branch 'dev' into rnaseq-batch-correct

dac3f4e

Merge branch 'dev' into rnaseq-batch-correct

0ac947b

ewafula and others added 5 commits January 10, 2023 07:49

Merge branch 'dev' into rnaseq-batch-correct

beb55ea

Update Dockerfile

2aff67e

Update Dockerfile

5c811a0

Update Dockerfile

3393398

Update README.md

c59cdfc

jharenza mentioned this pull request Jan 10, 2023

Updated analysis: Remove pilot batch correction modules d3b-center/ticket-tracker-OPC#483

Closed

jharenza approved these changes Jan 10, 2023

View reviewed changes

Merge branch 'dev' into rnaseq-batch-correct

92c132a

jharenza added merge next and removed don't merge labels Jan 12, 2023

ewafula merged commit 3858e9a into dev Jan 12, 2023

jharenza mentioned this pull request Feb 13, 2023

Proposed Analysis: RUVSeq evaluation of matched samples d3b-center/ticket-tracker-OPC#279

Closed

jharenza deleted the rnaseq-batch-correct branch February 19, 2023 02:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RUVseq RNA Batch Correction #247

RUVseq RNA Batch Correction #247

aadamk commented Aug 24, 2022 •

edited

Loading

aadamk commented Nov 28, 2022 •

edited

Loading

ewafula commented Nov 28, 2022 •

edited

Loading

ewafula commented Nov 28, 2022

aadamk commented Nov 28, 2022

ewafula commented Nov 30, 2022 •

edited

Loading

jharenza commented Jan 7, 2023

aadamk commented Jan 10, 2023

jharenza commented Jan 10, 2023

aadamk commented Jan 11, 2023

jharenza commented Jan 12, 2023

RUVseq RNA Batch Correction #247

RUVseq RNA Batch Correction #247

Conversation

aadamk commented Aug 24, 2022 • edited Loading

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

aadamk commented Nov 28, 2022 • edited Loading

ewafula commented Nov 28, 2022 • edited Loading

ewafula commented Nov 28, 2022

aadamk commented Nov 28, 2022

ewafula commented Nov 30, 2022 • edited Loading

jharenza commented Jan 7, 2023

aadamk commented Jan 10, 2023

jharenza commented Jan 10, 2023

aadamk commented Jan 11, 2023

jharenza commented Jan 12, 2023

aadamk commented Aug 24, 2022 •

edited

Loading

aadamk commented Nov 28, 2022 •

edited

Loading

ewafula commented Nov 28, 2022 •

edited

Loading

ewafula commented Nov 30, 2022 •

edited

Loading