-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RUVseq RNA Batch Correction #247
Conversation
…s/OpenPedCan-analysis into rnaseq-batch-correct
Unsure why this is giving an error. This previously ran successfully in the docker environment and is currently running successfully locally as well. When checking the sample content of the rnaseq rsem expression file from my v11 data download, specifically the few bs_ids that were listed as not found in the CI error, I did indeed find those ids as columns in the counts file. Unsure how to proceed - it seems like the data that are downloaded by CI do not match the data that downloaded locally to my machine. |
@aadamk, the test data used by the CI is a subset of the v11 expression matrices and can be downloaded using the following script: We might need to create a specific subset for the module if the current CI testing matrix is not suitable. The module that creates this CI testing subset data sets is here: Cc @jharenza |
@aadamk, here is the ticket for developing the corresponding CAVATICA app by the D3b bix-dev team. You can expound on specific details if any. |
Thank you for the clarification @ewafula |
@jharenza suggested the following two options to resolve the issue of the CI failing because of missing bs_ids in the CI testing counts subset matrix:
To me, it seems both options, including the tumor-normal, require enriching the CI testing counts matrix dataset with as follows:
Here is an example of a list of bs_ids included in the CI testing dataset to enrich for TP53 and NF1 mutations determined by this script in Since the batch correction analysis is upstream, the @aadamk, would you mind adding more detail, specifically for task 1 in ticket #460 with requirements on how to select bs_ids that would enrich the CI testing counts matrix for the batch correction module? It will be helpful when updating the create-subset-files module. |
Hi @aadamk! This passed with the latest updates @devbyaccident did over in #295 and I have now merged #297. Just a few more requests before I approve:
So close! :) |
Hi @jharenza |
|
Hi @afarrel - thank you for this review. can you approve so that we can merge this in? |
We can merge this next - we only need one approving review. Cc @ewafula |
Purpose/implementation Section
What scientific question is your analysis addressing?
This analysis addresses the issue of apparent batch effects by RNA-library in transcriptome data, correcting specifically for the factors that result in differences across polyA, ribo-deplete, and RNA exome libraries. This analysis should more effectively control for Type I and Type II errors when running differential gene expression analysis by DESeq2.
What was your approach?
This model assumes that differences in housekeeping genes from the HRT atlas v1.0 across DESeq2 comparison groups will specifically be a result of technical, rather than biological, variation, whether in a tumor-only or tumor-normal context. Therefore, RUVg uses HRT atlas genes as a negative control gene set for modeling factors of unwanted variation.
To select the model with the optimal sensitivity/specificity balance, this module requires that users specify a positive control gene set, simply a list of hugo gene names in an rds file containing genes that are expected to differ across comparison groups. A few use cases are specified in the run scripts, namely:
KIM_MYCN_AMPLIFICATION_TARGETS_[UP,DN]
as the positive control gene sets from C2 canonical pathways from MSigDb.Under Merged_differential_expression tab
, genes whose adj p-value across K27M vs WT day5 < 0.05.drop = 2
parameter in RUVg call) to ensure that biological variation across tumor versus normal is maintained (it is expected that biological variation, rather than RNA_library will be the main source of variation).What GitHub issue does your pull request address?
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Performs DESeq2 analysis using RUVg-based batch correction.
Key steps:
01-ruvseq-deseq.R
will run tumor-only standard differential expression analysis (no RUVg correction, by molecular_subtype) from DESeq2, to compare to RUVg using k=1-5 factors of variation in a batch-corrected DESeq2 module.02-ruvseq-deseq-tn.R
will run tumor-normal standard differential expression analysis (no RUVg correction, by molecular_subtype) from DESeq2, to compare to RUVg using k=1-5 factors of variation in a batch-corrected DESeq2 module.03-ruvseq-summarization.R
will select the batch correction method with the optimal balance of sensitivity and specificity, returning normalized counts .RDS file and show the UMAP clustering of the groups of interest.Is there anything that you want to discuss further?
A neuroblastoma tumor-normal use case may be valuable to test whether certain cancer genes/biomarkers of interest show the expected patterns across tumor versus normal (e.g. a cohort-level box plot).
Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes
Results
What types of results are included (e.g., table, figure)?
plots
directory. P-value histograms showing the impact on the preponderance of significant calls before and after batch correction are also included.output/[dataset]/normalized_counts
folderWhat is your summary of the results?
Note: files in
code/archive
subdirectory can be ignored but are useful to re-run matched sample analysisReproducibility Checklist
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.