-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Identify stably expressed housekeeping genes for batch correction #18
Conversation
The rna-seq-identify-bc-sehgs analysis module identifies stably expressed housekeeping genes for batch correction, which is created to address issue d3b-center/ticket-tracker-OPC#27 .
Add rna-seq-stably-expressed-housekeeping-genes analysis module to analyses/README.md.
Change rna-seq-stably-expressed-housekeeping-genes analysis module URL from https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/master/... to https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/... OpenPedCan-analysis uses dev as default branch.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to approach this analysis differently once our goals are complete from the discussion here.
The comparison of technical replicates you performed previously (matched samples differing only by polya vs stranded enrichment) may provide the necessary generalized negative control gene set for future DGE comparisons involving RNA_library as batch (and the complement of this gene set may be appropriate for correcting technical variations not related to RNA_library). However, this will be more definitively answered with an empirical evaluation of p-values of negative/positive control genes at multiple k's and RLE plots of negative control genes in the linked PR above + your suggestion of evaluating goodness of fit and the association of unwanted variation with biological factors.
Assuming a positive result from the above, my thought would be to instead focus this analysis as a T/N transcriptome-wide benchmarking comparison with and without RUVg (in place of the current goal of identifying a stably expressed gene set).
Thoughts?
@runjin326 @jharenza - Merging #93 without syncing with the dev branch introduced no new diff to this PR, which was synced from v4. Following are command logs.
|
@komalsrathi - Above is the latest summary by @aadamk on this PR. I will close this PR soon, according to @aadamk's summary. In addition, the issue that is related to this PR, d3b-center/ticket-tracker-OPC#27, has also been closed. |
Purpose/implementation Section
What scientific question is your analysis addressing?
Identify of stably expressed housekeeping genes in cancer and normal RNA-seq samples for batch correction.
What was your approach?
gtex_target_tcga-gene-counts-rsem-expected_count-collapsed.rds
column names with the mapping files shared by @komalsrathi at v4 release ticket-tracker-OPC#22 (comment).sample_barcode
s are mapped to multiplesample_id
s, and some of their expected count sums are different ingtex_target_tcga-gene-counts-rsem-expected_count-collapsed.rds
. Keep one of the duplicated samples that have the same RESM expected count.gene-counts-rsem-expected_count-collapsed.rds
column names as bothsample_barcode
andsample_id
for consitency.sample_barcode
s with theKids_First_Biospecimen_ID
s inhistologies.tsv
.What GitHub issue does your pull request address?
d3b-center/ticket-tracker-OPC#27
Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.
Which areas should receive a particularly close look?
Whether the following NBL vs GTEx edgeR DGE analysis code is expected?
https://github.com/logstar/OpenPedCan-analysis/blob/5dc8da7e0c0c3bce5d0723639380cb7c317985c4/analyses/rna-seq-stably-expressed-housekeeping-genes/02-identify-stably-expressed-housekeeping-genes.R#L26-L81 (the link does not show code preview)
Is there anything that you want to discuss further?
The
01-prepare-data.R
prepares data for02-identify-stably-expressed-housekeeping-genes.R
to run DGE analysis, in order to reduce the memory usage.01-prepare-data.R
uses about 15GB of memory, and the used memory cannot be released bygc()
. Restarting R session to free all used memory.02-identify-stably-expressed-housekeeping-genes.R
also uses about 15GB memory.Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?
Yes.
Results
What types of results are included (e.g., table, figure)?
Tables and figures.
What is your summary of the results?
results/tmm_normalized/NBL_vs_GTEX_dge_lrt_res.csv
: NBL vs GTEx DGE result table.plots/tmm_normalized
: NBL vs GTEx DGE plots.Reproducibility Checklist
Question: Is the continuous integration (CI) set up for PediatricOpenTargets/OpenPedCan-analysis? The
.circleci/config.yml
still has AlexsLemonade/OpenPBTA-analysis CI data URL,OPENPBTA_URL=https://open-pbta.s3.amazonaws.com/data OPENPBTA_RELEASE=testing
Documentation Checklist
README
and it is up to date.analyses/README.md
and the entry is up to date.