Identify stably expressed housekeeping genes for batch correction #18

logstar · 2021-06-08T19:41:14Z

Purpose/implementation Section

What scientific question is your analysis addressing?

Identify of stably expressed housekeeping genes in cancer and normal RNA-seq samples for batch correction.

What was your approach?

Merge PBTA, GTEx, TARGET and TCGA RSEM expected read counts.
1. Matched gtex_target_tcga-gene-counts-rsem-expected_count-collapsed.rds column names with the mapping files shared by @komalsrathi at v4 release ticket-tracker-OPC#22 (comment).
2. Removed duplicated samples. Some TCGA sample_barcodes are mapped to multiple sample_ids, and some of their expected count sums are different in gtex_target_tcga-gene-counts-rsem-expected_count-collapsed.rds. Keep one of the duplicated samples that have the same RESM expected count.
3. Use PBTA gene-counts-rsem-expected_count-collapsed.rds column names as both sample_barcode and sample_id for consitency.
4. Find the histologies of the samples by matching their sample_barcodes with the Kids_First_Biospecimen_IDs in histologies.tsv.
Run differential gene expression (DGE) analysis using edgeR TMM normalization and likelihood ratio test on housekeeping genes.

What GitHub issue does your pull request address?

d3b-center/ticket-tracker-OPC#27

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Whether the following NBL vs GTEx edgeR DGE analysis code is expected?

https://github.com/logstar/OpenPedCan-analysis/blob/5dc8da7e0c0c3bce5d0723639380cb7c317985c4/analyses/rna-seq-stably-expressed-housekeeping-genes/02-identify-stably-expressed-housekeeping-genes.R#L26-L81 (the link does not show code preview)

nbl_sids <- sh_rmna_m_kfbid_sid_htl_df[
    sh_rmna_m_kfbid_sid_htl_df$short_histology == 'NBL',
    'sample_id']


gtex_sids <- m_kfbid_sid_htl_df[
    substr(m_kfbid_sid_htl_df$sample_barcode, 1, 5) == 'GTEX-',
    'sample_id']


counts <- cbind(
    m_cnt_df[, nbl_sids],
    m_cnt_df[, gtex_sids]
)
stopifnot(identical(colnames(counts),
                    unique(colnames(counts))))


# defein group for DGE testing
group <- factor(c(rep("NBL", length(nbl_sids)),
                  rep("GTEX", length(gtex_sids))),
                levels = c('GTEX', 'NBL'))


# Make design matrix 
design <- model.matrix(~group)


### build normalized DGEList
counts_object <- DGEList(counts = counts, group = group)


counts_object <- counts_object[
    filterByExpr(counts_object), , keep.lib.sizes=FALSE]


counts_object <- calcNormFactors(counts_object, method = "TMM")
norm_cnt_mat <- cpm(counts_object)


# read and select housekeeping genes
hkgenes <- read.csv(file.path('input', 'Housekeeping_GenesHuman.csv'),
                    sep = ";", stringsAsFactors = FALSE)
hkgenes <- unique(hkgenes$Gene.name)
hkgenes <- intersect(rownames(counts_object), hkgenes)
counts_object <- counts_object[hkgenes, ]


plot_nm_str <- 'TMM'


counts_object <- estimateDisp(
    counts_object, design=design, prior.df=NULL, trend.method='locfit',
    tagwise=TRUE)


png(file.path(plot_outdir, 'estimated_dispersions.png'))
plotBCV(counts_object)
n_null_dev <- dev.off()




#------------ Run LRT ----------------------------------------------------------
print(paste0('Run differential gene expression LRT on',
             ' NBL vs GTEX RNA-seq...'))


glm_fit <- glmFit(counts_object, design)
lrt <- glmLRT(glm_fit)

Is there anything that you want to discuss further?

The 01-prepare-data.R prepares data for 02-identify-stably-expressed-housekeeping-genes.R to run DGE analysis, in order to reduce the memory usage. 01-prepare-data.R uses about 15GB of memory, and the used memory cannot be released by gc(). Restarting R session to free all used memory. 02-identify-stably-expressed-housekeeping-genes.R also uses about 15GB memory.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

Results

What types of results are included (e.g., table, figure)?

Tables and figures.

What is your summary of the results?

results/tmm_normalized/NBL_vs_GTEX_dge_lrt_res.csv: NBL vs GTEx DGE result table.
plots/tmm_normalized: NBL vs GTEx DGE plots.

Reproducibility Checklist

The dependencies required to run the code in this pull request have been added to the project Dockerfile.
This analysis has been added to continuous integration.

Question: Is the continuous integration (CI) set up for PediatricOpenTargets/OpenPedCan-analysis? The .circleci/config.yml still has AlexsLemonade/OpenPBTA-analysis CI data URL, OPENPBTA_URL=https://open-pbta.s3.amazonaws.com/data OPENPBTA_RELEASE=testing

Documentation Checklist

This analysis module has a README and it is up to date.
This analysis is recorded in the table in analyses/README.md and the entry is up to date.
The analytical code is documented and contains comments.

The rna-seq-identify-bc-sehgs analysis module identifies stably expressed housekeeping genes for batch correction, which is created to address issue d3b-center/ticket-tracker-OPC#27 .

Add rna-seq-stably-expressed-housekeeping-genes analysis module to analyses/README.md.

Change rna-seq-stably-expressed-housekeeping-genes analysis module URL from https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/master/... to https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/... OpenPedCan-analysis uses dev as default branch.

aadamk

We may want to approach this analysis differently once our goals are complete from the discussion here.

The comparison of technical replicates you performed previously (matched samples differing only by polya vs stranded enrichment) may provide the necessary generalized negative control gene set for future DGE comparisons involving RNA_library as batch (and the complement of this gene set may be appropriate for correcting technical variations not related to RNA_library). However, this will be more definitively answered with an empirical evaluation of p-values of negative/positive control genes at multiple k's and RLE plots of negative control genes in the linked PR above + your suggestion of evaluating goodness of fit and the association of unwanted variation with biological factors.

Assuming a positive result from the above, my thought would be to instead focus this analysis as a T/N transcriptome-wide benchmarking comparison with and without RUVg (in place of the current goal of identifying a stably expressed gene set).

Thoughts?

logstar · 2021-08-31T20:30:09Z

I will merge with the latest dev branch and push to this PR, in order to check whether merging #93 changes the diffs of previously opened PRs, as the diffs in #100 are affected.

Use this PR for testing, because this PR may be superseded by #74.

logstar · 2021-08-31T20:39:13Z

@runjin326 @jharenza - Merging #93 without syncing with the dev branch introduced no new diff to this PR, which was synced from v4.

Following are command logs.

git:(dev) $ git remote -vv
origin	https://github.com/logstar/OpenPedCan-analysis.git (fetch)
origin	https://github.com/logstar/OpenPedCan-analysis.git (push)
upstream	https://github.com/PediatricOpenTargets/OpenPedCan-analysis.git (fetch)
upstream	https://github.com/PediatricOpenTargets/OpenPedCan-analysis.git (push)

git:(dev) $ git fetch upstream
remote: Enumerating objects: 1058, done.
remote: Counting objects: 100% (540/540), done.
remote: Compressing objects: 100% (21/21), done.
remote: Total 1058 (delta 524), reused 531 (delta 519), pack-reused 518
Receiving objects: 100% (1058/1058), 257.69 MiB | 10.55 MiB/s, done.
Resolving deltas: 100% (661/661), completed with 172 local objects.
From https://github.com/PediatricOpenTargets/OpenPedCan-analysis
 * [new branch]        as-refactor-subtyping                        -> upstream/as-refactor-subtyping
 * [new branch]        as-small-refactor                            -> upstream/as-small-refactor
 * [new branch]        cansavvy/dockerfile                          -> upstream/cansavvy/dockerfile
 * [new branch]        cbethell/recode-oncoprints                   -> upstream/cbethell/recode-oncoprints
 * [new branch]        cbethell/use-harmonized-dx-interaction-plots -> upstream/cbethell/use-harmonized-dx-interaction-plots
 * [new branch]        cnv-Manta-oveerlap-fix                       -> upstream/cnv-Manta-oveerlap-fix
 * [new branch]        dev                                          -> upstream/dev
 * [new branch]        expression-plots                             -> upstream/expression-plots
 * [new branch]        focal-cn-preparation-v7-p2                   -> upstream/focal-cn-preparation-v7-p2
 * [new branch]        fusion-freq                                  -> upstream/fusion-freq
 * [new branch]        fusion-freq-doc                              -> upstream/fusion-freq-doc
 * [new branch]        fusion-freq-readme                           -> upstream/fusion-freq-readme
 * [new branch]        fusion-frequency-update                      -> upstream/fusion-frequency-update
 * [new branch]        fusion_filtering_cg_normal_match             -> upstream/fusion_filtering_cg_normal_match
 * [new branch]        gtex-rename                                  -> upstream/gtex-rename
 * [new branch]        histology_color_update                       -> upstream/histology_color_update
 * [new branch]        ind-samples                                  -> upstream/ind-samples
 * [new branch]        master                                       -> upstream/master
 * [new branch]        molecular-subtype-mb                         -> upstream/molecular-subtype-mb
 * [new branch]        pedcbio-template                             -> upstream/pedcbio-template
 * [new branch]        pr/821                                       -> upstream/pr/821
 * [new branch]        revert-75-pedcbio-template                   -> upstream/revert-75-pedcbio-template
 * [new branch]        rnaseq-batch-correct                         -> upstream/rnaseq-batch-correct
 * [new branch]        tp53_nf1_update                              -> upstream/tp53_nf1_update
 * [new branch]        tp53_nf1_update_p2                           -> upstream/tp53_nf1_update_p2
 * [new branch]        update-bucket                                -> upstream/update-bucket
 * [new branch]        update-readme                                -> upstream/update-readme
 * [new branch]        v7-release                                   -> upstream/v7-release

git:(dev) $ git merge upstream/dev
Updating 61e23154..d3c8b489
Fast-forward
 analyses/README.md                                                                                                        |   1 +
 analyses/pedot-table-column-display-order-name/.flake8                                                                    |   5 +++
 analyses/pedot-table-column-display-order-name/01-generate-pedot-column-display-order-name-xlsx.py                        |  50 +++++++++++++++++++++
 analyses/pedot-table-column-display-order-name/README.md                                                                  |  99 ++++++++++++++++++++++++++++++++++++++++++
 analyses/pedot-table-column-display-order-name/docs/pedot_table_display_column_order_name_coordination_Aug5_2021_v1.2.png | Bin 0 -> 316111 bytes
 analyses/pedot-table-column-display-order-name/results/pedot-table-column-display-order-name.xlsx                         | Bin 0 -> 69849 bytes
 analyses/pedot-table-column-display-order-name/run-pedot-table-column-display-order-name.sh                               |  23 ++++++++++
 analyses/pedot-table-column-display-order-name/utils/__init__.py                                                          |   2 +
 analyses/pedot-table-column-display-order-name/utils/tsv.py                                                               | 228 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 analyses/pedot-table-column-display-order-name/utils/xlsx.py                                                              |  41 ++++++++++++++++++
 download-data.sh                                                                                                          |   2 +-
 11 files changed, 450 insertions(+), 1 deletion(-)
 create mode 100644 analyses/pedot-table-column-display-order-name/.flake8
 create mode 100644 analyses/pedot-table-column-display-order-name/01-generate-pedot-column-display-order-name-xlsx.py
 create mode 100644 analyses/pedot-table-column-display-order-name/README.md
 create mode 100644 analyses/pedot-table-column-display-order-name/docs/pedot_table_display_column_order_name_coordination_Aug5_2021_v1.2.png
 create mode 100644 analyses/pedot-table-column-display-order-name/results/pedot-table-column-display-order-name.xlsx
 create mode 100644 analyses/pedot-table-column-display-order-name/run-pedot-table-column-display-order-name.sh
 create mode 100644 analyses/pedot-table-column-display-order-name/utils/__init__.py
 create mode 100644 analyses/pedot-table-column-display-order-name/utils/tsv.py
 create mode 100644 analyses/pedot-table-column-display-order-name/utils/xlsx.py

git:(dev) $ git push
Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/logstar/OpenPedCan-analysis.git
   61e23154..d3c8b489  dev -> dev

# checkout this PR's branch
git:(dev) $ git co -t origin/rna-seq-identify-bc-sehgs 
Updating files: 100% (384/384), done.
Branch 'rna-seq-identify-bc-sehgs' set up to track remote branch 'rna-seq-identify-bc-sehgs' from 'origin'.
Switched to a new branch 'rna-seq-identify-bc-sehgs'

git:(rna-seq-identify-bc-sehgs) $ git merge dev
Performing inexact rename detection: 100% (6025/6025), done.
Removing analyses/molecular-subtyping-MB/results/MB_batchcorrected_molecular_subtype.tsv
Removing analyses/independent-samples/results/independent-specimens.rnaseq.primary-plus-stranded.tsv
Removing analyses/independent-samples/results/independent-specimens.rnaseq.primary-plus-polya.tsv
Removing analyses/independent-samples/independent_rna_samples.R
Removing analyses/independent-samples/independent-samples.R
Removing analyses/fusion_filtering/results/pbta-fusion-recurrently-fused-genes-bysample.tsv
Removing analyses/fusion_filtering/results/pbta-fusion-recurrently-fused-genes-byhistology.tsv
Removing analyses/fusion_filtering/results/pbta-fusion-recurrent-fusion-bysample.tsv
Removing analyses/fusion_filtering/results/pbta-fusion-recurrent-fusion-byhistology.tsv
Removing analyses/fusion_filtering/results/pbta-fusion-putative-oncogenic.tsv
Removing analyses/fusion_filtering/06-recurrent-fusions-per-histology.R
Removing analyses/collapse-rnaseq/results/kfnbl-gene-expression-rsem-tpm-collapsed_table.stranded.rds
Removing analyses/collapse-rnaseq/results/kfnbl-gene-expression-rsem-fpkm-collapsed_table.stranded.rds
Removing analyses/collapse-rnaseq/results/kfnbl-gene-counts-rsem-expected_count-collapsed_table.stranded.rds
Removing analyses/collapse-rnaseq/02-analyze-drops-kfnbl-tpm.nb.html
Removing analyses/collapse-rnaseq/02-analyze-drops-kfnbl-fpkm.nb.html
Removing analyses/collapse-rnaseq/02-analyze-drops-kfnbl-expected_count.nb.html
Auto-merging analyses/README.md
CONFLICT (content): Merge conflict in analyses/README.md
Automatic merge failed; fix conflicts and then commit the result.

... resolved conflicts

git:(rna-seq-identify-bc-sehgs*) $ git add README.md

git:(rna-seq-identify-bc-sehgs*) $ git ci
[rna-seq-identify-bc-sehgs 9d6caf45] Merge branch 'dev' into rna-seq-identify-bc-sehgs

git:(rna-seq-identify-bc-sehgs) $ git push
Enumerating objects: 10, done.
Counting objects: 100% (10/10), done.
Delta compression using up to 8 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 640 bytes | 640.00 KiB/s, done.
Total 4 (delta 3), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (3/3), completed with 3 local objects.
To https://github.com/logstar/OpenPedCan-analysis.git
   038f94a5..9d6caf45  rna-seq-identify-bc-sehgs -> rna-seq-identify-bc-sehgs

logstar · 2022-01-31T21:21:25Z

We may want to approach this analysis differently once our goals are complete from the discussion here.

The comparison of technical replicates you performed previously (matched samples differing only by polya vs stranded enrichment) may provide the necessary generalized negative control gene set for future DGE comparisons involving RNA_library as batch (and the complement of this gene set may be appropriate for correcting technical variations not related to RNA_library). However, this will be more definitively answered with an empirical evaluation of p-values of negative/positive control genes at multiple k's and RLE plots of negative control genes in the linked PR above + your suggestion of evaluating goodness of fit and the association of unwanted variation with biological factors.

Assuming a positive result from the above, my thought would be to instead focus this analysis as a T/N transcriptome-wide benchmarking comparison with and without RUVg (in place of the current goal of identifying a stably expressed gene set).

Thoughts?

@komalsrathi - Above is the latest summary by @aadamk on this PR.

I will close this PR soon, according to @aadamk's summary. In addition, the issue that is related to this PR, d3b-center/ticket-tracker-OPC#27, has also been closed.

cc @taylordm @chinwallaa

logstar added 3 commits June 8, 2021 14:43

Add rna-seq-identify-bc-sehgs analysis module

7709eca

The rna-seq-identify-bc-sehgs analysis module identifies stably expressed housekeeping genes for batch correction, which is created to address issue d3b-center/ticket-tracker-OPC#27 .

Update analyses/README.md

ba59d9a

Add rna-seq-stably-expressed-housekeeping-genes analysis module to analyses/README.md.

Clean up comments

5dc8da7

logstar requested review from jharenza and aadamk June 8, 2021 19:41

jharenza requested a review from kgaonkar6 June 11, 2021 17:47

jharenza added the low priority label Jul 16, 2021

logstar mentioned this pull request Aug 3, 2021

Evaluate RUVseq using differentially expressed housekeeping genes as RUVg negative control genes #74

Closed

4 tasks

aadamk reviewed Aug 13, 2021

View reviewed changes

Merge branch 'dev' into rna-seq-identify-bc-sehgs

9d6caf4

logstar closed this Jan 31, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Identify stably expressed housekeeping genes for batch correction #18

Identify stably expressed housekeeping genes for batch correction #18

logstar commented Jun 8, 2021

aadamk left a comment

logstar commented Aug 31, 2021

logstar commented Aug 31, 2021 •

edited

Loading

logstar commented Jan 31, 2022

Identify stably expressed housekeeping genes for batch correction #18

Identify stably expressed housekeeping genes for batch correction #18

Conversation

logstar commented Jun 8, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

What was your approach?

What GitHub issue does your pull request address?

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Is there anything that you want to discuss further?

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Results

What types of results are included (e.g., table, figure)?

What is your summary of the results?

Reproducibility Checklist

Documentation Checklist

aadamk left a comment

Choose a reason for hiding this comment

logstar commented Aug 31, 2021

logstar commented Aug 31, 2021 • edited Loading

logstar commented Jan 31, 2022

logstar commented Aug 31, 2021 •

edited

Loading