Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Identify stably expressed housekeeping genes for batch correction #18

Closed
wants to merge 5 commits into from

Conversation

logstar
Copy link

@logstar logstar commented Jun 8, 2021

Purpose/implementation Section

What scientific question is your analysis addressing?

Identify of stably expressed housekeeping genes in cancer and normal RNA-seq samples for batch correction.

What was your approach?

  1. Merge PBTA, GTEx, TARGET and TCGA RSEM expected read counts.
    1. Matched gtex_target_tcga-gene-counts-rsem-expected_count-collapsed.rds column names with the mapping files shared by @komalsrathi at v4 release ticket-tracker-OPC#22 (comment).
    2. Removed duplicated samples. Some TCGA sample_barcodes are mapped to multiple sample_ids, and some of their expected count sums are different in gtex_target_tcga-gene-counts-rsem-expected_count-collapsed.rds. Keep one of the duplicated samples that have the same RESM expected count.
    3. Use PBTA gene-counts-rsem-expected_count-collapsed.rds column names as both sample_barcode and sample_id for consitency.
    4. Find the histologies of the samples by matching their sample_barcodes with the Kids_First_Biospecimen_IDs in histologies.tsv.
  2. Run differential gene expression (DGE) analysis using edgeR TMM normalization and likelihood ratio test on housekeeping genes.

What GitHub issue does your pull request address?

d3b-center/ticket-tracker-OPC#27

Directions for reviewers. Tell potential reviewers what kind of feedback you are soliciting.

Which areas should receive a particularly close look?

Whether the following NBL vs GTEx edgeR DGE analysis code is expected?

https://github.com/logstar/OpenPedCan-analysis/blob/5dc8da7e0c0c3bce5d0723639380cb7c317985c4/analyses/rna-seq-stably-expressed-housekeeping-genes/02-identify-stably-expressed-housekeeping-genes.R#L26-L81 (the link does not show code preview)

nbl_sids <- sh_rmna_m_kfbid_sid_htl_df[
    sh_rmna_m_kfbid_sid_htl_df$short_histology == 'NBL',
    'sample_id']


gtex_sids <- m_kfbid_sid_htl_df[
    substr(m_kfbid_sid_htl_df$sample_barcode, 1, 5) == 'GTEX-',
    'sample_id']


counts <- cbind(
    m_cnt_df[, nbl_sids],
    m_cnt_df[, gtex_sids]
)
stopifnot(identical(colnames(counts),
                    unique(colnames(counts))))


# defein group for DGE testing
group <- factor(c(rep("NBL", length(nbl_sids)),
                  rep("GTEX", length(gtex_sids))),
                levels = c('GTEX', 'NBL'))


# Make design matrix 
design <- model.matrix(~group)


### build normalized DGEList
counts_object <- DGEList(counts = counts, group = group)


counts_object <- counts_object[
    filterByExpr(counts_object), , keep.lib.sizes=FALSE]


counts_object <- calcNormFactors(counts_object, method = "TMM")
norm_cnt_mat <- cpm(counts_object)


# read and select housekeeping genes
hkgenes <- read.csv(file.path('input', 'Housekeeping_GenesHuman.csv'),
                    sep = ";", stringsAsFactors = FALSE)
hkgenes <- unique(hkgenes$Gene.name)
hkgenes <- intersect(rownames(counts_object), hkgenes)
counts_object <- counts_object[hkgenes, ]


plot_nm_str <- 'TMM'


counts_object <- estimateDisp(
    counts_object, design=design, prior.df=NULL, trend.method='locfit',
    tagwise=TRUE)


png(file.path(plot_outdir, 'estimated_dispersions.png'))
plotBCV(counts_object)
n_null_dev <- dev.off()




#------------ Run LRT ----------------------------------------------------------
print(paste0('Run differential gene expression LRT on',
             ' NBL vs GTEX RNA-seq...'))


glm_fit <- glmFit(counts_object, design)
lrt <- glmLRT(glm_fit)

Is there anything that you want to discuss further?

The 01-prepare-data.R prepares data for 02-identify-stably-expressed-housekeeping-genes.R to run DGE analysis, in order to reduce the memory usage. 01-prepare-data.R uses about 15GB of memory, and the used memory cannot be released by gc(). Restarting R session to free all used memory. 02-identify-stably-expressed-housekeeping-genes.R also uses about 15GB memory.

Is the analysis in a mature enough form that the resulting figure(s) and/or table(s) are ready for review?

Yes.

Results

What types of results are included (e.g., table, figure)?

Tables and figures.

What is your summary of the results?

  • results/tmm_normalized/NBL_vs_GTEX_dge_lrt_res.csv: NBL vs GTEx DGE result table.
  • plots/tmm_normalized: NBL vs GTEx DGE plots.

Reproducibility Checklist

  • The dependencies required to run the code in this pull request have been added to the project Dockerfile.
  • This analysis has been added to continuous integration.

Question: Is the continuous integration (CI) set up for PediatricOpenTargets/OpenPedCan-analysis? The .circleci/config.yml still has AlexsLemonade/OpenPBTA-analysis CI data URL, OPENPBTA_URL=https://open-pbta.s3.amazonaws.com/data OPENPBTA_RELEASE=testing

Documentation Checklist

  • This analysis module has a README and it is up to date.
  • This analysis is recorded in the table in analyses/README.md and the entry is up to date.
  • The analytical code is documented and contains comments.

The rna-seq-identify-bc-sehgs analysis module identifies stably
expressed housekeeping genes for batch correction, which is created to
address issue
d3b-center/ticket-tracker-OPC#27 .
Add rna-seq-stably-expressed-housekeeping-genes analysis module to
analyses/README.md.
@logstar logstar requested review from jharenza and aadamk June 8, 2021 19:41
@jharenza jharenza requested a review from kgaonkar6 June 11, 2021 17:47
Change rna-seq-stably-expressed-housekeeping-genes analysis module URL from
https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/master/...
to
https://github.com/PediatricOpenTargets/OpenPedCan-analysis/tree/dev/...

OpenPedCan-analysis uses dev as default branch.
Copy link

@aadamk aadamk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We may want to approach this analysis differently once our goals are complete from the discussion here.

The comparison of technical replicates you performed previously (matched samples differing only by polya vs stranded enrichment) may provide the necessary generalized negative control gene set for future DGE comparisons involving RNA_library as batch (and the complement of this gene set may be appropriate for correcting technical variations not related to RNA_library). However, this will be more definitively answered with an empirical evaluation of p-values of negative/positive control genes at multiple k's and RLE plots of negative control genes in the linked PR above + your suggestion of evaluating goodness of fit and the association of unwanted variation with biological factors.

Assuming a positive result from the above, my thought would be to instead focus this analysis as a T/N transcriptome-wide benchmarking comparison with and without RUVg (in place of the current goal of identifying a stably expressed gene set).

Thoughts?

@logstar
Copy link
Author

logstar commented Aug 31, 2021

I will merge with the latest dev branch and push to this PR, in order to check whether merging #93 changes the diffs of previously opened PRs, as the diffs in #100 are affected.

Use this PR for testing, because this PR may be superseded by #74.

@logstar
Copy link
Author

logstar commented Aug 31, 2021

@runjin326 @jharenza - Merging #93 without syncing with the dev branch introduced no new diff to this PR, which was synced from v4.

Following are command logs.

git:(dev) $ git remote -vv
origin	https://github.com/logstar/OpenPedCan-analysis.git (fetch)
origin	https://github.com/logstar/OpenPedCan-analysis.git (push)
upstream	https://github.com/PediatricOpenTargets/OpenPedCan-analysis.git (fetch)
upstream	https://github.com/PediatricOpenTargets/OpenPedCan-analysis.git (push)

git:(dev) $ git fetch upstream
remote: Enumerating objects: 1058, done.
remote: Counting objects: 100% (540/540), done.
remote: Compressing objects: 100% (21/21), done.
remote: Total 1058 (delta 524), reused 531 (delta 519), pack-reused 518
Receiving objects: 100% (1058/1058), 257.69 MiB | 10.55 MiB/s, done.
Resolving deltas: 100% (661/661), completed with 172 local objects.
From https://github.com/PediatricOpenTargets/OpenPedCan-analysis
 * [new branch]        as-refactor-subtyping                        -> upstream/as-refactor-subtyping
 * [new branch]        as-small-refactor                            -> upstream/as-small-refactor
 * [new branch]        cansavvy/dockerfile                          -> upstream/cansavvy/dockerfile
 * [new branch]        cbethell/recode-oncoprints                   -> upstream/cbethell/recode-oncoprints
 * [new branch]        cbethell/use-harmonized-dx-interaction-plots -> upstream/cbethell/use-harmonized-dx-interaction-plots
 * [new branch]        cnv-Manta-oveerlap-fix                       -> upstream/cnv-Manta-oveerlap-fix
 * [new branch]        dev                                          -> upstream/dev
 * [new branch]        expression-plots                             -> upstream/expression-plots
 * [new branch]        focal-cn-preparation-v7-p2                   -> upstream/focal-cn-preparation-v7-p2
 * [new branch]        fusion-freq                                  -> upstream/fusion-freq
 * [new branch]        fusion-freq-doc                              -> upstream/fusion-freq-doc
 * [new branch]        fusion-freq-readme                           -> upstream/fusion-freq-readme
 * [new branch]        fusion-frequency-update                      -> upstream/fusion-frequency-update
 * [new branch]        fusion_filtering_cg_normal_match             -> upstream/fusion_filtering_cg_normal_match
 * [new branch]        gtex-rename                                  -> upstream/gtex-rename
 * [new branch]        histology_color_update                       -> upstream/histology_color_update
 * [new branch]        ind-samples                                  -> upstream/ind-samples
 * [new branch]        master                                       -> upstream/master
 * [new branch]        molecular-subtype-mb                         -> upstream/molecular-subtype-mb
 * [new branch]        pedcbio-template                             -> upstream/pedcbio-template
 * [new branch]        pr/821                                       -> upstream/pr/821
 * [new branch]        revert-75-pedcbio-template                   -> upstream/revert-75-pedcbio-template
 * [new branch]        rnaseq-batch-correct                         -> upstream/rnaseq-batch-correct
 * [new branch]        tp53_nf1_update                              -> upstream/tp53_nf1_update
 * [new branch]        tp53_nf1_update_p2                           -> upstream/tp53_nf1_update_p2
 * [new branch]        update-bucket                                -> upstream/update-bucket
 * [new branch]        update-readme                                -> upstream/update-readme
 * [new branch]        v7-release                                   -> upstream/v7-release

git:(dev) $ git merge upstream/dev
Updating 61e23154..d3c8b489
Fast-forward
 analyses/README.md                                                                                                        |   1 +
 analyses/pedot-table-column-display-order-name/.flake8                                                                    |   5 +++
 analyses/pedot-table-column-display-order-name/01-generate-pedot-column-display-order-name-xlsx.py                        |  50 +++++++++++++++++++++
 analyses/pedot-table-column-display-order-name/README.md                                                                  |  99 ++++++++++++++++++++++++++++++++++++++++++
 analyses/pedot-table-column-display-order-name/docs/pedot_table_display_column_order_name_coordination_Aug5_2021_v1.2.png | Bin 0 -> 316111 bytes
 analyses/pedot-table-column-display-order-name/results/pedot-table-column-display-order-name.xlsx                         | Bin 0 -> 69849 bytes
 analyses/pedot-table-column-display-order-name/run-pedot-table-column-display-order-name.sh                               |  23 ++++++++++
 analyses/pedot-table-column-display-order-name/utils/__init__.py                                                          |   2 +
 analyses/pedot-table-column-display-order-name/utils/tsv.py                                                               | 228 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 analyses/pedot-table-column-display-order-name/utils/xlsx.py                                                              |  41 ++++++++++++++++++
 download-data.sh                                                                                                          |   2 +-
 11 files changed, 450 insertions(+), 1 deletion(-)
 create mode 100644 analyses/pedot-table-column-display-order-name/.flake8
 create mode 100644 analyses/pedot-table-column-display-order-name/01-generate-pedot-column-display-order-name-xlsx.py
 create mode 100644 analyses/pedot-table-column-display-order-name/README.md
 create mode 100644 analyses/pedot-table-column-display-order-name/docs/pedot_table_display_column_order_name_coordination_Aug5_2021_v1.2.png
 create mode 100644 analyses/pedot-table-column-display-order-name/results/pedot-table-column-display-order-name.xlsx
 create mode 100644 analyses/pedot-table-column-display-order-name/run-pedot-table-column-display-order-name.sh
 create mode 100644 analyses/pedot-table-column-display-order-name/utils/__init__.py
 create mode 100644 analyses/pedot-table-column-display-order-name/utils/tsv.py
 create mode 100644 analyses/pedot-table-column-display-order-name/utils/xlsx.py

git:(dev) $ git push
Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
To https://github.com/logstar/OpenPedCan-analysis.git
   61e23154..d3c8b489  dev -> dev

# checkout this PR's branch
git:(dev) $ git co -t origin/rna-seq-identify-bc-sehgs 
Updating files: 100% (384/384), done.
Branch 'rna-seq-identify-bc-sehgs' set up to track remote branch 'rna-seq-identify-bc-sehgs' from 'origin'.
Switched to a new branch 'rna-seq-identify-bc-sehgs'

git:(rna-seq-identify-bc-sehgs) $ git merge dev
Performing inexact rename detection: 100% (6025/6025), done.
Removing analyses/molecular-subtyping-MB/results/MB_batchcorrected_molecular_subtype.tsv
Removing analyses/independent-samples/results/independent-specimens.rnaseq.primary-plus-stranded.tsv
Removing analyses/independent-samples/results/independent-specimens.rnaseq.primary-plus-polya.tsv
Removing analyses/independent-samples/independent_rna_samples.R
Removing analyses/independent-samples/independent-samples.R
Removing analyses/fusion_filtering/results/pbta-fusion-recurrently-fused-genes-bysample.tsv
Removing analyses/fusion_filtering/results/pbta-fusion-recurrently-fused-genes-byhistology.tsv
Removing analyses/fusion_filtering/results/pbta-fusion-recurrent-fusion-bysample.tsv
Removing analyses/fusion_filtering/results/pbta-fusion-recurrent-fusion-byhistology.tsv
Removing analyses/fusion_filtering/results/pbta-fusion-putative-oncogenic.tsv
Removing analyses/fusion_filtering/06-recurrent-fusions-per-histology.R
Removing analyses/collapse-rnaseq/results/kfnbl-gene-expression-rsem-tpm-collapsed_table.stranded.rds
Removing analyses/collapse-rnaseq/results/kfnbl-gene-expression-rsem-fpkm-collapsed_table.stranded.rds
Removing analyses/collapse-rnaseq/results/kfnbl-gene-counts-rsem-expected_count-collapsed_table.stranded.rds
Removing analyses/collapse-rnaseq/02-analyze-drops-kfnbl-tpm.nb.html
Removing analyses/collapse-rnaseq/02-analyze-drops-kfnbl-fpkm.nb.html
Removing analyses/collapse-rnaseq/02-analyze-drops-kfnbl-expected_count.nb.html
Auto-merging analyses/README.md
CONFLICT (content): Merge conflict in analyses/README.md
Automatic merge failed; fix conflicts and then commit the result.

... resolved conflicts

git:(rna-seq-identify-bc-sehgs*) $ git add README.md

git:(rna-seq-identify-bc-sehgs*) $ git ci
[rna-seq-identify-bc-sehgs 9d6caf45] Merge branch 'dev' into rna-seq-identify-bc-sehgs

git:(rna-seq-identify-bc-sehgs) $ git push
Enumerating objects: 10, done.
Counting objects: 100% (10/10), done.
Delta compression using up to 8 threads
Compressing objects: 100% (4/4), done.
Writing objects: 100% (4/4), 640 bytes | 640.00 KiB/s, done.
Total 4 (delta 3), reused 0 (delta 0), pack-reused 0
remote: Resolving deltas: 100% (3/3), completed with 3 local objects.
To https://github.com/logstar/OpenPedCan-analysis.git
   038f94a5..9d6caf45  rna-seq-identify-bc-sehgs -> rna-seq-identify-bc-sehgs

@logstar
Copy link
Author

logstar commented Jan 31, 2022

We may want to approach this analysis differently once our goals are complete from the discussion here.

The comparison of technical replicates you performed previously (matched samples differing only by polya vs stranded enrichment) may provide the necessary generalized negative control gene set for future DGE comparisons involving RNA_library as batch (and the complement of this gene set may be appropriate for correcting technical variations not related to RNA_library). However, this will be more definitively answered with an empirical evaluation of p-values of negative/positive control genes at multiple k's and RLE plots of negative control genes in the linked PR above + your suggestion of evaluating goodness of fit and the association of unwanted variation with biological factors.

Assuming a positive result from the above, my thought would be to instead focus this analysis as a T/N transcriptome-wide benchmarking comparison with and without RUVg (in place of the current goal of identifying a stably expressed gene set).

Thoughts?

@komalsrathi - Above is the latest summary by @aadamk on this PR.

I will close this PR soon, according to @aadamk's summary. In addition, the issue that is related to this PR, d3b-center/ticket-tracker-OPC#27, has also been closed.

cc @taylordm @chinwallaa

@logstar logstar closed this Jan 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants