Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelize DESeq2 #79

Closed
drpatelh opened this issue Feb 7, 2020 · 7 comments
Closed

Parallelize DESeq2 #79

drpatelh opened this issue Feb 7, 2020 · 7 comments
Labels
enhancement New feature or request

Comments

@drpatelh
Copy link
Member

drpatelh commented Feb 7, 2020

It should be possible to add another parameter to the differential accessibility script specifying the number of cores in order to parallelize DESeq2:
http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#using-parallelization

Probably also worth adding an option just to skip this step e.g. --skip_differential_analysis

@drpatelh drpatelh added enhancement New feature or request good first issue Good for newcomers labels Feb 7, 2020
@drpatelh
Copy link
Member Author

drpatelh commented Feb 15, 2020

As suggested by @mikelove on Twitter it's worth looking into using limma for the CQN normalised data for speedup. See F1000 paper.

The implementation should by quite trivial. It's just a case of figuring out how! Contributions/thoughts welcome.

@mikelove
Copy link

Here's another pointer:

https://github.com/kauralasoo/macrophage-gxe-study/blob/5f8c7ce999da89fce5017af3e2cdd39106e68126/ATAC/munge/processPeakCounts.R

Also CC @kauralasoo whose paper that was (the data generation and processing).

@kauralasoo
Copy link

Thanks @mikelove for cc'ing me. Yes, we ran into the same issue that DESeq2 was a bit too slow when testing for differential accessibility of 300,000 features across 64 samples. In the paper, I decided to use limma voom for differential accessibility analysis, but did not benchmark it agains cqn normalisation + lmFit.

Here is the limma voom code: https://github.com/kauralasoo/macrophage-gxe-study/blob/master/ATAC/DA/clusterPeaks.R

I used cqn normalisation for chromatin accessibility QTL analysis, where we tested up to 5000 genetic variants around each feature as this can be efficiently done with efficient linear model implementations such as MatrixEQTL or QTLtools.

On our dataset cqn seemed to work better than log(TPM), but I vaguely remember other people having some issues with it on other dataset, so I would not dare to recommend it as the default without testing it on a few datasets.

Cqn requires feature QC content as a covariate. I calculated this based on the reference genome and peak coordinates using bedtools nuc:
bedtools nuc -fi ../../../annotations/GRCh38/dna/Homo_sapiens.GRCh38.dna.primary_assembly.fa -bed ATAC_consensus_peaks.gff3 > ATAC_consensus_peaks.nuc_content.txt

@drpatelh
Copy link
Member Author

drpatelh commented Feb 18, 2020

Thanks @kauralasoo ! cc'ing @macroscian here who is our in-house stats guru 😎

The pipeline currently generates raw read counts (featureCounts) and then processes them with DESeq2 based on a user-specified design. Ideally, the user would have to run the pipeline at an experiment-level because the DESeq2 model is fitted once across all samples in the design. Also, it doesnt make sense to create a consensus set of intervals across samples you dont want to compare together.

Unless Ive missed something, Im hoping that the simplest solution would be to implement an independent script that goes from raw read counts and uses limma voom to generate the differential intervals instead. I can then add that into the pipeline with an optional flag (e.g. --limma_diff_analysis) where this could be used if required. Furthermore, I can set the directory structure up in a way where you can get both outputs by using the above parameter along with -resume.

@mikelove I think we need to paralellise DESeq2 anyway. I am setting up some tests to add this into the pipeline but I was wondering whether you had an idea as to what sort of speed-up can be attained? In the dev version of the pipeline that process is currently labelled to use 6 cpus:

atacseq/main.nf

Line 1316 in 34ed69b

label 'process_medium'

as defined here:

atacseq/conf/base.config

Lines 28 to 32 in 34ed69b

withLabel:process_medium {
cpus = { check_max( 6 * task.attempt, 'cpus' ) }
memory = { check_max( 42.GB * task.attempt, 'memory' ) }
time = { check_max( 8.h * task.attempt, 'time' ) }
}

I realise this may be dependent on the input data but is there an upper-limit where there is minimal gain by using additional cores?

@drpatelh drpatelh removed the good first issue Good for newcomers label Feb 18, 2020
@mikelove
Copy link

There are a number of threads on the Bioc site about the speedup attainable with BPPARAM. The gist is: it works fine on my end (where I usually use 4-8 workers and attain some fractional gain relative to nworkers due to overhead, maybe like 50% of the optimal speedup), and usually what happens "in the wild" is that people end up requesting dozens of cores across different nodes, which gets bogged down in memory transfer, and end up with performance worse than if they had just used parallel=FALSE.

@drpatelh
Copy link
Member Author

Im seeing a significant speed-up in the differential accessibility analysis if I parallelise and allocate 6 cores to DESeq2. A previous run of the nf-core/chipseq pipeline on an in-house dataset with 907450 consensus intervals across 60 samples failed to complete because it bypassed our max wall-time limit of 72 hour where DESeq2 model building took ~15 hours and extracting the results for each possible pairwise comparison took ~20 minutes each. With the updated implementation in #84 the model building now takes ~ 3 hours and extracting pairwise results takes ~ 2 minutes. This really is quite a big difference and solves the initial subject of this issue to parallelise DESeq2.

@drpatelh drpatelh changed the title Parallelize DESeq2 Implement limma for differential analysis on large datasets Feb 24, 2020
@drpatelh drpatelh changed the title Implement limma for differential analysis on large datasets Parallelize DESeq2 Feb 24, 2020
@drpatelh
Copy link
Member Author

Going to close this issue and create a new one for the addition of limma.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants