-
Notifications
You must be signed in to change notification settings - Fork 117
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parallelize DESeq2 #79
Comments
As suggested by @mikelove on Twitter it's worth looking into using limma for the CQN normalised data for speedup. See F1000 paper. The implementation should by quite trivial. It's just a case of figuring out how! Contributions/thoughts welcome. |
Here's another pointer: Also CC @kauralasoo whose paper that was (the data generation and processing). |
Thanks @mikelove for cc'ing me. Yes, we ran into the same issue that DESeq2 was a bit too slow when testing for differential accessibility of 300,000 features across 64 samples. In the paper, I decided to use limma voom for differential accessibility analysis, but did not benchmark it agains cqn normalisation + lmFit. Here is the limma voom code: https://github.com/kauralasoo/macrophage-gxe-study/blob/master/ATAC/DA/clusterPeaks.R I used cqn normalisation for chromatin accessibility QTL analysis, where we tested up to 5000 genetic variants around each feature as this can be efficiently done with efficient linear model implementations such as MatrixEQTL or QTLtools. On our dataset cqn seemed to work better than log(TPM), but I vaguely remember other people having some issues with it on other dataset, so I would not dare to recommend it as the default without testing it on a few datasets. Cqn requires feature QC content as a covariate. I calculated this based on the reference genome and peak coordinates using bedtools nuc: |
Thanks @kauralasoo ! cc'ing @macroscian here who is our in-house stats guru 😎 The pipeline currently generates raw read counts (featureCounts) and then processes them with DESeq2 based on a user-specified design. Ideally, the user would have to run the pipeline at an experiment-level because the DESeq2 model is fitted once across all samples in the design. Also, it doesnt make sense to create a consensus set of intervals across samples you dont want to compare together. Unless Ive missed something, Im hoping that the simplest solution would be to implement an independent script that goes from raw read counts and uses limma voom to generate the differential intervals instead. I can then add that into the pipeline with an optional flag (e.g. @mikelove I think we need to paralellise DESeq2 anyway. I am setting up some tests to add this into the pipeline but I was wondering whether you had an idea as to what sort of speed-up can be attained? In the Line 1316 in 34ed69b
as defined here: Lines 28 to 32 in 34ed69b
I realise this may be dependent on the input data but is there an upper-limit where there is minimal gain by using additional cores? |
There are a number of threads on the Bioc site about the speedup attainable with BPPARAM. The gist is: it works fine on my end (where I usually use 4-8 workers and attain some fractional gain relative to nworkers due to overhead, maybe like 50% of the optimal speedup), and usually what happens "in the wild" is that people end up requesting dozens of cores across different nodes, which gets bogged down in memory transfer, and end up with performance worse than if they had just used |
Im seeing a significant speed-up in the differential accessibility analysis if I parallelise and allocate 6 cores to DESeq2. A previous run of the |
Going to close this issue and create a new one for the addition of limma. |
It should be possible to add another parameter to the differential accessibility script specifying the number of cores in order to parallelize
DESeq2
:http://bioconductor.org/packages/devel/bioc/vignettes/DESeq2/inst/doc/DESeq2.html#using-parallelization
Probably also worth adding an option just to skip this step e.g.
--skip_differential_analysis
The text was updated successfully, but these errors were encountered: