Optimal parallelisation: -p N / --multicore N with bowtie2 #96

sklages · 2017-03-15T07:56:10Z

Hi,

I am playing around with bismark using a few large human WGBS datasets.

Setup:

a few 80-core server, 1TB RAM
WGBS human datasets, Illumina PE125
Bismark Version: v0.17.1_dev
bowtie2 2.2.9

I am wondering which is the "optimal" combination of -p N / --multicore N.

No matter which value for -p I use, bowtie2 always runs on a single core.
Using --multicore N spawns N (bowtie2) processes, just like it should.

When I use --multicore 8 -p 4 I would expect 8 bismark/bowtie2 processes each running on 4 cores (using 32 cores in total). But it runs 8 bowtie2 processes each running on one single core.

So obviously there is something wrong with my understanding of -p N / --multicore N.

What about using --multicore 70 together with -p 1? I could imagine that this way I/O becomes annoying with splitting/joining the data.

Any ideas?

best,
Sven

The text was updated successfully, but these errors were encountered:

avilella · 2017-03-15T08:19:40Z

I think it's still the rule that each multicore will use 5 CPUs close to 100%. Is that right, @FelixKrueger ? I usually only change the --multicore N value according to the num. cores and memory available on the instance.

…

On Mar 15, 2017 07:56, "sklages" ***@***.***> wrote: Hi, I am playing around with bismark using a few large human WGBS datasets. Setup: - a few 80-core server, 1TB RAM - WGBS human datasets, Illumina PE125 - Bismark Version: v0.17.1_dev - bowtie2 2.2.9 I am wondering which is the "optimal" combination of -p N / --multicore N. No matter which value for -p I use, bowtie2 always runs on a single core. Using --multicore N spawns N (bowtie2) processes, just like it should. When I use --multicore 8 -p 4 I would expect 8 bismark/bowtie2 processes each running on 4 cores (using 32 cores in total). But it runs 8 bowtie2 processes each running on one single core. So obviously there is something wrong with my understanding of -p N / --multicore N. What about using --multicore 70 together with -p 1? I could imagine that this way I/O becomes annoying with splitting/joining the data. Any ideas? best, Sven — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#96>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJpN0RRK7rat8SjYgZBxPM-im91rln6ks5rl5magaJpZM4Mdmch> .

sklages · 2017-03-15T08:22:14Z

In my case, no. 8 multicores run on 8 CPUs, each with 100% -> single thread/core.

For me it is the same: --multicore 8 -p 1 and --multicore 8 -p 40. 8 processes, each one uses one CPU (100%)..

FelixKrueger · 2017-03-15T10:17:54Z

As a general rule -p should only govern how fast Bowtie2 finds hits in an individual thread, whereas --multi N should be more or less equivalent to running N instances of Bismark in parallel. We personally only use the default for -p (which is 1) because there have been reports that increasing it doesn't lead to a linear speed increase, and using -p > 4 can actually start creating too much overhead which may then result in the mapping speed going down somewhat again.

I just ran a quick test over here with -p 3:

You can see that this spawns 2 Bowtie2 threads, each using 300% CPU (you might have to wait until all instances have been spawned and are running fully).

Here is a test with -p 3 --multi 3:

This spawns 6 Bowtie2 threads ((OT+OB) * 3), each using 300% (from -p 3), so I'm afraid it seems to work as intended over here. Note that in this mode Bismark is already using around 21 cores (plus a few extra ones for gzip and samtools streams) and ~100GB of RAM, so using --multi 70 would almost certainly not be a very good idea :)

I would personally either go with what @avilella said and ignore -p altogether, or if you feel you want to use -p then stay under the limit of 4. --multi should be a near linear speed increase so this is what I would use. As an approximation --multi 10 should maybe cut down the time to ~15% of a single instance, and use around 30 cores at 100% (+ gzip/samtools streams but at low CPU usage) and 120GB of RAM for the human genome.

I hope this clears things up a little?

sklages · 2017-03-15T10:40:11Z

Thanks for the info. That's how I understood the docs. I am just a bit confused that bowtie2 always spawns just one single thread (100% cpu load), independently of which value I choose for -p.
This is what --multicore 8 -p 4 looks like here:

I would then play around with --multicore .. as you have recommended and leave -p to default.
best,
Sven

ewels · 2017-03-15T10:40:30Z

@FelixKrueger - a suggestion / request.. I personally find the term --multicore kind of confusing, as the name sounds like it sets the number of cores to use. But that's not what it's doing. Could it be renamed --parallel or something instead? Also, could a table be added to the Bismark docs showing the typical CPU and Memory usage (Human WGBS) for values 1-10? This would make it much easier to know how to use this feature 😁

FelixKrueger · 2017-03-15T11:03:23Z

@ewels - I have added/changed the name of --multicore to --parallel so that you don't have to be confused anymore, and updated the --help text as well (c8954fd). --muticore continues to be work as before so this change shouldn't break anything. I can work on such a table when I've got my next free afternoon :)

FelixKrueger · 2017-03-15T11:27:56Z

@sklages - From our screenshots above seem to indicate that both --parallel and -p are working as intended, so it might either be something related to your system or, probably more likely, a matter of how your top is presenting the threads. It appears that top is presenting threads in a hierarchical manner, so maybe something is 'being swallowed' somewhere? If you take a small test data set, is there any difference in time at all if you run -p as 1, 2, 3 or 4? Ultimately I don't think this is actually anything I could fix, but it would rather be a Bowtie2 issue...

avilella · 2017-03-15T11:40:29Z

Hi all, I created a small spreadsheet to dynamically do the maths between bismark mem/CPU requirements and machine specs: http://tinyurl.com/bismarkparallel If my understanding of the CPU usage is correct, an instance with, say, 36 CPU but only 60GB of RAM would only be able to do --parallel 5, even though with more RAM, the max to reach 100% would be 12. Genome size is for GRCh38Decoy (contains decoy sequences), @FelixKrueger, let me know if my calculations are correct... A.

…

On Wed, Mar 15, 2017 at 11:27 AM, FelixKrueger ***@***.***> wrote: @sklages <https://github.com/sklages> - From our screenshots above seem to indicate that both --parallel and -p are working as intended, so it might either be something related to your system or, probably more likely, a matter of how your top is presenting the threads. It appears that top is presenting threads in a hierarchical manner, so maybe something is 'being swallowed' somewhere? If you take a small test data set, is there any difference in time at all if you run -p as 1, 2, 3 or 4? Ultimately I don't think this is actually anything I could fix, but it would rather be a Bowtie2 issue... — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#96 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAJpN13M9GxX8cixKIwH8yq-hoKCqc8xks5rl8s8gaJpZM4Mdmch> .

FelixKrueger · 2017-03-15T11:45:57Z

The calculations in your spreadsheet look fine to me!

sklages · 2017-03-15T12:19:49Z

So, if I got it right: roughly 10G per one hg38-bismark instance, running at least 2 (directional) or 4 (non-directional) bowtie2 threads. Plus some decompression (gzip) in the beginning and samtools threads.
That is something I can live with 👍 .

FelixKrueger · 2017-03-15T12:59:51Z

Indeed, and then the Bismark thread itself which operates all the threads and does the methylatiuon calling etc. will add 1 core (at 100%), and 1 copy of the reference sequence in memory (so roughly 12G for a human genome in total). Glad this helped reduce the confusion.

sklages · 2017-03-15T13:07:39Z

Yes, thanks, this helped a lot 👍 .

ewels · 2017-03-15T15:00:46Z

Brilliant, thanks for renaming it @FelixKrueger! 😁 The table would still be nice too 😉

This implements the suggestions for speeding up bismark here: FelixKrueger/Bismark#96 They suggest just leaving -p (the number of threads for bowtie) to just be the default, and spinning up more instances of bismark with --parallel instead. This should be a more efficient usage of resources.

sklages closed this as completed Mar 15, 2017

LedaKatopodi mentioned this issue Dec 11, 2020

Bismark in WGBS extremely slow (~1M reads per hour) #401

Closed

iranianuser mentioned this issue Aug 1, 2024

bismark for human genome #689

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimal parallelisation: -p N / --multicore N with bowtie2 #96

Optimal parallelisation: -p N / --multicore N with bowtie2 #96

sklages commented Mar 15, 2017

avilella commented Mar 15, 2017 via email •

edited

Loading

sklages commented Mar 15, 2017 •

edited

Loading

FelixKrueger commented Mar 15, 2017

sklages commented Mar 15, 2017

ewels commented Mar 15, 2017

FelixKrueger commented Mar 15, 2017 •

edited

Loading

FelixKrueger commented Mar 15, 2017

avilella commented Mar 15, 2017 via email •

edited

Loading

FelixKrueger commented Mar 15, 2017

sklages commented Mar 15, 2017

FelixKrueger commented Mar 15, 2017

sklages commented Mar 15, 2017

ewels commented Mar 15, 2017

Optimal parallelisation: -p N / --multicore N with bowtie2 #96

Optimal parallelisation: -p N / --multicore N with bowtie2 #96

Comments

sklages commented Mar 15, 2017

avilella commented Mar 15, 2017 via email • edited Loading

sklages commented Mar 15, 2017 • edited Loading

FelixKrueger commented Mar 15, 2017

sklages commented Mar 15, 2017

ewels commented Mar 15, 2017

FelixKrueger commented Mar 15, 2017 • edited Loading

FelixKrueger commented Mar 15, 2017

avilella commented Mar 15, 2017 via email • edited Loading

FelixKrueger commented Mar 15, 2017

sklages commented Mar 15, 2017

FelixKrueger commented Mar 15, 2017

sklages commented Mar 15, 2017

ewels commented Mar 15, 2017

avilella commented Mar 15, 2017 via email •

edited

Loading

sklages commented Mar 15, 2017 •

edited

Loading

FelixKrueger commented Mar 15, 2017 •

edited

Loading

avilella commented Mar 15, 2017 via email •

edited

Loading