Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimal parallelisation: -p N / --multicore N with bowtie2 #96

Closed
sklages opened this issue Mar 15, 2017 · 13 comments
Closed

Optimal parallelisation: -p N / --multicore N with bowtie2 #96

sklages opened this issue Mar 15, 2017 · 13 comments

Comments

@sklages
Copy link

sklages commented Mar 15, 2017

Hi,

I am playing around with bismark using a few large human WGBS datasets.

Setup:

  • a few 80-core server, 1TB RAM
  • WGBS human datasets, Illumina PE125
  • Bismark Version: v0.17.1_dev
  • bowtie2 2.2.9

I am wondering which is the "optimal" combination of -p N / --multicore N.

No matter which value for -p I use, bowtie2 always runs on a single core.
Using --multicore N spawns N (bowtie2) processes, just like it should.

When I use --multicore 8 -p 4 I would expect 8 bismark/bowtie2 processes each running on 4 cores (using 32 cores in total). But it runs 8 bowtie2 processes each running on one single core.

So obviously there is something wrong with my understanding of -p N / --multicore N.

What about using --multicore 70 together with -p 1? I could imagine that this way I/O becomes annoying with splitting/joining the data.

Any ideas?

best,
Sven

@avilella
Copy link

avilella commented Mar 15, 2017 via email

@sklages
Copy link
Author

sklages commented Mar 15, 2017

In my case, no. 8 multicores run on 8 CPUs, each with 100% -> single thread/core.

For me it is the same: --multicore 8 -p 1 and --multicore 8 -p 40. 8 processes, each one uses one CPU (100%)..

@FelixKrueger
Copy link
Owner

As a general rule -p should only govern how fast Bowtie2 finds hits in an individual thread, whereas --multi N should be more or less equivalent to running N instances of Bismark in parallel. We personally only use the default for -p (which is 1) because there have been reports that increasing it doesn't lead to a linear speed increase, and using -p > 4 can actually start creating too much overhead which may then result in the mapping speed going down somewhat again.

I just ran a quick test over here with -p 3:

p3

You can see that this spawns 2 Bowtie2 threads, each using 300% CPU (you might have to wait until all instances have been spawned and are running fully).

Here is a test with -p 3 --multi 3:

p3multi3

This spawns 6 Bowtie2 threads ((OT+OB) * 3), each using 300% (from -p 3), so I'm afraid it seems to work as intended over here. Note that in this mode Bismark is already using around 21 cores (plus a few extra ones for gzip and samtools streams) and ~100GB of RAM, so using --multi 70 would almost certainly not be a very good idea :)

I would personally either go with what @avilella said and ignore -p altogether, or if you feel you want to use -p then stay under the limit of 4. --multi should be a near linear speed increase so this is what I would use. As an approximation --multi 10 should maybe cut down the time to ~15% of a single instance, and use around 30 cores at 100% (+ gzip/samtools streams but at low CPU usage) and 120GB of RAM for the human genome.

I hope this clears things up a little?

@sklages
Copy link
Author

sklages commented Mar 15, 2017

Thanks for the info. That's how I understood the docs. I am just a bit confused that bowtie2 always spawns just one single thread (100% cpu load), independently of which value I choose for -p.
This is what --multicore 8 -p 4 looks like here:
top_bowtie2

I would then play around with --multicore .. as you have recommended and leave -p to default.
best,
Sven

@ewels
Copy link
Contributor

ewels commented Mar 15, 2017

@FelixKrueger - a suggestion / request.. I personally find the term --multicore kind of confusing, as the name sounds like it sets the number of cores to use. But that's not what it's doing. Could it be renamed --parallel or something instead? Also, could a table be added to the Bismark docs showing the typical CPU and Memory usage (Human WGBS) for values 1-10? This would make it much easier to know how to use this feature 😁

@FelixKrueger
Copy link
Owner

FelixKrueger commented Mar 15, 2017

@ewels - I have added/changed the name of --multicore to --parallel so that you don't have to be confused anymore, and updated the --help text as well (c8954fd). --muticore continues to be work as before so this change shouldn't break anything. I can work on such a table when I've got my next free afternoon :)

@FelixKrueger
Copy link
Owner

@sklages - From our screenshots above seem to indicate that both --parallel and -p are working as intended, so it might either be something related to your system or, probably more likely, a matter of how your top is presenting the threads. It appears that top is presenting threads in a hierarchical manner, so maybe something is 'being swallowed' somewhere? If you take a small test data set, is there any difference in time at all if you run -p as 1, 2, 3 or 4? Ultimately I don't think this is actually anything I could fix, but it would rather be a Bowtie2 issue...

@avilella
Copy link

avilella commented Mar 15, 2017 via email

@FelixKrueger
Copy link
Owner

The calculations in your spreadsheet look fine to me!

@sklages
Copy link
Author

sklages commented Mar 15, 2017

So, if I got it right: roughly 10G per one hg38-bismark instance, running at least 2 (directional) or 4 (non-directional) bowtie2 threads. Plus some decompression (gzip) in the beginning and samtools threads.
That is something I can live with 👍 .

@FelixKrueger
Copy link
Owner

Indeed, and then the Bismark thread itself which operates all the threads and does the methylatiuon calling etc. will add 1 core (at 100%), and 1 copy of the reference sequence in memory (so roughly 12G for a human genome in total). Glad this helped reduce the confusion.

@sklages
Copy link
Author

sklages commented Mar 15, 2017

Yes, thanks, this helped a lot 👍 .

@sklages sklages closed this as completed Mar 15, 2017
@ewels
Copy link
Contributor

ewels commented Mar 15, 2017

Brilliant, thanks for renaming it @FelixKrueger! 😁 The table would still be nice too 😉

roryk added a commit to bcbio/bcbio-nextgen that referenced this issue Sep 14, 2020
This implements the suggestions for speeding up bismark here:

FelixKrueger/Bismark#96

They suggest just leaving -p (the number of threads for bowtie) to just
be the default, and spinning up more instances of bismark with
--parallel instead. This should be a more efficient usage of resources.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants