-
Notifications
You must be signed in to change notification settings - Fork 161
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
-p/--threads option always uses 1 more thread than is specified. #465
Comments
Hello, As of bowtie2 v2.5.0 we moved to async model (dedicated thread) for processing I/O. The |
@ch4rr0 The whole new CPU core usage was added in commit 6f6458c Line 1344 in 4cf8d52
I benchmarked before and after the commit with -p 12, and while before it would use 12.2 cores, afterwards it uses 13.1. I believe we should revert that commit. |
For completeness, I had to revert 3 commits to avoid revert conflicts when testing:
Not sure if there are any other side effects, but the CPU utilization definitely was drastically reduced. |
Hello Igor, We put this change in place after a user reported an issue with poor thread scaling which was due to contention on the condition variable. Why does the n+1 CPU utilization come as a surprise? As I recall it was part of the async reads pull request. moody_camel is a concurrent, lock-free queue and should not spawn any additional threads on it's own. |
The async reads were added as a way to increase performance with minimal overhead. The number of CPU cores in a node is limited, so any CPU cycles wasted on polling is CPU time that cannot be used on actual alignment work. |
If condition variables were really a problem at large thread count (can you reference me in the related issue, please?) @ch4rr0 I am willing to do the coding work, if you agree with the direction. |
This is the issue: #437. I am grateful that you are willing to do the work on this. If it's not too much to ask can you also provide some performance numbers for future reference? |
As a starting point, here are my benchmark numbers on the kind of problems of relevance to my group (QIITA). The used CPU is a AMD EPYC 7302 16-Core Processor. I am using the Wolka WolR1 DB (~40GB of memory use) and the The command used for a single 16-core run is:
(aligning 4x the same file, so I can compare with aligning them in parallel below) and for the 4x4-core runs I use
With the current master, I get: If I revert the moodycamel commits I get: Two takeways: |
@sfiligoi many thanks for taking the time to work on this. this would give me a throughput boost of 20-30%. |
when i run bowtie2 and specify the number of threads to use(eg
-p 1
,-p 4
), the amount of CPU usage is always 1 more core in use than is specified by the number of threads to use.for example, if i specify
-p 1
:bowtie2 --no-unal -p 1 \ -x 'ref-genome-idx' \ -1 reads-R1_001.fastq.gz \ -2 reads-R2_001.fastq.gz
the cpu usage (via
htop
) for the process with be at ~200% indicating a usage of 2 threads, not 1:likewise, for any other number of threads, it always uses n+1 CPU cores.
for
-p 4
:bowtie2 --no-unal -p 4 \ -x 'ref-genome-idx' \ -1 reads-R1_001.fastq.gz \ -2 reads-R2_001.fastq.gz
CPU usage (via
htop
):(between 400-500% core usage because i have other jobs running, but bare system it would be ~500% indicating usage of 5 CPU cores)
this is effectively limiting cutting my sample throughput in half because when
bowtie2
is in usage with 1 thread specified, every job uses 2 threads instead.is this a bug or a feature? am i missing something?
tested on:
bowtie2 --version /opt/homebrew/bin/../Cellar/bowtie2/2.5.2/bin/bowtie2-align-s version 2.5.2 64-bit Built on Ventura-arm64.local Sat Oct 14 18:03:18 UTC 2023 Compiler: InstalledDir: /Applications/Xcode.app/Contents/Developer/Toolchains/XcodeDefault.xctoolchain/usr/bin Options: -O3 -funroll-loops -g3 -std=c++11 -fopenmp-simd -DNO_SPINLOCK -DWITH_QUEUELOCK=1 Sizeof {int, long, long long, void*, size_t, off_t}: {4, 8, 8, 8, 8, 8}
The text was updated successfully, but these errors were encountered: