-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
suboptimal thread utilisation by mpileup #237
Comments
Hi, please submit the PR as it is, I'll review it on the weekend. Output of samtools is lz4-compressed in order to reduce memory consumption, mostly for per-base pileup output. As it's one of the fastest compression algorithms (~1GB/s), I don't expect it to have large influence on performance. |
to continue We noticed that in general mpileup shows suboptimtal thread utilisation (issue biod#237). Ordering of output results is currently the responsibility of worker threads, so many of the are waiting on the condition variable of ChunkDispatcher. This patch closes biod#237 by: * Queueing results in a min heap. * Writing from queue to file via a separate thread. * Allowing worker threads to process subsequent chunks immediately. * Moderately increasing in memory usage for short queues. We currently see that the output thread cannot keep up with the workers, leading to large increases in memory and runtime due to queueing, even though processing time is decreased and thread utilisation is full. Using a lower thread count would of course alleviate this but the cause is not completely clear. This could be a separate issue.
Submitted code is still undergoing a full run to verify that the writer thread is properly joined etc, so that part is untested. I don't understand the During normal processing,
I noticed the buffer sizes in |
I've just noticed that we pipe the output to bgzip, which is using about 50% CPU whilst sambamba uses 50%, occasional spikes from both. Maybe bgzip is the limiting factor (I think it is 64KB block size, which is also the pipe buffer size?). |
I didn't have any experience with The buffer size is ok, increasing it might only hurt because CPU cache is limited. |
I've confirmed that the output thread joins correctly and that removing bgzip from the pipeline results in fast output and low memory usage. Looks like the next step is investigating bgzip performance, trying pbgzip or avoiding it entirely. |
to continue We noticed that in general mpileup shows suboptimtal thread utilisation (issue biod#237). Ordering of output results is currently the responsibility of worker threads, so many of the are waiting on the condition variable of ChunkDispatcher. This patch closes biod#237 by: * Queueing results in a min heap. * Writing from queue to file via a separate thread. * Allowing worker threads to process subsequent chunks immediately. * Moderately increasing in memory usage for short queues. We currently see that the output thread cannot keep up with the workers, leading to large increases in memory and runtime due to queueing, even though processing time is decreased and thread utilisation is full. Using a lower thread count would of course alleviate this but the cause is not completely clear. This could be a separate issue.
to continue We noticed that in general mpileup shows suboptimtal thread utilisation (issue biod#237). Ordering of output results is currently the responsibility of worker threads, so many of the are waiting on the condition variable of ChunkDispatcher. This patch closes biod#237 by: * Queueing results in a min heap. * Writing from queue to file via a separate thread. * Allowing worker threads to process subsequent chunks immediately. * Moderately increasing in memory usage for short queues. We currently see that the output thread cannot keep up with the workers, leading to large increases in memory and runtime due to queueing, even though processing time is decreased and thread utilisation is full. Using a lower thread count would of course alleviate this but the cause is not completely clear. This could be a separate issue.
We find that mpileup with
-t 12
often only has 3-6 subprocesses running with a roughly equivalent load average, ~4. I believe that I understand what is happening and have a draft PR ready which fixes it. However, for larger thread counts this change results in completed chunks queueing to be output and thus suboptimal memory use and performance of the file output stage, so further investigation is required there.Currently the responsibility for outputting results in the correct order is partially with the worker itself and thus blocking further progress of that worker. Out-of-order chunk completions prevent workers from being utilised until the lowest-numbered chunk is completed. Idle threads are shown waiting on the condition variable signalled from
tryDump
.I have refactored to put responsibility for chunk result output into
ChunkDispatcher
: maintain a binary heap of the chunk results and have a separate output thread for outputting them. Workers can then proceed with new chunks as soon as they are finished with current chunks. Load average goes up to around 12 and total time for processsing (samtools) drops by over half.I expected this to come at a small memory cost, in the order of
chunk size * thread count
. However, the output itself is now much slower than the processing. This seems to be in-line with sambamba philosophy of shifting the problem to I/O optimisation. However, I see some pathological behaviour: output of the queued chunks after processing completes takes much longer than expected.Profiling the output thread with
perf
duringsamtools
processing shows that most of the time is taken by LZ4 decompression. Is it correct that the output of samtools is lz4compressed only to then be decompressed to file?Profiling the output thread in the final stage, when no further
samtools
processing is being done (only writing queued results to disk), shows that kernel functionschange_protection_range
andvm_normal_page
are hot (90% the former). It's not clear to me what is going on here. Does not seem to be IO or CPU bound (see my question here).Perhaps I should submit the PR in its current state and have someone review for concurrency correctness, lack of unnecessary object copies and style? I have never written D before.
The text was updated successfully, but these errors were encountered: