Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Queue completed pileup chunk results, allowing out-of-sequence workers to continue #238

Merged
merged 2 commits into from
Aug 3, 2016

Conversation

sambrightman
Copy link
Collaborator

We noticed that in general mpileup shows suboptimtal thread utilisation
(issue #237). Ordering of output results is currently the responsibility
of worker threads, so many of the are waiting on the condition variable
of ChunkDispatcher.

This patch closes #237 by:

  • Queueing results in a min heap.
  • Writing from queue to file via a separate thread.
  • Allowing worker threads to process subsequent chunks immediately.
  • Moderately increasing in memory usage for short queues.

We currently see that the output thread cannot keep up with the workers,
leading to large increases in memory and runtime due to queueing, even
though processing time is decreased and thread utilisation is
full. Using a lower thread count would of course alleviate this but the
cause is not completely clear. This could be a separate issue.


alias Tuple!(size_t, "num", char[], "data") Result;
alias Array!(Result) ResultQueue;
private BinaryHeap!(ResultQueue, "a > b") result_queue_;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"a.num > b.num" would be less confusing.

@lomereiter
Copy link
Contributor

Thanks, looks good.

What I worry about is relying on output to be fast enough. An alternative approach could be:

  • keep track of allocated memory (sum of data.length) in queueResult and dumpResults
  • make queueResult return bool, refusing to queue if the amount of used memory exceeds a threshold
  • introduce another condition variable dump_condition on which worker threads would wait to reattempt queuing

@sambrightman
Copy link
Collaborator Author

Yes, although this appears to be a more scalable setup, and writing immediately to disk doesn't seem to be a problem, in practice anyone with e.g. bgzip straight after could see problems (and indeed in my tests it appears that a slow output pipeline makes issue #237 behaviour worse). This has a trivial workaround though - reduce the thread count to the effective worker thread count from before (e.g. we had 12 with only 4 or 5 active, so reduce to 5). Maybe your idea of checking the queue size is still a good one though to prevent it hurting people by accident - and maybe emit a message to indicate that this likely means you need to optimise the output?

@lomereiter
Copy link
Contributor

From interactive use point of view, the workaround is trivial indeed. However, the tool (though not pileup yet) is often used in automated setups such as bcbio-nextgen or SpeedSeq, where some inputs behave better than others, and tweaking the setup to work reliably is non-trivial unless some guarantees are provided by the leveraged tools. Keeping memory usage under control even approximately helps a lot.

to continue

We noticed that in general mpileup shows suboptimtal thread utilisation
(issue biod#237). Ordering of output results is currently the responsibility
of worker threads, so many of the are waiting on the condition variable
of ChunkDispatcher.

This patch closes biod#237 by:

* Queueing results in a min heap.
* Writing from queue to file via a separate thread.
* Allowing worker threads to process subsequent chunks immediately.
* Moderately increasing in memory usage for short queues.

We currently see that the output thread cannot keep up with the workers,
leading to large increases in memory and runtime due to queueing, even
though processing time is decreased and thread utilisation is
full. Using a lower thread count would of course alleviate this but the
cause is not completely clear. This could be a separate issue.
@sambrightman
Copy link
Collaborator Author

Style issues are addressed. I implemented the memory usage check slightly differently (indirectly check via the queue length: simpler and somewhat auto-scaling based on selected buffer size and threads).

Existing users with slow output will see little difference in performance but will get log messages indicating that their output is too slow.

@@ -424,8 +428,12 @@ class ChunkDispatcher(ChunkRange) {

void queueResult(size_t num, char[] data) {
synchronized(queue_mutex_) {
while(result_queue_.length > max_queue_length_) {
Copy link
Contributor

@lomereiter lomereiter Aug 3, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I gave it a test run, and it stuck here. The condition should also have a clause && num != curr_num_, otherwise the next chunk to output may arrive too late.

Allowing worker threads to offload completed chunks to an output thread
raises the possibility that a slow output (slow device, piped into slow
bgzip etc.) can grow the queue indefinitely. If this continues for large
input then memory usage will grow extremely large.

It is possible to eliminate this effect by reducing the number of worker
threads or improving the output speed (e.g. pbgzip) but it's better to
log and prevent excessive queueing than to allow it to continue
happening indefinitely.

This change caps the number of queued chunks to 2 * number of threads.
@sambrightman
Copy link
Collaborator Author

Good catch. I ran a test with max length 1 hard-coded and didn't think that I saw anything wrong. Looking back at the logs it looks like it did happen once.

Fixed as you suggest and changed the inequality to >= for readability.

@lomereiter lomereiter merged commit 5819865 into biod:master Aug 3, 2016
@lomereiter
Copy link
Contributor

Ok, things look good now. Thanks for the contribution!

@sambrightman
Copy link
Collaborator Author

sambrightman commented Aug 3, 2016

Word of warning: whilst initially testing I was verifying every run completed with the same pileup result pbgzip run resulted in the same pileup after decompression (sha256sum and/or cmp). At some point this has stopped working, possibly due the memory-limiting change. I'm going to do another run now without that modification..

I was mistaken above; doesn't seem relevant to this PR. I do get a different pileup on every sambamba mpileup run - even without this PR - but I will file a separate issue if that looks buggy.

@sambrightman
Copy link
Collaborator Author

Is there a criteria for getting a release for such a change?

@sambrightman sambrightman deleted the mpileupthreading branch November 18, 2016 15:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

suboptimal thread utilisation by mpileup
2 participants