Queue completed pileup chunk results, allowing out-of-sequence workers to continue #238

sambrightman · 2016-07-29T11:49:20Z

We noticed that in general mpileup shows suboptimtal thread utilisation
(issue #237). Ordering of output results is currently the responsibility
of worker threads, so many of the are waiting on the condition variable
of ChunkDispatcher.

This patch closes #237 by:

Queueing results in a min heap.
Writing from queue to file via a separate thread.
Allowing worker threads to process subsequent chunks immediately.
Moderately increasing in memory usage for short queues.

We currently see that the output thread cannot keep up with the workers,
leading to large increases in memory and runtime due to queueing, even
though processing time is decreased and thread utilisation is
full. Using a lower thread count would of course alleviate this but the
cause is not completely clear. This could be a separate issue.

lomereiter · 2016-07-31T17:01:00Z

sambamba/pileup.d

+
+    alias Tuple!(size_t, "num", char[], "data") Result;
+    alias Array!(Result) ResultQueue;
+    private BinaryHeap!(ResultQueue, "a > b") result_queue_;


"a.num > b.num" would be less confusing.

lomereiter · 2016-07-31T17:28:00Z

Thanks, looks good.

What I worry about is relying on output to be fast enough. An alternative approach could be:

keep track of allocated memory (sum of data.length) in queueResult and dumpResults
make queueResult return bool, refusing to queue if the amount of used memory exceeds a threshold
introduce another condition variable dump_condition on which worker threads would wait to reattempt queuing

sambrightman · 2016-08-01T07:19:55Z

Yes, although this appears to be a more scalable setup, and writing immediately to disk doesn't seem to be a problem, in practice anyone with e.g. bgzip straight after could see problems (and indeed in my tests it appears that a slow output pipeline makes issue #237 behaviour worse). This has a trivial workaround though - reduce the thread count to the effective worker thread count from before (e.g. we had 12 with only 4 or 5 active, so reduce to 5). Maybe your idea of checking the queue size is still a good one though to prevent it hurting people by accident - and maybe emit a message to indicate that this likely means you need to optimise the output?

lomereiter · 2016-08-01T19:59:23Z

From interactive use point of view, the workaround is trivial indeed. However, the tool (though not pileup yet) is often used in automated setups such as bcbio-nextgen or SpeedSeq, where some inputs behave better than others, and tweaking the setup to work reliably is non-trivial unless some guarantees are provided by the leveraged tools. Keeping memory usage under control even approximately helps a lot.

to continue We noticed that in general mpileup shows suboptimtal thread utilisation (issue biod#237). Ordering of output results is currently the responsibility of worker threads, so many of the are waiting on the condition variable of ChunkDispatcher. This patch closes biod#237 by: * Queueing results in a min heap. * Writing from queue to file via a separate thread. * Allowing worker threads to process subsequent chunks immediately. * Moderately increasing in memory usage for short queues. We currently see that the output thread cannot keep up with the workers, leading to large increases in memory and runtime due to queueing, even though processing time is decreased and thread utilisation is full. Using a lower thread count would of course alleviate this but the cause is not completely clear. This could be a separate issue.

sambrightman · 2016-08-02T14:48:48Z

Style issues are addressed. I implemented the memory usage check slightly differently (indirectly check via the queue length: simpler and somewhat auto-scaling based on selected buffer size and threads).

Existing users with slow output will see little difference in performance but will get log messages indicating that their output is too slow.

lomereiter · 2016-08-03T06:08:26Z

sambamba/pileup.d

@@ -424,8 +428,12 @@ class ChunkDispatcher(ChunkRange) {

    void queueResult(size_t num, char[] data) {
        synchronized(queue_mutex_) {
+            while(result_queue_.length > max_queue_length_) {


I gave it a test run, and it stuck here. The condition should also have a clause && num != curr_num_, otherwise the next chunk to output may arrive too late.

Allowing worker threads to offload completed chunks to an output thread raises the possibility that a slow output (slow device, piped into slow bgzip etc.) can grow the queue indefinitely. If this continues for large input then memory usage will grow extremely large. It is possible to eliminate this effect by reducing the number of worker threads or improving the output speed (e.g. pbgzip) but it's better to log and prevent excessive queueing than to allow it to continue happening indefinitely. This change caps the number of queued chunks to 2 * number of threads.

sambrightman · 2016-08-03T12:09:55Z

Good catch. I ran a test with max length 1 hard-coded and didn't think that I saw anything wrong. Looking back at the logs it looks like it did happen once.

Fixed as you suggest and changed the inequality to >= for readability.

lomereiter · 2016-08-03T15:27:14Z

Ok, things look good now. Thanks for the contribution!

sambrightman · 2016-08-03T17:25:17Z

Word of warning: whilst initially testing I was verifying every ~~run completed with the same pileup result~~ pbgzip run resulted in the same pileup after decompression (sha256sum and/or cmp). ~~At some point this has stopped working, possibly due the memory-limiting change. I'm going to do another run now without that modification.~~.

I was mistaken above; doesn't seem relevant to this PR. I do get a different pileup on every sambamba mpileup run - even without this PR - but I will file a separate issue if that looks buggy.

sambrightman · 2016-08-10T12:26:40Z

Is there a criteria for getting a release for such a change?

lomereiter reviewed Jul 31, 2016
View reviewed changes

sambrightman force-pushed the mpileupthreading branch from f758824 to f169ec7 Compare August 2, 2016 14:45

lomereiter reviewed Aug 3, 2016
View reviewed changes

sambrightman force-pushed the mpileupthreading branch from f169ec7 to c45c0fd Compare August 3, 2016 12:09

lomereiter merged commit 5819865 into biod:master Aug 3, 2016

sambrightman deleted the mpileupthreading branch November 18, 2016 15:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Queue completed pileup chunk results, allowing out-of-sequence workers to continue #238

Queue completed pileup chunk results, allowing out-of-sequence workers to continue #238

sambrightman commented Jul 29, 2016

lomereiter Jul 31, 2016

lomereiter commented Jul 31, 2016

sambrightman commented Aug 1, 2016

lomereiter commented Aug 1, 2016

sambrightman commented Aug 2, 2016

lomereiter Aug 3, 2016 •

edited

Loading

sambrightman commented Aug 3, 2016

lomereiter commented Aug 3, 2016

sambrightman commented Aug 3, 2016 •

edited

Loading

sambrightman commented Aug 10, 2016

Queue completed pileup chunk results, allowing out-of-sequence workers to continue #238

Queue completed pileup chunk results, allowing out-of-sequence workers to continue #238

Conversation

sambrightman commented Jul 29, 2016

lomereiter Jul 31, 2016

Choose a reason for hiding this comment

lomereiter commented Jul 31, 2016

sambrightman commented Aug 1, 2016

lomereiter commented Aug 1, 2016

sambrightman commented Aug 2, 2016

lomereiter Aug 3, 2016 • edited Loading

Choose a reason for hiding this comment

sambrightman commented Aug 3, 2016

lomereiter commented Aug 3, 2016

sambrightman commented Aug 3, 2016 • edited Loading

sambrightman commented Aug 10, 2016

lomereiter Aug 3, 2016 •

edited

Loading

sambrightman commented Aug 3, 2016 •

edited

Loading