suboptimal thread utilisation by mpileup #237

sambrightman · 2016-07-29T10:40:09Z

We find that mpileup with -t 12 often only has 3-6 subprocesses running with a roughly equivalent load average, ~4. I believe that I understand what is happening and have a draft PR ready which fixes it. However, for larger thread counts this change results in completed chunks queueing to be output and thus suboptimal memory use and performance of the file output stage, so further investigation is required there.

Currently the responsibility for outputting results in the correct order is partially with the worker itself and thus blocking further progress of that worker. Out-of-order chunk completions prevent workers from being utilised until the lowest-numbered chunk is completed. Idle threads are shown waiting on the condition variable signalled from tryDump.

I have refactored to put responsibility for chunk result output into ChunkDispatcher: maintain a binary heap of the chunk results and have a separate output thread for outputting them. Workers can then proceed with new chunks as soon as they are finished with current chunks. Load average goes up to around 12 and total time for processsing (samtools) drops by over half.

I expected this to come at a small memory cost, in the order of chunk size * thread count. However, the output itself is now much slower than the processing. This seems to be in-line with sambamba philosophy of shifting the problem to I/O optimisation. However, I see some pathological behaviour: output of the queued chunks after processing completes takes much longer than expected.

Profiling the output thread with perf during samtools processing shows that most of the time is taken by LZ4 decompression. Is it correct that the output of samtools is lz4compressed only to then be decompressed to file?

Profiling the output thread in the final stage, when no further samtools processing is being done (only writing queued results to disk), shows that kernel functions change_protection_range and vm_normal_page are hot (90% the former). It's not clear to me what is going on here. Does not seem to be IO or CPU bound (see my question here).

Perhaps I should submit the PR in its current state and have someone review for concurrency correctness, lack of unnecessary object copies and style? I have never written D before.

The text was updated successfully, but these errors were encountered:

lomereiter · 2016-07-29T11:17:49Z

Hi, please submit the PR as it is, I'll review it on the weekend.

Output of samtools is lz4-compressed in order to reduce memory consumption, mostly for per-base pileup output. As it's one of the fastest compression algorithms (~1GB/s), I don't expect it to have large influence on performance.

to continue We noticed that in general mpileup shows suboptimtal thread utilisation (issue biod#237). Ordering of output results is currently the responsibility of worker threads, so many of the are waiting on the condition variable of ChunkDispatcher. This patch closes biod#237 by: * Queueing results in a min heap. * Writing from queue to file via a separate thread. * Allowing worker threads to process subsequent chunks immediately. * Moderately increasing in memory usage for short queues. We currently see that the output thread cannot keep up with the workers, leading to large increases in memory and runtime due to queueing, even though processing time is decreased and thread utilisation is full. Using a lower thread count would of course alleviate this but the cause is not completely clear. This could be a separate issue.

sambrightman · 2016-07-29T12:21:02Z

Submitted code is still undergoing a full run to verify that the writer thread is properly joined etc, so that part is untested.

I don't understand the perf profile that I asked a Stack Exchange question, but it could be partly due to the very high memory usage at that final stage - something related to kernel memory management (we are on 3.10.0). So potentially this would never happen if the queue didn't grow as much originally.

During normal processing, iotop is showing ~8MB/s written (~5M/s read). We have SSDs in RAID 10. The top of the perf report is:

44.88%  sambamba  sambamba            [.] LZ4_decompress_generic.constprop.3
11.09%  sambamba  sambamba            [.] XXH32_update
 5.91%  sambamba  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
 5.88%  sambamba  libc-2.17.so        [.] __memcpy_ssse3_back
 2.25%  sambamba  [kernel.kallsyms]   [k] default_send_IPI_mask_sequence_phys

I noticed the buffer sizes in lz4.d are only 256KB and the output buffer is the same size as the input buffer - could there be some overhead in not decompressing one long stream? Contexts are repeatedly created and destroyed too.

sambrightman · 2016-07-29T14:21:26Z

I've just noticed that we pipe the output to bgzip, which is using about 50% CPU whilst sambamba uses 50%, occasional spikes from both. Maybe bgzip is the limiting factor (I think it is 64KB block size, which is also the pipe buffer size?).

lomereiter · 2016-07-29T17:13:50Z

I didn't have any experience with perf, my preferred way is google-perftools + kcachegrind.

The buffer size is ok, increasing it might only hurt because CPU cache is limited.
I'm surprised to see LZ4_decompress_generic.constprop.3 at the top, because LZ4 source code indicates it should be inlined. Perhaps try to recompile it with -fnoipa-cp-clone and see if it's any better?

sambrightman · 2016-07-30T06:30:18Z

I've confirmed that the output thread joins correctly and that removing bgzip from the pipeline results in fast output and low memory usage.

Looks like the next step is investigating bgzip performance, trying pbgzip or avoiding it entirely.

to continue We noticed that in general mpileup shows suboptimtal thread utilisation (issue biod#237). Ordering of output results is currently the responsibility of worker threads, so many of the are waiting on the condition variable of ChunkDispatcher. This patch closes biod#237 by: * Queueing results in a min heap. * Writing from queue to file via a separate thread. * Allowing worker threads to process subsequent chunks immediately. * Moderately increasing in memory usage for short queues. We currently see that the output thread cannot keep up with the workers, leading to large increases in memory and runtime due to queueing, even though processing time is decreased and thread utilisation is full. Using a lower thread count would of course alleviate this but the cause is not completely clear. This could be a separate issue.

sambrightman mentioned this issue Jul 29, 2016

The -t switch always allocates an extra thread #94

Closed

sambrightman mentioned this issue Jul 29, 2016

Queue completed pileup chunk results, allowing out-of-sequence workers to continue #238

Merged

lomereiter closed this as completed in #238 Aug 3, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

suboptimal thread utilisation by mpileup #237

suboptimal thread utilisation by mpileup #237

sambrightman commented Jul 29, 2016

lomereiter commented Jul 29, 2016

sambrightman commented Jul 29, 2016 •

edited

Loading

sambrightman commented Jul 29, 2016

lomereiter commented Jul 29, 2016 •

edited

Loading

sambrightman commented Jul 30, 2016

suboptimal thread utilisation by mpileup #237

suboptimal thread utilisation by mpileup #237

Comments

sambrightman commented Jul 29, 2016

lomereiter commented Jul 29, 2016

sambrightman commented Jul 29, 2016 • edited Loading

sambrightman commented Jul 29, 2016

lomereiter commented Jul 29, 2016 • edited Loading

sambrightman commented Jul 30, 2016

sambrightman commented Jul 29, 2016 •

edited

Loading

lomereiter commented Jul 29, 2016 •

edited

Loading