Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

suboptimal thread utilisation by mpileup #237

Closed
sambrightman opened this issue Jul 29, 2016 · 5 comments · Fixed by #238
Closed

suboptimal thread utilisation by mpileup #237

sambrightman opened this issue Jul 29, 2016 · 5 comments · Fixed by #238

Comments

@sambrightman
Copy link
Collaborator

We find that mpileup with -t 12 often only has 3-6 subprocesses running with a roughly equivalent load average, ~4. I believe that I understand what is happening and have a draft PR ready which fixes it. However, for larger thread counts this change results in completed chunks queueing to be output and thus suboptimal memory use and performance of the file output stage, so further investigation is required there.

Currently the responsibility for outputting results in the correct order is partially with the worker itself and thus blocking further progress of that worker. Out-of-order chunk completions prevent workers from being utilised until the lowest-numbered chunk is completed. Idle threads are shown waiting on the condition variable signalled from tryDump.

I have refactored to put responsibility for chunk result output into ChunkDispatcher: maintain a binary heap of the chunk results and have a separate output thread for outputting them. Workers can then proceed with new chunks as soon as they are finished with current chunks. Load average goes up to around 12 and total time for processsing (samtools) drops by over half.

I expected this to come at a small memory cost, in the order of chunk size * thread count. However, the output itself is now much slower than the processing. This seems to be in-line with sambamba philosophy of shifting the problem to I/O optimisation. However, I see some pathological behaviour: output of the queued chunks after processing completes takes much longer than expected.

Profiling the output thread with perf during samtools processing shows that most of the time is taken by LZ4 decompression. Is it correct that the output of samtools is lz4compressed only to then be decompressed to file?

Profiling the output thread in the final stage, when no further samtools processing is being done (only writing queued results to disk), shows that kernel functions change_protection_range and vm_normal_page are hot (90% the former). It's not clear to me what is going on here. Does not seem to be IO or CPU bound (see my question here).

Perhaps I should submit the PR in its current state and have someone review for concurrency correctness, lack of unnecessary object copies and style? I have never written D before.

@lomereiter
Copy link
Contributor

Hi, please submit the PR as it is, I'll review it on the weekend.

Output of samtools is lz4-compressed in order to reduce memory consumption, mostly for per-base pileup output. As it's one of the fastest compression algorithms (~1GB/s), I don't expect it to have large influence on performance.

sambrightman added a commit to sambrightman/sambamba that referenced this issue Jul 29, 2016
to continue

We noticed that in general mpileup shows suboptimtal thread utilisation
(issue biod#237). Ordering of output results is currently the responsibility
of worker threads, so many of the are waiting on the condition variable
of ChunkDispatcher.

This patch closes biod#237 by:

* Queueing results in a min heap.
* Writing from queue to file via a separate thread.
* Allowing worker threads to process subsequent chunks immediately.
* Moderately increasing in memory usage for short queues.

We currently see that the output thread cannot keep up with the workers,
leading to large increases in memory and runtime due to queueing, even
though processing time is decreased and thread utilisation is
full. Using a lower thread count would of course alleviate this but the
cause is not completely clear. This could be a separate issue.
@sambrightman
Copy link
Collaborator Author

sambrightman commented Jul 29, 2016

Submitted code is still undergoing a full run to verify that the writer thread is properly joined etc, so that part is untested.

I don't understand the perf profile that I asked a Stack Exchange question, but it could be partly due to the very high memory usage at that final stage - something related to kernel memory management (we are on 3.10.0). So potentially this would never happen if the queue didn't grow as much originally.

During normal processing, iotop is showing ~8MB/s written (~5M/s read). We have SSDs in RAID 10. The top of the perf report is:

44.88%  sambamba  sambamba            [.] LZ4_decompress_generic.constprop.3
11.09%  sambamba  sambamba            [.] XXH32_update
 5.91%  sambamba  [kernel.kallsyms]   [k] copy_user_enhanced_fast_string
 5.88%  sambamba  libc-2.17.so        [.] __memcpy_ssse3_back
 2.25%  sambamba  [kernel.kallsyms]   [k] default_send_IPI_mask_sequence_phys

I noticed the buffer sizes in lz4.d are only 256KB and the output buffer is the same size as the input buffer - could there be some overhead in not decompressing one long stream? Contexts are repeatedly created and destroyed too.

@sambrightman
Copy link
Collaborator Author

I've just noticed that we pipe the output to bgzip, which is using about 50% CPU whilst sambamba uses 50%, occasional spikes from both. Maybe bgzip is the limiting factor (I think it is 64KB block size, which is also the pipe buffer size?).

@lomereiter
Copy link
Contributor

lomereiter commented Jul 29, 2016

I didn't have any experience with perf, my preferred way is google-perftools + kcachegrind.

The buffer size is ok, increasing it might only hurt because CPU cache is limited.
I'm surprised to see LZ4_decompress_generic.constprop.3 at the top, because LZ4 source code indicates it should be inlined. Perhaps try to recompile it with -fnoipa-cp-clone and see if it's any better?

@sambrightman
Copy link
Collaborator Author

I've confirmed that the output thread joins correctly and that removing bgzip from the pipeline results in fast output and low memory usage.

Looks like the next step is investigating bgzip performance, trying pbgzip or avoiding it entirely.

sambrightman added a commit to sambrightman/sambamba that referenced this issue Aug 2, 2016
to continue

We noticed that in general mpileup shows suboptimtal thread utilisation
(issue biod#237). Ordering of output results is currently the responsibility
of worker threads, so many of the are waiting on the condition variable
of ChunkDispatcher.

This patch closes biod#237 by:

* Queueing results in a min heap.
* Writing from queue to file via a separate thread.
* Allowing worker threads to process subsequent chunks immediately.
* Moderately increasing in memory usage for short queues.

We currently see that the output thread cannot keep up with the workers,
leading to large increases in memory and runtime due to queueing, even
though processing time is decreased and thread utilisation is
full. Using a lower thread count would of course alleviate this but the
cause is not completely clear. This could be a separate issue.
pjotrp pushed a commit to pjotrp/sambamba that referenced this issue Dec 13, 2016
to continue

We noticed that in general mpileup shows suboptimtal thread utilisation
(issue biod#237). Ordering of output results is currently the responsibility
of worker threads, so many of the are waiting on the condition variable
of ChunkDispatcher.

This patch closes biod#237 by:

* Queueing results in a min heap.
* Writing from queue to file via a separate thread.
* Allowing worker threads to process subsequent chunks immediately.
* Moderately increasing in memory usage for short queues.

We currently see that the output thread cannot keep up with the workers,
leading to large increases in memory and runtime due to queueing, even
though processing time is decreased and thread utilisation is
full. Using a lower thread count would of course alleviate this but the
cause is not completely clear. This could be a separate issue.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants