Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correction for hanging dump thread - #242 #247

Merged
merged 1 commit into from
Sep 21, 2016
Merged

Conversation

nosepy
Copy link
Contributor

@nosepy nosepy commented Sep 21, 2016

On specific BAM sets the output thread was hanging on an empty queue after dumping all chunks successfully dependent on load and thread scheduling.

The correction is based on using the total number of chunks as termination condition for the dump thread instead of the queue content and an indication for running workers. To avoid a hanging queuing thread in case of exceptions and unnecessary delay of program termination the join conditions for the workers and the dump thread were moved from the exit scope to the success branch at the end of the main function.

@lomereiter lomereiter merged commit 50102d4 into biod:master Sep 21, 2016
@lomereiter
Copy link
Contributor

Many thanks for your debugging effort!

@sambrightman
Copy link
Collaborator

This still looks potentially race-y to me:

  • what stops the worker grabbing the final chunk and increasing total_num_ just before dumpFinished is called again?
  • can the empty condition trigger early if the BAM reading is slower than pileup? I think probably not, but remember reading some code somewhere that looked like this could happen.

@nosepy
Copy link
Contributor Author

nosepy commented Sep 28, 2016

For access to curr_num a synchronization is not necessary in dumpFinished because it is only written by the queue thread (under queue lock) and parallel read/write access is not possible in one queue thread. The read access in the worker in queueResult happens within a queue lock.

Yes, there could be a parallel access to total_num_ when the worker is setting it and the queuing thread is checking it at the same time which in rare cases could lead to reading an inconsistent value. I assumed that a 4 byte access to an integer is always atomic both for read and write, but it seems that this is not always ensured dependent on hardware architecture and alignment of the integer. So some kind of synchronization or atomic access is needed. The writing in the worker happens within the mutex_ lock. Read access in dumpFinished can also be done with a mutex_ lock but likely a CAS (CompareAndSwap) based atomic read/write (cas in core/atomic.d) would be better because of reduced dependency of worker and queue thread. @lomereiter What is your opinion?

Related to the second topic:
A chunk empty condition before the entire reading of the BAM would lead to premature termination of some or all worker threads. If this can happen, this could lead to incomplete pileup processing and would definitely be an error in Sambamba independent of any queuing handling. (Processing of the entire file must happen independent of parallelization approaches and speeds of parallel running software units.) So if chunk empty condition can occur before the entire BAM is partitioned into chunks I think another issue should be opened.

@lomereiter
Copy link
Contributor

@nosepy I also wouldn't worry about 32-bit integer reads/writes on any of the common architectures. Variables are aligned anyway, unaligned access can happen only in case of manual heap allocation (GC.malloc returns aligned blocks). Using atomicLoad/atomicStore would perhaps convey the intention better, but practically it shouldn't make any difference.

I don't understand what it means for empty condition to be triggered 'early' since it's triggered on every chunk arrival, and exit is determined by nextChunk returning null.

All in all this approach looks correct to me.

@sambrightman
Copy link
Collaborator

  1. Off-by-one error in my thinking on this one, agree it looks fine.
  2. I had a recollection that the ChunkRange was being filled in parallel by reader threads and it could be .empty mid-run. You are right that if this happens, you have other problems anyway.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants