Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sambamba - missing supervision of thread error conditions / progress #228

Closed
nosepy opened this issue Jun 27, 2016 · 4 comments
Closed

Sambamba - missing supervision of thread error conditions / progress #228

nosepy opened this issue Jun 27, 2016 · 4 comments

Comments

@nosepy
Copy link
Contributor

nosepy commented Jun 27, 2016

Hi,
in connection with issue #214, which is corrected already, I see an other problem with sambamba. There one thread was hanging because of a wrong version of l4z. And the parallel processing did not terminate because of missing supervision of the parallel threads for error conditions. The correction for #214 corrected the immediate problem but did not improve the general sambamba behavior in case of problems of a single or multiple threads.

I assume that sambamba should supervise the threads for error conditions and progress and should terminate all threads immediately in case of an error condition / missing progress in a single thread as this will never lead to a successful sambama run if one of the parts crashes. Or more generally speaking I think appropriate error handling for parallelization means: a parallelized version of a functionality should recognize if one of its parts is not working properly and should terminate the entire functionality immediately in this situation. This makes error situations immediately visible and avoids unnecessary waste of processing resources.

Thank you for your time.

Cheers,

Johannes

@pjotrp
Copy link
Member

pjotrp commented Jun 27, 2016

Hi Johannes,

Sambamba has two strategies for parallel processing. For most functionality it uses a D threadpool which behaves correctly when threads fail. It is not supervised, but will behave well on error. See the information on Exceptions in http://dlang.org/phobos/std_parallelism.html#.TaskPool

For mpileup, however, we use a different strategy by running samtool processes. Because it is a simple strategy these processes are not supervised and can misbehave. We had not anticipated this feature to be so successful! If you want to run mpileup correctly it is probably best to use samtools directly and put in your own supervision system (right @lomereiter?)

Say hi to Sepp, Oswaldo and Ulrich from me :).

@lomereiter
Copy link
Contributor

Pjotr summarized it well, I'll just add my 2 cents.

The sad reality is, I don't have time resources for considering and handling all possible error conditions. I'm also not an end user of the tool anymore, so I rely on others' bug reports and try to add regression tests where it makes sense. Mpileup in particular is more of a fun experiment, I'm almost surprised that it works. (Multithreading brings quite a bit of headache already, adding multiple processes to the mix complicates the matter even more.)

@nosepy
Copy link
Contributor Author

nosepy commented Jun 28, 2016

Thank you for the quick response. I understand your concerns related to the necessary effort and the diversity of error situations.

I thought of something like this: a single supervision process starts the parallelization process which starts its working threads for parallelized processing. The parallelization process keeps track of which threads are active in a form that can be accessed from the supervision process, e.g. through a shared memory segment. If the shell command executed in the worker returns with non-success return code the active thread is marked as failed as indication for the supervisor. For supervision of hanging threads I think it could be sufficient for the supervising process to periodically check the accumulated thread cpu time of active threads as an indication that the thread is still alive.

The main influence to the parallelization process is maintaining the list of active thread PIDs and marking nonsuccessful threads based on the shell return code. The supervising process periodically checks for erronous thread entries signalled from the parallelization thread, checks the thread progress based on thread cpu time and terminates everything in case of error or a hanging thread.

Through this approach it should be possible to keep the supervising process simple and the influence to the parallelization process limited and the supervision independent of the used parallelization model.

@pjotrp
Copy link
Member

pjotrp commented Jun 28, 2016

Good idea. Feel free to add it!

@pjotrp pjotrp closed this as completed Feb 24, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants