-
Notifications
You must be signed in to change notification settings - Fork 107
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sambamba - missing supervision of thread error conditions / progress #228
Comments
Hi Johannes, Sambamba has two strategies for parallel processing. For most functionality it uses a D threadpool which behaves correctly when threads fail. It is not supervised, but will behave well on error. See the information on Exceptions in http://dlang.org/phobos/std_parallelism.html#.TaskPool For mpileup, however, we use a different strategy by running samtool processes. Because it is a simple strategy these processes are not supervised and can misbehave. We had not anticipated this feature to be so successful! If you want to run mpileup correctly it is probably best to use samtools directly and put in your own supervision system (right @lomereiter?) Say hi to Sepp, Oswaldo and Ulrich from me :). |
Pjotr summarized it well, I'll just add my 2 cents. The sad reality is, I don't have time resources for considering and handling all possible error conditions. I'm also not an end user of the tool anymore, so I rely on others' bug reports and try to add regression tests where it makes sense. Mpileup in particular is more of a fun experiment, I'm almost surprised that it works. (Multithreading brings quite a bit of headache already, adding multiple processes to the mix complicates the matter even more.) |
Thank you for the quick response. I understand your concerns related to the necessary effort and the diversity of error situations. I thought of something like this: a single supervision process starts the parallelization process which starts its working threads for parallelized processing. The parallelization process keeps track of which threads are active in a form that can be accessed from the supervision process, e.g. through a shared memory segment. If the shell command executed in the worker returns with non-success return code the active thread is marked as failed as indication for the supervisor. For supervision of hanging threads I think it could be sufficient for the supervising process to periodically check the accumulated thread cpu time of active threads as an indication that the thread is still alive. The main influence to the parallelization process is maintaining the list of active thread PIDs and marking nonsuccessful threads based on the shell return code. The supervising process periodically checks for erronous thread entries signalled from the parallelization thread, checks the thread progress based on thread cpu time and terminates everything in case of error or a hanging thread. Through this approach it should be possible to keep the supervising process simple and the influence to the parallelization process limited and the supervision independent of the used parallelization model. |
Good idea. Feel free to add it! |
Hi,
in connection with issue #214, which is corrected already, I see an other problem with sambamba. There one thread was hanging because of a wrong version of l4z. And the parallel processing did not terminate because of missing supervision of the parallel threads for error conditions. The correction for #214 corrected the immediate problem but did not improve the general sambamba behavior in case of problems of a single or multiple threads.
I assume that sambamba should supervise the threads for error conditions and progress and should terminate all threads immediately in case of an error condition / missing progress in a single thread as this will never lead to a successful sambama run if one of the parts crashes. Or more generally speaking I think appropriate error handling for parallelization means: a parallelized version of a functionality should recognize if one of its parts is not working properly and should terminate the entire functionality immediately in this situation. This makes error situations immediately visible and avoids unnecessary waste of processing resources.
Thank you for your time.
Cheers,
Johannes
The text was updated successfully, but these errors were encountered: