Parallelization is sometimes slow #19

TimothyOlsson · 2019-08-23T08:45:51Z

For very large data sets, each parallel_3 process is quite frankly, slow.

Adding placeholders and sorting the values is what makes the overhead calculations slow. A lot of work has already been done by Matthew, the creator of Quandenser to make it faster.

Changing the loop so the values are stored in .dat files, like what was done previously might solve this issue

MatthewThe · 2019-08-23T08:53:09Z

This sounds like it shouldn't be too hard, I will look into this sometime next week.

TimothyOlsson · 2019-08-23T09:53:29Z

This sounds like it shouldn't be too hard, I will look into this sometime next week.

Thank you! The work you have already done has been extremely helpful. The overhead calculations have been significantly improved after some testing and the "stability issues" I had with the pipeline (#15 types of crashes) are pretty much gone, from what I have seen.

Edit: addFeatureLinks could also possibly be a bottleneck

MatthewThe · 2019-09-04T13:47:29Z

I have added intermediate writing of features between alignments to your Quandenser branch:
statisticalbiotechnology/quandenser@df79378

This skips all the matching of the previous alignments. I've only tested it on the iPRG2016 set, where it resulted in a limited speed up, but I imagine it should give you a considerable speed-up on large sets. Please give it a try! :)

TimothyOlsson · 2019-09-06T07:58:15Z

I have added intermediate writing of features between alignments to your Quandenser branch:
statisticalbiotechnology/quandenser@df79378

This skips all the matching of the previous alignments. I've only tested it on the iPRG2016 set, where it resulted in a limited speed up, but I imagine it should give you a considerable speed-up on large sets. Please give it a try! :)

I have been testing it and it seems to really speed up the second step. I tested on an old data set on cyanobacteria (10 files), and got 45 minutes instead of 1h 21m. I am going to test it further, but results so far seems very promising. This will be very useful, thank you!

### Changes in building the image: **Minor changes:** * [command_wrapper.py](https://github.com/statisticalbiotechnology/quandenser-pipeline/blob/master/dependencies/command_wrapper.py) added to the image. The purpose is to fix some common errors found in the pipeline. ### Changes in the shell script: **Minor changes:** * Added messages when the user enables some options * The installed singularity version by the shell script is now fixed on singularity version v3.2.1, since the latest version v3.4.0 has some restrictions on what you can interact with on the host computer. This includes the "running jobs" tab and the button which can kill the processes. This means you would have to kill all processes manually, if you have v3.4.0 installed ### Changes in the image: **Major changes:** * Major improvements with parallel processing. See enhancement #19. A positive side effect of this change is that the parallel processing is way more efficient with memory, so that means you can run larger data sets with the parallel option! **Minor changes:** * This is a "minor" change, but this change fixes 2 very important and "pipeline breaking" bugs, namely issue #3 and issue #23. MSconvert, quandenser_parallel_1 and quandenser_parallel_3 will now be executed through a "wrapper" script, which checks for known errors which causes the issue. The script is found [here](https://github.com/statisticalbiotechnology/quandenser-pipeline/blob/master/dependencies/command_wrapper.py). This should also prevent zombie processes from happening. * You can now choose in tab 3 if you want to publish the files (aka copy them to the output directory from work directory). This is an useful option to have if you have limited storage space on the computer you are working on and don't want some of the "intermediate" files. Note that you still have to remove the work directory manually when you are done. * The pipeline now allows one crash on msconvert, quandenser_parallel_1 and quandenser_parallel_3, due to the known errors and command_wrapper. This also solves in most cases issue #22, where singularity fails to mount the image when running many parallel processes. It is sometimes rare, but if the pipeline fails to mount the image twice in a row on the same process, the pipeline will crash. This fix helps at least computers which has that issue from crashing 100% of the time, to a lot less than that.

TimothyOlsson added the enhancement New feature or request label Aug 23, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Parallelization is sometimes slow #19

Parallelization is sometimes slow #19

TimothyOlsson commented Aug 23, 2019

MatthewThe commented Aug 23, 2019

TimothyOlsson commented Aug 23, 2019 •

edited

Loading

MatthewThe commented Sep 4, 2019

TimothyOlsson commented Sep 6, 2019

Parallelization is sometimes slow #19

Parallelization is sometimes slow #19

Comments

TimothyOlsson commented Aug 23, 2019

MatthewThe commented Aug 23, 2019

TimothyOlsson commented Aug 23, 2019 • edited Loading

MatthewThe commented Sep 4, 2019

TimothyOlsson commented Sep 6, 2019

TimothyOlsson commented Aug 23, 2019 •

edited

Loading