Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parallelization is sometimes slow #19

Open
TimothyOlsson opened this issue Aug 23, 2019 · 4 comments
Open

Parallelization is sometimes slow #19

TimothyOlsson opened this issue Aug 23, 2019 · 4 comments
Labels
enhancement New feature or request

Comments

@TimothyOlsson
Copy link
Contributor

For very large data sets, each parallel_3 process is quite frankly, slow.

Adding placeholders and sorting the values is what makes the overhead calculations slow. A lot of work has already been done by Matthew, the creator of Quandenser to make it faster.

Changing the loop so the values are stored in .dat files, like what was done previously might solve this issue

@TimothyOlsson TimothyOlsson added the enhancement New feature or request label Aug 23, 2019
@MatthewThe
Copy link
Contributor

This sounds like it shouldn't be too hard, I will look into this sometime next week.

@TimothyOlsson
Copy link
Contributor Author

TimothyOlsson commented Aug 23, 2019

This sounds like it shouldn't be too hard, I will look into this sometime next week.

Thank you! The work you have already done has been extremely helpful. The overhead calculations have been significantly improved after some testing and the "stability issues" I had with the pipeline (#15 types of crashes) are pretty much gone, from what I have seen.

Edit: addFeatureLinks could also possibly be a bottleneck

@MatthewThe
Copy link
Contributor

I have added intermediate writing of features between alignments to your Quandenser branch:
statisticalbiotechnology/quandenser@df79378

This skips all the matching of the previous alignments. I've only tested it on the iPRG2016 set, where it resulted in a limited speed up, but I imagine it should give you a considerable speed-up on large sets. Please give it a try! :)

@TimothyOlsson
Copy link
Contributor Author

I have added intermediate writing of features between alignments to your Quandenser branch:
statisticalbiotechnology/quandenser@df79378

This skips all the matching of the previous alignments. I've only tested it on the iPRG2016 set, where it resulted in a limited speed up, but I imagine it should give you a considerable speed-up on large sets. Please give it a try! :)

I have been testing it and it seems to really speed up the second step. I tested on an old data set on cyanobacteria (10 files), and got 45 minutes instead of 1h 21m. I am going to test it further, but results so far seems very promising. This will be very useful, thank you!

TimothyOlsson added a commit that referenced this issue Sep 13, 2019
### Changes in building the image:

**Minor changes:**

* 
[command_wrapper.py](https://github.com/statisticalbiotechnology/quandenser-pipeline/blob/master/dependencies/command_wrapper.py) 
added to the image. The purpose is to fix some common errors found in 
the pipeline.

### Changes in the shell script:

**Minor changes:**

* Added messages when the user enables some options

* The installed singularity version by the shell script is now fixed on 
singularity version v3.2.1, since the latest version v3.4.0 has some 
restrictions on what you can interact with on the host computer. This 
includes the "running jobs" tab and the button which can kill the 
processes. This means you would have to kill all processes manually, if 
you have v3.4.0 installed

### Changes in the image:

**Major changes:**

* Major improvements with parallel processing. See enhancement 
#19. 
A positive side effect of this change is that the parallel processing is 
way more efficient with memory, so that means you can run larger data 
sets with the parallel option!

**Minor changes:**

* This is a "minor" change, but this change fixes 2 very important and 
"pipeline breaking" bugs, namely issue #3 and issue #23. MSconvert, 
quandenser_parallel_1 and quandenser_parallel_3 will now be executed 
through a "wrapper" script, which checks for known errors which causes 
the issue. The script is found 
[here](https://github.com/statisticalbiotechnology/quandenser-pipeline/blob/master/dependencies/command_wrapper.py). 
This should also prevent zombie processes from happening.

* You can now choose in tab 3 if you want to publish the files (aka copy 
them to the output directory from work directory). This is an useful 
option to have if you have limited storage space on the computer you are 
working on and don't want some of the "intermediate" files. Note that 
you still have to remove the work directory manually when you are done.

* The pipeline now allows one crash on msconvert, quandenser_parallel_1 
and quandenser_parallel_3, due to the known errors and command_wrapper. 
This also solves in most cases issue #22, where singularity fails to 
mount the image when running many parallel processes. It is sometimes 
rare, but if the pipeline fails to mount the image twice in a row on the 
same process, the pipeline will crash. This fix helps at least computers 
which has that issue from crashing 100% of the time, to a lot less than 
that.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants