-
Notifications
You must be signed in to change notification settings - Fork 59
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sun Grid Engine: SGEWorker #472
Conversation
Codecov Report
@@ Coverage Diff @@
## master #472 +/- ##
==========================================
- Coverage 83.04% 78.42% -4.63%
==========================================
Files 20 20
Lines 4046 4305 +259
Branches 1118 1182 +64
==========================================
+ Hits 3360 3376 +16
- Misses 491 733 +242
- Partials 195 196 +1
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
@chasejohnson3 - thank you for the PR, do you think it's possible to create a container that could be used for testing the new worker? |
@chasejohnson3 - btw. any chances you're joining the OHBM hackathon next week? |
@djarecka Thanks for taking a look at this PR. I'm afraid I am running low on time available to contribute to pydra, so I don't think I will be able to get to the worker or OHBM as much as I would like to. |
thanks @chasejohnson3 ! We could try to help with this PR. Perhaps, I can try to find someone who have some experience with SGE. |
Thanks for keeping me in the loop @djarecka! |
@djarecka What would be involved in creating a container to test the SGEWorker? Is there an example (ie one for SLURM) I could base it on? I am just trying to gauge how much of an effort that would be. |
8368af8
to
94f505a
Compare
@chasejohnson3 Could this be rebased and the conflict resolved? |
I did not initially merge this PR because we do not have a SGEWorker container to test it. @djarecka can the SGEWorker be merged without having a test container yet? |
@chasejohnson3 - I think it would be good to have tests, but we could add this as an experimental worker. Hopefully, someone will find some time soon to work on the testing. However, please merge master and clean a bit (you have some commented print statements, etc.). Also, please confirm that you tested this manually on your system |
a2f8476
to
5bb5915
Compare
5bb5915
to
9ae43c7
Compare
@djarecka I rebased this branch with master and removed the comments/print statements. I did successfully test the SGEWorker manually on our system. Let me know if anything else needs to be done before merging! |
@chasejohnson3 - Thank you! I've merged it, but could I ask you to open an issue to add tests for it (and perhaps some information what will be needed for it) |
Acknowledgment
Types of changes
Summary
This PR addresses a feature requested and discussed in #247
_result.pklz
pollingmax_threads
limitThis PR adds Sun Grid Engine compatibility to the list of pydra workers available. The implementation of the SGEWorker
slightly mirrored the SlurmWorker, but deviated to accommodate instances where many jobs would be submitted
at one time. To alleviate strain on SGE for large inputs, job arrays are utilized, and instead of polling for job completion
with qstat/qacct, job completion is polled by the existence of the
_result.pklz
file for a given task. These two changesdecrease the number of SGE calls are made (qsub, qstat, and qacct) which overload and slow the entire SGE system when
many are called at the same time.
To further avoid overloading SGE, an SGEWorker parameter
max_threads
is used to limit the number of slots are requestedof the SGE system at any one time. By default, a task requests 1 slot, but the number of slots used for a task can be set by the input_spec field sgeThreads. Indirect host job submission using SGEWorker parameter
indirect_submit_host
allows a pydra workflow to be run on a more-powerful SGE "compute" node while qsub calls are made from SGE "submit" nodes.Note: The cache_dir for workflows using SGEWorker MUST be in a directory shared between all nodes of an SGE cluster.
This is also true for the SGEWorkflow tests, where tempdir must be set to a "Shared" memory location.
Still to do:
Sometimes lock files are created for a task but the task makes not progress and "hangs" without completing.
I believe the best resolution to this issue is to find empty directories held by lock files, remove them, and rerun their tasks.
This seems like a pydra-wide bug (see #216 ), but there may be specific implementations when the SGEWorker specifically
comes across empty but locked directories.
Checklist
(we are using
black
: you canpip install pre-commit
,run
pre-commit install
in thepydra
directoryand
black
will be run automatically with each commit)