Sun Grid Engine: SGEWorker #472

chasejohnson3 · 2021-05-31T22:34:44Z

Acknowledgment

I acknowledge that this contribution will be available under the Apache 2 license.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)

Summary

This PR addresses a feature requested and discussed in #247

Array jobs
_result.pklz polling
max_threads limit
Set threads used per task
Indirect host job submission

This PR adds Sun Grid Engine compatibility to the list of pydra workers available. The implementation of the SGEWorker
slightly mirrored the SlurmWorker, but deviated to accommodate instances where many jobs would be submitted
at one time. To alleviate strain on SGE for large inputs, job arrays are utilized, and instead of polling for job completion
with qstat/qacct, job completion is polled by the existence of the _result.pklz file for a given task. These two changes
decrease the number of SGE calls are made (qsub, qstat, and qacct) which overload and slow the entire SGE system when
many are called at the same time.
To further avoid overloading SGE, an SGEWorker parameter max_threads is used to limit the number of slots are requested
of the SGE system at any one time. By default, a task requests 1 slot, but the number of slots used for a task can be set by the input_spec field sgeThreads. Indirect host job submission using SGEWorker parameter indirect_submit_host allows a pydra workflow to be run on a more-powerful SGE "compute" node while qsub calls are made from SGE "submit" nodes.

Note: The cache_dir for workflows using SGEWorker MUST be in a directory shared between all nodes of an SGE cluster.
This is also true for the SGEWorkflow tests, where tempdir must be set to a "Shared" memory location.

Still to do:
Sometimes lock files are created for a task but the task makes not progress and "hangs" without completing.
I believe the best resolution to this issue is to find empty directories held by lock files, remove them, and rerun their tasks.
This seems like a pydra-wide bug (see #216 ), but there may be specific implementations when the SGEWorker specifically
comes across empty but locked directories.

Checklist

All tests passing
I have added tests to cover my changes
I have updated documentation (if necessary)
My code follows the code style of this project
(we are using black: you can pip install pre-commit,
run pre-commit install in the pydra directory
and black will be run automatically with each commit)

codecov · 2021-05-31T22:38:27Z

Codecov Report

Merging #472 (426ed84) into master (841796c) will decrease coverage by 4.62%.
The diff coverage is 7.25%.

❗ Current head 426ed84 differs from pull request most recent head 9ae43c7. Consider uploading reports for the commit 9ae43c7 to get more accurate results

@@            Coverage Diff             @@
##           master     #472      +/-   ##
==========================================
- Coverage   83.04%   78.42%   -4.63%     
==========================================
  Files          20       20              
  Lines        4046     4305     +259     
  Branches     1118     1182      +64     
==========================================
+ Hits         3360     3376      +16     
- Misses        491      733     +242     
- Partials      195      196       +1

Flag	Coverage Δ
unittests	`78.35% <7.25%> (-4.63%)`	⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files	Coverage Δ
pydra/engine/workers.py	`17.83% <6.94%> (-13.95%)`	⬇️
pydra/engine/submitter.py	`83.33% <33.33%> (-1.13%)`	⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 841796c...9ae43c7. Read the comment docs.

djarecka · 2021-06-12T15:17:47Z

@chasejohnson3 - thank you for the PR, do you think it's possible to create a container that could be used for testing the new worker?

djarecka · 2021-06-12T15:18:57Z

@chasejohnson3 - btw. any chances you're joining the OHBM hackathon next week?

chasejohnson3 · 2021-06-15T22:35:44Z

@djarecka Thanks for taking a look at this PR. I'm afraid I am running low on time available to contribute to pydra, so I don't think I will be able to get to the worker or OHBM as much as I would like to.

djarecka · 2021-06-15T22:43:19Z

thanks @chasejohnson3 ! We could try to help with this PR. Perhaps, I can try to find someone who have some experience with SGE.
Even if you won't be officially at OHBM, you could always find us on nipype channel on mattermost (https://mattermost.brainhack.org/brainhack/channels/nipype)

chasejohnson3 · 2021-06-15T23:25:56Z

Thanks for keeping me in the loop @djarecka!

chasejohnson3 · 2021-06-24T14:41:02Z

@djarecka What would be involved in creating a container to test the SGEWorker? Is there an example (ie one for SLURM) I could base it on? I am just trying to gauge how much of an effort that would be.

hjmjohnson · 2021-10-15T15:13:26Z

@chasejohnson3 Could this be rebased and the conflict resolved?

chasejohnson3 · 2021-10-15T15:52:37Z

@chasejohnson3 Could this be rebased and the conflict resolved?

I did not initially merge this PR because we do not have a SGEWorker container to test it. @djarecka can the SGEWorker be merged without having a test container yet?

djarecka · 2021-10-17T00:24:36Z

@chasejohnson3 - I think it would be good to have tests, but we could add this as an experimental worker. Hopefully, someone will find some time soon to work on the testing.

However, please merge master and clean a bit (you have some commented print statements, etc.). Also, please confirm that you tested this manually on your system

chasejohnson3 · 2021-10-17T23:58:51Z

@djarecka I rebased this branch with master and removed the comments/print statements. I did successfully test the SGEWorker manually on our system. Let me know if anything else needs to be done before merging!

djarecka · 2021-10-20T02:28:31Z

@chasejohnson3 - Thank you! I've merged it, but could I ask you to open an issue to add tests for it (and perhaps some information what will be needed for it)

chasejohnson3 mentioned this pull request Jun 1, 2021

SunGridEngine compatibility needed #247

Closed

chasejohnson3 changed the title ~~Job arrays~~ Sun Grid Engine: SGEWorker Jun 1, 2021

chasejohnson3 force-pushed the job_arrays branch from 8368af8 to 94f505a Compare June 30, 2021 00:43

chasejohnson3 force-pushed the job_arrays branch 4 times, most recently from a2f8476 to 5bb5915 Compare October 17, 2021 23:29

Created SGEWorker to use Pydra with Sun Grid Engine

9ae43c7

chasejohnson3 force-pushed the job_arrays branch from 5bb5915 to 9ae43c7 Compare October 17, 2021 23:52

djarecka merged commit 25eb129 into nipype:master Oct 20, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sun Grid Engine: SGEWorker #472

Sun Grid Engine: SGEWorker #472

chasejohnson3 commented May 31, 2021

codecov bot commented May 31, 2021 •

edited

Loading

djarecka commented Jun 12, 2021

djarecka commented Jun 12, 2021

chasejohnson3 commented Jun 15, 2021 •

edited

Loading

djarecka commented Jun 15, 2021

chasejohnson3 commented Jun 15, 2021

chasejohnson3 commented Jun 24, 2021 •

edited

Loading

hjmjohnson commented Oct 15, 2021

chasejohnson3 commented Oct 15, 2021

djarecka commented Oct 17, 2021

chasejohnson3 commented Oct 17, 2021

djarecka commented Oct 20, 2021 •

edited

Loading

Sun Grid Engine: SGEWorker #472

Sun Grid Engine: SGEWorker #472

Conversation

chasejohnson3 commented May 31, 2021

Acknowledgment

Types of changes

Summary

Checklist

codecov bot commented May 31, 2021 • edited Loading

Codecov Report

djarecka commented Jun 12, 2021

djarecka commented Jun 12, 2021

chasejohnson3 commented Jun 15, 2021 • edited Loading

djarecka commented Jun 15, 2021

chasejohnson3 commented Jun 15, 2021

chasejohnson3 commented Jun 24, 2021 • edited Loading

hjmjohnson commented Oct 15, 2021

chasejohnson3 commented Oct 15, 2021

djarecka commented Oct 17, 2021

chasejohnson3 commented Oct 17, 2021

djarecka commented Oct 20, 2021 • edited Loading

codecov bot commented May 31, 2021 •

edited

Loading

chasejohnson3 commented Jun 15, 2021 •

edited

Loading

chasejohnson3 commented Jun 24, 2021 •

edited

Loading

djarecka commented Oct 20, 2021 •

edited

Loading