Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sun Grid Engine: SGEWorker #472

Merged
merged 1 commit into from
Oct 20, 2021
Merged

Conversation

chasejohnson3
Copy link
Contributor

Acknowledgment

  • I acknowledge that this contribution will be available under the Apache 2 license.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)

Summary

This PR addresses a feature requested and discussed in #247

  • Array jobs
  • _result.pklz polling
  • max_threads limit
  • Set threads used per task
  • Indirect host job submission

This PR adds Sun Grid Engine compatibility to the list of pydra workers available. The implementation of the SGEWorker
slightly mirrored the SlurmWorker, but deviated to accommodate instances where many jobs would be submitted
at one time. To alleviate strain on SGE for large inputs, job arrays are utilized, and instead of polling for job completion
with qstat/qacct, job completion is polled by the existence of the _result.pklz file for a given task. These two changes
decrease the number of SGE calls are made (qsub, qstat, and qacct) which overload and slow the entire SGE system when
many are called at the same time.
To further avoid overloading SGE, an SGEWorker parameter max_threads is used to limit the number of slots are requested
of the SGE system at any one time. By default, a task requests 1 slot, but the number of slots used for a task can be set by the input_spec field sgeThreads. Indirect host job submission using SGEWorker parameter indirect_submit_host allows a pydra workflow to be run on a more-powerful SGE "compute" node while qsub calls are made from SGE "submit" nodes.

Note: The cache_dir for workflows using SGEWorker MUST be in a directory shared between all nodes of an SGE cluster.
This is also true for the SGEWorkflow tests, where tempdir must be set to a "Shared" memory location.

Still to do:
Sometimes lock files are created for a task but the task makes not progress and "hangs" without completing.
I believe the best resolution to this issue is to find empty directories held by lock files, remove them, and rerun their tasks.
This seems like a pydra-wide bug (see #216 ), but there may be specific implementations when the SGEWorker specifically
comes across empty but locked directories.

Checklist

  • All tests passing
  • I have added tests to cover my changes
  • I have updated documentation (if necessary)
  • My code follows the code style of this project
    (we are using black: you can pip install pre-commit,
    run pre-commit install in the pydra directory
    and black will be run automatically with each commit)

@codecov
Copy link

codecov bot commented May 31, 2021

Codecov Report

Merging #472 (426ed84) into master (841796c) will decrease coverage by 4.62%.
The diff coverage is 7.25%.

❗ Current head 426ed84 differs from pull request most recent head 9ae43c7. Consider uploading reports for the commit 9ae43c7 to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master     #472      +/-   ##
==========================================
- Coverage   83.04%   78.42%   -4.63%     
==========================================
  Files          20       20              
  Lines        4046     4305     +259     
  Branches     1118     1182      +64     
==========================================
+ Hits         3360     3376      +16     
- Misses        491      733     +242     
- Partials      195      196       +1     
Flag Coverage Δ
unittests 78.35% <7.25%> (-4.63%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
pydra/engine/workers.py 17.83% <6.94%> (-13.95%) ⬇️
pydra/engine/submitter.py 83.33% <33.33%> (-1.13%) ⬇️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 841796c...9ae43c7. Read the comment docs.

@chasejohnson3 chasejohnson3 changed the title Job arrays Sun Grid Engine: SGEWorker Jun 1, 2021
@djarecka
Copy link
Collaborator

@chasejohnson3 - thank you for the PR, do you think it's possible to create a container that could be used for testing the new worker?

@djarecka
Copy link
Collaborator

@chasejohnson3 - btw. any chances you're joining the OHBM hackathon next week?

@chasejohnson3
Copy link
Contributor Author

chasejohnson3 commented Jun 15, 2021

@djarecka Thanks for taking a look at this PR. I'm afraid I am running low on time available to contribute to pydra, so I don't think I will be able to get to the worker or OHBM as much as I would like to.

@djarecka
Copy link
Collaborator

thanks @chasejohnson3 ! We could try to help with this PR. Perhaps, I can try to find someone who have some experience with SGE.
Even if you won't be officially at OHBM, you could always find us on nipype channel on mattermost (https://mattermost.brainhack.org/brainhack/channels/nipype)

@chasejohnson3
Copy link
Contributor Author

Thanks for keeping me in the loop @djarecka!

@chasejohnson3
Copy link
Contributor Author

chasejohnson3 commented Jun 24, 2021

@djarecka What would be involved in creating a container to test the SGEWorker? Is there an example (ie one for SLURM) I could base it on? I am just trying to gauge how much of an effort that would be.

@hjmjohnson
Copy link

@chasejohnson3 Could this be rebased and the conflict resolved?

@chasejohnson3
Copy link
Contributor Author

@chasejohnson3 Could this be rebased and the conflict resolved?

I did not initially merge this PR because we do not have a SGEWorker container to test it. @djarecka can the SGEWorker be merged without having a test container yet?

@djarecka
Copy link
Collaborator

@chasejohnson3 - I think it would be good to have tests, but we could add this as an experimental worker. Hopefully, someone will find some time soon to work on the testing.

However, please merge master and clean a bit (you have some commented print statements, etc.). Also, please confirm that you tested this manually on your system

@chasejohnson3 chasejohnson3 force-pushed the job_arrays branch 4 times, most recently from a2f8476 to 5bb5915 Compare October 17, 2021 23:29
@chasejohnson3
Copy link
Contributor Author

@djarecka I rebased this branch with master and removed the comments/print statements. I did successfully test the SGEWorker manually on our system. Let me know if anything else needs to be done before merging!

@djarecka djarecka merged commit 25eb129 into nipype:master Oct 20, 2021
@djarecka
Copy link
Collaborator

djarecka commented Oct 20, 2021

@chasejohnson3 - Thank you! I've merged it, but could I ask you to open an issue to add tests for it (and perhaps some information what will be needed for it)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants