-
Notifications
You must be signed in to change notification settings - Fork 208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CalcJob
: move job resource validation to the Scheduler
class
#4192
CalcJob
: move job resource validation to the Scheduler
class
#4192
Conversation
45edcfe
to
8699878
Compare
Codecov Report
@@ Coverage Diff @@
## develop #4192 +/- ##
===========================================
+ Coverage 78.96% 79.03% +0.08%
===========================================
Files 467 467
Lines 34511 34492 -19
===========================================
+ Hits 27248 27259 +11
+ Misses 7263 7233 -30
Continue to review full report at Codecov.
|
N.B.: this PR assumes that there are no schedulers out there in the wild that subclass their own Second important point: the |
@giovannipizzi it would probably be good to ship this with 1.3.0 given that we have had multiple users face this problem. I have added a second commit on top of the original one, which contains just some cleaning of the scheduler base class and job resource implementations. Especially the |
229ea91
to
db62c84
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor change in the docs, and a question
tot_num_mpiprocs = resources.get('tot_num_mpiprocs', None) | ||
if num_mpiprocs_per_machine is None and tot_num_mpiprocs is None and default_mpiprocs_per_machine is not None: | ||
# Only set the default value if tot_num_mpiprocs is not provided. Otherwise, it means that the user provided | ||
# both `num_machines` and `tot_num_mpiprocs`, and we have to ignore the default value of `tot_num_mpiprocs`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
# both `num_machines` and `tot_num_mpiprocs`, and we have to ignore the default value of `tot_num_mpiprocs`. | |
# both `num_machines` and `tot_num_mpiprocs`, and we have to ignore the default value of `num_mpiprocs_per_machine`. |
def_cpus_machine = computer.get_default_mpiprocs_per_machine() | ||
if def_cpus_machine is not None: | ||
resources['default_mpiprocs_per_machine'] = def_cpus_machine | ||
default_mpiprocs_per_machine = computer.get_default_mpiprocs_per_machine() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are right and this will fail. Even if the user does not explicitly specify num_mpiprocs_per_machine
and tot_num_mpiprocs
, if the computer has default_mpiprocs_per_machine
defined (which I think currently is almost always the case, or at least, verdi computer setup
suggests to set it), then num_mpiprocs_per_machine
will be added to the resources
and since I changed ParEnvJobResource.validate_resources
to raise when it gets unsupported keywords, this will fail.
I am not sure where and how to fix it. Easy hack would be to simply ignore this keyword in the ParEnvJobResource
but is not the correct solution. If specifying a number of machines for a ParEnvJobResource
based scheduler never makes sense, maybe we should not even have computers that are configured with such a scheduler, define a num_mpiprocs_per_machine
.
This is necessary because the resource validation is scheduler dependent. Up till now, the `CalcJob` defined a validator on the entire input namespace that validates the `metadata.options.resources` input. Specifically, it demanded that the `num_machines` keyword was defined, which is indeed a requirement for the `NodeNumberJobResource`, used for schedulers like SLURM, however, this field doesn't even make sense for schedulers like LSF and SGE that use the `ParEnvJobResource` type of job resources. The solution is to delegate the validation of the resources to the scheduler which in turn delegates it to the `JobResource` class that it uses. The validation used to happend on construction of the job resource instance, but is factored out to the `validate_resources` classmethod. This allows it to be called without having to construct an instance which is more efficient. Finally, the signature of the `CalcJob` input validators is changed and no longer raise an `InputValidationError` but instead return an error message as the interface of `Port.validator` requires. The port itself will detect if an error message is returned and raise a `ValueError` with all the relevant error messages.
The `Scheduler.get_valid_schedulers` method is deprecated as the `aiida.plugins` module is reserved to inquire about installed plugins. Furthermore, it is properly marked as an abstract class which allows to remove the explicit `raise NotImplementedError` from methods that are abstract and have to be implemented by subclasses. Finally, there is some minor cleaning up of the code and docstrings. The implementation of the `validate_resources` method of the subclasses of `JobResource` has been improved and significantly simplified in the case of `NodeNumberJobResource` making the logic a lot more readable. Tests are added for the `ParEnvJobResource` job resource class which was completely untested.
db62c84
to
fb0d103
Compare
@giovannipizzi I have a added a new commit that I think should address the problem you mentioned. I added a test that launches a calculation job using a computer with SGE scheduler, and I confirmed that it failed as I explained in my response to your question. The fix in the last commit addresses the problem. I am not sure if there are still other failure points hiding out there. @atztogo and @zhubonan if you have the time to give this branch a spin with a real calculation on a LSF or SGE cluster, that would be great! |
The `CalcJob.presubmit` method was always adding the resource keyword `num_mpiprocs_per_machine` if it was not already (implicitly) specified and the computer defined a default. However, this keyword is not valid for all schedulers, and for example, the SGE and LSF scheduler will raise if it is passed. Here we add the `Scheduler.preprocess_resources` method, which can be called and will properly take the type of job resources into account when deciding whether `num_mpiprocs_per_machine` should be defined.
fb0d103
to
80fef5e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minimal aiida-core version has to be fixed due to the recently introduced change in the aiida-core package moving the resources validation to the scheduler plugin. see aiidateam/aiida-core#4192
* Fix minimal aiida-core version to 1.3.0 Minimal aiida-core version has to be fixed due to the recently introduced change in the aiida-core package moving the resources validation to the scheduler plugin. see aiidateam/aiida-core#4192 * Fix python version to 3.7.7 in tests PyYAML causes issues only for tests running on python 3.7. A suggested change to fix this issue is fixing the python 3.7 version to the 3.7.7 minor release since test start to fail only for 3.7.8 see: actions/runner-images#1202
* Fix minimal aiida-core version to 1.3.0 Minimal aiida-core version has to be fixed due to the recently introduced change in the aiida-core package moving the resources validation to the scheduler plugin. see aiidateam/aiida-core#4192 * Fix python version to 3.7.7 in tests PyYAML causes issues only for tests running on python 3.7. A suggested change to fix this issue is fixing the python 3.7 version to the 3.7.7 minor release since test start to fail only for 3.7.8 see: actions/runner-images#1202
Fixes #3887
This is necessary because the resource validation is scheduler
dependent. Up till now, the
CalcJob
defined a validator on the entireinput namespace that validates the
metadata.options.resources
input.Specifically, it demanded that the
num_machines
keyword was defined,which is indeed a requirement for the
NodeNumberJobResource
, used forschedulers like SLURM, however, this field doesn't even make sense for
schedulers like LSF and SGE that use the
ParEnvJobResource
type of jobresources.
The solution is to delegate the validation of the resources to the
scheduler which in turn delegates it to the
JobResource
class that ituses. The validation used to happend on construction of the job resource
instance, but is factored out to the
validate_resources
classmethod.This allows it to be called without having to construct an instance
which is more efficient.
Finally, the signature of the
CalcJob
input validators is changed andno longer raise an
InputValidationError
but instead return an errormessage as the interface of
Port.validator
requires. The port itselfwill detect if an error message is returned and raise a
ValueError
with all the relevant error messages.