-
Notifications
You must be signed in to change notification settings - Fork 200
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AEP: CalcJob
live monitoring
#5659
AEP: CalcJob
live monitoring
#5659
Conversation
2e87140
to
f20bb13
Compare
f20bb13
to
958b565
Compare
958b565
to
bbd7b83
Compare
@giovannipizzi @ramirezfranciscof let me know if you would like to review this still. I would like to get this merged soon for the v2.1 releaes |
Sorry @sphuber , busy weeks. Now I need to prepare for my GM on friday, would it be ok if I take a look on friday afternoon? |
Hey, I'm still working on testing this, but one issue I found is that this seems to be using a type operation that is only supported from python 3.10 on (source). When running on 3.9 I get the following (in 3.10 it works fine):
|
Thanks for the report @ramirezfranciscof . It's interesting that it is failing because the CI runs on Python 3.8 and the tests pass just fine. Can you share the script that you are running? Note that if you use this notation in older Python versions, you need to explicitly enable it with |
Yeap, you are right. It is not exactly on the test script but on the plugin implementation, as I was copying the examples in the AEP and didn't realize they were using this python 10 feature. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hey @sphuber apologies for the delay, but I can confirm that the feature seems to be compatible with what we are trying to do with the Aurora project. Here are my comments on the code.
One general issue: whenever a calcjob finishes correctly, the status gets stuck in Waiting for transport task: retrieve
. Also the kill command leaves a None
status rather than the Killed through 'verdi process kill'
:
PK Created Process label Process State Process status
---- --------- ----------------------- ---------------- ------------------------------------
161 5D ago datanode_preparation ⏹ Finished [0]
165 5D ago BatteryCyclerExperiment ☠ Killed Killed through `verdi process kill`
168 5D ago CalcjobMonitor ⏹ Finished [0]
176 5D ago datanode_preparation ⏹ Finished [0]
180 5D ago BatteryCyclerExperiment ☠ Killed None
183 5D ago CalcjobMonitor ⏹ Finished [0] Waiting for transport task: retrieve
186 5D ago datanode_preparation ⏹ Finished [0]
190 5D ago BatteryCyclerExperiment ⏹ Finished [0] Waiting for transport task: retrieve
218 1D ago datanode_preparation ⏹ Finished [0]
223 1D ago BatteryCyclerExperiment ⏹ Finished [0] Waiting for transport task: retrieve
274 23h ago datanode_preparation ⏹ Finished [0]
279 23h ago BatteryCyclerExperiment ⏹ Finished [150]
284 22h ago datanode_preparation ⏹ Finished [0]
289 22h ago BatteryCyclerExperiment ⏹ Finished [150]
334 21h ago datanode_preparation ⏹ Finished [0]
339 21h ago BatteryCyclerExperiment ⏹ Finished [0] Waiting for transport task: retrieve
344 5h ago datanode_preparation ⏹ Finished [0]
349 5h ago BatteryCyclerExperiment ⏹ Finished [150]
354 4h ago datanode_preparation ⏹ Finished [0]
358 4h ago BatteryCyclerExperiment ⏹ Finished [0] Waiting for transport task: retrieve
363 4h ago datanode_preparation ⏹ Finished [0]
367 4h ago BatteryCyclerExperiment ☠ Killed None
370 4h ago CalcjobMonitor ⏹ Finished [0] Waiting for transport task: retrieve
Total results: 23
(Calculation 165 was run with a previous version of aiida)
I think this might be related to moving the set_process_status
method out of the try statement but this is just a guess.
Thanks for testing and reviewing the code @ramirezfranciscof
That's a bug indeed. Will have a look to fix it.
I don't think it should show |
Haha, yes that is true. I guess what I meant is that the |
We could. We could set the process status to the message of the exit code, which does contain the specific monitor error, but the thing is that it is not really consistent as we also don't do this for any other exit code. In my mind, the code being killed by a monitor is just like the parser returning a non-zero exit code. If a user sees We could discuss adding this, but then I think we should add it for all non-zero exit codes. But then we are really duplicating data in the database, since the |
bbd7b83
to
fe8979c
Compare
Ah my bad, I now see what you mean, it is when you actually kill a job it shows literal Edit: this was actually a change I accidentally introduced in 8bb7b34 . I will open a separate PR to reinstate this behavior. |
d4d22f0
to
f9035fe
Compare
@ramirezfranciscof thanks again for the thorough testing. I have added two commits for the bugs that you uncovered. I added them on top so it is easier for you to review the changes. Once you sign off on the changes, I will rebase them into the proper commits. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good for me, issues seem addressed. I can't re-test it right now because my computer is doing some lengthy maintenance, but I think it should be ok to merge this and if I find any outstanding problem in testing later we can just do a different PR to fix it.
Instead of killing the process, it is preferable to have the engine retrieve the files from the job and call the parser, if one was defined. However, this would result in the process to be marked as terminated nominally with the exit status returned by the parser. It would be impossible to see that the job was stopped by a monitor other than from the process report. It is important that one can query for processes that were stopped by a monitor. To do so, a dedicated exit code is defined on the `CalcJob` called `STOPPED_BY_MONITOR` which is set on the node, overriding any exit status returned by the parser. There is no way around this since only one exit status can be set on a node. Nevertheless the result from the parser is logged and so is visible in the process report. If monitors are defined for the `CalcJob` the corresponding node will contains information of the package versions from which the monitors come. The information is added to the `version` attribute which already contains version information of `aiida-core` and the `CalcJob` plugin itself.
f9035fe
to
a69a84b
Compare
The `priority` attribute takes an integer and is zero by default. It allows the user to define the order in which monitors need to be called in case multiple are defined. The ordering is implemented in the `CalcJobMonitors` utility class. This is mostly so that it is easy to unit test. It ensures the monitors are ordered by their priority, going from high to low. In case of identical priorities, the monitors are sorted alphabetically by the keys in the imonitors` input namespace.
The `minimum_poll_interval` is an optional positive integer that can be defined for a monitor. If defined, the engine will ensure that the interval between two successive calls of the monitor will at least be this long. In order to track the last time a monitor was called, the timestamp is added to the `call_timestamp` attribute when it is called by the `CalcJobMonitors.process` method.
This dataclass can be returned by a monitor to communicate to the engine the course of action to take. Returning a str from a monitor remains supported as it is automatically converted to a `CalcJobMonitorResult`. The first attribute that is added is `action` which takes an instance of the `CalcJobMonitorAction` enum. Currently the only value is `KILL` which is therefore also the default and instructs the engine to kill the job and stop monitoring.
By default it is `True`, but if set to `False`, the engine will skip the parsing step. In this case, the `STOPPED_BY_MONITOR` exit code will be set on the node.
By default it is `True`, but if set to `False`, the engine will skip the retrieval step and terminate the process straight away. The exit code that will be set is `CalcJob.exit_codes.STOPPED_BY_MONITOR` and there will not be a `retrieved` output node.
By default it is `True`, but if set to `False`, the engine will not override the exit code returned by the parser with the default exit code `STOPPED_BY_MONITOR` that is set when a job is stopped through a monitor. Naturally, this attribute is ignored when the `parse` and or `retrieve` attribute are set to `False` as in that case the parser is never even called so there is nothing to not override.
So far the `CalcJobMonitorAction`, the enum that can be returned by a calcjob monitor, only supported a single option `KILL`. When the engine receives this instruction, the job will be killed immediately. An alternative use case is where a monitor has performed a check and an optional action and now simply wants to let the job run its course. Often in this case it is important that the monitor itself, and any others that may have been registered, are no longer run for the lifetime of the job. The `CalcJobMonitorAction` now adds the `DISABLE_ALL` option, which when set on the `action` attribute of a `CalcJobMonitorResult`, the engine will disable all monitors for the remainder of the duration of the job.
The default action for a `CalcJobMonitorResult` is the option `CalcJobMonitorAction.KILL` which immediately kills the job. However, sometimes, one wants to simply disable the monitor and continue running the job nominally. The `CalcJobMonitorAction.DISABLE_SELF` option instructs the engine to not call the monitor that returned it again in future monitor evaluations.
a69a84b
to
465d499
Compare
Fixes #1925
Implementation of AEP: Allow CalcJobs to be actively monitored and interrupted