-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
✨ NEW: Add ProcessLauncher.process_cache
#213
base: master
Are you sure you want to change the base?
Conversation
Codecov Report
@@ Coverage Diff @@
## develop #213 +/- ##
===========================================
- Coverage 90.46% 90.45% -0.00%
===========================================
Files 22 22
Lines 2976 2994 +18
===========================================
+ Hits 2692 2708 +16
- Misses 284 286 +2
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems like a good move. From my side I'd be happy for this to go in.
thanks, I'll wait a bit to merge in case @sphuber wants to comment? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chrisjsewell . I think it is definitely useful to have this cache, but I have some question about the implementation. One I added in the code and the second comment I have is what happens when a process terminates. Shouldn't it be properly removed from the cache? Also, there is a risk of the cache going out of sync if a task gets sent back. At this point the cache should also be updated because otherwise if it gets reloaded it would unjustly hit the DuplicateProcess
exception.
proc: Process = proc_class(*init_args, **init_kwargs) | ||
|
||
if proc.pid in self._process_cache and not self._process_cache[proc.pid].has_terminated(): | ||
raise exceptions.DuplicateProcess(f'Process<{proc.pid}> is already running') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we actually use the "launch" part of the communicator but exclusively the "continue" so I don't think this will matter for our use-case, but isn't it a bit weird to be able to hit a duplicate process when launching it? When you launch it, it is the first time you are creating it and so the pid shouldn't already running as it can when continuing an existing process. I can see how there can still be a clash in generated process ids, but I think that the exception type or at the very least the message should be different from the one in _continue
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeh as you mention, we don't use it, so this is only for completeness. I would say though, whether you continue or launch, if there are two processes with the same PID they are duplicates, so I disagree that the message should be any difference.
(if you launch twice with the same PID, this is no different to continuing twice)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not quite sure I agree. I agree that the result of having a duplicate PID is the same, however, the origin would be very different and I think that is important to reflect. When you launch a new PID is created and so it should be unique. If that is not the case, then the ID generating algorithm is fundamentally broken, which is completely different from the case in continue
where can simply have requested to continue the same process twice.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just to have the lat word lol
however, the origin would be very different and I think that is important to reflect
but then you could see the origin in the traceback
When you launch a new PID is created
this would not be the case if you specifically set the pid in init_args
or init_kwargs
nope, thats the point of using a |
Note I don't think there is any way to always know exactly when a process should be removed from the cache, given that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @chrisjsewell with that addition this is fine for me to be merged. Note that I still don't agree on the same exception behavior being used in _launch
and _continue
as I describe in the comment thread, but it is not critical for now, so let's leave it.
Then to comment on your OP on how to stop running processes once a runner looses its connection. I think we might have to add a STOPPED
state to the state machine that can be reached by calling stop()
on a process that will then stop all mutations of state. As you said, kill()
is not what we are trying to do and so hacking around the undesirable consequences seems the wrong way to go about it.
Actually, hold on with merging. Can I retract my approval? Edit: I am now wondering about the correctness of the |
nope to late 😛 whats the issue? |
Hmm fair, ok lets game this out: So RMQ loses the connection and now still has a "process continue" task to fulfil, and it tries to give that task to the same daemon worker that was already running it (and has been reconnected).
So perhaps we want to try implementing |
In fact, in the |
@sphuber Could you please comment on Chris's points so that we can move forward and get this merged? |
note I am now working on this as @ltalirz alludes to, using the process cache together with a |
Note, it might also give it to another another daemon worker if more than one are active.
This is already happening in
Ideally, if a worker receives a task that it is already running, it would simply continue running the process and acknowledge the task once the process is done. This way there is no loss of work already been done since the last checkpoint of the process. However, I am not sure if this is technically possible or might come with some potential for bugs in edge cases. Maybe the safest is indeed to stop the corresponding process and then start running from the last checkpoint. Is there a reason why it has to be rejected and re-queued and the runner cannot simply continue to run it after having stopped it first and removed from the cache?
Can't we simply let the daemon runner except? Circus will then relaunch a new one and we would be sure that all tasks were properly requeued and there won't be possibilities for subtle inconsistencies with the process cache.
It looks like we would be needing this anyway, so that would be great yeah. |
Currently, in aiida-core for a daemon runner:
gc.get_objects()
)DuplicateSubscriberIdentifier
exception (see Issues with new asyncio daemon (stress-test) aiida-core#4595)So in this PR I propose to add a cache to the
ProcessLauncher
that keeps a (weak) reference of the processes it has launched, and raises aDuplicateProcess
exception when trying to launch/continue a process that is already running.I envision potentially also using this in two places:
verdi daemon status
aio_pika.Connection.add_close_callback
kiwipy#104, to create a callback that stops all processes currently running when the connection to RMQ is lost. Maybe something like: