Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent hang on exit while omero.client keepalive is active #424

Merged
merged 2 commits into from
Sep 16, 2024

Conversation

DavidStirling
Copy link
Contributor

When using omero.client best practice appears to be to have everything in a try block and explicitly call client.closeSession to shut everything down nicely, however users don't always remember to do this.

If a user has called client.enableKeepAlive(x) then exiting without calling closeSession currently seems to cause Python to hang instead of exiting.

Minimal example:

import omero
client = omero.client(host="my.omero.server")
session = client.createSession(username="user.name", password="my.pwd")
client.enableKeepAlive(5)
print("Session is active: ", client.getSessionId())
# A normal exit behaves similarly, but this is clearer in the console
raise Exception("A forced crash")

At least on my end, executing this script will throw the exception but fail to exit the interpreter.

Having investigated this, I believe that the task thread within the omero.util.Resources class fails to shut down when the main thread exits. The result is that, while the interpreter is waiting for all threads to close so that it can exit, the keepalive thread keeps spinning. It's currently set up to wait for an explicit close signal from the main thread, but this obviously never comes if that thread has stopped. You therefore get a hang.

As a quick fix, making these worker threads into daemon threads ensures that they die if the main thread exits, thus fixing the problem. I'm not sure how acceptable this is as the Resources class seems to be used for other repeating processes, but I couldn't find a scenario where these should also continue even in the absence of the main thread. If that's not the case then you could instead build some sort of "is the main thread alive?" check into the task.

It might actually be worth a broader review of omero-py's cleanup/shutdown routines. It looks like much of that code is rather old and so modern Python could have more effective ways to manage connections.

@joshmoore
Copy link
Member

It might actually be worth a broader review of omero-py's cleanup/shutdown routines. It looks like much of that code is rather old

Definitely! 😄

... and so modern Python could have more effective ways to manage connections.

👍

I'm not sure how acceptable this is as the Resources class seems to be used for other repeating processes, but I couldn't find a scenario where these should also continue even in the absence of the main thread.

Resources was originally built (IIRC) for use in the Processor service. A review of whether or not there's anything that those tasks need time to do might be worth it. Even if that is the case, though, we could certainly have this as a parameter to separate between server and client side.

@chris-allan
Copy link
Member

Having looked at this in a lot more detail I'm pretty confident this is just a bug fix. There is more detailed explanation here:

The summary is that our code that is in question here was written in the Python 2.4 era and sometime around Python 2.6.5 the semantics of how sys.exit is handled changed. This change resulted more or less in the standard patterns being applied to interpreter exit where they were not previously. The standard pattern for cleanup in the threading module is to join all non-daemon threads. Any thread that a Resources class spawns currently is not a daemon thread and must have the atexit handler triggered to be joinable. Consequently, we deadlock.

The only other place I see omero.util.Resources being used besides the client object keep alives is for instances of Servant. These too would currently deadlock if sys.exit were initiated but that doesn't happen as they are shut down explicitly. I'll check more thoroughly but I can't see a downside to daemon task thread being used there especially with the current implementation's reliance on an atexit handler.

@chris-allan
Copy link
Member

For reference, current uses of Servant are:

  • omero.processor.ProcessorI (uses Resources for session management)
  • omero.tables.TablesI (uses Resources for TableI servant, which uses SimpleServant without Resources, management)

This code was originally added to address
https://trac.openmicroscopy.org/ome/ticket/3260 where an exception was
being raised:

    Traceback (most recent call last):
      File "/usr/lib/python2.4/logging/__init__.py", line 737, in emit
        self.stream.write(fs % msg)
    ValueError: I/O operation on closed file

Due to the many changes in interpreter behaviour since Python 2.4,
including those surrounding sys.exit in the ~2.6.5 era [^1], as well as
improvements to the logging module itself this is no longer an issue.

In the referenced Python issue 9501, the related code changes were
originally committed in Subversion r84282, Mercurial changeset
42c70f17c2fede6116108568a75b0eec08b3c73a, Git commit
1ddd51fc71c6a6d255eee8e7e54d1bf7e05c5841.  Found by tracing the
Mercurial changeset via `Misc/svnmap.txt` in the cython repository.

At a future date `omero.util.concurrency.AtExitEvent` and
`omero.util.concurrency.get_event()` should likely be deprecated and
their use discontinued.  They attempt to handle issues that no longer
exist and add complexity and maintenance overhead to our runtime.

See also:
 * https://bugs.python.org/issue6333
 * https://bugs.python.org/issue13807

[^1]: https://stackoverflow.com/questions/3713360/python-2-6-x-theading-signals-atexit-fail-on-some-versions/3765160#3765160
@chris-allan
Copy link
Member

Detailed rationale for the logging changes in 0f8fe1c. Future deprecation, refactoring and cleanup of omero.util.concurrency.AtExitEvent and omero.util.concurrency.get_event() should definitely be done; the code makes Python 2.4 era assumptions.

Copy link
Member

@knabar knabar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both commits make sense

@will-moore
Copy link
Member

Without this PR, calling client.enableKeepAlive(5) prevents interpreter from exiting with Ctrl-D.
When this is forced with Ctrl-C I get...

(omeroweb2) Williams-MacBook-Pro:zarr-python wmoore$ python
Python 3.9.15 | packaged by conda-forge | (main, Nov 22 2022, 08:50:29) 
[Clang 14.0.6 ] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import omero
>>> client = omero.client(host="localhost")
>>> session = client.createSession(username="user", password="pass")
>>> client.enableKeepAlive(5)
>>> 
^CException ignored in: <module 'threading' from '/Users/wmoore/opt/anaconda3/envs/omeroweb2/lib/python3.9/threading.py'>
Traceback (most recent call last):
  File "/Users/wmoore/opt/anaconda3/envs/omeroweb2/lib/python3.9/threading.py", line 1477, in _shutdown
    lock.acquire()

Updating to this branch fixes that behaviour (clean exit with Ctrl-D).

👍

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I audited the usage of the omero.util.Resources API across the OME code:

  • in omero.clients, BaseClient object has a private __resources field which gets filled with one (or theoretically multiple) Entry objects which will periodically running a keepAlive() call
  • unTables.py starts a omero.tables.TablesI service server-side. This service is a subclass of omero.util.Servant which means it is an omero.util.SimpleServant managing some resources. Each TableI object retrieved via the getTable API is registered as a resource
    -runProcessor.py similarly starts the omero.processor.ProcessorI service server-side which is also subclass of omero.util.Servant. Two types of resources are registered in this service:
    • the session kept alive via the omero.processor.UseSessionHolder object
    • the omero.processor.ProcessI (or subclass) launched by the service
  • in omero-dropbox, MonitorClientI registers the omero.util.ServerContext as a resource to be able to retrieve the active session

No regression in the existing unit and integration tests have been flagged since this PR got included. As shown above, in many scenarios the resources objects are used to maintain an active connection . Client-side, the keepAlive task should definitely be terminated alongside the main thread which created the session and there are no active resources to cleanup.
Server-side, all usages of resources to maintain an active session should likely be terminated if the process itself is restarted.
I'll perform a round of functional testing trying and review the cleanup conditions for the table/processor use cases. Otherwise, I agree this should likely be released as a bug fix.

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ran some additional tests using the Processor service and the following minimal script which is waiting for 2 min and performing an action

#!/usr/bin/env python
# -*- coding: utf-8 -*-

from omero.gateway import BlitzGateway
from omero.rtypes import rstring

import omero.scripts as scripts
import time
if __name__ == "__main__":

    client = scripts.client(
        'Example.py', """This script ...""",
        version="0.1",
    )

    try:
        conn = BlitzGateway(client_obj=client)
        time.sleep(120)
        conn.getObject("Image",12003)
        client.setOutput("Message", rstring("Success"))

    finally:
        client.closeSession()

The Processor service was stopped:

  • using omero admin ice "server stop Processor-0" or kill <process_pid>
  • after the execution of the script or during the execution of the script
  • with OMERO.py 5.19.4 and with OMERO.py built with this PR included

In all scenarios above, the Processor was successfully stopped with the following sequence of logs:

2024-09-04 10:24:17,574 INFO  [                       omero.util.Server] (MainThread) Cleanup
2024-09-04 10:24:17,574 INFO  [                    omero.util.Resources] (Thread-2  ) Halted
2024-09-04 10:24:17,574 INFO  [              omero.processor.ProcessorI] (MainThread) Cleaning up
# Several events about killing session, changing job status...
# The nature of these depending on the sequence of events and whether active processes are running
2024-09-04 10:24:17,579 INFO  [              omero.processor.ProcessorI] (MainThread) Done
2024-09-04 10:24:17,586 INFO  [                       omero.util.Server] (MainThread) Stopped

The omero.util.Resources log statement indicate the Task.run method is returning which is the expectation. The processor was successfully auto-restarted in all cases and new scripts were executable.

The only minor RFE that arose from this testing is that when interrupting the Processor while a script was running, an active OMERO.web session trying to poll the results of the script will throw the generic 500 page with the following stack trace

Traceback (most recent call last):
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/django/core/handlers/exception.py", line 55, in inner
    response = get_response(request)
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/django/core/handlers/base.py", line 197, in _get_response
    response = wrapped_callback(request, *callback_args, **callback_kwargs)
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omeroweb/decorators.py", line 538, in wrapped
    retval = f(request, *args, **kwargs)
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omeroweb/decorators.py", line 597, in wrapper
    context = f(request, *args, **kwargs)
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omeroweb/webclient/views.py", line 3864, in activities
    cb = ScriptsCallback(conn.c, proc)
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omero/callbacks.py", line 72, in __init__
    process.registerCallback(self.prx)
  File "/opt/omero/OMERO.venv/lib/python3.10/site-packages/omero_Scripts_ice.py", line 1115, in registerCallback
    return _M_omero.grid.Process._op_registerCallback.invoke(self, ((cb, ), _ctx))
Ice.ObjectNotExistException: exception ::Ice::ObjectNotExistException
{
    id = 
    {
        name = D515E601-3969-445A-BB3D-382778797953
        category = 
    }
    facet = 
    operation = registerCallback
}
<WSGIRequest: GET '/webclient/activities/?_=1725446734744'>

and refreshing the window still displays the 500 so I had to log out/log in again. I assume this is possibly something that could be handled more gracefully at the OMERO.web level, handling the underlying server exception.

Unless someone thinks of another testing scenario to evaluate, I would be inclined to get this released as OMERO.py 5.19.5 and start consuming it in various places to ensure it does not cause any unexpected regression

@jburel jburel merged commit 789a097 into ome:master Sep 16, 2024
13 checks passed
@jburel
Copy link
Member

jburel commented Sep 16, 2024

Released in 5.19.5

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants