Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

FilesetWrapper lazy loads Images and UsedFiles #405

Merged
merged 3 commits into from
Apr 24, 2024

Conversation

will-moore
Copy link
Member

@will-moore will-moore commented Mar 27, 2024

Fixes #404.

When we do conn.getObject('Fileset', fileset_id) we don't try to load all Images and Files up front.
Instead, these are lazily loaded as required using the same pattern as for other Wrapper objects.

Test that fileset.copyImages() and fileset.listFiles() behaviour don't change:

E.g. test script should show no change:

python test.py 123

import argparse

from omero.cli import cli_login
from omero.gateway import BlitzGateway

def main(args):
    parser = argparse.ArgumentParser()
    parser.add_argument('fileset', type=int)
    args = parser.parse_args(args)
    fileset_id = args.fileset

    with cli_login() as cli:
        conn = BlitzGateway(client_obj=cli._client)

        fileset = conn.getObject('Fileset', fileset_id)

        for i in fileset.copyImages():
            print("Image", i.id, i.name)

        for f in fileset.listFiles():
            print("File", f.id, f.name)


if __name__ == '__main__':
    import sys
    main(sys.argv[1:])

@will-moore will-moore requested a review from knabar March 27, 2024 17:14
@sbesson
Copy link
Member

sbesson commented Mar 28, 2024

Maybe I am missing something obvious but shouldn't _FilesetWrapper._getQueryString() also be updated to remove the fetch commands in order to achieve proper lazy loading?

@will-moore
Copy link
Member Author

@sbesson - Apologies, that's a copy/paste error on my part. Thanks for catching!
I have had issues in the past with pip install -e for omero-py so I was editing in a different location.
I certainly used the updated query string in testing.
Fixed in b86cd89

@sbesson
Copy link
Member

sbesson commented Mar 28, 2024

Thanks. Two additional questions:

  • arguably, what is proposed is as a breaking change as a consumer relying conn.getObject('Fileset', fileset_id) to return a loaded object would need to modify their code to call copyImage and/or listFiles. Should this consider some form of backwards-compatibility e.g. using the opts parameter to toggle lazy loading?
  • when using lazy loading, should _FilesetWrapper._getQueryString simply use BlitzObjectWrapper._getQueryString which fetches the owner and creation event in addition to the object?

@will-moore
Copy link
Member Author

I'm not sure if this is considered a breaking change, at least not to the BlitzGateway API.
For a consumer to use the FilesetWrapper as in the example above, nothing should change, and it shouldn't make any difference that the loading happens under the hood during the initial getObject() call or subsequent calls (except that if you're only loading images or files (and not both) then the new behaviour should be faster).

If you consider the API to extend to the underlying fileset._obj object behaviour, and you want to do this:

fileset = conn.getObject('Fileset', fileset_id)
fileset._obj.copyUsedFiles()

then this will break. This is kinda edge-case behaviour, so I'm not sure that it justifies a major release, but if so then I think it's worth doing because it's important that the default behaviour of conn.getObject('Fileset', fileset_id) is changed to avoid OOM errors.

Using opts to specify whether we want to load more of the graph up front has been useful for other objects, primarily for use in the JSON API, where we want to pass the underlying omero.model objects to the encoder. In the case where we are using the BlitzGateway API to traverse the graph, lazy loading is the preferred behaviour, particularly in this case.

I'm happy to add opts for load_images and load_files, as long as the default opts are False, but I also feel that this feature isn't really needed yet so I'm OK with leaving it till it is required.

In the meantime I'll remove the _FilesetWrapper._getQueryString as suggested is it's not needed.

Copy link
Member

@sbesson sbesson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tested against a representative IDR plate (1K images, 20K files) - see https://idr.openmicroscopy.org/webclient/?show=plate-10263

>>> from omero.gateway import BlitzGateway
>>> conn=BlitzGateway("public","public",host="idr.openmicroscopy.org",secure=True)
>>> conn.connect()
True
>>> fileset = conn.getObject("Fileset",6311594)
>>> fileset._obj._usedFilesLoaded
False
>>> fileset._obj._imagesLoaded
False
>>> print(len(fileset.copyImages()))
1152
>>> fileset._obj._imagesLoaded
True
>>> fileset._obj._usedFilesLoaded
False
>>> print(len(fileset.listFiles()))
23045
>>> fileset._obj._usedFilesLoaded
True

In terms of response time, getObject("Fileset", 6311594) returned instantly while fileset.listFiles() took several seconds.

Following up on the discussion in #405, internal fields prefixed with _ should not be considered as part of the public API inline. In that sense, I agree there is no change in behavior as accessing the images and files associated with a fileset should happen using the public copyImages() and listFiles APIs as described in https://omero.readthedocs.io/en/stable/developers/Python.html#filesets-added-in-omero-5-0. The only difference introduced by this PR is that these API calls will no longer return instantly due to the fetching and their response time will depend on the number of underlying objects. I'll leave others to comment but it might be worth mentioning this aspect in the docstring and/or the reference documentation.

Overall, I agree that the current behavior is dangerous especially when loading filesets with large numbers of images and/or plates and lazy loading addresses this issue.

@jburel jburel merged commit eea0b87 into ome:master Apr 24, 2024
9 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

getObject("Fileset") leads to OutOfMemoryError
3 participants