Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support exporting dataset #83

Closed
wants to merge 6 commits into from
Closed

Conversation

dominikl
Copy link
Member

Attempt to add 'export dataset' functionality. Exports all images of a dataset into a directory with the dataset id as name. Export of the images happens in parallel using dask.
But something's not quite right. After running for a while you'll get a out of Java heap space error on the server side:

...
File "/home/dlindner/miniconda3/lib/python3.9/site-packages/omero_api_RawPixelsStore_ice.py", line 1199, in getPlane
    return _M_omero.api.RawPixelsStore._op_getPlane.invoke(self, ((z, c, t), _ctx))
omero.InternalException: exception ::omero::InternalException
{
    serverStackTrace = ome.conditions.InternalException:  Wrapped Exception: (java.lang.OutOfMemoryError):
Java heap space
        at loci.formats.tiff.TiffParser.getSamples(TiffParser.java:1030)

Is that an error on the client side (session not closed, etc.) or a server side issue? Any ideas @sbesson @joshmoore ?

@joshmoore
Copy link
Member

I don't see a missed close() or similar, so perhaps just the total number of calls to

planes = pixels.getPlanes(zct_list)
?

@will-moore
Copy link
Member

@dominikl That's great. I wonder if you could add .zattrs into the top-level Dataset dir, to list the paths to the images.

"collection": {
  "images": {
      "image1": {}, "image2": {}, "image3": {}, "image4": {}, "image5": {},
  },

I think that represents the summary of the discussion at ome/ngff#31

We probably want to consider some different naming of Dataset and Images (not just ID.zarr)?
That's not part of the spec, but if you're browsing the "collection" in another client, you probably want to know the Image names at the top, instead of having to open the .zattrs of each Image to get the names.

I guess we could do:

"collection": {
"images": {
"123.zarr": {"name": "image1.tiff"}, "456.zarr": {"name": "image2.tiff"},
},

but that's less nice. Using IDs as paths on the file-system is also not so human readable.

I wonder if we want to convert e.g. "image1.tiff" into "image1" since it's not really a Tiff anymore!?

@dominikl
Copy link
Member Author

Yes, I guess in the end the dataset has to be a "zarr" itself with the appropriate metadata instead of just a directory. The PR is more a draft, trying to export images in parallel, which unfortunately doesn't work very well, needs some more debugging. I agree too, names are nicer than meaningless Ids, but leads to conflicts.

@sbesson
Copy link
Member

sbesson commented Sep 17, 2021

Even without the addition of .zattrs as the discussion is ongoing, I would say structuring the top-level directory as a Zarr group should already allow the different images to be grouped in agreement with the Zarr specification.

Incidentally this raises the question of whether ome_zarr info should be updated to detect and report on this type of hierarchies. Another use case would be the Zarr groups created by bioformat2raw.

@dominikl
Copy link
Member Author

Tested locally, doesn't work very well either. After a while the export dies with (even with only 3 threads):

ERROR:omero.gateway:Failed to getPlane() or getTile() from rawPixelsStore
Traceback (most recent call last):
  File "/Users/dom/miniconda3/envs/zarr/lib/python3.9/site-packages/omero/gateway/__init__.py", line 7466, in getTiles
    rawPlane = rawPixelsStore.getPlane(z, c, t)
  File "/Users/dom/miniconda3/envs/zarr/lib/python3.9/site-packages/omero/gateway/__init__.py", line 4796, in __call__
    return self.handle_exception(e, *args, **kwargs)
  File "/Users/dom/miniconda3/envs/zarr/lib/python3.9/site-packages/omero/gateway/__init__.py", line 4793, in __call__
    return self.f(*args, **kwargs)
  File "/Users/dom/miniconda3/envs/zarr/lib/python3.9/site-packages/omero_api_RawPixelsStore_ice.py", line 1199, in getPlane
    return _M_omero.api.RawPixelsStore._op_getPlane.invoke(self, ((z, c, t), _ctx))
Ice.UnknownLocalException: exception ::Ice::UnknownLocalException
{
    unknown = ../../include/Ice/BasicStream.h:307: Ice::EncapsulationException:
protocol error: illegal encapsulation
  0 IceUtil::Exception::Exception(char const*, int) in /opt/ice-3.6.5/bin/../lib64/libIceUtil.so.36
  1 Ice::LocalException::LocalException(char const*, int) in /opt/ice-3.6.5/bin/../lib64/libIce.so.36
...

Does that mean the RawPixelsStore simply can't be called in parallel? Or is the parallel omero.cli.cli_login() call in order to create sessions the issue, and if so is there an alternative?

@joshmoore
Copy link
Member

Does that mean the RawPixelsStore simply can't be called in parallel?

A single RawPixelsStore should block multiple access, i.e. it should be thread-safe but not concurrent. Can you find the underlying exception that was thrown?

@dominikl
Copy link
Member Author

It was actually an out of heap space error too. Ran it with just one thread, got it too:

2021-09-22 11:54:35,192 ERROR [            ome.services.throttling.Task] (l.Server-0) Failed to invoke: ome.services.throttling.Callback@4c7969ed (omero.api._AMD_RawPixelsStore_getPlane@47a90f73 )
java.lang.reflect.InvocationTargetException: null
	at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[na:na]
...
Caused by: java.lang.OutOfMemoryError: Java heap space

Is there a memory leak somewhere on the server side when repeatedly calling RawPixelsStore.getPlane()?

@joshmoore
Copy link
Member

The only route I know of is If the pixel store doesn't get closed and more are created.

@dominikl
Copy link
Member Author

I think the problem is that I create a omero.gateway.BlitzGateway(client_obj=c.get_client()) for each image, but don't close it (so probably the pixel store isn't closed either). However I can't close it as it also kills the session. Is there a way to close the BlitzGateway without closing the session @will-moore ?

@will-moore
Copy link
Member

the add_image() uses BlitzGateway's getPlanes() -> getTiles() and this should close the raw pixel store after the last plane is requested: https://github.com/ome/omero-py/blob/e99bcbbdfdbfffdff9beaa954b5bb2143f23effa/src/omero/gateway/__init__.py#L7553

You could also try conn.close(False) which should close all the various proxy services but not kill the session: https://github.com/ome/omero-py/blob/e99bcbbdfdbfffdff9beaa954b5bb2143f23effa/src/omero/gateway/__init__.py#L1965

@dominikl
Copy link
Member Author

Thanks. I tried conn.close(False) which seemed to work, but after a while it also closes the session. I found a way now which works. But the RawPixelsstore has to be closed explicitly. There seems to be a leak somewhere in getPlanes()/getTiles(), but I couldn't see it, the methods look fine to me.

@will-moore
Copy link
Member

I wonder if some images aren't getting all their planes downloaded - so they don't get closed.
Might be interesting to see if conn.c.getStatefulServices() gives you anything when you're closing the pixelstore and log e.g. image name to check if the image all got downloaded OK.
Also see https://github.com/ome/omero-py/blob/master/src/omero/gateway/__init__.py#L1650 conn._assert_unregistered() that does this.

NB: We discussed an alternative export strategy (when you don't have access to the binary repo):

  • download all the original files locally, using the cliomero download command (e.g. into temp dir)
  • then run bioformats2raw on this dir

This means that the server is doing much less work and maybe more parallelisation is possible.
I could start looking at that approach if it sounds like a good idea (and if you don't want to take it on)?
Potential down-side: possible to download more data than you want e.g. for MIF where some images are in a different Dataset, you'd still download them.

@dominikl
Copy link
Member Author

👍 Would be good to have different strategies.
So far there would be three different ways to do it:

  • This PR: Retrieve the pixels data via the API and write the zarr. Might be the fastest way for smallish images with just a few planes, and a limited amount of parallelisation.
  • Your suggestion: Download the original files and run bioformats2raw. That should definitely be faster for larger and/or many plane images. And like you said, doesn't need access to the managed repo either.
  • @sbesson 's suggestion: Parallelise the current bioformats2raw export with direct access to the managed repository. As the bioformats2raw call has quite some overhead (e. g. single HPA image export is slower using bioformats2raw than getting pixels via API), this is probably the fastest way. But for smallish images it needs a certain minimum amount of parallelisation to outweigh the bioformats2raw overhead.

@will-moore
Copy link
Member

So, with a bit more testing, I realise that my approach of blindly using the Image name for the zarr image group causes issues with various characters, such as / (obviously) but also [ and ] I think.
Needs a thorough test of all special characters to see what needs replacing or escaping. Happy to look into this...?

@dominikl
Copy link
Member Author

dominikl commented Oct 4, 2021

Ah true, you'd have to escape all posix and windows special characters, that can be tricky. How about using UUIDs as directory names (instead of the arbitrary omero IDs), and put the names into the metadata? That means one always has to lookup the metadata to make sense of the directory structure, but saves the hassle of handling all sorts of special characters.

@dominikl
Copy link
Member Author

dominikl commented Oct 5, 2021

Superseded by #88

@dominikl dominikl closed this Oct 5, 2021
@will-moore will-moore mentioned this pull request Oct 5, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants