Fix serialization/deserialization of weakrefs #61

tlambert03 · 2020-12-23T00:39:12Z

fixes #60 and pickling of OME objects with weakrefs by adding __getstate__ and __setstate__ methods that de-reference re-reference the weakrefs.

I added tests, and most of them are passing, but a handful of them with xml annotations are still failing with a very weird message:

_pickle.PicklingError: Can't pickle <class 'xml.etree.ElementTree.Element'>: it's not the same object as xml.etree.ElementTree.Element

furthermore, I don't get that error when I do the test manually in ipython. @jmuhlich, would love to hear if you have any thoughts on what might be happening there.

jmuhlich · 2020-12-23T02:44:54Z

Oh man I waged epic battle with that xml.etree.ElementTree.Element identity issue myself. I discovered that due to how xml.etree.ElementTree is implemented in C, every time you import classes from it in a different module of your own, you get distinct class objects. Instances of one such distinct class are not considered instances of any other instance. This means that when you create e.g. an Element object in one module, then pass it to a function in another module which calls isinstance(elt, Element), it will return false. The way I dealt with it was to ensure our code only imports xml.etree.ElementTree in one module, then all our other modules import it from there. This isn't an issue in the actual module code since all XML handling is in one module, but it does impact the tests:

ome-types/testing/test_model.py

Lines 11 to 12 in 74103b1

    
           # Import ElementTree from one central module to avoid problems passing Elements around, 
        
           from ome_types.schema import ElementTree  # isort: skip

The PicklingError is almost certainly a manifestation of this issue. I'm not sure if there's a way to influence how pickle obtains the class object to use when unpickling, and if there isn't one we'll need to re-serialize XMLAnnotation elements as XML strings for pickling. I agree that's not great, but at least it's not the entire XML document. XMLAnnotations seem pretty rare in the wild from what I've seen anyway.

jmuhlich · 2020-12-23T03:56:19Z

Even once we fix the ElementTree identity issue, there will still be an issue with the equality testing in test_serialization. Element objects don't define a custom __eq__ method so even if you parse the same XML string twice, the resulting Element objects will not compare equal. We might need to implement a custom __eq__ method on XMLAnnotation elements that re-serializes the Element and compares that.

jmuhlich · 2020-12-23T06:23:23Z

I think any direct pickling of Element objects is doomed. There seems to always be some sequence of imports of our code and xml.etree that leads to the PicklingError, and in the face of user code doing its own imports there's nothing we can do. It looks we'll need to use the XML string representation here for pickling, and due to Element's lack of a custom __eq__ we'll probably want to use the XML string for equality comparison as well (in a canonicalized form). Here are the two approaches I see right away:

1. On-demand XML serialization in XMLAnnotation.__getstate__ and __eq__, and XML re-parsing in __setstate__.
This preserves our current XMLAnnotation model where .value is an Element. Performance of OME-XML parsing and generation is unchanged. Performance of code that interrogates the XML structure of XMLAnnotations also remains unchanged. Pickling/unpickling and __eq__ will incur XML serialization/parsing overhead every time.

2. Change the XMLAnnotation model so value holds an ahead-of-time canonicalized XML-serialized string.
This will make pickling and equality just trivially work without any custom __getstate__ or __eq__ implementations since the XML data is all in a plain string, and they will be fast. We can still provide a @property that builds an Element on demand at the cost of the XML re-parsing. The performance of OME-XML parsing in the presence of XMLAnnotations will be affected, since each XMLAnnotation will need to be re-serialized on the way in rather than just taking a direct reference to the already parsed Element object. OME-XML generation will be slower too because we'll need to re-parse on the way out. Accessing the Element will also be slowed down by the on-demand re-parse.

I'm not sure how much real-world performance is actually on the table here. It depends on 1) How many XMLAnnotations are present in a given workload of images and how complex their XML is, and 2) How much pickling/unpickling is done. Maybe the clarity of the code is more important... (My assumptions are that pickling/unpickling would occur only once per image in typical use cases, like sending images to worker processes, and model equality testing will be very rare. I think poking around in the XMLAnnotation Element content will be rare too, but I'm less certain of that.)

Sorry this has gotten pretty long. You guys let me know what you think, and if you can come up with any other solutions.

tlambert03 · 2020-12-23T15:18:48Z

This is awesome, thanks for all the sleuthing @jmuhlich! I guess I favor on-demand serialization in option number 1: putting the performance penalty (however small/large) on the pickling and equality checking. Feels like that's the less common operation, and the code would be a bit more sandboxed... You agree?

I added benchmarking in #63 to help tell whether any of this really matters...

tlambert03 · 2020-12-23T16:08:45Z

updated this PR with overrides for __getstate__, __setstate__ and __eq__ on XMLAnnotation. Seems to be working with all roundtrip serialization tests passing now. @jmuhlich, if you have a moment to take a quick look at the approach before merging, that'd be great.

thanks @JacksonMaxfield for raising this issue. this is definitely a useful thing to be testing.

evamaxfield · 2020-12-23T16:11:40Z

I just want to chime in and say this:

The way I dealt with it was to ensure our code only imports xml.etree.ElementTree in one module, then all our other modules import it from there.

Is WILD.

tlambert03 · 2020-12-23T16:12:58Z

agreed... that's A) some nonsense going on there in etree and B) incredible detective work!

jmuhlich · 2020-12-23T17:44:44Z

I agree it's wild, and it never sat right with me... You encouraged me to create a minimum reproducible example and figure out exactly what's going on! Turns out the culprit isn't xml.etree, it's xmlschema! I think this code is the issue but I'm still digging:
https://github.com/sissaschool/xmlschema/blob/master/xmlschema/etree.py#L24-L59
If we can figure out how to keep the ElementTree imports from getting messed up (maybe we just have to move our ElementTree import after our xmlschema import?) we might be able to avoid messing with XMLAnnotation pickling entirely! The custom __eq__ will be required regardless, though.

tlambert03 · 2020-12-23T17:51:47Z

wow...

jmuhlich · 2020-12-23T23:04:49Z

I have convinced myself there is definitely a bug in that module namespace manipulation code in xmlschema. I submitted a two-line fix at sissaschool/xmlschema#220 . From the pace that the developers there are closing issues and cutting new releases, I think there's a good chance this could make it into a published release quickly.

jmuhlich · 2020-12-23T23:47:53Z

In my local working directory, I removed the new XMLAnnotation __getstate__ and __setstate__ methods and changed __eq__ to use ElementTree.tostring, and that passes on py38 using my xmlschema fix. :) I guess we can move forward with this PR as it stands and then roll back the pickle part once xmlschema releases my fix?

Now that we're down to it, I'm not sure resolving the weakrefs is the best way to handle pickling the References. I'm a little nervous about what happens when fragments of a model are pickled and not the top-level OME object. What about removing the ref_ entirely from the pickled state, and re-running the code in OME.__post_init_post_parse__ in OME.__getstate__ to rebuild the weakrefs from the ID strings? Maybe the logic we have in __post_init_post_parse__ should be moved to a method that can be called more easily from elsewhere. (Calling __post_init_post_parse__ explicitly feels wrong)

jmuhlich · 2020-12-23T23:52:19Z

That proposed change means people could still pickle a fragment of a model, but any refs will be broken on unpickle. I think this would prevent the kinds of misuse of pickled model objects that I'd expect.

tlambert03 · 2020-12-24T01:55:57Z

awesome. thanks for the PR to xmlschema. will be cool if that fixes other things too

What about removing the ref_ entirely from the pickled state, and re-running the code in OME.__post_init_post_parse__ in OME.__getstate__ to rebuild the weakrefs from the ID strings

I'm happy to try that out before we commit to the approach in this PR. will give it a go tomorrow.

tlambert03 · 2020-12-24T20:03:27Z

looks like your fix is already merged at xmlschema. I'm going to go ahead and merge this anyway so @JacksonMaxfield can at least use our master branch, and we can look at bumping our xmlschema requirement and cleaning it up in a later PR... perhaps before cutting a new release.

tlambert03 · 2020-12-24T20:05:03Z

forgot to add... I made the change you suggested in #61 (comment) (removing refs on getstate and rebuilding them in OME.__setstate__). works well, and I like it better

add __getstate__ __setstate__

59c3d03

tlambert03 requested a review from jmuhlich December 23, 2020 00:39

tlambert03 mentioned this pull request Dec 23, 2020

Remove or replace call to weakref / support serdes #60

Closed

tlambert03 mentioned this pull request Dec 23, 2020

patch for xmlschema 1.4.0 #62

Merged

tlambert03 added 2 commits December 23, 2020 10:34

Merge branch 'master' into fix-serialization

2e68165

override pickle and eq methods for XMLAnnotation

13984d6

tlambert03 added 3 commits December 24, 2020 14:55

relink refs only in OME.__setstate__

c131176

remove weakref import

0ab98c2

revert import change

7e707c7

tlambert03 merged commit 8528da0 into master Dec 24, 2020

tlambert03 deleted the fix-serialization branch December 24, 2020 20:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix serialization/deserialization of weakrefs #61

Fix serialization/deserialization of weakrefs #61

tlambert03 commented Dec 23, 2020

jmuhlich commented Dec 23, 2020

jmuhlich commented Dec 23, 2020

jmuhlich commented Dec 23, 2020

tlambert03 commented Dec 23, 2020 •

edited

Loading

tlambert03 commented Dec 23, 2020

evamaxfield commented Dec 23, 2020

tlambert03 commented Dec 23, 2020

jmuhlich commented Dec 23, 2020

tlambert03 commented Dec 23, 2020

jmuhlich commented Dec 23, 2020

jmuhlich commented Dec 23, 2020 •

edited

Loading

jmuhlich commented Dec 23, 2020

tlambert03 commented Dec 24, 2020 •

edited

Loading

tlambert03 commented Dec 24, 2020

tlambert03 commented Dec 24, 2020

Fix serialization/deserialization of weakrefs #61

Fix serialization/deserialization of weakrefs #61

Conversation

tlambert03 commented Dec 23, 2020

jmuhlich commented Dec 23, 2020

jmuhlich commented Dec 23, 2020

jmuhlich commented Dec 23, 2020

tlambert03 commented Dec 23, 2020 • edited Loading

tlambert03 commented Dec 23, 2020

evamaxfield commented Dec 23, 2020

tlambert03 commented Dec 23, 2020

jmuhlich commented Dec 23, 2020

tlambert03 commented Dec 23, 2020

jmuhlich commented Dec 23, 2020

jmuhlich commented Dec 23, 2020 • edited Loading

jmuhlich commented Dec 23, 2020

tlambert03 commented Dec 24, 2020 • edited Loading

tlambert03 commented Dec 24, 2020

tlambert03 commented Dec 24, 2020

tlambert03 commented Dec 23, 2020 •

edited

Loading

jmuhlich commented Dec 23, 2020 •

edited

Loading

tlambert03 commented Dec 24, 2020 •

edited

Loading