Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cloudpickle breaks dill deserialization across servers. #217

Open
wmarshall484 opened this issue Mar 24, 2017 · 7 comments
Open

cloudpickle breaks dill deserialization across servers. #217

wmarshall484 opened this issue Mar 24, 2017 · 7 comments

Comments

@wmarshall484
Copy link

wmarshall484 commented Mar 24, 2017

Following up on this issue @mmckerns responded to on Stackoverflow:

http://stackoverflow.com/questions/42960637/python-3-5-dill-pickling-unpickling-on-different-servers-keyerror-classtype/43006034#43006034

In a nutshell, with Python 3.5:

Server A imports cloudpickle this causes types.ClassType to become defined.

>>> import types
>>> dir(types)
  ['BuiltinFunctionType',
   'BuiltinMethodType',
   'ClassType',
   'CodeType',
   ...
  ]

Server B does not import cloudpickle, so types.ClassType is left undefined.

>>> import types
>>> dir(types)
  ['BuiltinFunctionType',
   'BuiltinMethodType',
   'CodeType',
   ...
  ]

Objects which are serialized in server A also seem to serialize a reference to ClassType. Then, when they are deserialized on server B, we encounter the following error:

Traceback (most recent call last):
 File "/home/streamsadmin/git/streamsx.topology/test/python/topology/deleteme2.py", line 40, in <module>
   a = dill.loads(base64.b64decode(a.encode()))
 File "/home/streamsadmin/anaconda3/lib/python3.5/site-packages/dill/dill.py", line 277, in loads
   return load(file)
 File "/home/streamsadmin/anaconda3/lib/python3.5/site-packages/dill/dill.py", line 266, in load
   obj = pik.load()
 File "/home/streamsadmin/anaconda3/lib/python3.5/site-packages/dill/dill.py", line 524, in _load_type
   return _reverse_typemap[name]
KeyError: 'ClassType'

This is because _reverse_typemap is populated partly by the contents of types, which doesn't define the ClassType type by default.

The workaround on server B is to define ClassType in _reverse_typemap after dill is imported, and before an object is first deserialized.

import dill
dill.dill._reverse_typemap['ClassType'] = type

# do deserialization
dill.loads(some_serialized_string)

As a long term workaround, maybe create a whitelist of valid 3.5 types found in the types module? A whitelist would eliminate this kind of error and prevent any pollution/side effects from other modules like cloudpickle.

@mmckerns
Copy link
Member

@wmarshall484: Thanks for the detailed follow-up. I've never looked into using dill on one side cloudpickle on the other. Primarily because there are differences in how certain objects are serialized between both packages, and thus some objects might not translate. Your case seems to have an easy workaround, others may not. It should be worth investigating, however. One other thing to note is that pyspark has had a number of discussions and PRs submitted by their devs to natively support dill. As it stands, they still only use cloudpickle -- which they forked the development of from picloud once picloud went commercial.

This may be another one of the things that dill should be looking at if it wants to support cloudpickle serialization.

@wmarshall484
Copy link
Author

I've never looked into using dill on one side cloudpickle on the other

Just to be sure we're on the same page, I'm using dill on both sides. One side just happens to have cloudpickle imported. I understand it's one thing for dill to completely support cloudpickle serialization, but this is more about cloudpickle disrupting dill when they just happen to be in the same environment.

@ddebrunner
Copy link

ddebrunner commented Jun 25, 2018

@mmckerns FYI the latest version of dill 0.2.8.2 breaks the work-around @wmarshall484 provided.

    dill.dill._reverse_typemap['ClassType'] = type
AttributeError: module 'dill' has no attribute 'dill'

It worked in 0.2.7.1

@mmckerns
Copy link
Member

mmckerns commented Jun 26, 2018

There should be a simple fix for the above:

dill._dill._reverse_typemap['ClassType'] = type

Essentially the module dill.dill moved to dill._dill... that's it.

@pjmattingly
Copy link

Hi, I have the same issue. I've narrowed down the issue to something in the spaCy package; More specifically after importing spaCy then the key "ClassType" appears in types. The recommended approach does not work in my case. Applying this fix to _dill.py seems to address the issue:

def _load_type(name):
    #BUG FIX, applied 18.05.21
    if name == "ClassType":
        _reverse_typemap["ClassType"] = type
    
    return _reverse_typemap[name]

This thread on stackoverflow seems to support the conclusion that spaCy (or one of its dependencies) is the cause of the issue:

https://stackoverflow.com/questions/55308122/exception-has-occurred-modulenotfounderror-when-unpickling-objects-using-dill

Specifically:

I ran a pip freeze on both environments which had quite a few differences on important packages (numpy, spacy and others). I didn't try all combination of which packages fixed it but the obvious best practices worked. Thanks!

https://stackoverflow.com/questions/55308122/exception-has-occurred-modulenotfounderror-when-unpickling-objects-using-dill#comment97406107_55308432

@ankit-ghub
Copy link

Hi, I am also facing the same issue, trying to build an apache beam pipeline with serialization. It works fine until i introduce spacy:

Issue here:

https://stackoverflow.com/questions/69649645/spacy-breaks-serialization-in-pardo-apache-beam

@ankit-ghub
Copy link

@pjmattingly

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants