[SPARK-23159][PYTHON] Update cloudpickle to v0.4.3 #20373

BryanCutler · 2018-01-23T23:56:41Z

What changes were proposed in this pull request?

The version of cloudpickle in PySpark was close to version 0.4.0 with some additional backported fixes and some minor additions for Spark related things. This update removes Spark related changes and matches cloudpickle v0.4.3:

Changes by updating to 0.4.3 include:

Fix pickling of named tuples BUG: Fix bug pickling namedtuple. cloudpipe/cloudpickle#113
Built in type constructors for PyPy compatibility here
Fix memoryview support Some cleanups, fix memoryview support cloudpipe/cloudpickle#122
Improved compatibility with other cloudpickle versions Restore compatibility with functions pickled with 0.4.0 cloudpipe/cloudpickle#128
Several cleanups Remove save_reduce() override cloudpipe/cloudpickle#121 and here
[MRG] Regression on pickling classes from the main module [MRG][BRANCH-0.4.x] Regression on pickling classes from the __main__ module cloudpipe/cloudpickle#149
BUG: Handle instance methods of builtin types [BRANCH-0.4.x] BUG: Handle instancemethods of builtin types. cloudpipe/cloudpickle#154
Fix #129 : do not silence RuntimeError in dump() [BRANCH-0.4.x] Fix #129: do not silence RuntimeError in dump() (#140) cloudpipe/cloudpickle#153

How was this patch tested?

Existing pyspark.tests using python 2.7.14, 3.5.2, 3.6.3

BryanCutler · 2018-01-24T00:03:26Z

python/pyspark/cloudpickle.py

+    object.__new__: _get_object_new,
+}
+
+


MAINT: Handle builtin type new attrs: cloudpipe/cloudpickle@f0d2011

BryanCutler · 2018-01-24T00:04:31Z

python/pyspark/cloudpickle.py

-                msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg)
-            print_exec(sys.stderr)
-            raise pickle.PicklingError(msg)
-


This exception handling is Spark specific, it has been moved to serializers.py CloudPickleSerializer.dumps

I'm glad this is moved, should make the next update easier.

BryanCutler · 2018-01-24T00:05:30Z

python/pyspark/cloudpickle.py

-        """Fallback to save_string"""
-        Pickler.save_string(self, str(obj))
+        self.save(obj.tobytes())
+    dispatch[memoryview] = save_memoryview


Some cleanups, fix memoryview support cloudpipe/cloudpickle@f8187e9

So without the new line between these my brain doesn't parse this right on the first read. What do you think of adding a new line here + back to cloud pickle (can be a follow up)?

so you mean change to this

def save_memoryview(self, obj): self.save(obj.tobytes()) dispatch[memoryview] = save_memoryview

That format is done in a few places and I saw a couple other formatting issues.. so yeah I can submit a PR to do those

BryanCutler · 2018-01-24T00:06:31Z

python/pyspark/cloudpickle.py

+            # If the function we've received is in that cache, we just
+            # serialize it as a lookup into the cache.
+            return self.save_reduce(_BUILTIN_TYPE_CONSTRUCTORS[obj], (), obj=obj)
+


BUG: Hit the builtin type cache for any function cloudpipe/cloudpickle@d84980c

BryanCutler · 2018-01-24T00:07:43Z

python/pyspark/cloudpickle.py

+        # In Python 2, we can't set this attribute after construction.
+        __dict__ = clsdict.pop('__dict__', None)
+        if isinstance(__dict__, property):
+            type_kwargs['__dict__'] = __dict__


BUG: Fix bug pickling namedtuple cloudpipe/cloudpickle@28070bb

BryanCutler · 2018-01-24T00:09:53Z

python/pyspark/cloudpickle.py

+        }
+        if hasattr(func, '__qualname__'):
+            state['qualname'] = func.__qualname__
+        save(state)


Preserve func.qualname when defined cloudpipe/cloudpickle@14b38a3

BryanCutler · 2018-01-24T00:11:27Z

python/pyspark/cloudpickle.py

+        except Exception:
+            if obj.__module__ == "__builtin__" or obj.__module__ == "builtins":
+                if obj in _BUILTIN_TYPE_NAMES:
+                    return self.save_reduce(_builtin_type, (_BUILTIN_TYPE_NAMES[obj],), obj=obj)


Some cleanups, fix memoryview support cloudpipe/cloudpickle@f8187e9

BryanCutler · 2018-01-24T00:12:13Z

python/pyspark/cloudpickle.py

@@ -709,12 +702,7 @@ def save_property(self, obj):
    dispatch[property] = save_property

    def save_classmethod(self, obj):
-        try:
-            orig_func = obj.__func__
-        except AttributeError:  # Python 2.6


support for Python 2.6 removed

BryanCutler · 2018-01-24T00:13:30Z

python/pyspark/cloudpickle.py

-    if sys.version_info < (2,7):  # 2.7 supports partial pickling
-        dispatch[partial] = save_partial
-
-


Remove save_reduce() override: It is the exactly the same code as in Python 2's Pickler class.
cloudpipe/cloudpickle@2da4c24

BryanCutler · 2018-01-24T00:14:36Z

python/pyspark/cloudpickle.py

+    def inject_addons(self):
+        """Plug in system. Register additional pickling functions if modules already loaded"""
+        pass
+


Further cleanups cloudpipe/cloudpickle@c91aaf1

BryanCutler · 2018-01-24T00:15:20Z

python/pyspark/cloudpickle.py

+        cp.dump(obj)
+        return file.getvalue()
+    finally:
+        file.close()


Close StringIO timely on exception cloudpipe/cloudpickle@ca4661b

BryanCutler · 2018-01-24T00:16:16Z

python/pyspark/cloudpickle.py

+def _fill_function(*args):
+    """Fills in the rest of function data into the skeleton function object
+
+    The skeleton itself is create by _make_skel_func().


Restore compatibility with functions pickled with 0.4.0 (#128)
cloudpipe/cloudpickle@7d8c670

Yea, I think we need this.

That's more of an issue if using pickles stored on disk or if nodes in the cluster are on different versions. Is that likely for Spark?

I don't think we made the guarantee on it but the best is always to stay safer. Others seem bug fixes or improvements but this one could be a regression fix (about the support we haven't guaranteed). It's the part of 0.4.2v anyway.

(I pointed out cloudpipe/cloudpickle#145 too for the same reason. IIUC, this could be a regression)

BryanCutler · 2018-01-24T00:16:44Z

python/pyspark/cloudpickle.py

+    if 'module' in state:
+        func.__module__ = state['module']
+    if 'qualname' in state:
+        func.__qualname__ = state['qualname']


Preserve func.qualname when defined cloudpipe/cloudpickle@14b38a3

BryanCutler · 2018-01-24T00:17:26Z

python/pyspark/cloudpickle.py

-    """
-    from collections import namedtuple
-    return namedtuple(name, fields)
-


This didn't seem necessary anymore after the fix for namedtuples

BryanCutler · 2018-01-24T00:26:38Z

@holdenk @HyukjinKwon it seemed like mostly straightforward fixes/cleanups to match cloudpickle 0.4.2 but you two are way more experienced here than me. Are there any concerns over these updates or additional tests to run?

I did test that namedtuple pickling works with the new fix in cloudpickle, but since the standard pickle still fails we still need the hijack workaround in Spark.

SparkQA · 2018-01-24T00:37:29Z

Test build #86553 has finished for PR 20373 at commit c362df8.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2018-01-24T02:05:00Z

Thank you for going through and documentation the related changes back to the cloudpickle changes :)

HyukjinKwon · 2018-01-24T02:06:22Z

Whoa, nice efforts! Will take a close look within few days.

holdenk

Thanks for doing this! Not a full review but a quick pass before my talk with minor feedback. I'll look through more later.

holdenk · 2018-01-24T02:06:03Z

python/pyspark/cloudpickle.py

-                msg = "Could not serialize object: %s: %s" % (e.__class__.__name__, emsg)
-            print_exec(sys.stderr)
-            raise pickle.PicklingError(msg)
-


I'm glad this is moved, should make the next update easier.

holdenk · 2018-01-24T02:07:54Z

python/pyspark/cloudpickle.py

-        """Fallback to save_string"""
-        Pickler.save_string(self, str(obj))
+        self.save(obj.tobytes())
+    dispatch[memoryview] = save_memoryview


So without the new line between these my brain doesn't parse this right on the first read. What do you think of adding a new line here + back to cloud pickle (can be a follow up)?

holdenk · 2018-01-24T02:10:25Z

python/pyspark/cloudpickle.py

-                    raise pickle.PicklingError("Can't pickle %r" % obj)
-                else:
-                    rv = obj.__reduce_ex__(self.proto)
+                rv = obj.__reduce_ex__(self.proto)


Looks reasonable since Py 3.4+ only anyways.

holdenk · 2018-01-26T02:26:12Z

So any reason for the WIP tag?

BryanCutler · 2018-01-26T06:14:08Z

I wasn't sure if the named tuple hijack issue from https://issues.apache.org/jira/browse/SPARK-22674 could be fixed here, but it looks like that would require more outside of the scope of this since the problem is with the standard pickling too, right?

holdenk · 2018-01-26T06:46:16Z

Wait, so we left out cloudpickle#113 even though its in 0.4.2?

holdenk · 2018-01-26T06:52:02Z

hmm sorry nvm. So not for this time, but maybe next time we could also copy the cloudpickle_test file over as well.

HyukjinKwon · 2018-01-26T08:15:09Z

it looks like that would require more outside of the scope of this since the problem is with the standard pickling too, right?

Yup, I think so.

BryanCutler · 2018-01-26T17:57:03Z

Wait, so we left out cloudpickle#113 even though its in 0.4.2?

That patch is in here and this exactly matches 0.4.2. I also manually verified that cloudpickle will pickle named tuples with and without the hijack in types.py

BryanCutler · 2018-01-26T17:58:44Z

@holdenk and @HyukjinKwon , is there any further testing you guys can think that needs to be done to verify this is ok?

HyukjinKwon · 2018-01-29T04:13:40Z

I took a quick look for the commits and seems we should backport cloudpipe/cloudpickle#145 too as looks introduced from cloudpipe/cloudpickle#113. Let me try to backport it to cloudpickle and let's hear their opinion, if I didn't misunderstand.

BryanCutler · 2018-02-07T18:38:29Z

@HyukjinKwon would it be good to update this PR to match the upcoming 4.3 release you are working on? If the code is the same, then just updating the title/description so it is clear

HyukjinKwon · 2018-02-08T00:27:06Z

Yup, now the codes of branch "0.4.x" in cloudpickle is the same with the current PR. Was thinking of letting you know after 0.4.3. Please give me few days ... :-).

BryanCutler · 2018-02-08T00:55:56Z

Sounds good! No rush, I'll keep an eye out for the release

HyukjinKwon · 2018-02-13T11:45:26Z

@BryanCutler, I just released 0.4.3 - https://github.com/cloudpipe/cloudpickle/releases/tag/v0.4.3. Would you mind if I ask to fix PR and JIRA accordingly?

BryanCutler · 2018-02-13T19:00:50Z

Great, thanks @HyukjinKwon! The 0.4.3 code matches this exactly, so I will just adjust the descriptions.

…oudpickle-42-SPARK-23159

rgbkrk · 2018-02-13T20:02:34Z

Does the hijacking of the namedtuple still cause problems on Python 3.6?

SparkQA · 2018-02-13T20:13:35Z

Test build #87419 has finished for PR 20373 at commit 2d19f0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-02-14T00:37:57Z

I think it's fine in cloudpickle but Spark has the hijacking for regular pickling. I was thinking of a possibility for a deduplicated fix but might have to be investigated separately.

Let's hold this on a bit until the release of 2.3.0 as it's going to go into master anyway (I think). Seems it's been delayed unexpectedly and we better keep the diff small between master and branch-2.3 for now. Will keep my eyes on this PR anyway.

BryanCutler · 2018-02-14T00:38:09Z

Does the hijacking of the namedtuple still cause problems on Python 3.6?

I'm not too familiar with the history of this, but I ran PySpark tests that cover namedtuples with 3.6.3 and all passed.

holdenk · 2018-02-26T22:41:07Z

So it looks like the 2.3 release is probably going to go out but Jenkins thinks this can't be merged with master. So lets do a jenkins retest this please and I'll try and take some review cycles this week :)

SparkQA · 2018-02-26T23:19:49Z

Test build #87682 has finished for PR 20373 at commit 2d19f0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-04T04:10:59Z

retest this please

SparkQA · 2018-03-04T04:46:48Z

Test build #87939 has finished for PR 20373 at commit 2d19f0a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-05T14:55:14Z

@holdenk, have you had a chance to take a look for this one?

BryanCutler · 2018-03-05T18:02:25Z

Let me merge with master since it has been sitting a while

…oudpickle-42-SPARK-23159

SparkQA · 2018-03-05T20:41:39Z

Test build #87972 has finished for PR 20373 at commit 7d265f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2018-03-05T20:45:39Z

LGTM

HyukjinKwon · 2018-03-08T11:20:36Z

Merged to master.

BryanCutler · 2018-03-08T17:38:08Z

Thanks @HyukjinKwon @holdenk and @rgbkrk !

rgbkrk · 2018-03-08T18:23:27Z

Woohoo!

BryanCutler added 2 commits January 23, 2018 15:25

updated cloudpickle to match 0.4.2

89f13b8

removed unused import

c362df8

BryanCutler commented Jan 24, 2018

View reviewed changes

holdenk reviewed Jan 24, 2018

View reviewed changes

BryanCutler changed the title ~~[WIP][SPARK-23159][PYTHON] Update cloudpickle to match 0.4.2~~ [SPARK-23159][PYTHON] Update cloudpickle to match 0.4.2 Jan 26, 2018

HyukjinKwon mentioned this pull request Feb 6, 2018

[BRANCH-0.4.x] BUG: Handle instancemethods of builtin types. cloudpipe/cloudpickle#154

Merged

BryanCutler changed the title ~~[SPARK-23159][PYTHON] Update cloudpickle to v0.4.2 plus useful backport fixes~~ [SPARK-23159][PYTHON] Update cloudpickle to v0.4.3 plus useful backport fixes Feb 13, 2018

Merge remote-tracking branch 'upstream/master' into pyspark-update-cl…

2d19f0a

…oudpickle-42-SPARK-23159

BryanCutler changed the title ~~[SPARK-23159][PYTHON] Update cloudpickle to v0.4.3 plus useful backport fixes~~ [SPARK-23159][PYTHON] Update cloudpickle to v0.4.3 Feb 13, 2018

Merge remote-tracking branch 'upstream/master' into pyspark-update-cl…

7d265f5

…oudpickle-42-SPARK-23159

asfgit closed this in 9bb239c Mar 8, 2018

BryanCutler deleted the pyspark-update-cloudpickle-42-SPARK-23159 branch March 8, 2018 17:40

HyukjinKwon mentioned this pull request Jan 15, 2019

[SPARK-18161] [Python] Update cloudpickle to v0.6.1 #20691

Closed

		if sys.version_info < (2,7): # 2.7 supports partial pickling
		dispatch[partial] = save_partial

[SPARK-23159][PYTHON] Update cloudpickle to v0.4.3 #20373

[SPARK-23159][PYTHON] Update cloudpickle to v0.4.3 #20373

Conversation

BryanCutler commented Jan 23, 2018 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Jan 29, 2018 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

BryanCutler commented Jan 24, 2018

SparkQA commented Jan 24, 2018

holdenk commented Jan 24, 2018

HyukjinKwon commented Jan 24, 2018

holdenk left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

holdenk commented Jan 26, 2018

BryanCutler commented Jan 26, 2018

holdenk commented Jan 26, 2018

holdenk commented Jan 26, 2018

HyukjinKwon commented Jan 26, 2018 • edited Loading

BryanCutler commented Jan 26, 2018

BryanCutler commented Jan 26, 2018

HyukjinKwon commented Jan 29, 2018

BryanCutler commented Feb 7, 2018

HyukjinKwon commented Feb 8, 2018

BryanCutler commented Feb 8, 2018

HyukjinKwon commented Feb 13, 2018

BryanCutler commented Feb 13, 2018

rgbkrk commented Feb 13, 2018

SparkQA commented Feb 13, 2018

HyukjinKwon commented Feb 14, 2018

BryanCutler commented Feb 14, 2018

holdenk commented Feb 26, 2018

SparkQA commented Feb 26, 2018

HyukjinKwon commented Mar 4, 2018

SparkQA commented Mar 4, 2018

HyukjinKwon commented Mar 5, 2018

BryanCutler commented Mar 5, 2018

SparkQA commented Mar 5, 2018

HyukjinKwon commented Mar 5, 2018

HyukjinKwon commented Mar 8, 2018

BryanCutler commented Mar 8, 2018

rgbkrk commented Mar 8, 2018

BryanCutler commented Jan 23, 2018 •

edited

Loading

HyukjinKwon Jan 29, 2018 •

edited

Loading

HyukjinKwon commented Jan 26, 2018 •

edited

Loading