[SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib #2378

davies · 2014-09-13T07:27:13Z

Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib.

This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class.

All the modules are refactored to use this protocol.

Known issues: There will be some performance regression (both CPU and memory, the serialized data increased)

SparkQA · 2014-09-13T07:34:20Z

QA tests have started for PR 2378 at commit b30ef35.

This patch merges cleanly.

SparkQA · 2014-09-13T08:41:27Z

QA tests have finished for PR 2378 at commit b30ef35.

This patch passes unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class JavaSparkContext(val sc: SparkContext)
- class Rating(object):
- class JavaStreamingContext(val ssc: StreamingContext) extends Closeable

SparkQA · 2014-09-13T08:59:12Z

QA tests have started for PR 2378 at commit f1544c4.

This patch merges cleanly.

SparkQA · 2014-09-13T09:14:22Z

QA tests have started for PR 2378 at commit aa2287e.

This patch merges cleanly.

SparkQA · 2014-09-13T09:56:22Z

QA tests have finished for PR 2378 at commit f1544c4.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Vector(object):
- class DenseVector(Vector):
- class SparseVector(Vector):
- class Matrix(object):
- class DenseMatrix(Matrix):
- class Rating(object):

SparkQA · 2014-09-13T10:15:46Z

QA tests have finished for PR 2378 at commit aa2287e.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class JavaSparkContext(val sc: SparkContext)
- class TaskCompletionListenerException(errorMessages: Seq[String]) extends Exception
- class Dummy(object):
- class Vector(object):
- class DenseVector(Vector):
- class SparseVector(Vector):
- class Matrix(object):
- class DenseMatrix(Matrix):
- class Rating(object):
- class JavaStreamingContext(val ssc: StreamingContext) extends Closeable

Conflicts: python/pyspark/context.py

SparkQA · 2014-09-13T15:39:13Z

QA tests have started for PR 2378 at commit 8fe166a.

This patch merges cleanly.

SparkQA · 2014-09-13T16:31:23Z

QA tests have finished for PR 2378 at commit 8fe166a.

This patch fails unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class ArrayConstructor extends net.razorvine.pickle.objects.ArrayConstructor
- class Vector(object):
- class DenseVector(Vector):
- class SparseVector(Vector):
- class Matrix(object):
- class DenseMatrix(Matrix):
- class Rating(object):

SparkQA · 2014-09-14T06:54:12Z

QA tests have started for PR 2378 at commit 4d7963e.

This patch does not merge cleanly!

Conflicts: python/pyspark/mllib/_common.py

SparkQA · 2014-09-14T06:59:13Z

QA tests have started for PR 2378 at commit b02e34f.

This patch merges cleanly.

davies · 2014-09-14T07:00:27Z

@mengxr The new approach is almost ready, please take a quick look. I will do some refactor later.

SparkQA · 2014-09-14T07:38:02Z

QA tests have finished for PR 2378 at commit b02e34f.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2014-09-18T00:01:21Z

python/pyspark/mllib/linalg.py

+class DenseMatrix(Matrix):
+    def __init__(self, nRow, nCol, values):
+        assert len(values) == nRow * nCol
+        self.nRow = nRow


Should nRow and nCol not belong to the Matrix class?

jkbradley · 2014-09-18T02:15:57Z

@davies This looks like a great PR! I don’t see major issues, though +1 to the remarks about checking for performance regressions. Pending performance testing and my small comments, this looks good to me.

SparkQA · 2014-09-18T21:14:26Z

QA tests have started for PR 2378 at commit bd738ab.

This patch merges cleanly.

davies · 2014-09-18T22:03:34Z

@jkbradley I should have addressed all your comments, or leave comments if I have not figure out how to do now, thanks for reviewing this huge PR.

SparkQA · 2014-09-18T22:04:22Z

QA tests have started for PR 2378 at commit 032cd62.

This patch merges cleanly.

SparkQA · 2014-09-18T22:04:47Z

QA tests have finished for PR 2378 at commit bd738ab.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-18T22:22:51Z

QA tests have started for PR 2378 at commit 032cd62.

This patch merges cleanly.

SparkQA · 2014-09-18T22:54:04Z

QA tests have finished for PR 2378 at commit 032cd62.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-18T23:13:09Z

QA tests have finished for PR 2378 at commit 032cd62.

This patch fails unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2014-09-18T23:29:21Z

QA tests have started for PR 2378 at commit 810f97f.

This patch merges cleanly.

SparkQA · 2014-09-19T00:36:06Z

QA tests have finished for PR 2378 at commit 810f97f.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-09-19T05:24:16Z

test this please

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala

mengxr · 2014-09-19T07:56:47Z

@davies Does PickleSerializer compress data? If not, maybe we should cache the deserialized RDD instead of the one from _.reserialize. They have the same storage. I understand that batch-serialization can help GC. But algorithms like linear methods should only allocate short-lived objects. Is batch-serialization worth the tradeoff?

SparkQA · 2014-09-19T17:20:04Z

QA tests have started for PR 2378 at commit dffbba2.

This patch merges cleanly.

davies · 2014-09-19T17:43:22Z

@mengxr PickleSerializer do not compress data, there is CompressSerializer can do it using gzip(level 1). Compression can help for small range of double or repeated values, will be worser with random double in large range.

BatchedSerializer can help to reduce the overhead of name of class. In JVM, the memory of short lived objects can not be reused without GC, so batched-serialization will not increase the gc pressure if the batch size it not too large. (depend on how gc is configured)

davies · 2014-09-19T17:51:18Z

@mengxr In this PR, I just tried to avoid other changes except serialization, we could change the cache behavior or compression later.

It's will be good to have some number of about the performance regression, I only see 5% regression in LogisticRegressionWithSGD.train() with small dataset (locally). (the test was borrowed from staple's PR)

SparkQA · 2014-09-19T18:35:12Z

QA tests have finished for PR 2378 at commit dffbba2.

This patch passes unit tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2014-09-19T21:59:41Z

@davies LGTM except few linear algebra operators and caching. But those are orthogonal to this PR. I'm merging this and we will update the linear algebra ops later.

mengxr · 2014-09-19T22:06:56Z

Merged. Thanks a lot!

davies added 5 commits September 11, 2014 16:43

support unpickle array.array for Python 2.6

60e4e2f

cleanup debugging code

c77c87b

Merge branch 'master' into pickle

3908f5c

enable tests about array

f44f771

use pickle to serialize data for mllib/recommendation

b30ef35

use new protocol in mllib/stat

52d1350

refactor clustering

f1544c4

random

aa2287e

Merge branch 'pickle' into pickle_mllib

8fe166a

Conflicts: python/pyspark/context.py

JoshRosen mentioned this pull request Sep 13, 2014

[SPARK-2951] [PySpark] support unpickle array.array for Python 2.6 #2365

Closed

davies added 6 commits September 13, 2014 21:52

mllib/tree

cccb8b1

mllib/util

d9f691f

mllib/regression

f2a0856

classification

c383544

fix tests

6d26b03

remove muanlly serialization

4d7963e

davies added 2 commits September 13, 2014 23:54

Merge branch 'master' into pickle_mllib

84c721d

Conflicts: python/pyspark/mllib/_common.py

remove _common.py

b02e34f

jkbradley reviewed Sep 18, 2014
View reviewed changes

address comments

bd738ab

add more type check and conversion for user_product

032cd62

fix equal of matrix

810f97f

Merge branch 'master' of github.com:apache/spark into pickle_mllib

dffbba2

Conflicts: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala

davies force-pushed the pickle_mllib branch from 8e85420 to dffbba2 Compare September 19, 2014 06:38

asfgit closed this in fce5e25 Sep 19, 2014

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib #2378

[SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib #2378

davies commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 14, 2014

SparkQA commented Sep 14, 2014

davies commented Sep 14, 2014

SparkQA commented Sep 14, 2014

jkbradley Sep 18, 2014

jkbradley commented Sep 18, 2014

SparkQA commented Sep 18, 2014

davies commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 19, 2014

mengxr commented Sep 19, 2014

mengxr commented Sep 19, 2014

SparkQA commented Sep 19, 2014

davies commented Sep 19, 2014

davies commented Sep 19, 2014

SparkQA commented Sep 19, 2014

mengxr commented Sep 19, 2014

mengxr commented Sep 19, 2014

[SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib #2378

[SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib #2378

Conversation

davies commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 13, 2014

SparkQA commented Sep 14, 2014

SparkQA commented Sep 14, 2014

davies commented Sep 14, 2014

SparkQA commented Sep 14, 2014

jkbradley Sep 18, 2014

Choose a reason for hiding this comment

jkbradley commented Sep 18, 2014

SparkQA commented Sep 18, 2014

davies commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 18, 2014

SparkQA commented Sep 19, 2014

mengxr commented Sep 19, 2014

mengxr commented Sep 19, 2014

SparkQA commented Sep 19, 2014

davies commented Sep 19, 2014

davies commented Sep 19, 2014

SparkQA commented Sep 19, 2014

mengxr commented Sep 19, 2014

mengxr commented Sep 19, 2014