Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib #2378

Closed
wants to merge 41 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Sep 13, 2014

Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib.

This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class.

All the modules are refactored to use this protocol.

Known issues: There will be some performance regression (both CPU and memory, the serialized data increased)

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have started for PR 2378 at commit b30ef35.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have finished for PR 2378 at commit b30ef35.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class JavaSparkContext(val sc: SparkContext)
    • class Rating(object):
    • class JavaStreamingContext(val ssc: StreamingContext) extends Closeable

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have started for PR 2378 at commit f1544c4.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have started for PR 2378 at commit aa2287e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have finished for PR 2378 at commit f1544c4.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Vector(object):
    • class DenseVector(Vector):
    • class SparseVector(Vector):
    • class Matrix(object):
    • class DenseMatrix(Matrix):
    • class Rating(object):

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have finished for PR 2378 at commit aa2287e.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class JavaSparkContext(val sc: SparkContext)
    • class TaskCompletionListenerException(errorMessages: Seq[String]) extends Exception
    • class Dummy(object):
    • class Vector(object):
    • class DenseVector(Vector):
    • class SparseVector(Vector):
    • class Matrix(object):
    • class DenseMatrix(Matrix):
    • class Rating(object):
    • class JavaStreamingContext(val ssc: StreamingContext) extends Closeable

Conflicts:
	python/pyspark/context.py
@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have started for PR 2378 at commit 8fe166a.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 13, 2014

QA tests have finished for PR 2378 at commit 8fe166a.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class ArrayConstructor extends net.razorvine.pickle.objects.ArrayConstructor
    • class Vector(object):
    • class DenseVector(Vector):
    • class SparseVector(Vector):
    • class Matrix(object):
    • class DenseMatrix(Matrix):
    • class Rating(object):

@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have started for PR 2378 at commit 4d7963e.

  • This patch does not merge cleanly!

@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have started for PR 2378 at commit b02e34f.

  • This patch merges cleanly.

@davies
Copy link
Contributor Author

davies commented Sep 14, 2014

@mengxr The new approach is almost ready, please take a quick look. I will do some refactor later.

@SparkQA
Copy link

SparkQA commented Sep 14, 2014

QA tests have finished for PR 2378 at commit b02e34f.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

class DenseMatrix(Matrix):
def __init__(self, nRow, nCol, values):
assert len(values) == nRow * nCol
self.nRow = nRow
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should nRow and nCol not belong to the Matrix class?

@jkbradley
Copy link
Member

@davies This looks like a great PR! I don’t see major issues, though +1 to the remarks about checking for performance regressions. Pending performance testing and my small comments, this looks good to me.

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have started for PR 2378 at commit bd738ab.

  • This patch merges cleanly.

@davies
Copy link
Contributor Author

davies commented Sep 18, 2014

@jkbradley I should have addressed all your comments, or leave comments if I have not figure out how to do now, thanks for reviewing this huge PR.

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have started for PR 2378 at commit 032cd62.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have finished for PR 2378 at commit bd738ab.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have started for PR 2378 at commit 032cd62.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have finished for PR 2378 at commit 032cd62.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have finished for PR 2378 at commit 032cd62.

  • This patch fails unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 18, 2014

QA tests have started for PR 2378 at commit 810f97f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Sep 19, 2014

QA tests have finished for PR 2378 at commit 810f97f.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Sep 19, 2014

test this please

Conflicts:
	mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
@mengxr
Copy link
Contributor

mengxr commented Sep 19, 2014

@davies Does PickleSerializer compress data? If not, maybe we should cache the deserialized RDD instead of the one from _.reserialize. They have the same storage. I understand that batch-serialization can help GC. But algorithms like linear methods should only allocate short-lived objects. Is batch-serialization worth the tradeoff?

@SparkQA
Copy link

SparkQA commented Sep 19, 2014

QA tests have started for PR 2378 at commit dffbba2.

  • This patch merges cleanly.

@davies
Copy link
Contributor Author

davies commented Sep 19, 2014

@mengxr PickleSerializer do not compress data, there is CompressSerializer can do it using gzip(level 1). Compression can help for small range of double or repeated values, will be worser with random double in large range.

BatchedSerializer can help to reduce the overhead of name of class. In JVM, the memory of short lived objects can not be reused without GC, so batched-serialization will not increase the gc pressure if the batch size it not too large. (depend on how gc is configured)

@davies
Copy link
Contributor Author

davies commented Sep 19, 2014

@mengxr In this PR, I just tried to avoid other changes except serialization, we could change the cache behavior or compression later.

It's will be good to have some number of about the performance regression, I only see 5% regression in LogisticRegressionWithSGD.train() with small dataset (locally). (the test was borrowed from staple's PR)

@SparkQA
Copy link

SparkQA commented Sep 19, 2014

QA tests have finished for PR 2378 at commit dffbba2.

  • This patch passes unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Sep 19, 2014

@davies LGTM except few linear algebra operators and caching. But those are orthogonal to this PR. I'm merging this and we will update the linear algebra ops later.

@asfgit asfgit closed this in fce5e25 Sep 19, 2014
@mengxr
Copy link
Contributor

mengxr commented Sep 19, 2014

Merged. Thanks a lot!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants