-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-3491] [MLlib] [PySpark] use pickle to serialize data in MLlib #2378
Conversation
QA tests have started for PR 2378 at commit
|
QA tests have finished for PR 2378 at commit
|
QA tests have started for PR 2378 at commit
|
QA tests have started for PR 2378 at commit
|
QA tests have finished for PR 2378 at commit
|
QA tests have finished for PR 2378 at commit
|
Conflicts: python/pyspark/context.py
QA tests have started for PR 2378 at commit
|
QA tests have finished for PR 2378 at commit
|
QA tests have started for PR 2378 at commit
|
Conflicts: python/pyspark/mllib/_common.py
QA tests have started for PR 2378 at commit
|
@mengxr The new approach is almost ready, please take a quick look. I will do some refactor later. |
QA tests have finished for PR 2378 at commit
|
class DenseMatrix(Matrix): | ||
def __init__(self, nRow, nCol, values): | ||
assert len(values) == nRow * nCol | ||
self.nRow = nRow |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should nRow and nCol not belong to the Matrix class?
@davies This looks like a great PR! I don’t see major issues, though +1 to the remarks about checking for performance regressions. Pending performance testing and my small comments, this looks good to me. |
QA tests have started for PR 2378 at commit
|
@jkbradley I should have addressed all your comments, or leave comments if I have not figure out how to do now, thanks for reviewing this huge PR. |
QA tests have started for PR 2378 at commit
|
QA tests have finished for PR 2378 at commit
|
QA tests have started for PR 2378 at commit
|
QA tests have finished for PR 2378 at commit
|
QA tests have finished for PR 2378 at commit
|
QA tests have started for PR 2378 at commit
|
QA tests have finished for PR 2378 at commit
|
test this please |
Conflicts: mllib/src/main/scala/org/apache/spark/mllib/linalg/Matrices.scala
8e85420
to
dffbba2
Compare
@davies Does |
QA tests have started for PR 2378 at commit
|
@mengxr PickleSerializer do not compress data, there is CompressSerializer can do it using gzip(level 1). Compression can help for small range of double or repeated values, will be worser with random double in large range. BatchedSerializer can help to reduce the overhead of name of class. In JVM, the memory of short lived objects can not be reused without GC, so batched-serialization will not increase the gc pressure if the batch size it not too large. (depend on how gc is configured) |
@mengxr In this PR, I just tried to avoid other changes except serialization, we could change the cache behavior or compression later. It's will be good to have some number of about the performance regression, I only see 5% regression in LogisticRegressionWithSGD.train() with small dataset (locally). (the test was borrowed from staple's PR) |
QA tests have finished for PR 2378 at commit
|
@davies LGTM except few linear algebra operators and caching. But those are orthogonal to this PR. I'm merging this and we will update the linear algebra ops later. |
Merged. Thanks a lot! |
Currently, we serialize the data between JVM and Python case by case manually, this cannot scale to support so many APIs in MLlib.
This patch will try to address this problem by serialize the data using pickle protocol, using Pyrolite library to serialize/deserialize in JVM. Pickle protocol can be easily extended to support customized class.
All the modules are refactored to use this protocol.
Known issues: There will be some performance regression (both CPU and memory, the serialized data increased)