Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4531] [MLlib] cache serialized java object #3397

Closed
wants to merge 8 commits into from

Conversation

davies
Copy link
Contributor

@davies davies commented Nov 21, 2014

The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.

This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23708 has started for PR 3397 at commit f1063e1.

  • This patch merges cleanly.

@mengxr
Copy link
Contributor

mengxr commented Nov 21, 2014

@davies Could we cache with MEMORY_AND_DISK?

@jkbradley
Copy link
Member

It might be good to cache for decision tree too since it makes a couple of passes through the original RDD (before it creates the TreePoint RDD).

@davies
Copy link
Contributor Author

davies commented Nov 21, 2014

How about we call .cache() at the begging of iterations? Right now, we show a warning.

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23708 has finished for PR 3397 at commit f1063e1.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class RandomForestModel(JavaModelWrapper):
    • class RandomForest(object):

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23708/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23715 has started for PR 3397 at commit c2bdfc2.

  • This patch merges cleanly.

@davies
Copy link
Contributor Author

davies commented Nov 21, 2014

@mengxr @jkbradley I had changed the storage level to MEMORY_AND_DISK_SER, and move them into Scala. Also added cache() for decision tree and random forest (only three pass in them, needed?)

@mengxr
Copy link
Contributor

mengxr commented Nov 21, 2014

@davies Let's use MEMORY_AND_DISK instead for best performance. For decision tree, we still need to cache the input.

@davies
Copy link
Contributor Author

davies commented Nov 21, 2014

@mengxr Changed to MEMORY_AND_DISK. But for Rating, it use MEMORY_AND_DISK_SER.

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23717 has started for PR 3397 at commit dff33e1.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23715 has finished for PR 3397 at commit c2bdfc2.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class DefaultSource extends RelationProvider
    • case class ParquetRelation2(path: String)(@transient val sqlContext: SQLContext)
    • abstract class CatalystScan extends BaseRelation

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23715/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23717 has finished for PR 3397 at commit dff33e1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23717/
Test PASSed.

// Disable the uncached input warning because 'data' is a deliberately uncached MappedRDD.
learner.disableUncachedWarning()
val model = learner.run(data.rdd, initialWeights)
val model = learner.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK), initialWeights)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shall we call unpersist explicitly after training?

@davies
Copy link
Contributor Author

davies commented Nov 21, 2014

@mengxr fixed.

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23726 has started for PR 3397 at commit 4b52edd.

  • This patch merges cleanly.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23725/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #531 has started for PR 3397 at commit 4b52edd.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23726 has finished for PR 3397 at commit 4b52edd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23726/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #531 has finished for PR 3397 at commit 4b52edd.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

/**
* Return the Updater from string
*/
def getUpdateFromString(regType: String): Updater = {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Update --> Updater

@jkbradley
Copy link
Member

LGTM
@pwendell had questions about whether we should allow the user specify (in the Python call) whether they want to use caching. CC @mengxr

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23729 has started for PR 3397 at commit 7f6e6ce.

  • This patch merges cleanly.

@davies
Copy link
Contributor Author

davies commented Nov 21, 2014

@jkbradley Had chatted with @pwendell and @mengxr , we agreed that we could add a options for storage level in future if users really hit some problems.

@pwendell
Copy link
Contributor

Yep - that sounds good to me.

@SparkQA
Copy link

SparkQA commented Nov 21, 2014

Test build #23729 has finished for PR 3397 at commit 7f6e6ce.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23729/
Test PASSed.

@mengxr
Copy link
Contributor

mengxr commented Nov 21, 2014

Merged into master and branch-1.2. Thanks!

andrewor14 pushed a commit to andrewor14/spark that referenced this pull request Nov 21, 2014
The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.

This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.

Author: Davies Liu <davies@databricks.com>

Closes apache#3397 from davies/cache and squashes the following commits:

7f6e6ce [Davies Liu] Update -> Updater
4b52edd [Davies Liu] using named argument
63b984e [Davies Liu] fix
7da0332 [Davies Liu] add unpersist()
dff33e1 [Davies Liu] address comments
c2bdfc2 [Davies Liu] refactor
d572f00 [Davies Liu] Merge branch 'master' into cache
f1063e1 [Davies Liu] cache serialized java object
@davies davies closed this Nov 22, 2014
asfgit pushed a commit that referenced this pull request Nov 24, 2014
The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.

This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.

Author: Davies Liu <davies@databricks.com>

Closes #3397 from davies/cache and squashes the following commits:

7f6e6ce [Davies Liu] Update -> Updater
4b52edd [Davies Liu] using named argument
63b984e [Davies Liu] fix
7da0332 [Davies Liu] add unpersist()
dff33e1 [Davies Liu] address comments
c2bdfc2 [Davies Liu] refactor
d572f00 [Davies Liu] Merge branch 'master' into cache
f1063e1 [Davies Liu] cache serialized java object

(cherry picked from commit ce95bd8)
Signed-off-by: Xiangrui Meng <meng@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants