[SPARK-4531] [MLlib] cache serialized java object #3397

davies · 2014-11-21T06:46:03Z

The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.

This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.

SparkQA · 2014-11-21T06:50:04Z

Test build #23708 has started for PR 3397 at commit f1063e1.

This patch merges cleanly.

mengxr · 2014-11-21T06:52:56Z

@davies Could we cache with MEMORY_AND_DISK?

jkbradley · 2014-11-21T06:53:22Z

It might be good to cache for decision tree too since it makes a couple of passes through the original RDD (before it creates the TreePoint RDD).

davies · 2014-11-21T07:01:35Z

How about we call .cache() at the begging of iterations? Right now, we show a warning.

SparkQA · 2014-11-21T08:00:30Z

Test build #23708 has finished for PR 3397 at commit f1063e1.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class RandomForestModel(JavaModelWrapper):
- class RandomForest(object):

AmplabJenkins · 2014-11-21T08:00:34Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23708/
Test FAILed.

SparkQA · 2014-11-21T08:10:03Z

Test build #23715 has started for PR 3397 at commit c2bdfc2.

This patch merges cleanly.

davies · 2014-11-21T08:10:19Z

@mengxr @jkbradley I had changed the storage level to MEMORY_AND_DISK_SER, and move them into Scala. Also added cache() for decision tree and random forest (only three pass in them, needed?)

mengxr · 2014-11-21T08:17:09Z

@davies Let's use MEMORY_AND_DISK instead for best performance. For decision tree, we still need to cache the input.

davies · 2014-11-21T08:34:03Z

@mengxr Changed to MEMORY_AND_DISK. But for Rating, it use MEMORY_AND_DISK_SER.

SparkQA · 2014-11-21T08:37:47Z

Test build #23717 has started for PR 3397 at commit dff33e1.

This patch merges cleanly.

SparkQA · 2014-11-21T09:39:14Z

Test build #23715 has finished for PR 3397 at commit c2bdfc2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class DefaultSource extends RelationProvider
- case class ParquetRelation2(path: String)(@transient val sqlContext: SQLContext)
- abstract class CatalystScan extends BaseRelation

AmplabJenkins · 2014-11-21T09:39:18Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23715/
Test PASSed.

SparkQA · 2014-11-21T10:04:25Z

Test build #23717 has finished for PR 3397 at commit dff33e1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-21T10:04:28Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23717/
Test PASSed.

mengxr · 2014-11-21T10:39:30Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

-    // Disable the uncached input warning because 'data' is a deliberately uncached MappedRDD.
-    learner.disableUncachedWarning()
-    val model = learner.run(data.rdd, initialWeights)
+    val model = learner.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK), initialWeights)


Shall we call unpersist explicitly after training?

davies · 2014-11-21T16:45:54Z

@mengxr fixed.

SparkQA · 2014-11-21T16:50:00Z

Test build #23726 has started for PR 3397 at commit 4b52edd.

This patch merges cleanly.

AmplabJenkins · 2014-11-21T16:57:24Z

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23725/
Test FAILed.

SparkQA · 2014-11-21T17:02:02Z

Test build #531 has started for PR 3397 at commit 4b52edd.

This patch merges cleanly.

SparkQA · 2014-11-21T18:22:20Z

Test build #23726 has finished for PR 3397 at commit 4b52edd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-21T18:22:23Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23726/
Test PASSed.

SparkQA · 2014-11-21T18:28:23Z

Test build #531 has finished for PR 3397 at commit 4b52edd.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

jkbradley · 2014-11-21T20:17:13Z

mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala

+  /**
+   * Return the Updater from string
+   */
+  def getUpdateFromString(regType: String): Updater = {


Update --> Updater

jkbradley · 2014-11-21T20:19:56Z

LGTM
@pwendell had questions about whether we should allow the user specify (in the Python call) whether they want to use caching. CC @mengxr

SparkQA · 2014-11-21T21:00:23Z

Test build #23729 has started for PR 3397 at commit 7f6e6ce.

This patch merges cleanly.

davies · 2014-11-21T21:00:30Z

@jkbradley Had chatted with @pwendell and @mengxr , we agreed that we could add a options for storage level in future if users really hit some problems.

pwendell · 2014-11-21T21:47:07Z

Yep - that sounds good to me.

SparkQA · 2014-11-21T22:23:37Z

Test build #23729 has finished for PR 3397 at commit 7f6e6ce.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

AmplabJenkins · 2014-11-21T22:23:41Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23729/
Test PASSed.

mengxr · 2014-11-21T23:04:04Z

Merged into master and branch-1.2. Thanks!

The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step. This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster. Author: Davies Liu <davies@databricks.com> Closes apache#3397 from davies/cache and squashes the following commits: 7f6e6ce [Davies Liu] Update -> Updater 4b52edd [Davies Liu] using named argument 63b984e [Davies Liu] fix 7da0332 [Davies Liu] add unpersist() dff33e1 [Davies Liu] address comments c2bdfc2 [Davies Liu] refactor d572f00 [Davies Liu] Merge branch 'master' into cache f1063e1 [Davies Liu] cache serialized java object

The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step. This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster. Author: Davies Liu <davies@databricks.com> Closes #3397 from davies/cache and squashes the following commits: 7f6e6ce [Davies Liu] Update -> Updater 4b52edd [Davies Liu] using named argument 63b984e [Davies Liu] fix 7da0332 [Davies Liu] add unpersist() dff33e1 [Davies Liu] address comments c2bdfc2 [Davies Liu] refactor d572f00 [Davies Liu] Merge branch 'master' into cache f1063e1 [Davies Liu] cache serialized java object (cherry picked from commit ce95bd8) Signed-off-by: Xiangrui Meng <meng@databricks.com>

cache serialized java object

f1063e1

Merge branch 'master' into cache

d572f00

refactor

c2bdfc2

address comments

dff33e1

mengxr reviewed Nov 21, 2014
View reviewed changes

Davies Liu added 3 commits November 21, 2014 08:37

add unpersist()

7da0332

fix

63b984e

using named argument

4b52edd

jkbradley reviewed Nov 21, 2014
View reviewed changes

Update -> Updater

7f6e6ce

davies closed this Nov 22, 2014

jkbradley mentioned this pull request Nov 24, 2014

[SPARK-4562] [MLlib] speedup vector #3420

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-4531] [MLlib] cache serialized java object #3397

[SPARK-4531] [MLlib] cache serialized java object #3397

davies commented Nov 21, 2014

SparkQA commented Nov 21, 2014

mengxr commented Nov 21, 2014

jkbradley commented Nov 21, 2014

davies commented Nov 21, 2014

SparkQA commented Nov 21, 2014

AmplabJenkins commented Nov 21, 2014

SparkQA commented Nov 21, 2014

davies commented Nov 21, 2014

mengxr commented Nov 21, 2014

davies commented Nov 21, 2014

SparkQA commented Nov 21, 2014

SparkQA commented Nov 21, 2014

AmplabJenkins commented Nov 21, 2014

SparkQA commented Nov 21, 2014

AmplabJenkins commented Nov 21, 2014

mengxr Nov 21, 2014

davies commented Nov 21, 2014

SparkQA commented Nov 21, 2014

AmplabJenkins commented Nov 21, 2014

SparkQA commented Nov 21, 2014

SparkQA commented Nov 21, 2014

AmplabJenkins commented Nov 21, 2014

SparkQA commented Nov 21, 2014

jkbradley Nov 21, 2014

jkbradley commented Nov 21, 2014

SparkQA commented Nov 21, 2014

davies commented Nov 21, 2014

pwendell commented Nov 21, 2014

SparkQA commented Nov 21, 2014

AmplabJenkins commented Nov 21, 2014

mengxr commented Nov 21, 2014

[SPARK-4531] [MLlib] cache serialized java object #3397

[SPARK-4531] [MLlib] cache serialized java object #3397

Conversation

davies commented Nov 21, 2014

SparkQA commented Nov 21, 2014

mengxr commented Nov 21, 2014

jkbradley commented Nov 21, 2014

davies commented Nov 21, 2014

SparkQA commented Nov 21, 2014

AmplabJenkins commented Nov 21, 2014

SparkQA commented Nov 21, 2014

davies commented Nov 21, 2014

mengxr commented Nov 21, 2014

davies commented Nov 21, 2014

SparkQA commented Nov 21, 2014

SparkQA commented Nov 21, 2014

AmplabJenkins commented Nov 21, 2014

SparkQA commented Nov 21, 2014

AmplabJenkins commented Nov 21, 2014

mengxr Nov 21, 2014

Choose a reason for hiding this comment

davies commented Nov 21, 2014

SparkQA commented Nov 21, 2014

AmplabJenkins commented Nov 21, 2014

SparkQA commented Nov 21, 2014

SparkQA commented Nov 21, 2014

AmplabJenkins commented Nov 21, 2014

SparkQA commented Nov 21, 2014

jkbradley Nov 21, 2014

Choose a reason for hiding this comment

jkbradley commented Nov 21, 2014

SparkQA commented Nov 21, 2014

davies commented Nov 21, 2014

pwendell commented Nov 21, 2014

SparkQA commented Nov 21, 2014

AmplabJenkins commented Nov 21, 2014

mengxr commented Nov 21, 2014