-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-4531] [MLlib] cache serialized java object #3397
Conversation
Test build #23708 has started for PR 3397 at commit
|
@davies Could we cache with MEMORY_AND_DISK? |
It might be good to cache for decision tree too since it makes a couple of passes through the original RDD (before it creates the TreePoint RDD). |
How about we call .cache() at the begging of iterations? Right now, we show a warning. |
Test build #23708 has finished for PR 3397 at commit
|
Test FAILed. |
Test build #23715 has started for PR 3397 at commit
|
@mengxr @jkbradley I had changed the storage level to MEMORY_AND_DISK_SER, and move them into Scala. Also added cache() for decision tree and random forest (only three pass in them, needed?) |
@davies Let's use MEMORY_AND_DISK instead for best performance. For decision tree, we still need to cache the input. |
@mengxr Changed to MEMORY_AND_DISK. But for Rating, it use MEMORY_AND_DISK_SER. |
Test build #23717 has started for PR 3397 at commit
|
Test build #23715 has finished for PR 3397 at commit
|
Test PASSed. |
Test build #23717 has finished for PR 3397 at commit
|
Test PASSed. |
// Disable the uncached input warning because 'data' is a deliberately uncached MappedRDD. | ||
learner.disableUncachedWarning() | ||
val model = learner.run(data.rdd, initialWeights) | ||
val model = learner.run(data.rdd.persist(StorageLevel.MEMORY_AND_DISK), initialWeights) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shall we call unpersist
explicitly after training?
@mengxr fixed. |
Test build #23726 has started for PR 3397 at commit
|
Test FAILed. |
Test build #531 has started for PR 3397 at commit
|
Test build #23726 has finished for PR 3397 at commit
|
Test PASSed. |
Test build #531 has finished for PR 3397 at commit
|
/** | ||
* Return the Updater from string | ||
*/ | ||
def getUpdateFromString(regType: String): Updater = { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Update --> Updater
Test build #23729 has started for PR 3397 at commit
|
@jkbradley Had chatted with @pwendell and @mengxr , we agreed that we could add a options for storage level in future if users really hit some problems. |
Yep - that sounds good to me. |
Test build #23729 has finished for PR 3397 at commit
|
Test PASSed. |
Merged into master and branch-1.2. Thanks! |
The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step. This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster. Author: Davies Liu <davies@databricks.com> Closes apache#3397 from davies/cache and squashes the following commits: 7f6e6ce [Davies Liu] Update -> Updater 4b52edd [Davies Liu] using named argument 63b984e [Davies Liu] fix 7da0332 [Davies Liu] add unpersist() dff33e1 [Davies Liu] address comments c2bdfc2 [Davies Liu] refactor d572f00 [Davies Liu] Merge branch 'master' into cache f1063e1 [Davies Liu] cache serialized java object
The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step. This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster. Author: Davies Liu <davies@databricks.com> Closes #3397 from davies/cache and squashes the following commits: 7f6e6ce [Davies Liu] Update -> Updater 4b52edd [Davies Liu] using named argument 63b984e [Davies Liu] fix 7da0332 [Davies Liu] add unpersist() dff33e1 [Davies Liu] address comments c2bdfc2 [Davies Liu] refactor d572f00 [Davies Liu] Merge branch 'master' into cache f1063e1 [Davies Liu] cache serialized java object (cherry picked from commit ce95bd8) Signed-off-by: Xiangrui Meng <meng@databricks.com>
The Pyrolite is pretty slow (comparing to the adhoc serializer in 1.1), it cause much performance regression in 1.2, because we cache the serialized Python object in JVM, deserialize them into Java object in each step.
This PR change to cache the deserialized JavaRDD instead of PythonRDD to avoid the deserialization of Pyrolite. It should have similar memory usage as before, but much faster.