[SPARK-1212, Part II] Support sparse data in MLlib #245

mengxr · 2014-03-27T01:03:13Z

In PR #117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other Array[Double] usage by Vector in generalized linear models (GLMs) and Naive Bayes. Major changes:

LabeledPoint becomes LabeledPoint(Double, Vector).
Methods that accept RDD[Array[Double]] now accept RDD[Vector]. We cannot support both in an elegant way because of type erasure.
Mark 'createModel' and 'predictPoint' protected because they are not for end users.
Add libSVMFile to MLContext.
NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's NaiveBayesModel).
Gradient computation no longer creates temp vectors.
Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training.

TODO:

~~Use axpy when possible.~~
~~Optimize Naive Bayes.~~

mark createModel protected mark predictPoint protected

update some ml algorithms to use Vector (cont.)

AmplabJenkins · 2014-03-27T01:13:50Z

Merged build triggered.

AmplabJenkins · 2014-03-27T01:13:50Z

Merged build started.

AmplabJenkins · 2014-03-27T01:15:19Z

Merged build finished.

AmplabJenkins · 2014-03-27T01:15:19Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13487/

AmplabJenkins · 2014-03-27T03:23:19Z

Build triggered.

AmplabJenkins · 2014-03-27T04:12:56Z

Build started.

AmplabJenkins · 2014-03-27T04:14:55Z

Build finished.

AmplabJenkins · 2014-03-27T04:14:55Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13494/

yinxusen · 2014-03-27T08:30:29Z

mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala

+  def compute(data: Vector, label: Double, weights: Vector): (Vector, Double)
+
+  /**
+   * Compute the gradient and loss given the features of a single data point, add the gradient to a provided vector to


100 characters limit.

yinxusen · 2014-03-27T08:34:12Z

mllib/src/main/scala/org/apache/spark/mllib/optimization/Gradient.scala

+    val margin: Double = -1.0 * brzWeights.dot(brzData)
+    val gradientMultiplier = (1.0 / (1.0 + math.exp(margin))) - label
+
+    brzAxpy(gradientMultiplier, brzData, gradientAddTo.toBreeze)


I think there are too many toBreezes when using the Vector trait. How about using implicit to eliminate them?

breeze uses implicits a lot. Scala do not look for second degree implicit conversions.

AmplabJenkins · 2014-03-27T08:53:19Z

Build triggered. One or more automated tests failed

AmplabJenkins · 2014-03-27T08:53:29Z

Build started. One or more automated tests failed

AmplabJenkins · 2014-03-27T08:55:17Z

Build finished. One or more automated tests failed

AmplabJenkins · 2014-03-27T08:55:17Z

One or more automated tests failed
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13511/

mengxr · 2014-03-27T08:59:57Z

@yinxusen This is WIP. I will let you know when it is ready for review.

mengxr · 2014-04-02T17:18:25Z

Jenkins, retest this please.

AmplabJenkins · 2014-04-02T17:22:23Z

Merged build triggered.

AmplabJenkins · 2014-04-02T17:22:32Z

Merged build started.

AmplabJenkins · 2014-04-02T18:15:49Z

Merged build finished. All automated tests passed.

AmplabJenkins · 2014-04-02T18:15:49Z

All automated tests passed.
Refer to this link for build results: https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13686/

mateiz · 2014-04-02T20:21:17Z

mllib/src/main/scala/org/apache/spark/mllib/util/MLUtils.scala

+   * Binary label parser, which outputs 1.0 (positive) if the value is greater than 0.5,
+   * or 0.0 (negative) otherwise.
+   */
+  val binaryLabelParser: String => Double = label => if (label.toDouble > 0.5) 1.0 else 0.0


Instead of using String => Double as the type of these, we should create a trait LabelParser and have some implementations of it so that it becomes friendly to call from Java.

mateiz · 2014-04-02T20:25:35Z

@mengxr I made some comments on the libSVM stuff but I'd also be okay fixing them alter if that's more convenient for you, because a lot of other MLlib patches depend on this.

mateiz · 2014-04-02T21:01:50Z

I've merged this in now, Xiangrui sent me an IM that he will handle the libSVM input stuff later.

mengxr · 2014-04-02T21:02:10Z

Thanks @mateiz !

srowen · 2014-04-04T11:08:52Z

Right now, there is still use of jblas in addition to breeze in the code base and in APIs. In theory there are no more API changes before 1.0 now. This seems pretty important to get sorted before the API freezes though, since it theoretically can't change before 2.0 then. Or is this like a lot of stuff in MLlib going to be marked as not-stable-yet?

mengxr · 2014-04-04T21:34:03Z

@srowen I don't think jblas's DoubleMatrix is exposed in public APIs. But if there are, yes, we should clean them before v1.0. We will mark some APIs developer/experimental in the v1.0 release.

srowen · 2014-04-04T21:39:46Z

@mengxr On closer inspection, yes almost all the uses in an API are actually not public, or in a test method. There are a few places where they turn up, I think, like run() in SVDPlusPlus.scala in graphx. It returns a graph involving DoubleMatrix.

I would wholeheartedly agree with you reserving the right to change all of these APIs before 2.x by marking them experimental, which makes it a non-issue. I am almost certain some of the good stuff coming over the next year will want at least a few API changes.

mengxr · 2014-04-07T17:25:23Z

@srowen Thanks for taking a closer look! For graphx interfaces, let's ask @rxin and @jegonzal to see whether they want to hide DoubleMatrix from public interfaces.

@mateiz

This is a patch to address @mateiz 's comment in #245 MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass. Author: Xiangrui Meng <meng@databricks.com> Closes #345 from mengxr/label-parser and squashes the following commits: ac44409 [Xiangrui Meng] use singleton objects for label parsers 3b1a7c6 [Xiangrui Meng] add tests for label parsers c2e571c [Xiangrui Meng] rename LabelParser.apply to LabelParser.parse use extends for singleton 11c94e0 [Xiangrui Meng] add return types 7f8eb36 [Xiangrui Meng] change labelParser from annoymous function to trait

In PR apache#117, we added dense/sparse vector data model and updated KMeans to support sparse input. This PR is to replace all other `Array[Double]` usage by `Vector` in generalized linear models (GLMs) and Naive Bayes. Major changes: 1. `LabeledPoint` becomes `LabeledPoint(Double, Vector)`. 2. Methods that accept `RDD[Array[Double]]` now accept `RDD[Vector]`. We cannot support both in an elegant way because of type erasure. 3. Mark 'createModel' and 'predictPoint' protected because they are not for end users. 4. Add libSVMFile to MLContext. 5. NaiveBayes can accept arbitrary labels (introducing a breaking change to Python's `NaiveBayesModel`). 6. Gradient computation no longer creates temp vectors. 7. Column normalization and centering are removed from Lasso and Ridge because the operation will densify the data. Simple feature transformation can be done before training. TODO: 1. ~~Use axpy when possible.~~ 2. ~~Optimize Naive Bayes.~~ Author: Xiangrui Meng <meng@databricks.com> Closes apache#245 from mengxr/vector and squashes the following commits: eb6e793 [Xiangrui Meng] move libSVMFile to MLUtils and rename to loadLibSVMData c26c4fc [Xiangrui Meng] update DecisionTree to use RDD[Vector] 11999c7 [Xiangrui Meng] Merge branch 'master' into vector f7da54b [Xiangrui Meng] add minSplits to libSVMFile da25e24 [Xiangrui Meng] revert the change to default addIntercept because it might change the behavior of existing code without warning 493f26f [Xiangrui Meng] Merge branch 'master' into vector 7c1bc01 [Xiangrui Meng] add a TODO to NB b9b7ef7 [Xiangrui Meng] change default value of addIntercept to false b01df54 [Xiangrui Meng] allow to change or clear threshold in LR and SVM 4addc50 [Xiangrui Meng] merge master 4ca5b1b [Xiangrui Meng] remove normalization from Lasso and update tests f04fe8a [Xiangrui Meng] remove normalization from RidgeRegression and update tests d088552 [Xiangrui Meng] use static constructor for MLContext 6f59eed [Xiangrui Meng] update libSVMFile to determine number of features automatically 3432e84 [Xiangrui Meng] update NaiveBayes to support sparse data 0f8759b [Xiangrui Meng] minor updates to NB b11659c [Xiangrui Meng] style update 78c4671 [Xiangrui Meng] add libSVMFile to MLContext f0fe616 [Xiangrui Meng] add a test for sparse linear regression 44733e1 [Xiangrui Meng] use in-place gradient computation e981396 [Xiangrui Meng] use axpy in Updater db808a1 [Xiangrui Meng] update JavaLR example befa592 [Xiangrui Meng] passed scala/java tests 75c83a4 [Xiangrui Meng] passed test compile 1859701 [Xiangrui Meng] passed compile 834ada2 [Xiangrui Meng] optimized MLUtils.computeStats update some ml algorithms to use Vector (cont.) 135ab72 [Xiangrui Meng] merge glm 0e57aa4 [Xiangrui Meng] update Lasso and RidgeRegression to parse the weights correctly from GLM mark createModel protected mark predictPoint protected d7f629f [Xiangrui Meng] fix a bug in GLM when intercept is not used 3f346ba [Xiangrui Meng] update some ml algorithms to use Vector

@mateiz

This is a patch to address @mateiz 's comment in apache#245 MLUtils#loadLibSVMData uses an anonymous function for the label parser. Java users won't like it. So I make a trait for LabelParser and provide two implementations: binary and multiclass. Author: Xiangrui Meng <meng@databricks.com> Closes apache#345 from mengxr/label-parser and squashes the following commits: ac44409 [Xiangrui Meng] use singleton objects for label parsers 3b1a7c6 [Xiangrui Meng] add tests for label parsers c2e571c [Xiangrui Meng] rename LabelParser.apply to LabelParser.parse use extends for singleton 11c94e0 [Xiangrui Meng] add return types 7f8eb36 [Xiangrui Meng] change labelParser from annoymous function to trait

Add Rd files for sampleByKey() of [SPARKR-163] and sumRDD() of [SPARKR-92]

This PR pulls in recent changes in SparkR-pkg, including cartesian, intersection, sampleByKey, subtract, subtractByKey, except, and some API for StructType and StructField. Author: cafreeman <cfreeman@alteryx.com> Author: Davies Liu <davies@databricks.com> Author: Zongheng Yang <zongheng.y@gmail.com> Author: Shivaram Venkataraman <shivaram.venkataraman@gmail.com> Author: Shivaram Venkataraman <shivaram@cs.berkeley.edu> Author: Sun Rui <rui.sun@intel.com> Closes #5436 from davies/R3 and squashes the following commits: c2b09be [Davies Liu] SQLTypes -> schema a5a02f2 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R3 168b7fe [Davies Liu] sort generics b1fe460 [Davies Liu] fix conflict in README.md e74c04e [Davies Liu] fix schema.R 4f5ac09 [Davies Liu] Merge branch 'master' of github.com:apache/spark into R5 41f8184 [Davies Liu] rm man ae78312 [Davies Liu] Merge pull request #237 from sun-rui/SPARKR-154_3 1bdcb63 [Zongheng Yang] Updates to README.md. 5a553e7 [cafreeman] Use object attribute instead of argument 71372d9 [cafreeman] Update docs and examples 8526d2e [cafreeman] Remove `tojson` functions 6ef5f2d [cafreeman] Fix spacing 7741d66 [cafreeman] Rename the SQL DataType function 141efd8 [Shivaram Venkataraman] Merge pull request #245 from hqzizania/upstream 9387402 [Davies Liu] fix style 40199eb [Shivaram Venkataraman] Move except into sorted position 07d0dbc [Sun Rui] [SPARKR-244] Fix test failure after integration of subtract() and subtractByKey() for RDD. 7e8caa3 [Shivaram Venkataraman] Merge pull request #246 from hlin09/fixCombineByKey ed66c81 [cafreeman] Update `subtract` to work with `generics.R` f3ba785 [cafreeman] Fixed duplicate export 275deb4 [cafreeman] Update `NAMESPACE` and tests 1a3b63d [cafreeman] new version of `CreateDF` 836c4bf [cafreeman] Update `createDataFrame` and `toDF` be5d5c1 [cafreeman] refactor schema functions 40338a4 [Zongheng Yang] Merge pull request #244 from sun-rui/SPARKR-154_5 20b97a6 [Zongheng Yang] Merge pull request #234 from hqzizania/assist ba54e34 [Shivaram Venkataraman] Merge pull request #238 from sun-rui/SPARKR-154_4 c9497a3 [Shivaram Venkataraman] Merge pull request #208 from lythesia/master b317aa7 [Zongheng Yang] Merge pull request #243 from hqzizania/master 136a07e [Zongheng Yang] Merge pull request #242 from hqzizania/stats cd66603 [cafreeman] new line at EOF 8b76e81 [Shivaram Venkataraman] Merge pull request #233 from redbaron/fail-early-on-missing-dep 7dd81b7 [cafreeman] Documentation 0e2a94f [cafreeman] Define functions for schema and fields

Changing the autocommit behaviour of the JDBC connection tests has brought a number of issues. To fix that, went through all the tests and cleaned it up to use autocommit=true everywhere. Author: Juliusz Sompolski <julek@databricks.com> Closes apache#245 from juliuszsompolski/SC-5621-fixup.

…park 2.2.1 (apache#245)

mengxr added 9 commits March 25, 2014 22:47

update some ml algorithms to use Vector

3f346ba

fix a bug in GLM when intercept is not used

d7f629f

update Lasso and RidgeRegression to parse the weights correctly from GLM

0e57aa4

mark createModel protected mark predictPoint protected

merge glm

135ab72

optimized MLUtils.computeStats

834ada2

update some ml algorithms to use Vector (cont.)

passed compile

1859701

passed test compile

75c83a4

passed scala/java tests

befa592

update JavaLR example

db808a1

mengxr added 3 commits March 26, 2014 19:00

use axpy in Updater

e981396

use in-place gradient computation

44733e1

add a test for sparse linear regression

f0fe616

yinxusen reviewed Mar 27, 2014
View reviewed changes

add libSVMFile to MLContext

78c4671

yinxusen reviewed Mar 27, 2014
View reviewed changes

style update

b11659c

mateiz reviewed Apr 2, 2014
View reviewed changes

asfgit closed this in 9c65fa7 Apr 2, 2014

mengxr mentioned this pull request Apr 7, 2014

[SPARK-1434] [MLLIB] change labelParser from anonymous function to trait #345

Closed

davies pushed a commit to davies/spark that referenced this pull request Apr 14, 2015

Merge pull request apache#245 from hqzizania/upstream

141efd8

Add Rd files for sampleByKey() of [SPARKR-163] and sumRDD() of [SPARKR-92]

jamesrgrinter pushed a commit to jamesrgrinter/spark that referenced this pull request Apr 22, 2018

[SPARK-198] Update hadoop dependency version to 2.7.0-mapr-1803 for S…

d7168d4

…park 2.2.1 (apache#245)

wangyum mentioned this pull request Aug 19, 2020

[SPARK-32444][SQL] Infer filters from DPP #29243

Closed

arjunshroff pushed a commit to arjunshroff/spark that referenced this pull request Nov 24, 2020

[SPARK-198] Update hadoop dependency version to 2.7.0-mapr-1803 for S…

1bd5f05

…park 2.2.1 (apache#245)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-1212, Part II] Support sparse data in MLlib #245

[SPARK-1212, Part II] Support sparse data in MLlib #245

mengxr commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

yinxusen Mar 27, 2014

yinxusen Mar 27, 2014

mengxr Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

mengxr commented Mar 27, 2014

mengxr commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

mateiz Apr 2, 2014

mateiz commented Apr 2, 2014

mateiz commented Apr 2, 2014

mengxr commented Apr 2, 2014

srowen commented Apr 4, 2014

mengxr commented Apr 4, 2014

srowen commented Apr 4, 2014

mengxr commented Apr 7, 2014

[SPARK-1212, Part II] Support sparse data in MLlib #245

[SPARK-1212, Part II] Support sparse data in MLlib #245

Conversation

mengxr commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

yinxusen Mar 27, 2014

Choose a reason for hiding this comment

yinxusen Mar 27, 2014

Choose a reason for hiding this comment

mengxr Mar 27, 2014

Choose a reason for hiding this comment

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

AmplabJenkins commented Mar 27, 2014

mengxr commented Mar 27, 2014

mengxr commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

AmplabJenkins commented Apr 2, 2014

mateiz Apr 2, 2014

Choose a reason for hiding this comment

mateiz commented Apr 2, 2014

mateiz commented Apr 2, 2014

mengxr commented Apr 2, 2014

srowen commented Apr 4, 2014

mengxr commented Apr 4, 2014

srowen commented Apr 4, 2014

mengxr commented Apr 7, 2014