[MLLib]SPARK-6348:Enable useFeatureScaling in SVMWithSGD #5055

tanyinyan · 2015-03-17T01:51:47Z

set useFeatureScaling true in SVMWithSGD, the problem describled in jira (https://issues.apache.org/jira/browse/SPARK-6348)

srowen · 2015-03-17T10:34:54Z

ok to test

srowen · 2015-03-17T10:36:17Z

It looks reasonable, but this then forces feature scaling, which sort of changes behavior.
Hm, can this class just be made instantiable so that people can set this as they like? I don't know if there's a good reason that its constructor was made private. @jkbradley

SparkQA · 2015-03-17T11:45:31Z

Test build #28716 has finished for PR 5055 at commit 158a766.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

tanyinyan · 2015-03-17T12:00:59Z

test failed. Is there something wrong with the test case? set feature scaling true will change the predict result. @srowen @AmplabJenkins

File "pyspark/mllib/classification.py", line 232, in main.SVMModel
Failed example:
svm.predict(array([1.0]))
Expected:
1.25...
Got:
1.0107186024067978

srowen · 2015-03-17T13:44:35Z

No, it's almost certainly a result of your change. The prediction of the SVM changes if you force it to scale features. This is why I'm not sure this is the right thing to do; it swaps one hard-coded behavior for the other, and changes behavior. I'd prefer to simply make this selectable for all models.

SparkQA · 2015-03-18T02:11:39Z

Test build #28757 has finished for PR 5055 at commit 249d36a.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SVMWithSGD (

provide a interface in object SVMWithSGD,to set useFeatureScaling

tanyinyan · 2015-03-18T06:33:39Z

Agree with you. I commited a version to make SVMWithSGD class public ,but it fails Spark unit tests. I don't know why, Maybe provide a interface in object SVMWithSGD is also ok? @srowen

SparkQA · 2015-03-18T06:34:12Z

Test build #28780 has finished for PR 5055 at commit ef437cb.

This patch fails Scala style tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2015-03-18T08:03:18Z

Test build #28782 has finished for PR 5055 at commit 26558da.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

srowen · 2015-03-18T13:31:50Z

Hm, maybe. I am sort of reluctant to add yet another utility method overload to expose this when we could just add it as a setter. @mengxr did you say you also favored not adding more of these methods in the objects? what do you think about making this constructor non-private to allow access directly to the class and its setter for feature scaling?

SparkQA · 2015-03-19T02:37:12Z

Test build #28850 has finished for PR 5055 at commit 3c622f8.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SVMWithSGD (

make setFeatureScaling public

SparkQA · 2015-03-19T07:52:14Z

Test build #28854 has finished for PR 5055 at commit 2dc9cb8.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SVMWithSGD (

tanyinyan · 2015-03-19T08:14:57Z

Yes，I have made this constructor and setter public.@srowen

srowen · 2015-03-19T12:12:06Z

OK. I want to see if @mengxr is OK with this as it adds a new bit of API, technically. I think we'd want to document the params and perhaps mark this as ::Experimental:: if this is exposed.

Document the params of SVMWithSGD constructor and mark it as ::Experimental::

SparkQA · 2015-03-20T03:38:00Z

Test build #28902 has finished for PR 5055 at commit 32c8507.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class SVMWithSGD (

tanyinyan · 2015-03-20T09:36:20Z

I already document the params of SVMWithSGD constructor and mark it as ::Experimental::

mengxr · 2015-03-20T21:12:13Z

Sorry for my late response! I tend to not expose this method to keep the API lightweight. If we feel feature scaling is good, we can enable it by default. For example, LIBLINEAR uses feature scaling but it doesn't expose it to users.

jkbradley · 2015-03-21T00:24:58Z

Apologies for my late response too. I feel like what we really need to do is clarify the intended behavior of feature scaling. Currently, feature scaling changes the optimal solution vector for regularized models since it changes each feature's relative amount of regularization. I see 2 options:

Keep the current behavior.
- If we go with this behavior, then we should expose it as an option since it changes the optimal solution vector.
Adjust the regularization parameter for each feature such that the optimal solution vector is identical (after rescaling) to the solution for the original problem (before scaling).
- I believe this is what libsvm (or maybe liblinear) does.
- If we do feature scaling under the hood (and do not expose it as an option), then we should use this behavior. Otherwise, users will be confused when the optimal solution is not what they would expect.

I strongly vote for the 2nd option: It has the intended benefit of improving optimization behavior, and it is better for users since it gives them the solution they expect.

srowen · 2015-03-27T13:37:22Z

@tanyinyan are you in a position to implement @jkbradley's suggestion? it's more involved for sure but it would indeed not change the solution, which sounds nice. I apologize for taking this down the wrong path of "option 1", exposing as an option.

tanyinyan · 2015-03-28T02:39:10Z

Hi @jkbradley , @srowen , I'm considering "option 2" these days, and look up what libsvm and liblinear does . And find that, both in libsvm and liblinear , scaling changes the performance.

In libsvm guide:http://www.csie.ntu.edu.tw/~cjlin/papers/guide/guide.pdf ," We recommend linearly
scaling each attribute to the range [−1, +1] or [0, 1]", in the appendix A,there are some examples that scaling improve the accuracy.

The same as in liblinear:http://cran.r-project.org/web/packages/LiblineaR/LiblineaR.pdf , "Classification models usually perform better if each dimension of the data is first centered and scaled."

The goal to do scaling is to improve accuracy and the convergence rate, if the optimal solution vector is identical(before and after scaling), then it is meaningless to do scaling.

So, I suggest "setFeatureScaling(true)" (don't expose it as an option) as what is done in class LogisticRegressionWithLBFGS

jkbradley · 2015-03-30T19:57:04Z

@tanyinyan I think what you're arguing for is actually option (1). I propose this combination of the solutions:

Expose setFeatureScaling() as an option. Default to true.

If featureScaling is true, then we scale features and do not adjust regularization. This will change the optimal solution, but as in your references, it is generally better to do anyways. (My experience is the same.)

If featureScaling is false, then we scale features internally but also adjust regularization. This will improve optimization behavior but will not change the optimal solution.

Defaulting to true will mean the algorithm will probably do the best thing by default, but will allow informed users to get what they really want if necessary.

This proposal will also avoid an API change since the meaning of featureScaling will stay the same.

tanyinyan · 2015-04-02T08:59:23Z

" If featureScaling is false, then we scale features internally but also adjust regularization. This will improve optimization behavior but will not change the optimal solution."

@jkbradley , I'm not understanding the meaning of "optimization behavior" here, does it means convergence rate? If we scale features internally and also adjust regularization, then we will get the same gradient as not scale features for every labeled point, so I think if the optimal solution is not changed , so does the optimization behavior.

Have I understood correctly？

jkbradley · 2015-04-02T18:33:55Z

@tanyinyan "Optimization behavior" means convergence rate, yes.

If we scale feature internally and also adjust regularization, then:

The optimal solution will not change. (I agree with you on this.)
The optimization behavior will change. This is because we use a single step size for all features.
- E.g., suppose we have 2 features a and b, where norm(column b) = 1000 * norm(column a). The step size we use needs to be adjusted based on the norm of the feature columns; since column b has a really big norm, we will need to use a very small step size. This means we will progress really slowly, especially if (a) is the useful feature.

Does that make sense?

We could actually adjust the step size for each feature, rather than scaling the data. Come to think of it, that might be a more efficient solution since it's cheaper than creating a new copy of the data. I'll make a JIRA for that since it belongs in another PR.

jkbradley · 2015-04-03T20:10:29Z

After speaking with @mengxr we might want to make some hard decisions about changing behaviors and hiding feature scaling. I noted them here: [https://issues.apache.org/jira/browse/SPARK-6683]

AmplabJenkins · 2015-04-27T18:20:34Z

Can one of the admins verify this patch?

srowen · 2015-04-27T18:35:32Z

I think this is WontFix in favor of SPARK-6683?

mengxr · 2015-04-27T18:44:30Z

Yes. @tanyinyan Do you mind closing this for now? Thanks everyone for the discussion!

tanyinyan · 2015-04-28T01:25:00Z

ok. @mengxr , @jkbradley how is SPARK-6683 going? I'm very glad if I could join and do some contribution to it.

dbtsai · 2015-04-28T01:31:34Z

SPARK-6683 will be more like doing the feature scaling within the objective function. LinearRegression with L1/L2 (ElasticNet) using OWLQN, #4259 does the scaling internally so there is no need for creating a new dataset.

Update SVM.scala

158a766

Update SVM.scala

249d36a

Update SVM.scala

ef437cb

provide a interface in object SVMWithSGD,to set useFeatureScaling

Update SVM.scala

26558da

Update SVM.scala

3c622f8

Update GeneralizedLinearAlgorithm.scala

2dc9cb8

make setFeatureScaling public

Update SVM.scala

32c8507

Document the params of SVMWithSGD constructor and mark it as ::Experimental::

asfgit closed this in 555213e Apr 28, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[MLLib]SPARK-6348:Enable useFeatureScaling in SVMWithSGD #5055

[MLLib]SPARK-6348:Enable useFeatureScaling in SVMWithSGD #5055

tanyinyan commented Mar 17, 2015

srowen commented Mar 17, 2015

srowen commented Mar 17, 2015

SparkQA commented Mar 17, 2015

tanyinyan commented Mar 17, 2015

srowen commented Mar 17, 2015

SparkQA commented Mar 18, 2015

tanyinyan commented Mar 18, 2015

SparkQA commented Mar 18, 2015

SparkQA commented Mar 18, 2015

srowen commented Mar 18, 2015

SparkQA commented Mar 19, 2015

SparkQA commented Mar 19, 2015

tanyinyan commented Mar 19, 2015

srowen commented Mar 19, 2015

SparkQA commented Mar 20, 2015

tanyinyan commented Mar 20, 2015

mengxr commented Mar 20, 2015

jkbradley commented Mar 21, 2015

srowen commented Mar 27, 2015

tanyinyan commented Mar 28, 2015

jkbradley commented Mar 30, 2015

tanyinyan commented Apr 2, 2015

jkbradley commented Apr 2, 2015

jkbradley commented Apr 3, 2015

AmplabJenkins commented Apr 27, 2015

srowen commented Apr 27, 2015

mengxr commented Apr 27, 2015

tanyinyan commented Apr 28, 2015

dbtsai commented Apr 28, 2015

[MLLib]SPARK-6348:Enable useFeatureScaling in SVMWithSGD #5055

[MLLib]SPARK-6348:Enable useFeatureScaling in SVMWithSGD #5055

Conversation

tanyinyan commented Mar 17, 2015

srowen commented Mar 17, 2015

srowen commented Mar 17, 2015

SparkQA commented Mar 17, 2015

tanyinyan commented Mar 17, 2015

srowen commented Mar 17, 2015

SparkQA commented Mar 18, 2015

tanyinyan commented Mar 18, 2015

SparkQA commented Mar 18, 2015

SparkQA commented Mar 18, 2015

srowen commented Mar 18, 2015

SparkQA commented Mar 19, 2015

SparkQA commented Mar 19, 2015

tanyinyan commented Mar 19, 2015

srowen commented Mar 19, 2015

SparkQA commented Mar 20, 2015

tanyinyan commented Mar 20, 2015

mengxr commented Mar 20, 2015

jkbradley commented Mar 21, 2015

srowen commented Mar 27, 2015

tanyinyan commented Mar 28, 2015

jkbradley commented Mar 30, 2015

tanyinyan commented Apr 2, 2015

jkbradley commented Apr 2, 2015

jkbradley commented Apr 3, 2015

AmplabJenkins commented Apr 27, 2015

srowen commented Apr 27, 2015

mengxr commented Apr 27, 2015

tanyinyan commented Apr 28, 2015

dbtsai commented Apr 28, 2015