[SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py #11203

yinxusen · 2016-02-15T00:25:45Z

Add save/load for feature.py. Meanwhile, add save/load for ElementwiseProduct in Scala side and fix a bug of missing setDefault in VectorSlicer and StopWordsRemover.

In this PR I ignore the RFormula and RFormulaModel because its Scala implementation is pending in #9884. I'll add them in this PR if #9884 gets merged first. Or add a follow-up JIRA for RFormula.

yinxusen · 2016-02-15T00:28:04Z

ok to test

holdenk · 2016-02-15T01:06:50Z

python/pyspark/ml/feature.py

@@ -53,6 +53,18 @@ class Binarizer(JavaTransformer, HasInputCol, HasOutputCol):
    >>> params = {binarizer.threshold: -0.5, binarizer.outputCol: "vector"}
    >>> binarizer.transform(df, params).head().vector
    1.0
+     >>> import tempfile


So as ended up being a follow up to #10999 we might want to simplify this a bit so it is more like an example since doctests are also (ideally) readable by users - maybe waiting for #11197 and then following the pattern in that PR. cc @mengxr

Good to know that. I'll update mine according to yours after it getting
merged.

2016年2月14日星期日，Holden Karau notifications@github.com 写道：

In python/pyspark/ml/feature.py
#11203 (comment):

@@ -53,6 +53,18 @@ class Binarizer(JavaTransformer, HasInputCol, HasOutputCol):
>>> params = {binarizer.threshold: -0.5, binarizer.outputCol: "vector"}
>>> binarizer.transform(df, params).head().vector
1.0

>>> import tempfile

So as ended up being a follow up to #10999
#10999 we might want to simplify
this a bit so it is more like an example since doctests are also (ideally)
readable by users - maybe waiting for #11197
#11197 and then following the
pattern in that PR. cc @mengxr https://github.com/mengxr

—
Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/11203/files#r52852811.

Cheers

Xusen Yin (尹绪森)
LinkedIn: https://cn.linkedin.com/in/xusenyin

SparkQA · 2016-02-15T01:12:24Z

Test build #51285 has finished for PR 11203 at commit 9aba283.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol, MLReadable, MLWritable):
- class VectorAssembler(JavaTransformer, HasInputCols, HasOutputCol, MLReadable, MLWritable):
- class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, MLReadable, MLWritable):
- class VectorIndexerModel(JavaModel, MLReadable, MLWritable):
- class VectorSlicer(JavaTransformer, HasInputCol, HasOutputCol, MLReadable, MLWritable):
- class Word2Vec(JavaEstimator, HasStepSize, HasMaxIter, HasSeed, HasInputCol, HasOutputCol,
- class Word2VecModel(JavaModel, MLReadable, MLWritable):
- class PCA(JavaEstimator, HasInputCol, HasOutputCol, MLReadable, MLWritable):
- class PCAModel(JavaModel, MLReadable, MLWritable):
- class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, MLReadable,
- class ChiSqSelectorModel(JavaModel, MLReadable, MLWritable):

yinxusen · 2016-02-15T21:20:51Z

retest it please

SparkQA · 2016-02-15T22:05:03Z

Test build #51323 has finished for PR 11203 at commit 7159154.

This patch fails PySpark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-02-16T01:05:19Z

Test build #51327 has finished for PR 11203 at commit bb1a2f4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yinxusen · 2016-02-25T22:51:35Z

test it please

SparkQA · 2016-02-25T23:39:52Z

Test build #52002 has finished for PR 11203 at commit 918f7e3.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, HasSeed, MLReadable,

yinxusen · 2016-02-25T23:43:59Z

@mengxr @yanboliang Ready for review.

yanboliang · 2016-02-26T08:57:06Z

python/pyspark/ml/feature.py

+    >>> loadedHashingTF = HashingTF.load(hashingTFPath)
+    >>> param = loadedHashingTF.getParam("numFeatures")
+    >>> loadedHashingTF.getOrDefault(param) == hashingTF.getOrDefault(param)
+    True


Could you use getNumFeatures like other transformers in the doc test? It will make your test clean. HashingTF extends from HasNumFeatures, so it has this method.

yanboliang · 2016-02-26T09:30:29Z

@yinxusen Looks good overall, I left some inline comments. Thanks!

yinxusen · 2016-02-26T09:33:56Z

@yanboliang Thanks for reviewing it! I'll change them soon.

yinxusen · 2016-03-01T22:17:31Z

@yanboliang I leave doctests of VectorAssembler, Tokenizer, IDFModel, MaxAbsScalerModel with a transform for each of them. Others are fixed like remove fit.

SparkQA · 2016-03-01T22:35:14Z

Test build #52257 has finished for PR 11203 at commit 4a63fbe.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2016-03-01T23:01:42Z

Test build #52259 has finished for PR 11203 at commit 162f0c6.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class MaxAbsScaler(JavaEstimator, HasInputCol, HasOutputCol, MLReadable, MLWritable):
- class MaxAbsScalerModel(JavaModel, MLReadable, MLWritable):

SparkQA · 2016-03-02T00:29:04Z

Test build #52256 has finished for PR 11203 at commit 749f01b.

This patch passes all tests.
This patch does not merge cleanly.
This patch adds no public classes.

yanboliang · 2016-03-02T10:03:14Z

python/pyspark/ml/feature.py

+    >>> modelPath = temp_path + "/max-abs-scaler-model"
+    >>> model.save(modelPath)
+    >>> loadedModel = MaxAbsScalerModel.load(modelPath)
+    >>> loadedModel.transform(df).first().scaled == model.transform(df).first().scaled


Here we should check the equality of maxAbs which is a vector.

yinxusen · 2016-03-03T07:14:30Z

@yanboliang fixed and added the interface

yinxusen · 2016-03-03T07:14:55Z

test it please

SparkQA · 2016-03-03T08:00:45Z

Test build #52381 has finished for PR 11203 at commit ecd1df1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

yanboliang · 2016-03-03T09:51:34Z

LGTM for me, cc @mengxr

mengxr · 2016-03-03T17:44:02Z

@yinxusen Could you resolve conflicts with master?

yinxusen · 2016-03-04T00:31:49Z

@mengxr Solved.

SparkQA · 2016-03-04T01:22:10Z

Test build #52423 has finished for PR 11203 at commit 730a639.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

mengxr · 2016-03-04T16:32:40Z

Merged into master. Thanks!

Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`. In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in apache#9884. I'll add them in this PR if apache#9884 gets merged first. Or add a follow-up JIRA for `RFormula`. Author: Xusen Yin <yinxusen@gmail.com> Closes apache#11203 from yinxusen/SPARK-13036.

yinxusen added 3 commits February 13, 2016 22:18

add ElementwiseProduct save/load

f7b9f6c

push first part

38eb277

save/load for feature part2

9aba283

holdenk reviewed Feb 15, 2016
View reviewed changes

fix error in python2.6

7159154

remove unidoc/str equal test

bb1a2f4

yinxusen added 3 commits February 25, 2016 14:47

reduce code

2e4394c

Merge branch 'master' into SPARK-13036

65049fa

merge with master

918f7e3

yanboliang reviewed Feb 26, 2016
View reviewed changes

yinxusen added 3 commits March 1, 2016 13:45

remove unnecessary fit

749f01b

merge with master

4a63fbe

add load/save for MaxAbsScaler

162f0c6

yanboliang reviewed Mar 2, 2016
View reviewed changes

add new test and maxAbs interface

ecd1df1

merge with master

730a639

asfgit closed this in 83302c3 Mar 4, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py #11203

[SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py #11203

yinxusen commented Feb 15, 2016

yinxusen commented Feb 15, 2016

holdenk Feb 15, 2016

yinxusen Feb 15, 2016

SparkQA commented Feb 15, 2016

yinxusen commented Feb 15, 2016

SparkQA commented Feb 15, 2016

SparkQA commented Feb 16, 2016

yinxusen commented Feb 25, 2016

SparkQA commented Feb 25, 2016

yinxusen commented Feb 25, 2016

yanboliang Feb 26, 2016

yanboliang commented Feb 26, 2016

yinxusen commented Feb 26, 2016

yinxusen commented Mar 1, 2016

SparkQA commented Mar 1, 2016

SparkQA commented Mar 1, 2016

SparkQA commented Mar 2, 2016

yanboliang Mar 2, 2016

yinxusen commented Mar 3, 2016

yinxusen commented Mar 3, 2016

SparkQA commented Mar 3, 2016

yanboliang commented Mar 3, 2016

mengxr commented Mar 3, 2016

yinxusen commented Mar 4, 2016

SparkQA commented Mar 4, 2016

mengxr commented Mar 4, 2016

[SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py #11203

[SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py #11203

Conversation

yinxusen commented Feb 15, 2016

yinxusen commented Feb 15, 2016

holdenk Feb 15, 2016

Choose a reason for hiding this comment

yinxusen Feb 15, 2016

Choose a reason for hiding this comment

Cheers

SparkQA commented Feb 15, 2016

yinxusen commented Feb 15, 2016

SparkQA commented Feb 15, 2016

SparkQA commented Feb 16, 2016

yinxusen commented Feb 25, 2016

SparkQA commented Feb 25, 2016

yinxusen commented Feb 25, 2016

yanboliang Feb 26, 2016

Choose a reason for hiding this comment

yanboliang commented Feb 26, 2016

yinxusen commented Feb 26, 2016

yinxusen commented Mar 1, 2016

SparkQA commented Mar 1, 2016

SparkQA commented Mar 1, 2016

SparkQA commented Mar 2, 2016

yanboliang Mar 2, 2016

Choose a reason for hiding this comment

yinxusen commented Mar 3, 2016

yinxusen commented Mar 3, 2016

SparkQA commented Mar 3, 2016

yanboliang commented Mar 3, 2016

mengxr commented Mar 3, 2016

yinxusen commented Mar 4, 2016

SparkQA commented Mar 4, 2016

mengxr commented Mar 4, 2016