Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-13036][SPARK-13318][SPARK-13319] Add save/load for feature.py #11203

Closed
wants to merge 13 commits into from

Conversation

yinxusen
Copy link
Contributor

Add save/load for feature.py. Meanwhile, add save/load for ElementwiseProduct in Scala side and fix a bug of missing setDefault in VectorSlicer and StopWordsRemover.

In this PR I ignore the RFormula and RFormulaModel because its Scala implementation is pending in #9884. I'll add them in this PR if #9884 gets merged first. Or add a follow-up JIRA for RFormula.

@yinxusen
Copy link
Contributor Author

ok to test

@@ -53,6 +53,18 @@ class Binarizer(JavaTransformer, HasInputCol, HasOutputCol):
>>> params = {binarizer.threshold: -0.5, binarizer.outputCol: "vector"}
>>> binarizer.transform(df, params).head().vector
1.0
>>> import tempfile
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So as ended up being a follow up to #10999 we might want to simplify this a bit so it is more like an example since doctests are also (ideally) readable by users - maybe waiting for #11197 and then following the pattern in that PR. cc @mengxr

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good to know that. I'll update mine according to yours after it getting
merged.

2016年2月14日星期日,Holden Karau notifications@github.com 写道:

In python/pyspark/ml/feature.py
#11203 (comment):

@@ -53,6 +53,18 @@ class Binarizer(JavaTransformer, HasInputCol, HasOutputCol):
>>> params = {binarizer.threshold: -0.5, binarizer.outputCol: "vector"}
>>> binarizer.transform(df, params).head().vector
1.0

  • >>> import tempfile
    

So as ended up being a follow up to #10999
#10999 we might want to simplify
this a bit so it is more like an example since doctests are also (ideally)
readable by users - maybe waiting for #11197
#11197 and then following the
pattern in that PR. cc @mengxr https://github.com/mengxr


Reply to this email directly or view it on GitHub
https://github.com/apache/spark/pull/11203/files#r52852811.

Cheers

Xusen Yin (尹绪森)
LinkedIn: https://cn.linkedin.com/in/xusenyin

@SparkQA
Copy link

SparkQA commented Feb 15, 2016

Test build #51285 has finished for PR 11203 at commit 9aba283.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol, MLReadable, MLWritable):
    • class VectorAssembler(JavaTransformer, HasInputCols, HasOutputCol, MLReadable, MLWritable):
    • class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol, MLReadable, MLWritable):
    • class VectorIndexerModel(JavaModel, MLReadable, MLWritable):
    • class VectorSlicer(JavaTransformer, HasInputCol, HasOutputCol, MLReadable, MLWritable):
    • class Word2Vec(JavaEstimator, HasStepSize, HasMaxIter, HasSeed, HasInputCol, HasOutputCol,
    • class Word2VecModel(JavaModel, MLReadable, MLWritable):
    • class PCA(JavaEstimator, HasInputCol, HasOutputCol, MLReadable, MLWritable):
    • class PCAModel(JavaModel, MLReadable, MLWritable):
    • class ChiSqSelector(JavaEstimator, HasFeaturesCol, HasOutputCol, HasLabelCol, MLReadable,
    • class ChiSqSelectorModel(JavaModel, MLReadable, MLWritable):

@yinxusen
Copy link
Contributor Author

retest it please

@SparkQA
Copy link

SparkQA commented Feb 15, 2016

Test build #51323 has finished for PR 11203 at commit 7159154.

  • This patch fails PySpark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Feb 16, 2016

Test build #51327 has finished for PR 11203 at commit bb1a2f4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yinxusen
Copy link
Contributor Author

test it please

@SparkQA
Copy link

SparkQA commented Feb 25, 2016

Test build #52002 has finished for PR 11203 at commit 918f7e3.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class QuantileDiscretizer(JavaEstimator, HasInputCol, HasOutputCol, HasSeed, MLReadable,

@yinxusen
Copy link
Contributor Author

@mengxr @yanboliang Ready for review.

>>> loadedHashingTF = HashingTF.load(hashingTFPath)
>>> param = loadedHashingTF.getParam("numFeatures")
>>> loadedHashingTF.getOrDefault(param) == hashingTF.getOrDefault(param)
True
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you use getNumFeatures like other transformers in the doc test? It will make your test clean. HashingTF extends from HasNumFeatures, so it has this method.

@yanboliang
Copy link
Contributor

@yinxusen Looks good overall, I left some inline comments. Thanks!

@yinxusen
Copy link
Contributor Author

@yanboliang Thanks for reviewing it! I'll change them soon.

@yinxusen
Copy link
Contributor Author

yinxusen commented Mar 1, 2016

@yanboliang I leave doctests of VectorAssembler, Tokenizer, IDFModel, MaxAbsScalerModel with a transform for each of them. Others are fixed like remove fit.

@SparkQA
Copy link

SparkQA commented Mar 1, 2016

Test build #52257 has finished for PR 11203 at commit 4a63fbe.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Mar 1, 2016

Test build #52259 has finished for PR 11203 at commit 162f0c6.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class MaxAbsScaler(JavaEstimator, HasInputCol, HasOutputCol, MLReadable, MLWritable):
    • class MaxAbsScalerModel(JavaModel, MLReadable, MLWritable):

@SparkQA
Copy link

SparkQA commented Mar 2, 2016

Test build #52256 has finished for PR 11203 at commit 749f01b.

  • This patch passes all tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

>>> modelPath = temp_path + "/max-abs-scaler-model"
>>> model.save(modelPath)
>>> loadedModel = MaxAbsScalerModel.load(modelPath)
>>> loadedModel.transform(df).first().scaled == model.transform(df).first().scaled
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here we should check the equality of maxAbs which is a vector.

@yinxusen
Copy link
Contributor Author

yinxusen commented Mar 3, 2016

@yanboliang fixed and added the interface

@yinxusen
Copy link
Contributor Author

yinxusen commented Mar 3, 2016

test it please

@SparkQA
Copy link

SparkQA commented Mar 3, 2016

Test build #52381 has finished for PR 11203 at commit ecd1df1.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@yanboliang
Copy link
Contributor

LGTM for me, cc @mengxr

@mengxr
Copy link
Contributor

mengxr commented Mar 3, 2016

@yinxusen Could you resolve conflicts with master?

@yinxusen
Copy link
Contributor Author

yinxusen commented Mar 4, 2016

@mengxr Solved.

@SparkQA
Copy link

SparkQA commented Mar 4, 2016

Test build #52423 has finished for PR 11203 at commit 730a639.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@mengxr
Copy link
Contributor

mengxr commented Mar 4, 2016

Merged into master. Thanks!

@asfgit asfgit closed this in 83302c3 Mar 4, 2016
roygao94 pushed a commit to roygao94/spark that referenced this pull request Mar 22, 2016
Add save/load for feature.py. Meanwhile, add save/load for `ElementwiseProduct` in Scala side and fix a bug of missing `setDefault` in `VectorSlicer` and `StopWordsRemover`.

In this PR I ignore the `RFormula` and `RFormulaModel` because its Scala implementation is pending in apache#9884. I'll add them in this PR if apache#9884 gets merged first. Or add a follow-up JIRA for `RFormula`.

Author: Xusen Yin <yinxusen@gmail.com>

Closes apache#11203 from yinxusen/SPARK-13036.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants