[SPARK-7383][ML] Feature Parity in PySpark for ml.features #5991

brkyvz · 2015-05-07T22:49:49Z

Implemented python wrappers for Scala functions that don't exist in ml.features

AmplabJenkins · 2015-05-07T22:52:11Z

Merged build triggered.

AmplabJenkins · 2015-05-07T22:52:19Z

Merged build started.

SparkQA · 2015-05-07T22:53:53Z

Test build #32156 has started for PR 5991 at commit bd39fd2.

SparkQA · 2015-05-08T00:42:10Z

Test build #32156 has finished for PR 5991 at commit bd39fd2.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Binarizer(JavaTransformer, HasInputCol, HasOutputCol):
- class IDF(JavaEstimator, HasInputCol, HasOutputCol):
- class IDFModel(JavaModel):
- class Normalizer(JavaTransformer, HasInputCol, HasOutputCol):
- class OneHotEncoder(JavaTransformer, HasInputCol, HasOutputCol):
- class PolynomialExpansion(JavaTransformer, HasInputCol, HasOutputCol):
- class StandardScaler(JavaEstimator, HasInputCol, HasOutputCol):
- class StandardScalerModel(JavaModel):
- class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol):
- class StringIndexerModel(JavaModel):
- class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol):
- class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol):
- class Word2Vec(JavaEstimator, HasStepSize, HasMaxIter, HasSeed, HasInputCol, HasOutputCol):
- class Word2VecModel(JavaModel):
- class HasSeed(Params):
- class HasTol(Params):
- class HasStepSize(Params):

AmplabJenkins · 2015-05-08T00:42:15Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-08T00:42:15Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32156/
Test PASSed.

mengxr · 2015-05-08T05:11:49Z

python/pyspark/ml/feature.py

-    Traceback (most recent call last):
-        ...
-    TypeError: Method setParams forces keyword arguments.
+    >>> df = sc.parallelize([Row(values=0.5)]).toDF()


minor: I'm not sure which one is the recommended approach to create a DataFrame. @rxin

df = sc.parallelize([Row(values=0.5)]).toDF()

vs.

df = sqlContext.createDataFrame([(0.5,)], ["values"]) # don't need to import Row

I prefer the 2nd approach

mengxr · 2015-05-08T05:15:15Z

@brkyvz Thanks for working on this! It looks good except the variable naming in the doctests. It seems that RegexTokenizer is missing from the list. Could you add it as well?

AmplabJenkins · 2015-05-08T15:02:11Z

Merged build triggered.

AmplabJenkins · 2015-05-08T15:02:16Z

Merged build started.

SparkQA · 2015-05-08T15:04:01Z

Test build #32240 has started for PR 5991 at commit adcca55.

mengxr · 2015-05-08T16:11:08Z

LGTM pending Jenkins.

SparkQA · 2015-05-08T16:43:50Z

Test build #32240 has finished for PR 5991 at commit adcca55.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
- class Binarizer(JavaTransformer, HasInputCol, HasOutputCol):
- class IDF(JavaEstimator, HasInputCol, HasOutputCol):
- class IDFModel(JavaModel):
- class Normalizer(JavaTransformer, HasInputCol, HasOutputCol):
- class OneHotEncoder(JavaTransformer, HasInputCol, HasOutputCol):
- class PolynomialExpansion(JavaTransformer, HasInputCol, HasOutputCol):
- class RegexTokenizer(JavaTransformer, HasInputCol, HasOutputCol):
- class StandardScaler(JavaEstimator, HasInputCol, HasOutputCol):
- class StandardScalerModel(JavaModel):
- class StringIndexer(JavaEstimator, HasInputCol, HasOutputCol):
- class StringIndexerModel(JavaModel):
- class Tokenizer(JavaTransformer, HasInputCol, HasOutputCol):
- class VectorIndexer(JavaEstimator, HasInputCol, HasOutputCol):
- class Word2Vec(JavaEstimator, HasStepSize, HasMaxIter, HasSeed, HasInputCol, HasOutputCol):
- class Word2VecModel(JavaModel):
- class HasSeed(Params):
- class HasTol(Params):
- class HasStepSize(Params):

AmplabJenkins · 2015-05-08T16:43:55Z

Merged build finished. Test PASSed.

AmplabJenkins · 2015-05-08T16:43:55Z

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/32240/
Test PASSed.

mengxr · 2015-05-08T18:15:08Z

Merged into master and branch-1.4. Thanks!

Implemented python wrappers for Scala functions that don't exist in `ml.features` Author: Burak Yavuz <brkyvz@gmail.com> Closes #5991 from brkyvz/ml-feat-PR and squashes the following commits: adcca55 [Burak Yavuz] add regex tokenizer to __all__ b91cb44 [Burak Yavuz] addressed comments bd39fd2 [Burak Yavuz] remove addition b82bd7c [Burak Yavuz] Parity in PySpark for ml.features (cherry picked from commit f5ff4a8) Signed-off-by: Xiangrui Meng <meng@databricks.com>

Implemented python wrappers for Scala functions that don't exist in `ml.features` Author: Burak Yavuz <brkyvz@gmail.com> Closes apache#5991 from brkyvz/ml-feat-PR and squashes the following commits: adcca55 [Burak Yavuz] add regex tokenizer to __all__ b91cb44 [Burak Yavuz] addressed comments bd39fd2 [Burak Yavuz] remove addition b82bd7c [Burak Yavuz] Parity in PySpark for ml.features

brkyvz added 2 commits May 7, 2015 15:44

Parity in PySpark for ml.features

b82bd7c

remove addition

bd39fd2

mengxr reviewed May 8, 2015
View reviewed changes

brkyvz added 2 commits May 8, 2015 07:59

addressed comments

b91cb44

add regex tokenizer to __all__

adcca55

asfgit closed this in f5ff4a8 May 8, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-7383][ML] Feature Parity in PySpark for ml.features #5991

[SPARK-7383][ML] Feature Parity in PySpark for ml.features #5991

brkyvz commented May 7, 2015

AmplabJenkins commented May 7, 2015

AmplabJenkins commented May 7, 2015

SparkQA commented May 7, 2015

SparkQA commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

mengxr May 8, 2015

rxin May 8, 2015

mengxr commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

SparkQA commented May 8, 2015

mengxr commented May 8, 2015

SparkQA commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

mengxr commented May 8, 2015

[SPARK-7383][ML] Feature Parity in PySpark for ml.features #5991

[SPARK-7383][ML] Feature Parity in PySpark for ml.features #5991

Conversation

brkyvz commented May 7, 2015

AmplabJenkins commented May 7, 2015

AmplabJenkins commented May 7, 2015

SparkQA commented May 7, 2015

SparkQA commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

mengxr May 8, 2015

Choose a reason for hiding this comment

rxin May 8, 2015

Choose a reason for hiding this comment

mengxr commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

SparkQA commented May 8, 2015

mengxr commented May 8, 2015

SparkQA commented May 8, 2015

AmplabJenkins commented May 8, 2015

AmplabJenkins commented May 8, 2015

mengxr commented May 8, 2015