[SPARK-26133][ML] Remove deprecated OneHotEncoder and rename OneHotEn…

…coderEstimator to OneHotEncoder ## What changes were proposed in this pull request? We have deprecated `OneHotEncoder` at Spark 2.3.0 and introduced `OneHotEncoderEstimator`. At 3.0.0, we remove deprecated `OneHotEncoder` and rename `OneHotEncoderEstimator` to `OneHotEncoder`. TODO: According to ML migration guide, we need to keep `OneHotEncoderEstimator` as an alias after renaming. This is not done at this patch in order to facilitate review. ## How was this patch tested? Existing tests. Closes apache#23100 from viirya/remove_one_hot_encoder. Authored-by: Liang-Chi Hsieh <viirya@gmail.com> Signed-off-by: DB Tsai <d_tsai@apple.com>
kai-chi · Nov 29, 2018 · 8bfea86 · 8bfea86
1 parent fa0d4bf
commit 8bfea86
Show file tree

Hide file tree

Showing 12 changed files with 841 additions and 1,222 deletions.
diff --git a/docs/ml-features.md b/docs/ml-features.md
@@ -779,43 +779,37 @@ for more details on the API.
 </div>
 </div>
 
-## OneHotEncoder (Deprecated since 2.3.0)
-
-Because this existing `OneHotEncoder` is a stateless transformer, it is not usable on new data where the number of categories may differ from the training data. In order to fix this, a new `OneHotEncoderEstimator` was created that produces an `OneHotEncoderModel` when fitting. For more detail, please see [SPARK-13030](https://issues.apache.org/jira/browse/SPARK-13030).
-
-`OneHotEncoder` has been deprecated in 2.3.0 and will be removed in 3.0.0. Please use [OneHotEncoderEstimator](ml-features.html#onehotencoderestimator) instead.
-
-## OneHotEncoderEstimator
+## OneHotEncoder
 
 [One-hot encoding](http://en.wikipedia.org/wiki/One-hot) maps a categorical feature, represented as a label index, to a binary vector with at most a single one-value indicating the presence of a specific feature value from among the set of all feature values. This encoding allows algorithms which expect continuous features, such as Logistic Regression, to use categorical features. For string type input data, it is common to encode categorical features using [StringIndexer](ml-features.html#stringindexer) first.
 
-`OneHotEncoderEstimator` can transform multiple columns, returning an one-hot-encoded output vector column for each input column. It is common to merge these vectors into a single feature vector using [VectorAssembler](ml-features.html#vectorassembler).
+`OneHotEncoder` can transform multiple columns, returning an one-hot-encoded output vector column for each input column. It is common to merge these vectors into a single feature vector using [VectorAssembler](ml-features.html#vectorassembler).
 
-`OneHotEncoderEstimator` supports the `handleInvalid` parameter to choose how to handle invalid input during transforming data. Available options include 'keep' (any invalid inputs are assigned to an extra categorical index) and 'error' (throw an error).
+`OneHotEncoder` supports the `handleInvalid` parameter to choose how to handle invalid input during transforming data. Available options include 'keep' (any invalid inputs are assigned to an extra categorical index) and 'error' (throw an error).
 
 **Examples**
 
 <div class="codetabs">
 <div data-lang="scala" markdown="1">
 
-Refer to the [OneHotEncoderEstimator Scala docs](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoderEstimator) for more details on the API.
+Refer to the [OneHotEncoder Scala docs](api/scala/index.html#org.apache.spark.ml.feature.OneHotEncoder) for more details on the API.
 
-{% include_example scala/org/apache/spark/examples/ml/OneHotEncoderEstimatorExample.scala %}
+{% include_example scala/org/apache/spark/examples/ml/OneHotEncoderExample.scala %}
 </div>
 
 <div data-lang="java" markdown="1">
 
-Refer to the [OneHotEncoderEstimator Java docs](api/java/org/apache/spark/ml/feature/OneHotEncoderEstimator.html)
+Refer to the [OneHotEncoder Java docs](api/java/org/apache/spark/ml/feature/OneHotEncoder.html)
 for more details on the API.
 
-{% include_example java/org/apache/spark/examples/ml/JavaOneHotEncoderEstimatorExample.java %}
+{% include_example java/org/apache/spark/examples/ml/JavaOneHotEncoderExample.java %}
 </div>
 
 <div data-lang="python" markdown="1">
 
-Refer to the [OneHotEncoderEstimator Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoderEstimator) for more details on the API.
+Refer to the [OneHotEncoder Python docs](api/python/pyspark.ml.html#pyspark.ml.feature.OneHotEncoder) for more details on the API.
 
-{% include_example python/ml/onehot_encoder_estimator_example.py %}
+{% include_example python/ml/onehot_encoder_example.py %}
 </div>
 </div>
 

diff --git a/docs/ml-guide.md b/docs/ml-guide.md
@@ -104,6 +104,12 @@ MLlib is under active development.
 The APIs marked `Experimental`/`DeveloperApi` may change in future releases,
 and the migration guide below will explain all changes between releases.
 
+## From 2.4 to 3.0
+
+### Breaking changes
+
+* `OneHotEncoder` which is deprecated in 2.3, is removed in 3.0 and `OneHotEncoderEstimator` is now renamed to `OneHotEncoder`.
+
 ## From 2.2 to 2.3
 
 ### Breaking changes

diff --git a/...ml/JavaOneHotEncoderEstimatorExample.java → ...examples/ml/JavaOneHotEncoderExample.java b/...ml/JavaOneHotEncoderEstimatorExample.java → ...examples/ml/JavaOneHotEncoderExample.java
@@ -23,7 +23,7 @@
 import java.util.Arrays;
 import java.util.List;
 
-import org.apache.spark.ml.feature.OneHotEncoderEstimator;
+import org.apache.spark.ml.feature.OneHotEncoder;
 import org.apache.spark.ml.feature.OneHotEncoderModel;
 import org.apache.spark.sql.Dataset;
 import org.apache.spark.sql.Row;
@@ -34,11 +34,11 @@
 import org.apache.spark.sql.types.StructType;
 // $example off$
 
-public class JavaOneHotEncoderEstimatorExample {
+public class JavaOneHotEncoderExample {
   public static void main(String[] args) {
     SparkSession spark = SparkSession
       .builder()
-      .appName("JavaOneHotEncoderEstimatorExample")
+      .appName("JavaOneHotEncoderExample")
       .getOrCreate();
 
     // Note: categorical features are usually first encoded with StringIndexer
@@ -59,7 +59,7 @@ public static void main(String[] args) {
 
     Dataset<Row> df = spark.createDataFrame(data, schema);
 
-    OneHotEncoderEstimator encoder = new OneHotEncoderEstimator()
+    OneHotEncoder encoder = new OneHotEncoder()
       .setInputCols(new String[] {"categoryIndex1", "categoryIndex2"})
       .setOutputCols(new String[] {"categoryVec1", "categoryVec2"});
 

diff --git a/...on/ml/onehot_encoder_estimator_example.py → .../main/python/ml/onehot_encoder_example.py b/...on/ml/onehot_encoder_estimator_example.py → .../main/python/ml/onehot_encoder_example.py
@@ -18,14 +18,14 @@
 from __future__ import print_function
 
 # $example on$
-from pyspark.ml.feature import OneHotEncoderEstimator
+from pyspark.ml.feature import OneHotEncoder
 # $example off$
 from pyspark.sql import SparkSession
 
 if __name__ == "__main__":
     spark = SparkSession\
         .builder\
-        .appName("OneHotEncoderEstimatorExample")\
+        .appName("OneHotEncoderExample")\
         .getOrCreate()
 
     # Note: categorical features are usually first encoded with StringIndexer
@@ -39,8 +39,8 @@
         (2.0, 0.0)
     ], ["categoryIndex1", "categoryIndex2"])
 
-    encoder = OneHotEncoderEstimator(inputCols=["categoryIndex1", "categoryIndex2"],
-                                     outputCols=["categoryVec1", "categoryVec2"])
+    encoder = OneHotEncoder(inputCols=["categoryIndex1", "categoryIndex2"],
+                            outputCols=["categoryVec1", "categoryVec2"])
     model = encoder.fit(df)
     encoded = model.transform(df)
     encoded.show()

diff --git a/...es/ml/OneHotEncoderEstimatorExample.scala → ...rk/examples/ml/OneHotEncoderExample.scala b/...es/ml/OneHotEncoderEstimatorExample.scala → ...rk/examples/ml/OneHotEncoderExample.scala
@@ -19,15 +19,15 @@
 package org.apache.spark.examples.ml
 
 // $example on$
-import org.apache.spark.ml.feature.OneHotEncoderEstimator
+import org.apache.spark.ml.feature.OneHotEncoder
 // $example off$
 import org.apache.spark.sql.SparkSession
 
-object OneHotEncoderEstimatorExample {
+object OneHotEncoderExample {
   def main(args: Array[String]): Unit = {
     val spark = SparkSession
       .builder
-      .appName("OneHotEncoderEstimatorExample")
+      .appName("OneHotEncoderExample")
       .getOrCreate()
 
     // Note: categorical features are usually first encoded with StringIndexer
@@ -41,7 +41,7 @@ object OneHotEncoderEstimatorExample {
       (2.0, 0.0)
     )).toDF("categoryIndex1", "categoryIndex2")
 
-    val encoder = new OneHotEncoderEstimator()
+    val encoder = new OneHotEncoder()
       .setInputCols(Array("categoryIndex1", "categoryIndex2"))
       .setOutputCols(Array("categoryVec1", "categoryVec2"))
     val model = encoder.fit(df)