[SPARK-24489][ML]Check for invalid input type of weight data in ml.PowerIterationClustering #21509

shahidki31 · 2018-06-07T21:14:34Z

What changes were proposed in this pull request?

The test case will result the following failure. currently in ml.PIC, there is no check for the data type of weight column.

test("invalid input types for weight") {
   val invalidWeightData = spark.createDataFrame(Seq(
     (0L, 1L, "a"),
     (2L, 3L, "b")
   )).toDF("src", "dst", "weight")

   val pic = new PowerIterationClustering()
     .setWeightCol("weight")

   val result = pic.assignClusters(invalidWeightData)
 }

Job aborted due to stage failure: Task 0 in stage 8077.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8077.0 (TID 882, localhost, executor driver): scala.MatchError: [0,1,null] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
	at org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
	at org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
	at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)

In this PR, added check types for weight column.

How was this patch tested?

UT added

Please review http://spark.apache.org/contributing.html before opening a pull request.

mgaido91 · 2018-06-08T13:32:56Z

mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala

@@ -166,6 +166,7 @@ class PowerIterationClustering private[clustering] (
    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) {
      lit(1.0)
    } else {
+      SchemaUtils.checkColumnTypes(dataset.schema, $(weightCol), Seq(FloatType, DoubleType))


shall we check if it is a NumericType? An integer column with value 1 is a valid input as well and if someone is using it, this may introduce a regression.

@mgaido91 Yes. To be consistent with the test case "supported input types" and the previous PR (a471880), I was checking only for 'FloatType' and 'DoubleType' for the similarity column. I have modified the code.

holdenk

Thanks for working on this, improved error messages earlier is great. One question around the function used to check types :)

holdenk · 2018-12-28T18:28:46Z

mllib/src/main/scala/org/apache/spark/ml/clustering/PowerIterationClustering.scala

@@ -166,6 +166,8 @@ class PowerIterationClustering private[clustering] (
    val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) {
      lit(1.0)
    } else {
+      SchemaUtils.checkColumnTypes(dataset.schema, $(weightCol), Seq(FloatType, DoubleType,


There is a built in function called checkNumericType which I think might be what you want to use (unless we don't support decimal input types here)?

Thanks @holdenk . I have updated accordingly. kindly review

SparkQA · 2018-12-28T20:27:13Z

Test build #4489 has finished for PR 21509 at commit c33d2ec.

This patch fails Spark unit tests.
This patch does not merge cleanly.
This patch adds no public classes.

shahidki31 · 2019-01-03T17:41:34Z

Jenkins, retest this please

holdenk · 2019-01-04T17:43:36Z

Jenkins retest this please.

holdenk · 2019-01-04T17:44:38Z

LGTM pending Jenkins.

SparkQA · 2019-01-04T18:56:35Z

Test build #100748 has finished for PR 21509 at commit b5e6deb.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

holdenk · 2019-01-07T17:18:00Z

Thanks for working on this, merged to master :)

shahidki31 · 2019-01-07T17:42:54Z

Thanks a lot @holdenk

…owerIterationClustering ## What changes were proposed in this pull request? The test case will result the following failure. currently in ml.PIC, there is no check for the data type of weight column. ``` test("invalid input types for weight") { val invalidWeightData = spark.createDataFrame(Seq( (0L, 1L, "a"), (2L, 3L, "b") )).toDF("src", "dst", "weight") val pic = new PowerIterationClustering() .setWeightCol("weight") val result = pic.assignClusters(invalidWeightData) } ``` ``` Job aborted due to stage failure: Task 0 in stage 8077.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8077.0 (TID 882, localhost, executor driver): scala.MatchError: [0,1,null] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) at org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178) at org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847) ``` In this PR, added check types for weight column. ## How was this patch tested? UT added Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#21509 from shahidki31/testCasePic. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Holden Karau <holden@pigscanfly.ca>

shahidki31 changed the title ~~Check for invalid input type of weight data in ml.PowerIterationClustering~~ [24489]Check for invalid input type of weight data in ml.PowerIterationClustering Jun 7, 2018

shahidki31 changed the title ~~[24489]Check for invalid input type of weight data in ml.PowerIterationClustering~~ [SPARK-24489]Check for invalid input type of weight data in ml.PowerIterationClustering Jun 7, 2018

mgaido91 reviewed Jun 8, 2018

View reviewed changes

shahidki31 changed the title ~~[SPARK-24489]Check for invalid input type of weight data in ml.PowerIterationClustering~~ [SPARK-24489][ML]Check for invalid input type of weight data in ml.PowerIterationClustering Jun 8, 2018

holdenk reviewed Dec 28, 2018

View reviewed changes

shahidki31 force-pushed the testCasePic branch from c33d2ec to f529798 Compare January 3, 2019 17:29

update

b5e6deb

shahidki31 force-pushed the testCasePic branch from f529798 to b5e6deb Compare January 3, 2019 17:40

asfgit closed this in 71183b2 Jan 7, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-24489][ML]Check for invalid input type of weight data in ml.PowerIterationClustering #21509

[SPARK-24489][ML]Check for invalid input type of weight data in ml.PowerIterationClustering #21509

shahidki31 commented Jun 7, 2018

mgaido91 Jun 8, 2018

shahidki31 Jun 8, 2018 •

edited

Loading

holdenk left a comment

holdenk Dec 28, 2018

shahidki31 Jan 3, 2019

SparkQA commented Dec 28, 2018

shahidki31 commented Jan 3, 2019

holdenk commented Jan 4, 2019

holdenk commented Jan 4, 2019

SparkQA commented Jan 4, 2019

holdenk commented Jan 7, 2019

shahidki31 commented Jan 7, 2019

[SPARK-24489][ML]Check for invalid input type of weight data in ml.PowerIterationClustering #21509

[SPARK-24489][ML]Check for invalid input type of weight data in ml.PowerIterationClustering #21509

Conversation

shahidki31 commented Jun 7, 2018

What changes were proposed in this pull request?

How was this patch tested?

mgaido91 Jun 8, 2018

Choose a reason for hiding this comment

shahidki31 Jun 8, 2018 • edited Loading

Choose a reason for hiding this comment

holdenk left a comment

Choose a reason for hiding this comment

holdenk Dec 28, 2018

Choose a reason for hiding this comment

shahidki31 Jan 3, 2019

Choose a reason for hiding this comment

SparkQA commented Dec 28, 2018

shahidki31 commented Jan 3, 2019

holdenk commented Jan 4, 2019

holdenk commented Jan 4, 2019

SparkQA commented Jan 4, 2019

holdenk commented Jan 7, 2019

shahidki31 commented Jan 7, 2019

shahidki31 Jun 8, 2018 •

edited

Loading