-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-24489][ML]Check for invalid input type of weight data in ml.PowerIterationClustering #21509
Conversation
@@ -166,6 +166,7 @@ class PowerIterationClustering private[clustering] ( | |||
val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) { | |||
lit(1.0) | |||
} else { | |||
SchemaUtils.checkColumnTypes(dataset.schema, $(weightCol), Seq(FloatType, DoubleType)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we check if it is a NumericType
? An integer column with value 1 is a valid input as well and if someone is using it, this may introduce a regression.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for working on this, improved error messages earlier is great. One question around the function used to check types :)
@@ -166,6 +166,8 @@ class PowerIterationClustering private[clustering] ( | |||
val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) { | |||
lit(1.0) | |||
} else { | |||
SchemaUtils.checkColumnTypes(dataset.schema, $(weightCol), Seq(FloatType, DoubleType, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a built in function called checkNumericType
which I think might be what you want to use (unless we don't support decimal input types here)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @holdenk . I have updated accordingly. kindly review
Test build #4489 has finished for PR 21509 at commit
|
c33d2ec
to
f529798
Compare
f529798
to
b5e6deb
Compare
Jenkins, retest this please |
Jenkins retest this please. |
LGTM pending Jenkins. |
Test build #100748 has finished for PR 21509 at commit
|
Thanks for working on this, merged to master :) |
Thanks a lot @holdenk |
…owerIterationClustering ## What changes were proposed in this pull request? The test case will result the following failure. currently in ml.PIC, there is no check for the data type of weight column. ``` test("invalid input types for weight") { val invalidWeightData = spark.createDataFrame(Seq( (0L, 1L, "a"), (2L, 3L, "b") )).toDF("src", "dst", "weight") val pic = new PowerIterationClustering() .setWeightCol("weight") val result = pic.assignClusters(invalidWeightData) } ``` ``` Job aborted due to stage failure: Task 0 in stage 8077.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8077.0 (TID 882, localhost, executor driver): scala.MatchError: [0,1,null] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema) at org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178) at org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178) at scala.collection.Iterator$$anon$11.next(Iterator.scala:409) at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434) at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440) at scala.collection.Iterator$class.foreach(Iterator.scala:893) at scala.collection.AbstractIterator.foreach(Iterator.scala:1336) at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107) at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105) at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847) ``` In this PR, added check types for weight column. ## How was this patch tested? UT added Please review http://spark.apache.org/contributing.html before opening a pull request. Closes apache#21509 from shahidki31/testCasePic. Authored-by: Shahid <shahidki31@gmail.com> Signed-off-by: Holden Karau <holden@pigscanfly.ca>
What changes were proposed in this pull request?
The test case will result the following failure. currently in ml.PIC, there is no check for the data type of weight column.
In this PR, added check types for weight column.
How was this patch tested?
UT added
Please review http://spark.apache.org/contributing.html before opening a pull request.