Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-24489][ML]Check for invalid input type of weight data in ml.PowerIterationClustering #21509

Closed
wants to merge 1 commit into from

Conversation

shahidki31
Copy link
Contributor

What changes were proposed in this pull request?

The test case will result the following failure. currently in ml.PIC, there is no check for the data type of weight column.

test("invalid input types for weight") {
   val invalidWeightData = spark.createDataFrame(Seq(
     (0L, 1L, "a"),
     (2L, 3L, "b")
   )).toDF("src", "dst", "weight")

   val pic = new PowerIterationClustering()
     .setWeightCol("weight")

   val result = pic.assignClusters(invalidWeightData)
 }
Job aborted due to stage failure: Task 0 in stage 8077.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8077.0 (TID 882, localhost, executor driver): scala.MatchError: [0,1,null] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
	at org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
	at org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
	at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)

In this PR, added check types for weight column.

How was this patch tested?

UT added

Please review http://spark.apache.org/contributing.html before opening a pull request.

@shahidki31 shahidki31 changed the title Check for invalid input type of weight data in ml.PowerIterationClustering [24489]Check for invalid input type of weight data in ml.PowerIterationClustering Jun 7, 2018
@shahidki31 shahidki31 changed the title [24489]Check for invalid input type of weight data in ml.PowerIterationClustering [SPARK-24489]Check for invalid input type of weight data in ml.PowerIterationClustering Jun 7, 2018
@@ -166,6 +166,7 @@ class PowerIterationClustering private[clustering] (
val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) {
lit(1.0)
} else {
SchemaUtils.checkColumnTypes(dataset.schema, $(weightCol), Seq(FloatType, DoubleType))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we check if it is a NumericType? An integer column with value 1 is a valid input as well and if someone is using it, this may introduce a regression.

Copy link
Contributor Author

@shahidki31 shahidki31 Jun 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mgaido91 Yes. To be consistent with the test case "supported input types" and the previous PR (a471880), I was checking only for 'FloatType' and 'DoubleType' for the similarity column. I have modified the code.

@shahidki31 shahidki31 changed the title [SPARK-24489]Check for invalid input type of weight data in ml.PowerIterationClustering [SPARK-24489][ML]Check for invalid input type of weight data in ml.PowerIterationClustering Jun 8, 2018
Copy link
Contributor

@holdenk holdenk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working on this, improved error messages earlier is great. One question around the function used to check types :)

@@ -166,6 +166,8 @@ class PowerIterationClustering private[clustering] (
val w = if (!isDefined(weightCol) || $(weightCol).isEmpty) {
lit(1.0)
} else {
SchemaUtils.checkColumnTypes(dataset.schema, $(weightCol), Seq(FloatType, DoubleType,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a built in function called checkNumericType which I think might be what you want to use (unless we don't support decimal input types here)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @holdenk . I have updated accordingly. kindly review

@SparkQA
Copy link

SparkQA commented Dec 28, 2018

Test build #4489 has finished for PR 21509 at commit c33d2ec.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@shahidki31
Copy link
Contributor Author

Jenkins, retest this please

@holdenk
Copy link
Contributor

holdenk commented Jan 4, 2019

Jenkins retest this please.

@holdenk
Copy link
Contributor

holdenk commented Jan 4, 2019

LGTM pending Jenkins.

@SparkQA
Copy link

SparkQA commented Jan 4, 2019

Test build #100748 has finished for PR 21509 at commit b5e6deb.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@holdenk
Copy link
Contributor

holdenk commented Jan 7, 2019

Thanks for working on this, merged to master :)

@asfgit asfgit closed this in 71183b2 Jan 7, 2019
@shahidki31
Copy link
Contributor Author

Thanks a lot @holdenk

jackylee-ch pushed a commit to jackylee-ch/spark that referenced this pull request Feb 18, 2019
…owerIterationClustering

## What changes were proposed in this pull request?
The test case will result the following failure. currently in ml.PIC, there is no check for the data type of weight column.
 ```
 test("invalid input types for weight") {
    val invalidWeightData = spark.createDataFrame(Seq(
      (0L, 1L, "a"),
      (2L, 3L, "b")
    )).toDF("src", "dst", "weight")

    val pic = new PowerIterationClustering()
      .setWeightCol("weight")

    val result = pic.assignClusters(invalidWeightData)
  }
```
```
Job aborted due to stage failure: Task 0 in stage 8077.0 failed 1 times, most recent failure: Lost task 0.0 in stage 8077.0 (TID 882, localhost, executor driver): scala.MatchError: [0,1,null] (of class org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema)
	at org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
	at org.apache.spark.ml.clustering.PowerIterationClustering$$anonfun$3.apply(PowerIterationClustering.scala:178)
	at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
	at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
	at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
	at scala.collection.Iterator$class.foreach(Iterator.scala:893)
	at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
	at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
	at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
	at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:847)
```
In this PR, added check types for weight column.
## How was this patch tested?
UT added

Please review http://spark.apache.org/contributing.html before opening a pull request.

Closes apache#21509 from shahidki31/testCasePic.

Authored-by: Shahid <shahidki31@gmail.com>
Signed-off-by: Holden Karau <holden@pigscanfly.ca>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants