[SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R #17785

HyukjinKwon · 2017-04-27T15:26:09Z

What changes were proposed in this pull request?

It seems we are using SQLUtils.getSQLDataType for type string in structField. It looks we can replace this with CatalystSqlParser.parseDataType.

They look similar DDL-like type definitions as below:

scala> Seq(Tuple1(Tuple1("a"))).toDF.show()

+---+
| _1|
+---+
|[a]|
+---+

scala> Seq(Tuple1(Tuple1("a"))).toDF.select($"_1".cast("struct<_1:string>")).show()

+---+
| _1|
+---+
|[a]|
+---+

Such type strings looks identical when R’s one as below:

> write.df(sql("SELECT named_struct('_1', 'a') as struct"), "/tmp/aa", "parquet")
> collect(read.df("/tmp/aa", "parquet", structType(structField("struct", "struct<_1:string>"))))
  struct
1      a

R’s one is stricter because we are checking the types via regular expressions in R side ahead.

Actual logics there look a bit different but as we check it ahead in R side, it looks replacing it would not introduce (I think) no behaviour changes. To make this sure, the tests dedicated for it were added in SPARK-20105. (It looks structField is the only place that calls this method).

How was this patch tested?

Existing tests - https://github.com/apache/spark/blob/master/R/pkg/inst/tests/testthat/test_sparkSQL.R#L143-L194 should cover this.

HyukjinKwon · 2017-04-27T15:26:42Z

(I will cc related reviewers after the tests got passed).

SparkQA · 2017-04-27T17:40:50Z

Test build #76228 has finished for PR 17785 at commit 8874de1.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-04-27T17:47:15Z

Test build #76230 has finished for PR 17785 at commit b802e36.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

HyukjinKwon · 2017-04-27T21:27:37Z

cc @felixcheung, @hvanhovell and @gatorsmile. Could you take a look please?

felixcheung

I guess we target master (and not branch-2.2) for this change?

felixcheung · 2017-04-28T03:47:48Z

R/pkg/R/utils.R

@@ -864,6 +864,14 @@ captureJVMException <- function(e, method) {
    # Extract the first message of JVM exception.
    first <- strsplit(msg[2], "\r?\n\tat")[[1]][1]
    stop(paste0(rmsg, "no such table - ", first), call. = FALSE)
+  } else if (any(grep("org.apache.spark.sql.catalyst.parser.ParseException: ", stacktrace))) {
+    msg <- strsplit(stacktrace, "org.apache.spark.sql.catalyst.parser.ParseException: ",
+    fixed = TRUE)[[1]]


felixcheung · 2017-04-28T03:49:53Z

sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala

  def createStructField(name: String, dataType: String, nullable: Boolean): StructField = {
-    val dtObj = getSQLDataType(dataType)
+    val dtObj = CatalystSqlParser.parseDataType(dataType)


haven't checked myself, what are the differences if any between getSQLDataType and CatalystSqlParser.parseDataType?

is it

R’s one is stricter because we are checking the types via regular expressions in R side ahead.

Up to my knowledge, getSQLDataType supports the types below:

binary boolean byte character date double float integer logical numeric raw string timestamp array<...> struct<...> map<...>

and these look required to be case-sensitive whereas parseDataType supports ...

bigint binary boolean byte char date decimal double float int integer long short smallint string timestamp tinyint varchar array<...> struct<...> map<...>

and these look case-insensitive.

I think the inital intention for getSQLDataType was to support R type string conversions
too but they look unreachable codes now because we were checking the type strings before actually calling getSQLDataType in checkType.

If the types are not in !is.null(PRIMITIVE_TYPES[[type]]) (case-sensitive), it looks throwing an error.

bigint binary boolean byte date decimal double float int integer smallint string timestamp tinyint array<...> map<...> struct<...>

In short, I think there should not be a behaviour change below types (intersection between getSQLDataType and parseDataType) ...

binary string double float boolean timestamp date integer byte array<...> map<...> struct<...>

and these should be case-sensitive.

Additionally, we will support the types below (which are written in R's PREMISITVE_TYPES but getSQLDataType did not support before):

tinyint smallint int bigint decimal

Before

> structField("_col", "tinyint") ... Error in handleErrors(returnStatus, conn) : java.lang.IllegalArgumentException: Invalid type tinyint at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131) at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136) at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala) ...

> structField("_col", "smallint") ... Error in handleErrors(returnStatus, conn) : java.lang.IllegalArgumentException: Invalid type smallint at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131) at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136) at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala) ...

> structField("_col", "int") ... Error in handleErrors(returnStatus, conn) : java.lang.IllegalArgumentException: Invalid type int at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131) at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136) at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala) ...

> structField("_col", "bigint") ... Error in handleErrors(returnStatus, conn) : java.lang.IllegalArgumentException: Invalid type bigint at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131) at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136) at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala) ...

> structField("_col", "decimal") ... java.lang.IllegalArgumentException: Invalid type decimal at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131) at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136) at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala) ...

After

> structField("_col", "tinyint") StructField(name = "_col", type = "ByteType", nullable = TRUE)>

> structField("_col", "smallint") StructField(name = "_col", type = "ShortType", nullable = TRUE)>

> structField("_col", "int") StructField(name = "_col", type = "IntegerType", nullable = TRUE)>

> structField("_col", "bigint") StructField(name = "_col", type = "LongType", nullable = TRUE)>

> structField("_col", "decimal") StructField(name = "_col", type = "DecimalType(10,0)", nullable = TRUE)>

I just wrote the details about this at my best. Yes, I think this should be targeted to master not 2.2.

thanks for looking into it. if I take the diff,

character logical numeric raw

these are actually R native type names though, for which if I have to guess, is intentional that we support R native type in structField as well as Scala/Spark types.

I'm not sure how much coverage we have for something like this but is that going to still work with this change?

Yea, however, for those types, we can't create that field because the check via checkType fails as it is not in the keys in PREMISITVE_TYPES as below:

> structField("_col", "character") Error in checkType(type) : Unsupported type for SparkDataframe: character > structField("_col", "logical") Error in checkType(type) : Unsupported type for SparkDataframe: logical > structField("_col", "numeric") Error in checkType(type) : Unsupported type for SparkDataframe: numeric > structField("_col", "raw") Error in checkType(type) : Unsupported type for SparkDataframe: raw

I double-checked this is the only place where we called getSQLDataType and therefore they look unreachable (I hope you could double-check this when you have some time for this one just in case I missed something about this).

I see - then I suppose this was "broken" when checkType was added / over the past 2 years or so.
I'm ok with this change then - could you please check that each of the case in checkType we have a test for in R?

Sure, probably those 12 cases in test_sparkSQL.R#L143-L194 and 5 cases in #17785 (comment) of total 17 cases in checkType above will be covered.

Let me double check them and will leave a comment here today.

Let me add positive cases for ...

bigint tinyint int smallint decimal

and negative cases for ...

short varchar long char

SparkQA · 2017-04-28T06:09:34Z

Test build #76255 has finished for PR 17785 at commit 257e625.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2017-04-28T07:59:14Z

re: your title De-deuplicate -> De-duplicate

felixcheung

LGTM if we check as per #17785 (comment)

And AppVeyor passes

SparkQA · 2017-04-28T16:23:24Z

Test build #76270 has finished for PR 17785 at commit e9d672a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung

LGTM

felixcheung · 2017-04-29T18:02:30Z

merged to master, thanks!

HyukjinKwon · 2017-04-29T19:55:47Z

Thank you for checking this closely with me and merging it @felixcheung.

De-deuplicate parse logics for DDL-like type string in R

8874de1

Minimised fix

b802e36

HyukjinKwon closed this Apr 27, 2017

HyukjinKwon reopened this Apr 27, 2017

felixcheung reviewed Apr 28, 2017

View reviewed changes

Indent mistake

257e625

HyukjinKwon changed the title ~~[SPARK-20493][R] De-deuplicate parse logics for DDL-like type strings in R~~ [SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R Apr 28, 2017

felixcheung approved these changes Apr 28, 2017

View reviewed changes

Add tests for each case

e9d672a

felixcheung approved these changes Apr 29, 2017

View reviewed changes

asfgit closed this in 70f1bcd Apr 29, 2017

HyukjinKwon deleted the SPARK-20493 branch January 2, 2018 03:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R #17785

[SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R #17785

HyukjinKwon commented Apr 27, 2017 •

edited

Loading

HyukjinKwon commented Apr 27, 2017

SparkQA commented Apr 27, 2017

SparkQA commented Apr 27, 2017

HyukjinKwon commented Apr 27, 2017

felixcheung left a comment

felixcheung Apr 28, 2017

felixcheung Apr 28, 2017

felixcheung Apr 28, 2017

HyukjinKwon Apr 28, 2017 •

edited

Loading

HyukjinKwon Apr 28, 2017

felixcheung Apr 28, 2017

HyukjinKwon Apr 28, 2017 •

edited

Loading

felixcheung Apr 28, 2017

HyukjinKwon Apr 28, 2017

HyukjinKwon Apr 28, 2017

SparkQA commented Apr 28, 2017

felixcheung commented Apr 28, 2017

felixcheung left a comment

SparkQA commented Apr 28, 2017

felixcheung left a comment

felixcheung commented Apr 29, 2017

HyukjinKwon commented Apr 29, 2017

[SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R #17785

[SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R #17785

Conversation

HyukjinKwon commented Apr 27, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

HyukjinKwon commented Apr 27, 2017

SparkQA commented Apr 27, 2017

SparkQA commented Apr 27, 2017

HyukjinKwon commented Apr 27, 2017

felixcheung left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Apr 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

HyukjinKwon Apr 28, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Apr 28, 2017

felixcheung commented Apr 28, 2017

felixcheung left a comment

Choose a reason for hiding this comment

SparkQA commented Apr 28, 2017

felixcheung left a comment

Choose a reason for hiding this comment

felixcheung commented Apr 29, 2017

HyukjinKwon commented Apr 29, 2017

HyukjinKwon commented Apr 27, 2017 •

edited

Loading

HyukjinKwon Apr 28, 2017 •

edited

Loading

HyukjinKwon Apr 28, 2017 •

edited

Loading