Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20493][R] De-duplicate parse logics for DDL-like type strings in R #17785

Closed
wants to merge 4 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions R/pkg/R/utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -864,6 +864,14 @@ captureJVMException <- function(e, method) {
# Extract the first message of JVM exception.
first <- strsplit(msg[2], "\r?\n\tat")[[1]][1]
stop(paste0(rmsg, "no such table - ", first), call. = FALSE)
} else if (any(grep("org.apache.spark.sql.catalyst.parser.ParseException: ", stacktrace))) {
msg <- strsplit(stacktrace, "org.apache.spark.sql.catalyst.parser.ParseException: ",
fixed = TRUE)[[1]]
# Extract "Error in ..." message.
rmsg <- msg[1]
# Extract the first message of JVM exception.
first <- strsplit(msg[2], "\r?\n\tat")[[1]][1]
stop(paste0(rmsg, "parse error - ", first), call. = FALSE)
} else {
stop(stacktrace, call. = FALSE)
}
Expand Down
13 changes: 11 additions & 2 deletions R/pkg/inst/tests/testthat/test_sparkSQL.R
Original file line number Diff line number Diff line change
Expand Up @@ -150,7 +150,12 @@ test_that("structField type strings", {
binary = "BinaryType",
boolean = "BooleanType",
timestamp = "TimestampType",
date = "DateType")
date = "DateType",
tinyint = "ByteType",
smallint = "ShortType",
int = "IntegerType",
bigint = "LongType",
decimal = "DecimalType(10,0)")

complexTypes <- list("map<string,integer>" = "MapType(StringType,IntegerType,true)",
"array<string>" = "ArrayType(StringType,true)",
Expand All @@ -174,7 +179,11 @@ test_that("structField type strings", {
numeric = "numeric",
character = "character",
raw = "raw",
logical = "logical")
logical = "logical",
short = "short",
varchar = "varchar",
long = "long",
char = "char")

complexErrors <- list("map<string, integer>" = " integer",
"array<String>" = "String",
Expand Down
6 changes: 3 additions & 3 deletions R/pkg/inst/tests/testthat/test_utils.R
Original file line number Diff line number Diff line change
Expand Up @@ -167,13 +167,13 @@ test_that("convertToJSaveMode", {
})

test_that("captureJVMException", {
method <- "getSQLDataType"
method <- "createStructField"
expect_error(tryCatch(callJStatic("org.apache.spark.sql.api.r.SQLUtils", method,
"unknown"),
"col", "unknown", TRUE),
error = function(e) {
captureJVMException(e, method)
}),
"Error in getSQLDataType : illegal argument - Invalid type unknown")
"parse error - .*DataType unknown.*not supported.")
})

test_that("hashCode", {
Expand Down
43 changes: 2 additions & 41 deletions sql/core/src/main/scala/org/apache/spark/sql/api/r/SQLUtils.scala
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ import org.apache.spark.broadcast.Broadcast
import org.apache.spark.rdd.RDD
import org.apache.spark.sql._
import org.apache.spark.sql.catalyst.expressions.GenericRowWithSchema
import org.apache.spark.sql.catalyst.parser.CatalystSqlParser
import org.apache.spark.sql.execution.command.ShowTablesCommand
import org.apache.spark.sql.internal.StaticSQLConf.CATALOG_IMPLEMENTATION
import org.apache.spark.sql.types._
Expand Down Expand Up @@ -92,48 +93,8 @@ private[sql] object SQLUtils extends Logging {
def r: Regex = new Regex(sc.parts.mkString, sc.parts.tail.map(_ => "x"): _*)
}

def getSQLDataType(dataType: String): DataType = {
dataType match {
case "byte" => org.apache.spark.sql.types.ByteType
case "integer" => org.apache.spark.sql.types.IntegerType
case "float" => org.apache.spark.sql.types.FloatType
case "double" => org.apache.spark.sql.types.DoubleType
case "numeric" => org.apache.spark.sql.types.DoubleType
case "character" => org.apache.spark.sql.types.StringType
case "string" => org.apache.spark.sql.types.StringType
case "binary" => org.apache.spark.sql.types.BinaryType
case "raw" => org.apache.spark.sql.types.BinaryType
case "logical" => org.apache.spark.sql.types.BooleanType
case "boolean" => org.apache.spark.sql.types.BooleanType
case "timestamp" => org.apache.spark.sql.types.TimestampType
case "date" => org.apache.spark.sql.types.DateType
case r"\Aarray<(.+)${elemType}>\Z" =>
org.apache.spark.sql.types.ArrayType(getSQLDataType(elemType))
case r"\Amap<(.+)${keyType},(.+)${valueType}>\Z" =>
if (keyType != "string" && keyType != "character") {
throw new IllegalArgumentException("Key type of a map must be string or character")
}
org.apache.spark.sql.types.MapType(getSQLDataType(keyType), getSQLDataType(valueType))
case r"\Astruct<(.+)${fieldsStr}>\Z" =>
if (fieldsStr(fieldsStr.length - 1) == ',') {
throw new IllegalArgumentException(s"Invalid type $dataType")
}
val fields = fieldsStr.split(",")
val structFields = fields.map { field =>
field match {
case r"\A(.+)${fieldName}:(.+)${fieldType}\Z" =>
createStructField(fieldName, fieldType, true)

case _ => throw new IllegalArgumentException(s"Invalid type $dataType")
}
}
createStructType(structFields)
case _ => throw new IllegalArgumentException(s"Invalid type $dataType")
}
}

def createStructField(name: String, dataType: String, nullable: Boolean): StructField = {
val dtObj = getSQLDataType(dataType)
val dtObj = CatalystSqlParser.parseDataType(dataType)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

haven't checked myself, what are the differences if any between getSQLDataType and CatalystSqlParser.parseDataType?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it

R’s one is stricter because we are checking the types via regular expressions in R side ahead.

Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up to my knowledge, getSQLDataType supports the types below:

binary
boolean
byte
character
date
double
float
integer
logical
numeric
raw
string
timestamp
array<...>
struct<...>
map<...>

and these look required to be case-sensitive whereas parseDataType supports ...

bigint
binary
boolean
byte
char
date
decimal
double
float
int
integer
long
short
smallint
string
timestamp
tinyint
varchar
array<...>
struct<...>
map<...>

and these look case-insensitive.

I think the inital intention for getSQLDataType was to support R type string conversions
too but they look unreachable codes now because we were checking the type strings before actually calling getSQLDataType in checkType.

If the types are not in !is.null(PRIMITIVE_TYPES[[type]]) (case-sensitive), it looks throwing an error.

bigint
binary
boolean
byte
date
decimal
double
float
int
integer
smallint
string
timestamp
tinyint
array<...>
map<...>
struct<...>

In short, I think there should not be a behaviour change below types (intersection between getSQLDataType and parseDataType) ...

binary
string
double
float
boolean
timestamp
date
integer
byte
array<...>
map<...>
struct<...>

and these should be case-sensitive.

Additionally, we will support the types below (which are written in R's PREMISITVE_TYPES but getSQLDataType did not support before):

tinyint
smallint
int
bigint
decimal

Before

> structField("_col", "tinyint")
...
Error in handleErrors(returnStatus, conn) :
  java.lang.IllegalArgumentException: Invalid type tinyint
	at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
	at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
	at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
	...
> structField("_col", "smallint")
...
Error in handleErrors(returnStatus, conn) :
  java.lang.IllegalArgumentException: Invalid type smallint
	at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
	at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
	at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
	...
> structField("_col", "int")
...
Error in handleErrors(returnStatus, conn) :
  java.lang.IllegalArgumentException: Invalid type int
	at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
	at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
	at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
	...
> structField("_col", "bigint")
...
Error in handleErrors(returnStatus, conn) :
  java.lang.IllegalArgumentException: Invalid type bigint
	at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
	at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
	at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
	...
> structField("_col", "decimal")
  ...
  java.lang.IllegalArgumentException: Invalid type decimal
	at org.apache.spark.sql.api.r.SQLUtils$.getSQLDataType(SQLUtils.scala:131)
	at org.apache.spark.sql.api.r.SQLUtils$.createStructField(SQLUtils.scala:136)
	at org.apache.spark.sql.api.r.SQLUtils.createStructField(SQLUtils.scala)
	...

After

> structField("_col", "tinyint")
StructField(name = "_col", type = "ByteType", nullable = TRUE)>
> structField("_col", "smallint")
StructField(name = "_col", type = "ShortType", nullable = TRUE)>
> structField("_col", "int")
StructField(name = "_col", type = "IntegerType", nullable = TRUE)>
> structField("_col", "bigint")
StructField(name = "_col", type = "LongType", nullable = TRUE)>
> structField("_col", "decimal")
StructField(name = "_col", type = "DecimalType(10,0)", nullable = TRUE)>

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wrote the details about this at my best. Yes, I think this should be targeted to master not 2.2.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for looking into it. if I take the diff,

character
logical
numeric
raw

these are actually R native type names though, for which if I have to guess, is intentional that we support R native type in structField as well as Scala/Spark types.

I'm not sure how much coverage we have for something like this but is that going to still work with this change?

Copy link
Member Author

@HyukjinKwon HyukjinKwon Apr 28, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yea, however, for those types, we can't create that field because the check via checkType fails as it is not in the keys in PREMISITVE_TYPES as below:

> structField("_col", "character")
Error in checkType(type) : Unsupported type for SparkDataframe: character
> structField("_col", "logical")
Error in checkType(type) : Unsupported type for SparkDataframe: logical
> structField("_col", "numeric")
Error in checkType(type) : Unsupported type for SparkDataframe: numeric
> structField("_col", "raw")
Error in checkType(type) : Unsupported type for SparkDataframe: raw

I double-checked this is the only place where we called getSQLDataType and therefore they look unreachable (I hope you could double-check this when you have some time for this one just in case I missed something about this).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see - then I suppose this was "broken" when checkType was added / over the past 2 years or so.
I'm ok with this change then - could you please check that each of the case in checkType we have a test for in R?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, probably those 12 cases in test_sparkSQL.R#L143-L194 and 5 cases in #17785 (comment) of total 17 cases in checkType above will be covered.

Let me double check them and will leave a comment here today.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add positive cases for ...

bigint
tinyint
int
smallint
decimal

and negative cases for ...

short
varchar
long
char

StructField(name, dtObj, nullable)
}

Expand Down