-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-25252][SQL] Support arrays of any types by to_json #22226
Conversation
Test build #95227 has finished for PR 22226 at commit
|
@HyukjinKwon Please, have a look at the PR. |
Test build #95235 has finished for PR 22226 at commit
|
cc @dongjoon-hyun Try to review this PR? |
@MaxGekk btw, why did you attach this pr to the resolved jira? follow-up? |
@maropu The JIRA ticket was about both |
@@ -65,6 +66,8 @@ private[sql] class JacksonGenerator( | |||
(arr: SpecializedGetters, i: Int) => { | |||
writeObject(writeMapData(arr.getMap(i), mt, mapElementWriter)) | |||
} | |||
case _ => throw new UnsupportedOperationException( | |||
s"Initial type ${dataType.catalogString} must be an array, a struct or a map") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s"Initial type ${dataType.catalogString} must be an array, a struct or a map")
-> s"Initial type ${dataType.catalogString} must be an ${ArrayType.simpleString}, a ${StructType.simpleString} or a ${MapType.simpleString}")
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed here and above
def verifyType(name: String, dataType: DataType): Unit = dataType match { | ||
case NullType | BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | | ||
DoubleType | StringType | TimestampType | DateType | BinaryType | _: DecimalType => | ||
def verifyType(name: String, dataType: DataType): Unit = dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
private?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I extracted it to use outside
-- to_json - array type | ||
select to_json(array('1','2','3')); | ||
select to_json(array(array(1,2,3),array(4))); | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super nit: add space afiter ,
, e.g., select to_json(array('1', '2', '3'));
Probably, you'd be better to file separate jira for each function. |
+1 for separate JIRA. |
I created the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25252 |
Test build #95291 has finished for PR 22226 at commit
|
@@ -613,8 +613,7 @@ case class JsonToStructs( | |||
} | |||
|
|||
/** | |||
* Converts a [[StructType]], [[ArrayType]] of [[StructType]]s, [[MapType]] | |||
* or [[ArrayType]] of [[MapType]]s to a json output string. | |||
* Converts a [[StructType]], [[ArrayType]] or [[MapType]] to a json output string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not a big deal but JSON
while we are here
python/pyspark/sql/functions.py
Outdated
into a JSON string. Throws an exception, in the case of an unsupported type. | ||
|
||
:param col: name of column containing the struct, array of the structs, the map or | ||
array of the maps. | ||
:param col: name of column containing a struct, an array or a map. | ||
:param options: options to control converting. accepts the same options as the json datasource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto
Seems okay but I or someone else should take a closer look before getting this in. |
Test build #95331 has finished for PR 22226 at commit
|
Test build #95347 has finished for PR 22226 at commit
|
@@ -28,7 +27,7 @@ import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, MapData} | |||
import org.apache.spark.sql.types._ | |||
|
|||
/** | |||
* `JackGenerator` can only be initialized with a `StructType` or a `MapType`. | |||
* `JackGenerator` can only be initialized with a `StructType`, a `MapType` ot `ArrayType`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: ot
R/pkg/R/functions.R
Outdated
#' \code{to_json}: Converts a column containing a \code{structType}, array of \code{structType}, | ||
#' a \code{mapType} or array of \code{mapType} into a Column of JSON string. | ||
#' \code{to_json}: Converts a column containing a \code{structType}, a \code{mapType} | ||
#' or an array into a Column of JSON string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does \code{arrayType}
work here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it should
could we add some tests for this in R?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's add one simple python doctest as well
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added tests for Python and R. Please, take a look at them.
require(dataType.isInstanceOf[StructType] || dataType.isInstanceOf[MapType], | ||
// `JackGenerator` can only be initialized with a `StructType`, a `MapType` or a `ArrayType`. | ||
require(dataType.isInstanceOf[StructType] || dataType.isInstanceOf[MapType] | ||
|| dataType.isInstanceOf[ArrayType], | ||
s"JacksonGenerator only supports to be initialized with a ${StructType.simpleString} " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe need a ,
between ${StructType.simpleString}
and ${MapType.simpleString}
?
* Throws an exception, in the case of an unsupported type. | ||
* | ||
* @param e a column containing a struct or array of the structs. | ||
* @param e a column containing a struct, a array or a map. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: a array
-> an array
@@ -469,4 +469,53 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext { | |||
|
|||
checkAnswer(sql("""select json[0] from jsonTable"""), Seq(Row(null))) | |||
} | |||
|
|||
test("to_json - array of primitive type") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo: primitive type
-> primitive types
Test build #95496 has finished for PR 22226 at commit
|
Test build #95575 has finished for PR 22226 at commit
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Let me take another look before getting this in.
} | ||
|
||
// `ValueWriter` for array data storing rows of the schema. | ||
private lazy val arrElementWriter: ValueWriter = dataType match { | ||
case at: ArrayType => makeWriter(at.elementType) | ||
case st: StructType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we do case _: StructType | _: MapType => makeWriter(dataType)
?
def verifyType(name: String, dataType: DataType): Unit = dataType match { | ||
case NullType | BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType | | ||
DoubleType | StringType | TimestampType | DateType | BinaryType | _: DecimalType => | ||
def verifyType(name: String, dataType: DataType): Unit = dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can do:
def verifyType(name: String, dataType: DataType): Unit = {
dataType match {
case ...
}
}
to reduce the diff.
Test build #95622 has finished for PR 22226 at commit
|
case mt: MapType => mt | ||
case ArrayType(mt: MapType, _) => mt | ||
} | ||
lazy val rowSchema = child.dataType |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: rowSchema
-> intputSchema
. I named this to rowSchema
because it was always the schema for the row itself. Now, it seems can be other types as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's make it val
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to remove lazy
and got many errors on tests like:
Invalid call to dataType on unresolved object, tree: 'a
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'a
at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
at org.apache.spark.sql.catalyst.expressions.StructsToJson.<init>(jsonExpressions.scala:665)
If you don't mind, I will keep it lazy
.
case ArrayType(_: MapType, _) => | ||
(arr: Any) => | ||
gen.write(arr.asInstanceOf[ArrayData]) | ||
getAndReset() | ||
} | ||
} | ||
|
||
override def dataType: DataType = StringType | ||
|
||
override def checkInputDataTypes(): TypeCheckResult = child.dataType match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: child.dataType
-> inputSchema
} | ||
} | ||
|
||
override def dataType: DataType = StringType | ||
|
||
override def checkInputDataTypes(): TypeCheckResult = child.dataType match { | ||
case _: StructType | ArrayType(_: StructType, _) => | ||
case _: StructType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: case _: StructType
and use it instead of rowSchema.asInstanceOf[StructType]
.
try { | ||
JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType]) | ||
TypeCheckResult.TypeCheckSuccess | ||
} catch { | ||
case e: UnsupportedOperationException => | ||
TypeCheckResult.TypeCheckFailure(e.getMessage) | ||
} | ||
case _: MapType | ArrayType(_: MapType, _) => | ||
case _: MapType => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit case mapType: Maptype =>
and use it below likewise.
@@ -685,33 +679,29 @@ case class StructsToJson( | |||
(row: Any) => |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
child.dataType
-> intputSchema
R/pkg/R/functions.R
Outdated
#' \code{to_json}: Converts a column containing a \code{structType}, array of \code{structType}, | ||
#' a \code{mapType} or array of \code{mapType} into a Column of JSON string. | ||
#' \code{to_json}: Converts a column containing a \code{structType}, a \code{mapType} | ||
#' or an array into a Column of JSON string. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\code{arrayType}
. It seems missed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM otherwise
Test build #95671 has finished for PR 22226 at commit
|
Merged to master. |
What changes were proposed in this pull request?
In the PR, I propose to extended
to_json
and support any types as element types of input arrays. It should allow converting arrays of primitive types and arrays of arrays. For example:How was this patch tested?
Added a couple sql tests for arrays of primitive type and of arrays. Also I added round trip test
from_json
->to_json
.