Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-25252][SQL] Support arrays of any types by to_json #22226

Closed
wants to merge 13 commits into from

Conversation

MaxGekk
Copy link
Member

@MaxGekk MaxGekk commented Aug 24, 2018

What changes were proposed in this pull request?

In the PR, I propose to extended to_json and support any types as element types of input arrays. It should allow converting arrays of primitive types and arrays of arrays. For example:

select to_json(array('1','2','3'))
> ["1","2","3"]
select to_json(array(array(1,2,3),array(4)))
> [[1,2,3],[4]]

How was this patch tested?

Added a couple sql tests for arrays of primitive type and of arrays. Also I added round trip test from_json -> to_json.

@SparkQA
Copy link

SparkQA commented Aug 24, 2018

Test build #95227 has finished for PR 22226 at commit e9a5f50.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 24, 2018

@HyukjinKwon Please, have a look at the PR.

@SparkQA
Copy link

SparkQA commented Aug 25, 2018

Test build #95235 has finished for PR 22226 at commit c1755de.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gatorsmile
Copy link
Member

cc @dongjoon-hyun Try to review this PR?

@maropu
Copy link
Member

maropu commented Aug 27, 2018

@MaxGekk btw, why did you attach this pr to the resolved jira? follow-up?

@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 27, 2018

@maropu The JIRA ticket was about both to_json and from_json originally.

@@ -65,6 +66,8 @@ private[sql] class JacksonGenerator(
(arr: SpecializedGetters, i: Int) => {
writeObject(writeMapData(arr.getMap(i), mt, mapElementWriter))
}
case _ => throw new UnsupportedOperationException(
s"Initial type ${dataType.catalogString} must be an array, a struct or a map")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s"Initial type ${dataType.catalogString} must be an array, a struct or a map") -> s"Initial type ${dataType.catalogString} must be an ${ArrayType.simpleString}, a ${StructType.simpleString} or a ${MapType.simpleString}")?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed here and above

def verifyType(name: String, dataType: DataType): Unit = dataType match {
case NullType | BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType |
DoubleType | StringType | TimestampType | DateType | BinaryType | _: DecimalType =>
def verifyType(name: String, dataType: DataType): Unit = dataType match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

private?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I extracted it to use outside

-- to_json - array type
select to_json(array('1','2','3'));
select to_json(array(array(1,2,3),array(4)));

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super nit: add space afiter ,, e.g., select to_json(array('1', '2', '3'));

@maropu
Copy link
Member

maropu commented Aug 27, 2018

Probably, you'd be better to file separate jira for each function.

@HyukjinKwon
Copy link
Member

+1 for separate JIRA.

@MaxGekk MaxGekk changed the title [SPARK-24391][SQL] Support arrays of any types by to_json [SPARK-25252][SQL] Support arrays of any types by to_json Aug 27, 2018
@MaxGekk
Copy link
Member Author

MaxGekk commented Aug 27, 2018

Probably, you'd be better to file separate jira for each function.
+1 for separate JIRA.

I created the JIRA ticket: https://issues.apache.org/jira/browse/SPARK-25252

@SparkQA
Copy link

SparkQA commented Aug 27, 2018

Test build #95291 has finished for PR 22226 at commit 906a301.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -613,8 +613,7 @@ case class JsonToStructs(
}

/**
* Converts a [[StructType]], [[ArrayType]] of [[StructType]]s, [[MapType]]
* or [[ArrayType]] of [[MapType]]s to a json output string.
* Converts a [[StructType]], [[ArrayType]] or [[MapType]] to a json output string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not a big deal but JSON while we are here

into a JSON string. Throws an exception, in the case of an unsupported type.

:param col: name of column containing the struct, array of the structs, the map or
array of the maps.
:param col: name of column containing a struct, an array or a map.
:param options: options to control converting. accepts the same options as the json datasource
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto

@HyukjinKwon
Copy link
Member

Seems okay but I or someone else should take a closer look before getting this in.

@SparkQA
Copy link

SparkQA commented Aug 28, 2018

Test build #95331 has finished for PR 22226 at commit e91a33c.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Aug 28, 2018

Test build #95347 has finished for PR 22226 at commit 2c3a641.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -28,7 +27,7 @@ import org.apache.spark.sql.catalyst.util.{ArrayData, DateTimeUtils, MapData}
import org.apache.spark.sql.types._

/**
* `JackGenerator` can only be initialized with a `StructType` or a `MapType`.
* `JackGenerator` can only be initialized with a `StructType`, a `MapType` ot `ArrayType`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: ot

#' \code{to_json}: Converts a column containing a \code{structType}, array of \code{structType},
#' a \code{mapType} or array of \code{mapType} into a Column of JSON string.
#' \code{to_json}: Converts a column containing a \code{structType}, a \code{mapType}
#' or an array into a Column of JSON string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does \code{arrayType} work here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it should
could we add some tests for this in R?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add one simple python doctest as well

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added tests for Python and R. Please, take a look at them.

require(dataType.isInstanceOf[StructType] || dataType.isInstanceOf[MapType],
// `JackGenerator` can only be initialized with a `StructType`, a `MapType` or a `ArrayType`.
require(dataType.isInstanceOf[StructType] || dataType.isInstanceOf[MapType]
|| dataType.isInstanceOf[ArrayType],
s"JacksonGenerator only supports to be initialized with a ${StructType.simpleString} " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe need a , between ${StructType.simpleString} and ${MapType.simpleString}?

* Throws an exception, in the case of an unsupported type.
*
* @param e a column containing a struct or array of the structs.
* @param e a column containing a struct, a array or a map.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: a array -> an array

@@ -469,4 +469,53 @@ class JsonFunctionsSuite extends QueryTest with SharedSQLContext {

checkAnswer(sql("""select json[0] from jsonTable"""), Seq(Row(null)))
}

test("to_json - array of primitive type") {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo: primitive type -> primitive types

@SparkQA
Copy link

SparkQA commented Aug 31, 2018

Test build #95496 has finished for PR 22226 at commit 2f73c9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Sep 1, 2018

Test build #95575 has finished for PR 22226 at commit ce3b17c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Let me take another look before getting this in.

}

// `ValueWriter` for array data storing rows of the schema.
private lazy val arrElementWriter: ValueWriter = dataType match {
case at: ArrayType => makeWriter(at.elementType)
case st: StructType =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we do case _: StructType | _: MapType => makeWriter(dataType)?

def verifyType(name: String, dataType: DataType): Unit = dataType match {
case NullType | BooleanType | ByteType | ShortType | IntegerType | LongType | FloatType |
DoubleType | StringType | TimestampType | DateType | BinaryType | _: DecimalType =>
def verifyType(name: String, dataType: DataType): Unit = dataType match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can do:

def verifyType(name: String, dataType: DataType): Unit = {
  dataType match {
    case ...
  }
}

to reduce the diff.

@SparkQA
Copy link

SparkQA commented Sep 3, 2018

Test build #95622 has finished for PR 22226 at commit 72d2628.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

case mt: MapType => mt
case ArrayType(mt: MapType, _) => mt
}
lazy val rowSchema = child.dataType
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: rowSchema -> intputSchema. I named this to rowSchema because it was always the schema for the row itself. Now, it seems can be other types as well.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's make it val.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to remove lazy and got many errors on tests like:

Invalid call to dataType on unresolved object, tree: 'a
org.apache.spark.sql.catalyst.analysis.UnresolvedException: Invalid call to dataType on unresolved object, tree: 'a
	at org.apache.spark.sql.catalyst.analysis.UnresolvedAttribute.dataType(unresolved.scala:105)
	at org.apache.spark.sql.catalyst.expressions.StructsToJson.<init>(jsonExpressions.scala:665)

If you don't mind, I will keep it lazy.

case ArrayType(_: MapType, _) =>
(arr: Any) =>
gen.write(arr.asInstanceOf[ArrayData])
getAndReset()
}
}

override def dataType: DataType = StringType

override def checkInputDataTypes(): TypeCheckResult = child.dataType match {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: child.dataType -> inputSchema

}
}

override def dataType: DataType = StringType

override def checkInputDataTypes(): TypeCheckResult = child.dataType match {
case _: StructType | ArrayType(_: StructType, _) =>
case _: StructType =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: case _: StructType and use it instead of rowSchema.asInstanceOf[StructType].

try {
JacksonUtils.verifySchema(rowSchema.asInstanceOf[StructType])
TypeCheckResult.TypeCheckSuccess
} catch {
case e: UnsupportedOperationException =>
TypeCheckResult.TypeCheckFailure(e.getMessage)
}
case _: MapType | ArrayType(_: MapType, _) =>
case _: MapType =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit case mapType: Maptype => and use it below likewise.

@@ -685,33 +679,29 @@ case class StructsToJson(
(row: Any) =>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

child.dataType -> intputSchema

#' \code{to_json}: Converts a column containing a \code{structType}, array of \code{structType},
#' a \code{mapType} or array of \code{mapType} into a Column of JSON string.
#' \code{to_json}: Converts a column containing a \code{structType}, a \code{mapType}
#' or an array into a Column of JSON string.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

\code{arrayType}. It seems missed.

Copy link
Member

@HyukjinKwon HyukjinKwon left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM otherwise

@SparkQA
Copy link

SparkQA commented Sep 4, 2018

Test build #95671 has finished for PR 22226 at commit 90c9687.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@HyukjinKwon
Copy link
Member

Merged to master.

@asfgit asfgit closed this in d749d03 Sep 6, 2018
@MaxGekk MaxGekk deleted the to_json-array branch August 17, 2019 13:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants