-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-43063][SQL][FOLLOWUP] Add ToPrettyString expression for Dataset.show #40922
Conversation
I think the PR still introduces user-facing changes. |
We can by duplicating the code of |
I suppose it is fine to have changes in Cast. Would it be possible to check the example queries in my comment in #40699 and what results they return? Or let me know if this is ready for review and I can check out the PR and try those examples on my machine. |
Does this PR need #40699? I was under the assumption that we had to revert the original patch and have another solution instead. |
Most of the changes in #40699 are updating tests, and we still need them as we don't revert the behavior change of It's ready for review now. |
@@ -67,6 +67,8 @@ object StringUtils extends Logging { | |||
"(?s)" + out.result() // (?s) enables dotall mode, causing "." to match new lines | |||
} | |||
|
|||
def getHexString(bytes: Array[Byte]): String = bytes.map("%02X".format(_)).mkString("[", " ", "]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems all functions in this file have javadoc. Shall we also add one for this method?
SchemaUtils.escapeMetaCharacters(cell.toString) | ||
} | ||
// Escapes meta-characters not to break the `showString` format | ||
val str = SchemaUtils.escapeMetaCharacters(cell.toString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This could potentially throw NullPointerException if cell
is null. This was handled in the original code with match-case statement.
We could replace it with s"$cell"
or handle null separately.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should never be null as ToPrettyString.nullable
is false. I can add an assert to enforce it.
|
||
override def eval(input: InternalRow): Any = { | ||
val v = child.eval(input) | ||
if (v == null) UTF8String.fromString("NULL") else castFunc(v) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be UTF8String.fromString(nullString)
|${childCode.code} | ||
|UTF8String ${ev.value}; | ||
|if (${childCode.isNull}) { | ||
| ${ev.value} = UTF8String.fromString("NULL"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here, nullString instead of "NULL".
|
||
override protected def useDecimalPlainString: Boolean = true | ||
|
||
override protected def useHexFormatForBinary: Boolean = true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this be applied to the thrift server? or still, it will keep the string representation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think so. Thriftserver still calls HiveResult
to generate string. @yikf can you confirm?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the late reply, It was busy last week.
Yes, spark-sql
CLI and Thriftserver still calls hiveResultString
to generate the string.
* Returns a pretty string of the byte array which prints each byte as a hex digit and add spaces | ||
* between them. For example, [1A C0]. | ||
*/ | ||
def getHexString(bytes: Array[Byte]): String = bytes.map("%02X".format(_)).mkString("[", " ", "]") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to use uppercase here?
FYI, jdk17 has HexFormat and the default is to use lowercase characters "0-9","a-f"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I missed this is a existing behavior
thanks for the review, merging to master! |
+1, late LGTM |
…t.show ### What changes were proposed in this pull request? This is a followup of apache#40699 to avoid changing the Cast behavior. It pulls out the cast-to-string code into a base trait, and add a new Expression `ToPrettyString` to extend this trait with a little customization. It also handles binary value inside array/struct/map to also print hex format, for `df.show` only, not `Cast`. ### Why are the changes needed? avoid behavior change ### Does this PR introduce _any_ user-facing change? change back the behavior of casting array/map/struct to string regarding null elements. It was `null`, then changed to `NULL` in apache#40699 , and is `null` again after this PR. ### How was this patch tested? existing tests Closes apache#40922 from cloud-fan/pretty. Authored-by: Wenchen Fan <wenchen@databricks.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
builder.append(keyToUTF8String(keyArray.get(0, kt)).asInstanceOf[UTF8String]) | ||
builder.append(" ->") | ||
if (valueArray.isNullAt(0)) { | ||
if (nullString.nonEmpty) builder.append(nullString) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for late review.
This seems to cause a bug because previously we had the following.
if (!legacyCastToStr) builder.append(" NULL")
In other words, we have a space.
scala> spark.version
res9: String = 3.4.0
scala> sql("select map('k', null)").show()
+------------+
|map(k, NULL)|
+------------+
| {k -> null}|
+------------+
Now, master branch shows the following. There is no space between ->
and NULL
.
scala> spark.version
res9: String = 3.5.0-SNAPSHOT
scala> sql("select map('k', null)").show()
+------------+
|map(k, NULL)|
+------------+
| {k ->NULL}|
+------------+
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a follow-up for this.
### What changes were proposed in this pull request? This is a follow-up of #40922. This PR aims to add a space between `->` and value. It seems to be missed here because the original PR already have the same code pattern in other place. https://github.com/apache/spark/blob/74b04eeffdc4765f56fe3a9e97165b15ed4e2c73/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala#L114 ### Why are the changes needed? **BEFORE** ``` scala> sql("select map('k', null)").show() +------------+ |map(k, NULL)| +------------+ | {k ->NULL}| +------------+ ``` **AFTER** ``` scala> sql("select map('k', null)").show() +------------+ |map(k, NULL)| +------------+ | {k -> NULL}| +------------+ ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. Closes #41432 from dongjoon-hyun/SPARK-43063. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
### What changes were proposed in this pull request? This is a follow-up of apache#40922. This PR aims to add a space between `->` and value. It seems to be missed here because the original PR already have the same code pattern in other place. https://github.com/apache/spark/blob/74b04eeffdc4765f56fe3a9e97165b15ed4e2c73/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/ToStringBase.scala#L114 ### Why are the changes needed? **BEFORE** ``` scala> sql("select map('k', null)").show() +------------+ |map(k, NULL)| +------------+ | {k ->NULL}| +------------+ ``` **AFTER** ``` scala> sql("select map('k', null)").show() +------------+ |map(k, NULL)| +------------+ | {k -> NULL}| +------------+ ``` ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? Manual review. Closes apache#41432 from dongjoon-hyun/SPARK-43063. Authored-by: Dongjoon Hyun <dongjoon@apache.org> Signed-off-by: Hyukjin Kwon <gurwls223@apache.org>
The following are the changes needed: * PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark. * `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`. * The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet * Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4. * Remove unrelated tutorial in `delta/examples/tutorials/saiseu19` * Test failure fixes * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed * `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-java#1000 in `parquet-mr` dependency. * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`. * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage. * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release Closes delta-io#1986 GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30
The following are the changes needed: * PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark. * `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`. * The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet * Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4. * Remove unrelated tutorial in `delta/examples/tutorials/saiseu19` * Test failure fixes * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed * `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-java#1000 in `parquet-mr` dependency. * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`. * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage. * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release Closes #1986 GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30
The following are the changes needed: * PySpark 3.5 has deprecated the support for Python 3.7. This required changes to Delta test infra to install the appropriate Python version and other packages. The `Dockerfile` used for running tests is also updated to have required Python version and packages and uses the same base image as PySpark test infra in Apache Spark. * `StructType.toAttributes` and `StructType.fromAttributes` methods are moved into a utility class `DataTypeUtils`. * The `iceberg` module is disabled as there is no released version of `iceberg` that works with Spark 3.5 yet * Remove the URI path hack used in `DMLWithDeletionVectorsHelper` to get around a bug in Spark 3.4. * Remove unrelated tutorial in `delta/examples/tutorials/saiseu19` * Test failure fixes * `org.apache.spark.sql.delta.DeltaHistoryManagerSuite` - Error message has changed * `org.apache.spark.sql.delta.DeltaOptionSuite` - Parquet file name using the LZ4 code has changed due to a apache/parquet-java#1000 in `parquet-mr` dependency. * `org.apache.spark.sql.delta.deletionvectors.DeletionVectorsSuite` - Parquet now generates `row-index` whenever `_metadata` column is selected, however Spark 3.5 has a bug where a row group containing more than 2bn rows fails. For now don't return any `row-index` column in `_metadata` by overriding the `metadataSchemaFields: Seq[StructField]` in `DeltaParquetFileFormat`. * `org.apache.spark.sql.delta.perf.OptimizeMetadataOnlyDeltaQuerySuite`: A behavior change by apache/spark#40922. In Spark plans a new function called `ToPrettyString` is used instead of `cast(aggExpr To STRING)` in when `Dataset.show()` usage. * `org.apache.spark.sql.delta.DeltaCDCStreamDeletionVectorSuite` and `org.apache.spark.sql.delta.DeltaCDCStreamSuite`: Regression in Spark 3.5 RC fixed by apache/spark#42774 before the Spark 3.5 release Closes delta-io#1986 GitOrigin-RevId: b0e4a81b608a857e45ecba71b070309347616a30
What changes were proposed in this pull request?
This is a followup of #40699 to avoid changing the Cast behavior. It pulls out the cast-to-string code into a base trait, and add a new Expression
ToPrettyString
to extend this trait with a little customization.It also handles binary value inside array/struct/map to also print hex format, for
df.show
only, notCast
.Why are the changes needed?
avoid behavior change
Does this PR introduce any user-facing change?
change back the behavior of casting array/map/struct to string regarding null elements. It was
null
, then changed toNULL
in #40699 , and isnull
again after this PR.How was this patch tested?
existing tests