-
Notifications
You must be signed in to change notification settings - Fork 28.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-49334][SQL] str_to_map
should check whether the collation
values of all parameter types are the same
#47825
Conversation
…values of all parameter types are the same
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks reasonable. cc @uros-db You touched the expression last time, please, take a look at the changes.
override def checkInputDataTypes(): TypeCheckResult = { | ||
val defaultCheck = super.checkInputDataTypes() | ||
if (defaultCheck.isFailure) { | ||
defaultCheck | ||
} else if (!TypeCoercion.haveSameType(children.map(_.dataType))) { | ||
DataTypeMismatch( | ||
errorSubClass = "DATA_DIFF_TYPES", | ||
messageParameters = Map( | ||
"functionName" -> toSQLId(prettyName), | ||
"dataType" -> children.map(_.dataType).map(toSQLType).mkString("[", ", ", "]") | ||
) | ||
) | ||
} else { | ||
TypeCheckSuccess | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is not how we usually enforce collation type coercion
please see CollationTypeCasts.scala
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me take a closer look, thank you!
val tableName = "t_diff_collation" | ||
withTable(tableName) { | ||
sql(s"CREATE TABLE $tableName (" + | ||
s"text STRING COLLATE UTF8_BINARY, " + | ||
s"pairDelim STRING COLLATE UTF8_LCASE, " + | ||
s"keyValueDelim STRING COLLATE UTF8_BINARY) " + | ||
s"USING parquet") | ||
checkError( | ||
exception = intercept[AnalysisException] { | ||
sql(s"SELECT str_to_map(text, pairDelim, keyValueDelim) from $tableName") | ||
}, | ||
errorClass = "DATATYPE_MISMATCH.DATA_DIFF_TYPES", | ||
sqlState = "42K09", | ||
parameters = Map( | ||
"functionName" -> "`str_to_map`", | ||
"dataType" -> "[\"STRING\", \"STRING COLLATE UTF8_LCASE\", \"STRING\"]", | ||
"sqlExpr" -> "\"str_to_map(text, pairDelim, keyValueDelim)\""), | ||
context = ExpectedContext( | ||
fragment = "str_to_map(text, pairDelim, keyValueDelim)", | ||
start = 7, | ||
stop = 48) | ||
) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we shouldn't use DATATYPE_MISMATCH.DATA_DIFF_TYPES
for collation match verification
please see COLLATION_MISMATCH.EXPLICIT
, there should be many tests across the codebase market with "// Collation mismatch"
withTable(tableName) { | ||
sql(s"CREATE TABLE $tableName (" + | ||
s"text STRING COLLATE UTF8_BINARY, " + | ||
s"pairDelim STRING COLLATE UTF8_LCASE, " + | ||
s"keyValueDelim STRING COLLATE UTF8_BINARY) " + | ||
s"USING parquet") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also, there is an ongoing effort to move such tests to collations.sql
golden file
please see: #47828 and https://issues.apache.org/jira/browse/SPARK-48779
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Updated, I have moved the related tests to file collations.sql
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please try to include collation-related tickets (such as https://issues.apache.org/jira/browse/SPARK-49334) in the corresponding Collation support
epic: https://issues.apache.org/jira/browse/SPARK-46830
even better, we have a String function support
parent under this epic: https://issues.apache.org/jira/browse/SPARK-46837
this way, we can all see and discuss the ongoing collation tickets
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to sum up, I do agree that type coercion should be enforced here
the reasoning is: pairDelim
and keyValueDelim
are treated as (collation-dependent) delimiters, rather than (collation-unaware) regular expressions
please make the requsted changes, and feel free to request a re-review as needed
I have moved this pr(SPARK-49334) as a subtask to SPARK-46837 |
This PR has been updated according to |
@@ -90,3 +90,14 @@ select 'a' collate en_ci_ai = 'Å'; | |||
select 'Kypper' collate sv < 'Köpfe'; | |||
select 'Kypper' collate de > 'Köpfe'; | |||
select 'I' collate tr_ci = 'ı'; | |||
|
|||
-- create table for str_to_map | |||
create table t1 (text string collate utf8_binary, pairDelim string collate utf8_lcase, keyValueDelim string collate utf8_binary) using parquet; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
create table t1 (text string collate utf8_binary, pairDelim string collate utf8_lcase, keyValueDelim string collate utf8_binary) using parquet; | |
create table t4 (text string collate utf8_binary, pairDelim string collate utf8_lcase, keyValueDelim string collate utf8_binary) using parquet; |
let's just use another table name here, so that we don't confuse it with the three tables above in the future (I know that the naming is not currently consistent, but we do have an ongoing effort to revamp this golden file - https://issues.apache.org/jira/browse/SPARK-48779)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I highly agree, it has been updated!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, just one small nit
+1, LGTM. Merging to master. |
@uros-db @MaxGekk
please help review it when you have free time, thank you very much! |
…values of all parameter types are the same ### What changes were proposed in this pull request? The same principle as #47825 (review), the parameter `delimiter` in expression `split_part` are treated as (`collation-dependent`) delimiters, rather than (`collation-unaware`) regular expressions. ### Why are the changes needed? Strengthen the parameter data type check of expression `split_part` to avoid potential issues. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Add some `test case` to `collations.sql`. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47845 from panbingkun/SPARK-49354. Lead-authored-by: panbingkun <panbingkun@baidu.com> Co-authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
… values of all parameter types are the same ### What changes were proposed in this pull request? The same principle as #47825 (review), the parameters (`left` and `right`) in expression `levenshtein` are `collation-dependent`, rather than `collation-unaware`. ### Why are the changes needed? Strengthen the parameter data type check of expression `levenshtein` to avoid potential issues. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Add some `test case` to `collations.sql`. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #47847 from panbingkun/SPARK-49355. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
…values of all parameter types are the same ### What changes were proposed in this pull request? From the implementation logic of expression `str_to_map`, we found that it only extracts `collation` of the first parameter `text` for the `next` logical processing, as follows: https://github.com/apache/spark/blob/7f1a0c62a03c0d6498503987c4855327c0b6cae1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L588 https://github.com/apache/spark/blob/7f1a0c62a03c0d6498503987c4855327c0b6cae1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L594-L595 https://github.com/apache/spark/blob/7f1a0c62a03c0d6498503987c4855327c0b6cae1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L600-L601 I propose to perform `consistency checks` on the `collation` values of all parameters to avoid potential issues. ### Why are the changes needed? Strengthen the parameter data type check of expression `str_to_map` to avoid potential issues. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Add new UT. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47825 from panbingkun/SPARK-49334. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
…values of all parameter types are the same ### What changes were proposed in this pull request? The same principle as apache#47825 (review), the parameter `delimiter` in expression `split_part` are treated as (`collation-dependent`) delimiters, rather than (`collation-unaware`) regular expressions. ### Why are the changes needed? Strengthen the parameter data type check of expression `split_part` to avoid potential issues. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Add some `test case` to `collations.sql`. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47845 from panbingkun/SPARK-49354. Lead-authored-by: panbingkun <panbingkun@baidu.com> Co-authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
… values of all parameter types are the same ### What changes were proposed in this pull request? The same principle as apache#47825 (review), the parameters (`left` and `right`) in expression `levenshtein` are `collation-dependent`, rather than `collation-unaware`. ### Why are the changes needed? Strengthen the parameter data type check of expression `levenshtein` to avoid potential issues. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Add some `test case` to `collations.sql`. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47847 from panbingkun/SPARK-49355. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
…values of all parameter types are the same ### What changes were proposed in this pull request? From the implementation logic of expression `str_to_map`, we found that it only extracts `collation` of the first parameter `text` for the `next` logical processing, as follows: https://github.com/apache/spark/blob/7f1a0c62a03c0d6498503987c4855327c0b6cae1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L588 https://github.com/apache/spark/blob/7f1a0c62a03c0d6498503987c4855327c0b6cae1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L594-L595 https://github.com/apache/spark/blob/7f1a0c62a03c0d6498503987c4855327c0b6cae1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L600-L601 I propose to perform `consistency checks` on the `collation` values of all parameters to avoid potential issues. ### Why are the changes needed? Strengthen the parameter data type check of expression `str_to_map` to avoid potential issues. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Add new UT. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47825 from panbingkun/SPARK-49334. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
…values of all parameter types are the same ### What changes were proposed in this pull request? The same principle as apache#47825 (review), the parameter `delimiter` in expression `split_part` are treated as (`collation-dependent`) delimiters, rather than (`collation-unaware`) regular expressions. ### Why are the changes needed? Strengthen the parameter data type check of expression `split_part` to avoid potential issues. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Add some `test case` to `collations.sql`. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47845 from panbingkun/SPARK-49354. Lead-authored-by: panbingkun <panbingkun@baidu.com> Co-authored-by: panbingkun <pbk1982@gmail.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
… values of all parameter types are the same ### What changes were proposed in this pull request? The same principle as apache#47825 (review), the parameters (`left` and `right`) in expression `levenshtein` are `collation-dependent`, rather than `collation-unaware`. ### Why are the changes needed? Strengthen the parameter data type check of expression `levenshtein` to avoid potential issues. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? - Add some `test case` to `collations.sql`. - Pass GA. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47847 from panbingkun/SPARK-49355. Authored-by: panbingkun <panbingkun@baidu.com> Signed-off-by: Max Gekk <max.gekk@gmail.com>
What changes were proposed in this pull request?
From the implementation logic of expression
str_to_map
, we found that it only extractscollation
of the first parametertext
for thenext
logical processing, as follows:spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Line 588 in 7f1a0c6
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Lines 594 to 595 in 7f1a0c6
spark/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala
Lines 600 to 601 in 7f1a0c6
I propose to perform
consistency checks
on thecollation
values of all parameters to avoid potential issues.Why are the changes needed?
Strengthen the parameter data type check of expression
str_to_map
to avoid potential issues.Does this PR introduce any user-facing change?
No.
How was this patch tested?
Was this patch authored or co-authored using generative AI tooling?
No.