Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-49334][SQL] str_to_map should check whether the collation values of all parameter types are the same #47825

Closed
wants to merge 7 commits into from

Conversation

panbingkun
Copy link
Contributor

@panbingkun panbingkun commented Aug 21, 2024

What changes were proposed in this pull request?

From the implementation logic of expression str_to_map, we found that it only extracts collation of the first parameter text for the next logical processing, as follows:

private final lazy val collationId: Int = text.dataType.asInstanceOf[StringType].collationId

val keyValues = CollationAwareUTF8String.splitSQL(inputString.asInstanceOf[UTF8String],
stringDelimiter.asInstanceOf[UTF8String], -1, collationId)

val keyValueArray = CollationAwareUTF8String.splitSQL(
keyValues(i), keyValueDelimiterUTF8String, 2, collationId)

I propose to perform consistency checks on the collation values of all parameters to avoid potential issues.

Why are the changes needed?

Strengthen the parameter data type check of expression str_to_map to avoid potential issues.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

  • Add new UT.
  • Pass GA.

Was this patch authored or co-authored using generative AI tooling?

No.

…values of all parameter types are the same
@github-actions github-actions bot added the SQL label Aug 21, 2024
@panbingkun panbingkun marked this pull request as ready for review August 21, 2024 09:15
@panbingkun
Copy link
Contributor Author

cc @MaxGekk @cloud-fan

@panbingkun panbingkun marked this pull request as draft August 21, 2024 11:12
@panbingkun panbingkun marked this pull request as ready for review August 21, 2024 11:35
Copy link
Member

@MaxGekk MaxGekk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks reasonable. cc @uros-db You touched the expression last time, please, take a look at the changes.

Comment on lines 587 to 602
override def checkInputDataTypes(): TypeCheckResult = {
val defaultCheck = super.checkInputDataTypes()
if (defaultCheck.isFailure) {
defaultCheck
} else if (!TypeCoercion.haveSameType(children.map(_.dataType))) {
DataTypeMismatch(
errorSubClass = "DATA_DIFF_TYPES",
messageParameters = Map(
"functionName" -> toSQLId(prettyName),
"dataType" -> children.map(_.dataType).map(toSQLType).mkString("[", ", ", "]")
)
)
} else {
TypeCheckSuccess
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not how we usually enforce collation type coercion

please see CollationTypeCasts.scala

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me take a closer look, thank you!

Comment on lines 1000 to 1022
val tableName = "t_diff_collation"
withTable(tableName) {
sql(s"CREATE TABLE $tableName (" +
s"text STRING COLLATE UTF8_BINARY, " +
s"pairDelim STRING COLLATE UTF8_LCASE, " +
s"keyValueDelim STRING COLLATE UTF8_BINARY) " +
s"USING parquet")
checkError(
exception = intercept[AnalysisException] {
sql(s"SELECT str_to_map(text, pairDelim, keyValueDelim) from $tableName")
},
errorClass = "DATATYPE_MISMATCH.DATA_DIFF_TYPES",
sqlState = "42K09",
parameters = Map(
"functionName" -> "`str_to_map`",
"dataType" -> "[\"STRING\", \"STRING COLLATE UTF8_LCASE\", \"STRING\"]",
"sqlExpr" -> "\"str_to_map(text, pairDelim, keyValueDelim)\""),
context = ExpectedContext(
fragment = "str_to_map(text, pairDelim, keyValueDelim)",
start = 7,
stop = 48)
)
}
Copy link
Contributor

@uros-db uros-db Aug 21, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we shouldn't use DATATYPE_MISMATCH.DATA_DIFF_TYPES for collation match verification

please see COLLATION_MISMATCH.EXPLICIT, there should be many tests across the codebase market with "// Collation mismatch"

Comment on lines 1001 to 1006
withTable(tableName) {
sql(s"CREATE TABLE $tableName (" +
s"text STRING COLLATE UTF8_BINARY, " +
s"pairDelim STRING COLLATE UTF8_LCASE, " +
s"keyValueDelim STRING COLLATE UTF8_BINARY) " +
s"USING parquet")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also, there is an ongoing effort to move such tests to collations.sql golden file

please see: #47828 and https://issues.apache.org/jira/browse/SPARK-48779

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated, I have moved the related tests to file collations.sql.

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please try to include collation-related tickets (such as https://issues.apache.org/jira/browse/SPARK-49334) in the corresponding Collation support epic: https://issues.apache.org/jira/browse/SPARK-46830

even better, we have a String function support parent under this epic: https://issues.apache.org/jira/browse/SPARK-46837

this way, we can all see and discuss the ongoing collation tickets

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to sum up, I do agree that type coercion should be enforced here

the reasoning is: pairDelim and keyValueDelim are treated as (collation-dependent) delimiters, rather than (collation-unaware) regular expressions

please make the requsted changes, and feel free to request a re-review as needed

@panbingkun
Copy link
Contributor Author

please try to include collation-related tickets (such as https://issues.apache.org/jira/browse/SPARK-49334) in the corresponding Collation support epic: https://issues.apache.org/jira/browse/SPARK-46830

even better, we have a String function support parent under this epic: https://issues.apache.org/jira/browse/SPARK-46837

this way, we can all see and discuss the ongoing collation tickets

I have moved this pr(SPARK-49334) as a subtask to SPARK-46837

@panbingkun
Copy link
Contributor Author

panbingkun commented Aug 22, 2024

to sum up, I do agree that type coercion should be enforced here

the reasoning is: pairDelim and keyValueDelim are treated as (collation-dependent) delimiters, rather than (collation-unaware) regular expressions

please make the requsted changes, and feel free to request a re-review as needed

This PR has been updated according to the requsted changes,
please review it again when you have free time, thank you very much!

@@ -90,3 +90,14 @@ select 'a' collate en_ci_ai = 'Å';
select 'Kypper' collate sv < 'Köpfe';
select 'Kypper' collate de > 'Köpfe';
select 'I' collate tr_ci = 'ı';

-- create table for str_to_map
create table t1 (text string collate utf8_binary, pairDelim string collate utf8_lcase, keyValueDelim string collate utf8_binary) using parquet;
Copy link
Contributor

@uros-db uros-db Aug 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
create table t1 (text string collate utf8_binary, pairDelim string collate utf8_lcase, keyValueDelim string collate utf8_binary) using parquet;
create table t4 (text string collate utf8_binary, pairDelim string collate utf8_lcase, keyValueDelim string collate utf8_binary) using parquet;

let's just use another table name here, so that we don't confuse it with the three tables above in the future (I know that the naming is not currently consistent, but we do have an ongoing effort to revamp this golden file - https://issues.apache.org/jira/browse/SPARK-48779)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I highly agree, it has been updated!

Copy link
Contributor

@uros-db uros-db left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, just one small nit

@MaxGekk
Copy link
Member

MaxGekk commented Aug 22, 2024

+1, LGTM. Merging to master.
Thank you, @panbingkun and @uros-db for review.

@panbingkun
Copy link
Contributor Author

@uros-db @MaxGekk
I have checked all expressions with spark input data type StringTypeAnyCollation, and I believe that split_part and levenshtein should also need similar checks and enhancements. Therefore, I have submitted the following two PRs to complete it.

please help review it when you have free time, thank you very much!

MaxGekk pushed a commit that referenced this pull request Aug 26, 2024
…values of all parameter types are the same

### What changes were proposed in this pull request?
The same principle as #47825 (review), the parameter `delimiter` in expression `split_part` are treated as (`collation-dependent`) delimiters, rather than (`collation-unaware`) regular expressions.

### Why are the changes needed?
Strengthen the parameter data type check of expression `split_part`  to avoid potential issues.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Add some `test case` to `collations.sql`.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47845 from panbingkun/SPARK-49354.

Lead-authored-by: panbingkun <panbingkun@baidu.com>
Co-authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
MaxGekk pushed a commit that referenced this pull request Aug 27, 2024
… values of all parameter types are the same

### What changes were proposed in this pull request?
The same principle as #47825 (review), the parameters (`left` and `right`) in expression `levenshtein` are `collation-dependent`, rather than `collation-unaware`.

### Why are the changes needed?
Strengthen the parameter data type check of expression `levenshtein`  to avoid potential issues.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Add some `test case` to `collations.sql`.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes #47847 from panbingkun/SPARK-49355.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
…values of all parameter types are the same

### What changes were proposed in this pull request?
From the implementation logic of expression `str_to_map`, we found that it only extracts `collation` of the first parameter `text` for the `next` logical processing, as follows:
https://github.com/apache/spark/blob/7f1a0c62a03c0d6498503987c4855327c0b6cae1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L588

https://github.com/apache/spark/blob/7f1a0c62a03c0d6498503987c4855327c0b6cae1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L594-L595

https://github.com/apache/spark/blob/7f1a0c62a03c0d6498503987c4855327c0b6cae1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L600-L601

I propose to perform `consistency checks` on the `collation` values of all parameters to avoid potential issues.

### Why are the changes needed?
Strengthen the parameter data type check of expression `str_to_map`  to avoid potential issues.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Add new UT.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47825 from panbingkun/SPARK-49334.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
…values of all parameter types are the same

### What changes were proposed in this pull request?
The same principle as apache#47825 (review), the parameter `delimiter` in expression `split_part` are treated as (`collation-dependent`) delimiters, rather than (`collation-unaware`) regular expressions.

### Why are the changes needed?
Strengthen the parameter data type check of expression `split_part`  to avoid potential issues.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Add some `test case` to `collations.sql`.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47845 from panbingkun/SPARK-49354.

Lead-authored-by: panbingkun <panbingkun@baidu.com>
Co-authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
IvanK-db pushed a commit to IvanK-db/spark that referenced this pull request Sep 20, 2024
… values of all parameter types are the same

### What changes were proposed in this pull request?
The same principle as apache#47825 (review), the parameters (`left` and `right`) in expression `levenshtein` are `collation-dependent`, rather than `collation-unaware`.

### Why are the changes needed?
Strengthen the parameter data type check of expression `levenshtein`  to avoid potential issues.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Add some `test case` to `collations.sql`.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47847 from panbingkun/SPARK-49355.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…values of all parameter types are the same

### What changes were proposed in this pull request?
From the implementation logic of expression `str_to_map`, we found that it only extracts `collation` of the first parameter `text` for the `next` logical processing, as follows:
https://github.com/apache/spark/blob/7f1a0c62a03c0d6498503987c4855327c0b6cae1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L588

https://github.com/apache/spark/blob/7f1a0c62a03c0d6498503987c4855327c0b6cae1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L594-L595

https://github.com/apache/spark/blob/7f1a0c62a03c0d6498503987c4855327c0b6cae1/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/complexTypeCreator.scala#L600-L601

I propose to perform `consistency checks` on the `collation` values of all parameters to avoid potential issues.

### Why are the changes needed?
Strengthen the parameter data type check of expression `str_to_map`  to avoid potential issues.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Add new UT.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47825 from panbingkun/SPARK-49334.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…values of all parameter types are the same

### What changes were proposed in this pull request?
The same principle as apache#47825 (review), the parameter `delimiter` in expression `split_part` are treated as (`collation-dependent`) delimiters, rather than (`collation-unaware`) regular expressions.

### Why are the changes needed?
Strengthen the parameter data type check of expression `split_part`  to avoid potential issues.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Add some `test case` to `collations.sql`.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47845 from panbingkun/SPARK-49354.

Lead-authored-by: panbingkun <panbingkun@baidu.com>
Co-authored-by: panbingkun <pbk1982@gmail.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
… values of all parameter types are the same

### What changes were proposed in this pull request?
The same principle as apache#47825 (review), the parameters (`left` and `right`) in expression `levenshtein` are `collation-dependent`, rather than `collation-unaware`.

### Why are the changes needed?
Strengthen the parameter data type check of expression `levenshtein`  to avoid potential issues.

### Does this PR introduce _any_ user-facing change?
No.

### How was this patch tested?
- Add some `test case` to `collations.sql`.
- Pass GA.

### Was this patch authored or co-authored using generative AI tooling?
No.

Closes apache#47847 from panbingkun/SPARK-49355.

Authored-by: panbingkun <panbingkun@baidu.com>
Signed-off-by: Max Gekk <max.gekk@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants