-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-50792][SQL] Format binary data as a binary literal in JDBC. #49452
[SPARK-50792][SQL] Format binary data as a binary literal in JDBC. #49452
Conversation
Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
val select = s"SELECT * FROM $catalogName.$tableName WHERE binary_col = $binary" | ||
assert(spark.sql(select).collect().length === 1, s"Binary literal test failed: $select") | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please add the similar test case into JDBCV2Suite
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please prepare data at tablePreparation
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tablePreparation
Sorry, I'm not quite familiar with the test infrastructure. In case I make mistakes, let me confirm this question.
To mixin the tablePreparation and dataPreparation from trait defined in V2JDBCTest.scala, we need to update all the integration tests and call the these functions defined in trait.
And duplicate the extra call to multiple integration tests is OK, am I right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, Just realized I have to use Spark SQL to create table and use the types defined in Spark SQL. If I prepare table and data in tablePreparation and dataPreparation, that will have to be database specific. The code will definitely have to be duplicated for connectors of all the databases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes. If we update the basic class JdbcDialect
, we should test all the built-in integration tests.
tablePreparation
used to customize the DDL, I'm afraid Spark SQL can covers all the built-in integration tests. But you could do your best effort, let's see the result and make the decision.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, just realized each dialect embeds a builder which can override the implementation. Let me have a try.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oracle support is ready for review, PTAL. Thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can override visitBinaryComparison
in OracleSQLBuilder
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are couple threads discussing this topic. Let me copy the comment in case it's missed.
We can only rewrite comparison when one of the arguments is BLOB. For other cases, we have to use existing implementation. But unfortunately, the signature of visitBinaryComparison is accepting everything in string which loss the type information to understand if one of the arguments is binary type.
protected String visitBinaryComparison(String name, String l, String r) {
if (name.equals("<=>")) {
return "((" + l + " IS NOT NULL AND " + r + " IS NOT NULL AND " + l + " = " + r + ") " +
"OR (" + l + " IS NULL AND " + r + " IS NULL))";
}
return l + " " + name + " " + r;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tests passed, but downloading report failed. We can rerun the whole test again to clear all the checks, but it takes quite some time to finish.
https://github.com/sunxiaoguang/spark/actions/runs/12744731853
Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
Depends on apache#49452 and apache#49453 Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
Depends on apache#49452 and apache#49453 Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
f729330
to
8ddfcc2
Compare
// Unfornately, Oracle can only compare two BLOBs with a special function dbms_lob.compare | ||
// The V2ExpressionSQLBuilder can not support rewriting the '=' operator. | ||
// We don't test binary data on Oracle and do not support binary data on Oracle. | ||
if (!this.isInstanceOf[OracleIntegrationSuite]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The test case skips the Oracle, but the users still possible use this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that requires a significant change in how V2ExpressionSQLBuilder works. The binary literal was not working before this PR anyway. We can make it work at least on other databases this time. And propose another design changes to V2ExpressionSQLBuilder to make it flexible enable to support cases like this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As Oracle is supported, the test is now enabled for Oracle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's not a good idea if the bug still exists when using oracle dialect. Is there a way to fix the bug?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that requires a significant change in how V2ExpressionSQLBuilder works. The binary literal was not working before this PR anyway. We can make it work at least on other databases this time. And propose another design changes to V2ExpressionSQLBuilder to make it flexible enable to support cases like this.
Why we need change V2ExpressionSQLBuilder? Please describe the detail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry for the confusion; I didn't check the inheritance structure initially. After realizing that each dialect embeds a builder inherited from V2ExpressionSQLBuilder, I made all the changes in OracleDialect. PTAL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW: I just changed test to verify the returned result in addition. It's going to run for a while.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tests passed, but downloading report failed. We can rerun the whole test again to clear all the checks, but it takes quite some time to finish.
https://github.com/sunxiaoguang/spark/actions/runs/12744731853
Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
} | ||
} | ||
|
||
override def build(expr: Expression): String = expr match { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You just need override visitBinaryComparison
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can only rewrite comparison when one of the arguments is BLOB. For other cases, we have to use existing implementation. But unfortunately, the signature of visitBinaryComparison is accepting everything in string which loss the type information to understand if one of the arguments is binary type.
protected String visitBinaryComparison(String name, String l, String r) {
if (name.equals("<=>")) {
return "((" + l + " IS NOT NULL AND " + r + " IS NOT NULL AND " + l + " = " + r + ") " +
"OR (" + l + " IS NULL AND " + r + " IS NULL))";
}
return l + " " + name + " " + r;
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All tests passed, but downloading report failed. We can rerun the whole test again to clear all the checks, but it takes quite some time to finish.
https://github.com/sunxiaoguang/spark/actions/runs/12744731853
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. The code override build looks good to me. But we may need to add the protected method protected String visitBinaryComparison(String name, Expression l, Expression r)
in V2ExpressionSQLBuilder
.
What's your opinion? @cloud-fan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, if we agree changing the signature of visitBinaryComparison in this PR, I can make the change all together to make the code easier to maintain.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please wait a moment. We should change the signature in another PR if we agree on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala
Outdated
Show resolved
Hide resolved
connector/docker-integration-tests/src/test/scala/org/apache/spark/sql/jdbc/v2/V2JDBCTest.scala
Outdated
Show resolved
Hide resolved
testBinaryLiteral("<>", lessThanBinary, 1) | ||
testBinaryLiteral("<=>", binary, 1) | ||
testBinaryLiteral("<=>", lessThanBinary, 0) | ||
testBinaryLiteral("<=>", greaterThanBinary, 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will run these tests with all builtin dialects?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, the V2JDBCTest
trait mixed in existing integrations and run against all the dialects. With this test, we figured out that Microsoft SQL Server
, Oracle
and PostgreSQL
require extra work to support Binary Literal
.
val op = if (operator == "<=>") "=" else operator | ||
val compare = s"DBMS_LOB.COMPARE($l, $r) $op 0" | ||
if (operator == "<=>") { | ||
s"(($l IS NOT NULL AND $r IS NOT NULL AND $compare) OR ($l IS NULL AND $r IS NULL))" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is dirty but I don't have better ideas...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it's dirty. Fortunately, we only found Oracle requires a special function to compare blob values.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even if we change signature of visitBinaryComparison to make the override better, we still cannot get rid of these dirty rewrites of <=>
and DBMS_LOB.COMPARE
.
@@ -3097,4 +3097,19 @@ class JDBCV2Suite extends QueryTest with SharedSparkSession with ExplainSuiteHel | |||
assert(rows.contains(Row(null))) | |||
assert(rows.contains(Row("a a a"))) | |||
} | |||
|
|||
test("SPARK-50792 Format binary data as a binary literal in JDBC.") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for dialect-specific fixes, it's sufficient to put the tests V2JDBCTest.scala
. This test suite is for testing the shared code paths for all dialects.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, this PR is somewhat complicated, it has changes in shared code paths to support Binary Literal
. And also has changes in MsSqlServerDialect
, OracleDialect
and PostgresDialect
. If this test is not necessary, we can remove it.
…park/sql/jdbc/v2/V2JDBCTest.scala Co-authored-by: Wenchen Fan <cloud0fan@gmail.com>
Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Question: Prior to these changes is there any scenario where some data sources might have returned something, although incorrect ? Is there a chance that someone has adapted to this "wrong" behaviour ?
Should we protect this change behind a flag ?
My understanding is that this never produced any kind of results that customers could have adapted to - hence flag is not required. Is this your understanding as well ?
Also. please lets check that pushdown is happening, to make sure we are testing what we want
sql(s"CREATE TABLE $tableName (binary_col BINARY)") | ||
sql(s"INSERT INTO $tableName VALUES ($binary)") | ||
|
||
val select = s"SELECT * FROM $tableName WHERE binary_col = $binary" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets verify that this is actually pushed down - e.g. verify that there is no FilterExec present in plan
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verification added, PTAL. Thanks
sql(s"INSERT INTO $tableName VALUES ($binary)") | ||
|
||
def testBinaryLiteral(operator: String, literal: String, expected: Int): Unit = { | ||
val sql = s"SELECT * FROM $tableName WHERE binary_col $operator $literal" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets verify that pushdown is present
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
checkFilterPushed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check added, PTAL. Thanks
Signed-off-by: Xiaoguang Sun <sunxiaoguang@gmail.com>
Can you provide your take on this:
|
thanks, merging to master/4.0! |
### What changes were proposed in this pull request? Format binary data as a binary literal in JDBC. ### Why are the changes needed? The binary data is not handled to format it as binary literal in JDBC connectors These are the steps to reproduce. 1. CREATE TABLE test_binary_literal(b BINARY); 2. INSERT INTO test_binary_literal VALUES(x'010203'); 3. SELECT * FROM test_binary_literal WHERE b=x'010203'; <img width="1180" alt="image" src="https://github.com/user-attachments/assets/92800c55-5400-46b0-b3f1-d95b85d89cb5" /> ### Does this PR introduce _any_ user-facing change? 'No' ### How was this patch tested? Added new integration tests ### Was this patch authored or co-authored using generative AI tooling? 'No' Closes #49452 from sunxiaoguang/fix_binary_literal_format_in_jdbc. Lead-authored-by: Xiaoguang Sun <sunxiaoguang@gmail.com> Co-authored-by: Xiaoguang Sun <sunxiaoguang@users.noreply.github.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com> (cherry picked from commit 813591f) Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Format binary data as a binary literal in JDBC.
Why are the changes needed?
The binary data is not handled to format it as binary literal in JDBC connectors
These are the steps to reproduce.
Does this PR introduce any user-facing change?
'No'
How was this patch tested?
Added new integration tests
Was this patch authored or co-authored using generative AI tooling?
'No'