Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(spark): Custom annotation for more string functions #4156

Merged
merged 2 commits into from
Sep 27, 2024

Conversation

VaggelisD
Copy link
Collaborator

@VaggelisD VaggelisD commented Sep 25, 2024

This is a follow up on #4004, as during my investigation for SUBSTRING I came across other Spark string functions (CONCAT, LPAD, RPAD) which are input type dependent.

Although it's not documented clearly, these functions will return a BINARY only if all their "string" operands are BINARY, otherwise they return a STRING:

spark-sql (default)> with tbl as (select cast('test' as binary) as bin_col, 'str' as str_col) 
select 
 typeof(concat(bin_col, bin_col)), 
 typeof(concat(str_col, bin_col)), 
 typeof(concat(bin_col, str_col)), 
 typeof(concat(str_col, str_col)) 
from tbl;

binary  string  string  string
spark-sql (default)> with tbl as (select cast('test' as binary) as bin_col, 'str' as str_col) 
select 
 typeof(lpad(bin_col, 1, bin_col)), 
 typeof(lpad(str_col, 1, bin_col)), 
 typeof(lpad(bin_col, 1, str_col)), 
 typeof(lpad(str_col, 1, str_col))
from tbl;

binary  string  string  string

Docs

Databricks LPAD | Databricks CONCAT

@VaggelisD VaggelisD changed the title feat(spark, databricks): Custom annotation for string functions feat(spark): Custom annotation for more string functions Sep 25, 2024
Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we consolidate _annotate_by_same_args and _annotate_by_args somehow? They look very similar, perhaps a flag could help?

sqlglot/dialects/spark2.py Outdated Show resolved Hide resolved
sqlglot/dialects/spark2.py Outdated Show resolved Hide resolved
tests/test_optimizer.py Outdated Show resolved Hide resolved
@VaggelisD
Copy link
Collaborator Author

Can we consolidate _annotate_by_same_args and _annotate_by_args somehow? They look very similar, perhaps a flag could help?

Some parts are overlapping but the "access pattern" is different. I think as each dialect is starting to form their own annotators we should keep that logic separate from the common functions; If some boilerplate parts are getting repetitive, we should of course factor these out in time.

Regarding the Spark logic, I came up with a new set of rules for that annotator that biases STRING less now (that still remains the fallback type though). What do you think of that?

@VaggelisD VaggelisD force-pushed the vaggelisd/spark_annotators branch from 1caba37 to ed25681 Compare September 26, 2024 15:20
@VaggelisD VaggelisD force-pushed the vaggelisd/spark_annotators branch from ed25681 to 0c21fec Compare September 26, 2024 16:00
Copy link
Collaborator

@georgesittas georgesittas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's test unknowns as well

@VaggelisD
Copy link
Collaborator Author

@georgesittas I did add a few test cases with UNKNOWNs in the previous commit, do you have more cases in mind?

@georgesittas
Copy link
Collaborator

@georgesittas I did add a few test cases with UNKNOWNs in the previous commit, do you have more cases in mind?

Ah I didn't see them, apologies. LGTM.

@georgesittas georgesittas merged commit 7af33a2 into main Sep 27, 2024
6 checks passed
@georgesittas georgesittas deleted the vaggelisd/spark_annotators branch September 27, 2024 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants