Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(bigquery): Improve roundtrip of typed STRUCTs #4684

Closed
wants to merge 1 commit into from

Conversation

VaggelisD
Copy link
Collaborator

Context #4671

  • Before this PR:
>>> sqlglot.parse_one("SELECT STRUCT<x INT64, y STRING>(1, 'bar')", dialect="bigquery").sql("bigquery")
"SELECT CAST(STRUCT(1, 'bar') AS STRUCT<x INT64, y STRING>)"

  • After this PR:
>>> sqlglot.parse_one("SELECT STRUCT<x INT64, y STRING>(1, 'bar')", dialect="bigquery").sql("bigquery")
"SELECT STRUCT<x INT64, y STRING>(1, 'bar')"

cc: @sean-rose

Docs

BQ STRUCT

@VaggelisD
Copy link
Collaborator Author

I think I'll actually go ahead and close this for now. The issue is that we canonicalize typed STRUCTs into cast, but during generation we can't deduce if the same cast was a user vs a canonicalized one. Notice how it's not always safe to transform these casts back into the typed versions, e.g:

  • This cast is fine
bq> SELECT CAST(STRUCT(1 AS old_name) AS STRUCT<new_name INT64>) AS strct;
strct.new_name
1

  • This inlined construction is not
bq> SELECT STRUCT<new_name INT64>(1 AS old_name) AS strct;

Error: STRUCT constructors cannot specify both an explicit type and field names with AS at [1:36]

This is the reason I went ahead and added the this.find(exp.PropertyEQ) call, but this can add performance hits for users with wide and/or deeply nested STRUCTs since we may traverse the entire sub-AST.

@VaggelisD
Copy link
Collaborator Author

PS: For future reference, it seems that typed structs will error only if the top-level fields are named, otherwise it's fine:

bq> SELECT STRUCT<test INT64, bar STRUCT<foo INT64>>(1, STRUCT(2 AS baz));
strct
"{
  ""strct"": {
    ""test"": ""1"",
    ""bar"": {
      ""foo"": ""2""
    }
  }
}"

This means that we can make the Generator check a linear scan, but that still incurs hits for wide STRUCTs:

+++ b/sqlglot/dialects/bigquery.py
@@ -1236,7 +1236,7 @@ class BigQuery(Dialect):
             if isinstance(this, exp.Array):
                 return f"{self.sql(expression, 'to')}{self.sql(this)}"
 
-            if isinstance(this, exp.Struct) and not this.find(exp.PropertyEQ):
+            if isinstance(this, exp.Struct) and not any(isinstance(expr, exp.PropertyEQ) for expr in this.expressions):

@VaggelisD VaggelisD closed this Jan 30, 2025
@georgesittas georgesittas deleted the vaggelisd/bq_typed_structs branch January 30, 2025 12:17
@sean-rose
Copy link

sean-rose commented Jan 30, 2025

The issue is that we canonicalize typed STRUCTs into cast, but during generation we can't deduce if the same cast was a user vs a canonicalized one.

That canonicalization change made in #3751 is losing information, which is problematic since as you've demonstrated the two forms of struct definition aren't always equivalent in BigQuery.

Could we perhaps change BigQuery parsing to add some sort of annotation to such canonicalized Cast expressions to indicate whether it's a normal cast or a type-annotation (e.g. kind)? Then that could be used by BigQuery SQL generation to preserve such type-annotated arrays & structs during a roundtrip through SQLGlot.

@georgesittas
Copy link
Collaborator

which is problematic since as you've demonstrated the two forms of struct definition aren't always equivalent in BigQuery.

@sean-rose can you point out how converting the STRUCT<...>(...) into a cast is losing information? As long as the semantics are preserved, this should be fine from SQLGlot's standpoint. I'm in favor of preserving the original syntax, but in this case I think the added complexity is somewhat questionable. Happy to be persuaded otherwise if this conversion is erroneous for whatever reason.

@sean-rose
Copy link

@sean-rose can you point out how converting the STRUCT<...>(...) into a cast is losing information? As long as the semantics are preserved, this should be fine from SQLGlot's standpoint. I'm in favor of preserving the original syntax, but in this case I think the added complexity is somewhat questionable. Happy to be persuaded otherwise if this conversion is erroneous for whatever reason.

It's losing the information about whether the original form was a literal CAST() or a type-annotated STRUCT<...>() constructor. Since it appears BigQuery doesn't balk at casting structs like it does at casting arrays this isn't a dealbreaker, but IMO being able to preserve the original syntax as much as possible during a roundtrip through SQLGlot is worth some added complexity. For example, I've been considering using SQLGlot to replace some existing SQL formatting logic, but I wouldn't want it to arbitrarily change idiomatic BigQuery SQL like STRUCT<...>(...) into CAST(... AS STRUCT<...>).

@georgesittas
Copy link
Collaborator

I hear you, however I think for this particular case we'll leave it as is, since BigQuery works with either form. I'd suggest overriding the relevant parts of the dialect if this is unacceptable, e.g. if implementing a formatting tool like you mentioned.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants