fix(bigquery): Improve roundtrip of typed STRUCTs #4684

VaggelisD · 2025-01-30T10:49:11Z

Context #4671

Before this PR:

>>> sqlglot.parse_one("SELECT STRUCT<x INT64, y STRING>(1, 'bar')", dialect="bigquery").sql("bigquery")
"SELECT CAST(STRUCT(1, 'bar') AS STRUCT<x INT64, y STRING>)"

After this PR:

>>> sqlglot.parse_one("SELECT STRUCT<x INT64, y STRING>(1, 'bar')", dialect="bigquery").sql("bigquery")
"SELECT STRUCT<x INT64, y STRING>(1, 'bar')"

cc: @sean-rose

Docs

BQ STRUCT

VaggelisD · 2025-01-30T12:08:39Z

I think I'll actually go ahead and close this for now. The issue is that we canonicalize typed STRUCTs into cast, but during generation we can't deduce if the same cast was a user vs a canonicalized one. Notice how it's not always safe to transform these casts back into the typed versions, e.g:

This cast is fine

bq> SELECT CAST(STRUCT(1 AS old_name) AS STRUCT<new_name INT64>) AS strct;
strct.new_name
1

This inlined construction is not

bq> SELECT STRUCT<new_name INT64>(1 AS old_name) AS strct;

Error: STRUCT constructors cannot specify both an explicit type and field names with AS at [1:36]

This is the reason I went ahead and added the this.find(exp.PropertyEQ) call, but this can add performance hits for users with wide and/or deeply nested STRUCTs since we may traverse the entire sub-AST.

VaggelisD · 2025-01-30T12:08:48Z

PS: For future reference, it seems that typed structs will error only if the top-level fields are named, otherwise it's fine:

bq> SELECT STRUCT<test INT64, bar STRUCT<foo INT64>>(1, STRUCT(2 AS baz));
strct
"{
  ""strct"": {
    ""test"": ""1"",
    ""bar"": {
      ""foo"": ""2""
    }
  }
}"

This means that we can make the Generator check a linear scan, but that still incurs hits for wide STRUCTs:

+++ b/sqlglot/dialects/bigquery.py
@@ -1236,7 +1236,7 @@ class BigQuery(Dialect):
             if isinstance(this, exp.Array):
                 return f"{self.sql(expression, 'to')}{self.sql(this)}"
 
-            if isinstance(this, exp.Struct) and not this.find(exp.PropertyEQ):
+            if isinstance(this, exp.Struct) and not any(isinstance(expr, exp.PropertyEQ) for expr in this.expressions):

sean-rose · 2025-01-30T16:33:45Z

The issue is that we canonicalize typed STRUCTs into cast, but during generation we can't deduce if the same cast was a user vs a canonicalized one.

That canonicalization change made in #3751 is losing information, which is problematic since as you've demonstrated the two forms of struct definition aren't always equivalent in BigQuery.

Could we perhaps change BigQuery parsing to add some sort of annotation to such canonicalized Cast expressions to indicate whether it's a normal cast or a type-annotation (e.g. kind)? Then that could be used by BigQuery SQL generation to preserve such type-annotated arrays & structs during a roundtrip through SQLGlot.

georgesittas · 2025-01-30T22:19:21Z

which is problematic since as you've demonstrated the two forms of struct definition aren't always equivalent in BigQuery.

@sean-rose can you point out how converting the STRUCT<...>(...) into a cast is losing information? As long as the semantics are preserved, this should be fine from SQLGlot's standpoint. I'm in favor of preserving the original syntax, but in this case I think the added complexity is somewhat questionable. Happy to be persuaded otherwise if this conversion is erroneous for whatever reason.

sean-rose · 2025-02-05T17:20:58Z

@sean-rose can you point out how converting the STRUCT<...>(...) into a cast is losing information? As long as the semantics are preserved, this should be fine from SQLGlot's standpoint. I'm in favor of preserving the original syntax, but in this case I think the added complexity is somewhat questionable. Happy to be persuaded otherwise if this conversion is erroneous for whatever reason.

It's losing the information about whether the original form was a literal CAST() or a type-annotated STRUCT<...>() constructor. Since it appears BigQuery doesn't balk at casting structs like it does at casting arrays this isn't a dealbreaker, but IMO being able to preserve the original syntax as much as possible during a roundtrip through SQLGlot is worth some added complexity. For example, I've been considering using SQLGlot to replace some existing SQL formatting logic, but I wouldn't want it to arbitrarily change idiomatic BigQuery SQL like STRUCT<...>(...) into CAST(... AS STRUCT<...>).

georgesittas · 2025-02-05T17:48:02Z

I hear you, however I think for this particular case we'll leave it as is, since BigQuery works with either form. I'd suggest overriding the relevant parts of the dialect if this is unacceptable, e.g. if implementing a formatting tool like you mentioned.

fix(bigquery): Improve roundtrip of typed STRUCTs

6488504

VaggelisD mentioned this pull request Jan 30, 2025

fix(bigquery)!: Inline type-annotated ARRAY literals #4671

Merged

VaggelisD closed this Jan 30, 2025

georgesittas deleted the vaggelisd/bq_typed_structs branch January 30, 2025 12:17

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(bigquery): Improve roundtrip of typed STRUCTs #4684

fix(bigquery): Improve roundtrip of typed STRUCTs #4684

VaggelisD commented Jan 30, 2025

VaggelisD commented Jan 30, 2025

VaggelisD commented Jan 30, 2025

sean-rose commented Jan 30, 2025 •

edited

Loading

georgesittas commented Jan 30, 2025

sean-rose commented Feb 5, 2025

georgesittas commented Feb 5, 2025

fix(bigquery): Improve roundtrip of typed STRUCTs #4684

fix(bigquery): Improve roundtrip of typed STRUCTs #4684

Conversation

VaggelisD commented Jan 30, 2025

Docs

VaggelisD commented Jan 30, 2025

VaggelisD commented Jan 30, 2025

sean-rose commented Jan 30, 2025 • edited Loading

georgesittas commented Jan 30, 2025

sean-rose commented Feb 5, 2025

georgesittas commented Feb 5, 2025

sean-rose commented Jan 30, 2025 •

edited

Loading