-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(BigQuery): explicitly quote columns in select_star #16822
Conversation
Codecov Report
@@ Coverage Diff @@
## master #16822 +/- ##
==========================================
+ Coverage 77.00% 77.02% +0.02%
==========================================
Files 1018 1027 +9
Lines 54654 54987 +333
Branches 7454 7454
==========================================
+ Hits 42086 42354 +268
- Misses 12324 12389 +65
Partials 244 244
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
superset/db_engine_specs/base.py
Outdated
if show_cols: | ||
# Explicitly quote all column names, as BigQuery doesn't quote column | ||
# names that are also identifiers (eg, "limit") by default. | ||
fields = [text(quote(col["name"])) for col in cols] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this going to affect all dbs? Do you think we should create a property on the db engine spec like "force_column_quotes"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, it will. It should be safe, and note that we use the same method to quote schemas and tables below.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@villebro thoughts on this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would expect the BQ dialect to automatically quote reserved keywords. I tried this quickly on sqlite:
create table bad_colnames("limit" varchar(10), "offset" varchar(10), regular varchar(10));
here's what it rendered:
SELECT "limit",
"offset",
regular
FROM main.bad_colnames
LIMIT 100
OFFSET 0
So I'd rather we submit a PR on the BQ connector to make sure reserved keywords are quoted correctly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While I agree this is probably quite safe, I think @eschutho's recommendation to add a property for forcing quotes would be a good solution, so as to avoid adding workarounds that aren't necessary for all engines.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ahh, looks like someone changed the behavior of _get_fields
in BigQuery: https://github.com/apache/superset/blob/master/superset/db_engine_specs/bigquery.py#L284-L296
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This goes back 3+ years: #5655
@classmethod | ||
def _get_fields(cls, cols: List[Dict[str, Any]]) -> List[ColumnClause]: | ||
""" | ||
BigQuery dialect requires us to not use backtick in the fieldname which are | ||
nested. | ||
Using literal_column handles that issue. | ||
https://docs.sqlalchemy.org/en/latest/core/tutorial.html#using-more-specific-text-with-table-literal-column-and-column | ||
Also explicility specifying column names so we don't encounter duplicate | ||
column names in the result. | ||
""" | ||
return [ | ||
literal_column(c["name"]).label(c["name"].replace(".", "__")) for c in cols | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was implemented in 2018 (#5655) and is no longer needed. I tested loading the preview for a table with nested records and it works fine when the _get_fields
method is removed, each part gets quoted separately:
Note that the data preview query quotes the parts correctly, even though it fails (for an unrelated reason).
e068667
to
88898d3
Compare
88898d3
to
d1cd905
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for untangling this! Would it be possible to add a few tests for this behavior? Given how complex this logic is, it would be great to add some simple tests for the basic cases and the currently known special cases covered in the docstring.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is awesome work @betodealmeida and really sets a new standard for db engine spec tests! LGTM 🎉
* fix (BigQuery): explicitly quote columns in select_star * Fix test * Fix SELECT * in BQ * Add unit tests * Remove type changes (cherry picked from commit c993c58)
* fix (BigQuery): explicitly quote columns in select_star * Fix test * Fix SELECT * in BQ * Add unit tests * Remove type changes
* fix (BigQuery): explicitly quote columns in select_star * Fix test * Fix SELECT * in BQ * Add unit tests * Remove type changes
SUMMARY
BigQuery is not quoting column names correctly when generating a
SELECT *
statement. Columns with reserved names are not quoted because the DB engine spec overrides the_get_fields()
method and intentionally removes the quotes. This is done because at some point nested fields were not working with backticks (#5655).I tested previewing data without the custom
_get_fields
and the SQL is generated correctly today, so I removed the use ofliteral_column
to bring the quoting back. Additionally, I fixedSELECT *
so that arrays of records work correctly. Currently, if you have a columnfoo
that is an array of records(a int, b string)
the generated data preview query looks like this:This fails, because the expanded syntax only works in records, not arrays of records. Eg, in the example I used we have a column
author
that is a record with fields(name, email, ...)
. In that case, we want the data preview query to have:Note that when we have an array of records the pseudo-columns show up in the metadata browser:
But they're not present in the preview.
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
Before we get an error because
limit
andoffset
are not quoted.Note also that
LIMIT
is applied twice, since the parser is unable to find the limit of the query due to the lack of quoting.After (the table has no data):
The LIMIT is applied correctly now, the query is sent as:
And runs as:
TESTING INSTRUCTIONS
ADDITIONAL INFORMATION