Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(BigQuery): explicitly quote columns in select_star #16822

Merged
merged 5 commits into from
Oct 6, 2021

Conversation

betodealmeida
Copy link
Member

@betodealmeida betodealmeida commented Sep 23, 2021

SUMMARY

BigQuery is not quoting column names correctly when generating a SELECT * statement. Columns with reserved names are not quoted because the DB engine spec overrides the _get_fields() method and intentionally removes the quotes. This is done because at some point nested fields were not working with backticks (#5655).

I tested previewing data without the custom _get_fields and the SQL is generated correctly today, so I removed the use of literal_column to bring the quoting back. Additionally, I fixed SELECT * so that arrays of records work correctly. Currently, if you have a column foo that is an array of records (a int, b string) the generated data preview query looks like this:

SELECT
  foo,
  foo.a,  -- bad
  foo.b   -- also bad
FROM ...

This fails, because the expanded syntax only works in records, not arrays of records. Eg, in the example I used we have a column author that is a record with fields (name, email, ...). In that case, we want the data preview query to have:

SELECT
  author,  -- maybe this should be removed?
  author.name,
  author.email,
  ...
FROM ...

Note that when we have an array of records the pseudo-columns show up in the metadata browser:

Screenshot 2021-09-27 at 16-44-33  DEV  Superset

But they're not present in the preview.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Before we get an error because limit and offset are not quoted.

Screenshot 2021-09-23 at 15-37-49  DEV  Superset

Note also that LIMIT is applied twice, since the parser is unable to find the limit of the query due to the lack of quoting.

After (the table has no data):

Screenshot 2021-09-23 at 15-41-12  DEV  Superset

The LIMIT is applied correctly now, the query is sent as:

SELECT `limit`,       `offset`FROM `test`.`bad_column_names`LIMIT 100

And runs as:

SELECT `limit`,
 `offset`
FROM `test`.`bad_column_names`
LIMIT 101

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@betodealmeida betodealmeida changed the title fix (BigQuery): explicitly quote columns in select_star fix(BigQuery): explicitly quote columns in select_star Sep 23, 2021
@betodealmeida betodealmeida marked this pull request as ready for review September 23, 2021 22:58
@codecov
Copy link

codecov bot commented Sep 23, 2021

Codecov Report

Merging #16822 (c015e97) into master (4086bed) will increase coverage by 0.02%.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master   #16822      +/-   ##
==========================================
+ Coverage   77.00%   77.02%   +0.02%     
==========================================
  Files        1018     1027       +9     
  Lines       54654    54987     +333     
  Branches     7454     7454              
==========================================
+ Hits        42086    42354     +268     
- Misses      12324    12389      +65     
  Partials      244      244              
Flag Coverage Δ
hive 81.45% <100.00%> (-0.04%) ⬇️
mysql 81.90% <100.00%> (+0.06%) ⬆️
postgres 81.91% <100.00%> (-0.03%) ⬇️
presto 81.78% <100.00%> (+<0.01%) ⬆️
python 82.42% <100.00%> (-0.03%) ⬇️
sqlite 81.58% <100.00%> (+0.06%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/db_engine_specs/bigquery.py 85.97% <100.00%> (-0.87%) ⬇️
superset/result_set.py 98.38% <100.00%> (ø)
superset/db_engine_specs/databricks.py 90.00% <0.00%> (-10.00%) ⬇️
superset/errors.py 94.20% <0.00%> (-5.80%) ⬇️
superset/db_engine_specs/__init__.py 46.15% <0.00%> (-3.85%) ⬇️
superset/db_engine_specs/druid.py 86.27% <0.00%> (-2.62%) ⬇️
superset/exceptions.py 91.26% <0.00%> (-1.92%) ⬇️
superset/views/utils.py 81.81% <0.00%> (-1.76%) ⬇️
superset/dao/base.py 95.12% <0.00%> (-1.76%) ⬇️
... and 38 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4086bed...c015e97. Read the comment docs.

if show_cols:
# Explicitly quote all column names, as BigQuery doesn't quote column
# names that are also identifiers (eg, "limit") by default.
fields = [text(quote(col["name"])) for col in cols]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this going to affect all dbs? Do you think we should create a property on the db engine spec like "force_column_quotes"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, it will. It should be safe, and note that we use the same method to quote schemas and tables below.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@villebro thoughts on this?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would expect the BQ dialect to automatically quote reserved keywords. I tried this quickly on sqlite:

create table bad_colnames("limit" varchar(10), "offset" varchar(10), regular varchar(10));

here's what it rendered:

SELECT "limit",
       "offset",
       regular
FROM main.bad_colnames
LIMIT 100
OFFSET 0

So I'd rather we submit a PR on the BQ connector to make sure reserved keywords are quoted correctly.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree this is probably quite safe, I think @eschutho's recommendation to add a property for forcing quotes would be a good solution, so as to avoid adding workarounds that aren't necessary for all engines.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, SQLite does it the right way, BigQuery is the culprit here.

I'll file a ticket upstream and try to make a PR, and incorporate @eschutho's suggestion into this PR so we don't have to wait for a new release upstream.

Thanks, @villebro!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ahh, looks like someone changed the behavior of _get_fields in BigQuery: https://github.com/apache/superset/blob/master/superset/db_engine_specs/bigquery.py#L284-L296

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This goes back 3+ years: #5655

Comment on lines -284 to -296
@classmethod
def _get_fields(cls, cols: List[Dict[str, Any]]) -> List[ColumnClause]:
"""
BigQuery dialect requires us to not use backtick in the fieldname which are
nested.
Using literal_column handles that issue.
https://docs.sqlalchemy.org/en/latest/core/tutorial.html#using-more-specific-text-with-table-literal-column-and-column
Also explicility specifying column names so we don't encounter duplicate
column names in the result.
"""
return [
literal_column(c["name"]).label(c["name"].replace(".", "__")) for c in cols
]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This was implemented in 2018 (#5655) and is no longer needed. I tested loading the preview for a table with nested records and it works fine when the _get_fields method is removed, each part gets quoted separately:

Screenshot 2021-09-27 at 16-40-41  DEV  Superset

Note that the data preview query quotes the parts correctly, even though it fails (for an unrelated reason).

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for untangling this! Would it be possible to add a few tests for this behavior? Given how complex this logic is, it would be great to add some simple tests for the basic cases and the currently known special cases covered in the docstring.

Copy link
Member

@villebro villebro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome work @betodealmeida and really sets a new standard for db engine spec tests! LGTM 🎉

@betodealmeida betodealmeida merged commit c993c58 into apache:master Oct 6, 2021
@betodealmeida betodealmeida mentioned this pull request Oct 7, 2021
9 tasks
@villebro villebro added the v1.4 label Oct 12, 2021
eschutho pushed a commit to preset-io/superset that referenced this pull request Oct 27, 2021
* fix (BigQuery): explicitly quote columns in select_star

* Fix test

* Fix SELECT * in BQ

* Add unit tests

* Remove type changes

(cherry picked from commit c993c58)
opus-42 pushed a commit to opus-42/incubator-superset that referenced this pull request Nov 14, 2021
* fix (BigQuery): explicitly quote columns in select_star

* Fix test

* Fix SELECT * in BQ

* Add unit tests

* Remove type changes
QAlexBall pushed a commit to QAlexBall/superset that referenced this pull request Dec 28, 2021
* fix (BigQuery): explicitly quote columns in select_star

* Fix test

* Fix SELECT * in BQ

* Add unit tests

* Remove type changes
@mistercrunch mistercrunch added 🍒 1.4.0 🍒 1.4.1 🍒 1.4.2 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.5.0 labels Mar 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/L v1.4 🍒 1.4.0 🍒 1.4.1 🍒 1.4.2 🚢 1.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants