-
Notifications
You must be signed in to change notification settings - Fork 14.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix(sqla-query): order by aggregations in Presto and Hive #13739
Conversation
12ce76f
to
1928d10
Compare
Codecov Report
@@ Coverage Diff @@
## master #13739 +/- ##
==========================================
- Coverage 77.83% 77.05% -0.79%
==========================================
Files 934 938 +4
Lines 47320 47750 +430
Branches 5913 6039 +126
==========================================
- Hits 36831 36793 -38
- Misses 10346 10814 +468
Partials 143 143
Flags with carried forward coverage won't be shown. Click here to find out more.
Continue to review full report at Codecov.
|
3d7cbd5
to
bfbdf6f
Compare
@@ -465,7 +467,7 @@ class SqlaTable( # pylint: disable=too-many-public-methods,too-many-instance-at | |||
database_id = Column(Integer, ForeignKey("dbs.id"), nullable=False) | |||
fetch_values_predicate = Column(String(1000)) | |||
owners = relationship(owner_class, secondary=sqlatable_user, backref="tables") | |||
database = relationship( | |||
database: Database = relationship( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bycatch: this hint allows quick jump to instance methods/attributes in the IDE.
if db_engine_spec.allows_alias_in_select: | ||
label = db_engine_spec.make_label_compatible(label_expected) | ||
sqla_col = sqla_col.label(label) | ||
return sqla_col |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Moved to make sure built-in methods are at the top of the file.
data_["time_grain_sqla"] = grains | ||
data_["time_grain_sqla"] = [ | ||
(g.duration, g.name) for g in self.database.grains() or [] | ||
] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove indirectness to fix typing.
if isinstance(col, Label) and re.search( | ||
f"\\b{col.name}\\b", orderby_clause | ||
): | ||
col.name = f"{col.name}__" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the easiest way to resolve column name conflicts. As it turns out, adding table prefix to source columns used in metrics as proposed here is not as easy as it seems, especially when we need to account for free form custom SQL.
:param select_exprs: all columns in the select clause | ||
:return: columns to be included in the final select clause | ||
""" | ||
return select_exprs |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bycatch, this methods is not useful anymore after #9954
Unless users have custom engines that use this method, this clean up should be safe.
} | ||
else: | ||
# remove `poll_interval` from databases that do not support it | ||
extra = {**extra, "engine_params": {}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Bycatch, this fixes the case when you repetitively run independent tests with different backends as poll_interval
param is only accepted by Presto connector.
response = responses["queries"][0] | ||
assert len(response) == 2 | ||
assert response["language"] == "sql" | ||
return response["query"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Refactor to DRY.
f2dee34
to
3690a89
Compare
3690a89
to
f3149aa
Compare
bump.. appreciate reviews 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
the same as a source column. In this case, we update the SELECT alias to | ||
another name to avoid the conflict. | ||
""" | ||
if self.database.db_engine_spec.allows_alias_to_source_column: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I recommend making this change by default.
for instance, this SQL in Postgres. the 'ORDER BY' clause still refer to the original score
column. We should try to avoid the same Column Name
and Column Alias
SELECT name, sum(score) as score
FROM (
SELECT 'a' as name, 4 as score
UNION ALL
SELECT 'b', 5
UNION ALL
SELECT 'a', 4
) t
GROUP BY name
ORDER by max(score) desc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The goal is to allow users to run the generated SQL in SQL Lab directly and get the same output, so I'm a little hesitant to change change the column alias users provided unless absolutely necessary.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good work !
couple small nits and it would be great to cover codepath flagged by the codecov
} | ||
else: | ||
# remove `poll_interval` from databases that do not support it | ||
extra = {**extra, "engine_params": {}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
extra = {**extra, "engine_params": {}}
- I think it is better to figure out where poll_interval is set and not to set it for the databases that do not support it.
setup_presto_if_needed
function should not modify other database settings
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's set by setup_presto_if_needed
. We only have one database connection for test cases, when I want to run a test multiple times with different backend, it updates configs of the one example database connection in Superset. This function has to update this database connection unless we delete and re-add one every time we run a test.
setup_presto_if_needed
shouldn't update settings for other database backends, but it also didn't clean up itself.
f3149aa
to
047d93a
Compare
* master: (26 commits) chore: bump to new superset-ui version (#13932) fix: do not run containers as root by default in Helm chart (#13917) feat(explore): adhoc column formatting for Table chart (#13758) fix(sqla-query): order by aggregations in Presto and Hive (#13739) feat(alert/report): add ALERTS_ATTACH_REPORTS feature flags + feature (#13894) test: Fixes PropertiesModal_spec (#13548) fix: Pin Prophet dependency after breaking changes (#13852) test: Adds tests to dnd controls (#13650) test: Adds tests to the AnnotationLayer component (#13748) test: Refactor and enhance tests for the Explore DatasourcePanel Component (#13799) Add tests (#13778) test: DisplayQueryButton (#13750) Fixing condition around left margin for dashboard layout. Fixes #13863 (#13905) Revert "fix: select table overlay (#13694)" (#13901) test: Adds tests to the OptionControls component (#13729) test: DatasourceControl (#13605) tests for function handleScroll (#13896) test: Adds tests to the CustomFrame component (#13675) test: Adds tests to the AdvancedFrame component (#13664) test: DataTableControl (#13668) ...
SUMMARY
Bugfixes for a couple of engine-specific issues in SQLA generator:
Adding two
DBEngineSpec
attributes to handle these cases. We can potentially just: 1) always use random strings for column aliases; 2) always add ORDER BY metrics to SELECT, but adding these flags helps keep the generated SQL clean.Closes #13228
Closes #13426
BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF
Presto / Before
When a metric has a label named the same as an existing column, adding "Sort by" aggregations on this column will generate invalid Presto queries and throw an error:
The root cause in Presto is prestodb/presto#4698 but it seems Presto has no plan to fix this issue.
Presto / After
Column alias in the SELECT query was updated but it shouldn't affect column names in charts (including the Data preview panel) because columns are updated post-query anyway.
Hive / Before
Sort by metrics (aggregation clause) are directly created in ORDER BY, which doesn't work in Hive. It throws a "Invalid table alias or column reference" error.
Hive / After
Always add ORDER BY clauses to select and use aliases to reference them in ORDER BY.
TEST PLAN
Manually tested with local Presto and Hive clusters. To start your own, run these commands:
Then create new Databases in Superset.
Go to SQL Lab, use this query to create a sample virtual table:
Also added some unit tests.
ADDITIONAL INFORMATION