Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(sqla-query): order by aggregations in Presto and Hive #13739

Merged
merged 6 commits into from
Apr 2, 2021

Conversation

ktmud
Copy link
Member

@ktmud ktmud commented Mar 22, 2021

SUMMARY

Bugfixes for a couple of engine-specific issues in SQLA generator:

  1. Issue 1 in Issues with Sort By on Table viz (for Presto) #13228 (in Presto, SELECT alias cannot be the same as a source column if used in ORDER BY).
  2. ORDER BY metrics in Hive could not reference a metric not used in SELECT.

Adding two DBEngineSpec attributes to handle these cases. We can potentially just: 1) always use random strings for column aliases; 2) always add ORDER BY metrics to SELECT, but adding these flags helps keep the generated SQL clean.

Closes #13228
Closes #13426

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

Presto / Before

When a metric has a label named the same as an existing column, adding "Sort by" aggregations on this column will generate invalid Presto queries and throw an error:

dup-alias-before-query

dup-alias-before

The root cause in Presto is prestodb/presto#4698 but it seems Presto has no plan to fix this issue.

Presto / After

dup-alias-after

dup-alias-after-query

Column alias in the SELECT query was updated but it shouldn't affect column names in charts (including the Data preview panel) because columns are updated post-query anyway.

Hive / Before

hive-before

Sort by metrics (aggregation clause) are directly created in ORDER BY, which doesn't work in Hive. It throws a "Invalid table alias or column reference" error.

Hive / After

hive-after

Always add ORDER BY clauses to select and use aliases to reference them in ORDER BY.

TEST PLAN

Manually tested with local Presto and Hive clusters. To start your own, run these commands:

# start a Presto cluster in the foreground
docker run -d -p 127.0.0.1:17777:8080 --name presto starburstdata/presto

# Start a local Hive cluster in the foreground
docker-compose -f scripts/databases/hive/docker-compose.yml up

Then create new Databases in Superset.

Go to SQL Lab, use this query to create a sample virtual table:

SELECT name, score
FROM (
  SELECT 'aaa' as name, 1.1231 as score
  UNION
  SELECT 'bac', 0.39e-14
  UNION
  SELECT 'aaa', 10e19
) t

Also added some unit tests.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Changes UI
  • Requires DB Migration.
  • Confirm DB Migration upgrade and downgrade tested.
  • Introduces new feature or API
  • Removes existing feature or API

@ktmud ktmud changed the title fix(sqla-query): Presto bug with alias to source column fix(sqla-query): order by aggregations in Presto and Hive Mar 23, 2021
@ktmud ktmud force-pushed the presto-sql-generator branch from 12ce76f to 1928d10 Compare March 23, 2021 20:24
@pull-request-size pull-request-size bot added size/L and removed size/M labels Mar 23, 2021
@ktmud ktmud marked this pull request as ready for review March 23, 2021 20:24
@codecov
Copy link

codecov bot commented Mar 23, 2021

Codecov Report

Merging #13739 (89c7a58) into master (9156f08) will decrease coverage by 0.78%.
The diff coverage is 94.59%.

❗ Current head 89c7a58 differs from pull request most recent head 047d93a. Consider uploading reports for the commit 047d93a to get more accurate results
Impacted file tree graph

@@            Coverage Diff             @@
##           master   #13739      +/-   ##
==========================================
- Coverage   77.83%   77.05%   -0.79%     
==========================================
  Files         934      938       +4     
  Lines       47320    47750     +430     
  Branches     5913     6039     +126     
==========================================
- Hits        36831    36793      -38     
- Misses      10346    10814     +468     
  Partials      143      143              
Flag Coverage Δ
cypress 56.05% <0.00%> (+0.02%) ⬆️
hive 80.26% <84.48%> (+<0.01%) ⬆️
mysql ?
postgres ?
presto 80.28% <91.37%> (+0.01%) ⬆️
python 80.77% <96.55%> (-0.34%) ⬇️
sqlite ?

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
...ontrols/AnnotationLayerControl/AnnotationLayer.jsx 52.40% <ø> (ø)
...mponents/controls/AnnotationLayerControl/index.jsx 82.85% <ø> (ø)
...ponents/controls/DndColumnSelectControl/Option.tsx 0.00% <ø> (ø)
.../controls/DndColumnSelectControl/OptionWrapper.tsx 0.00% <0.00%> (ø)
.../src/explore/components/controls/SelectControl.jsx 94.30% <ø> (+0.55%) ⬆️
.../explore/components/controls/TextControl/index.tsx 83.92% <ø> (-4.45%) ⬇️
...ontrols/DndColumnSelectControl/DndFilterSelect.tsx 6.46% <83.33%> (-2.77%) ⬇️
superset/connectors/sqla/models.py 90.53% <94.87%> (+0.05%) ⬆️
...ontrols/DndColumnSelectControl/DndColumnSelect.tsx 22.58% <100.00%> (ø)
...ontrols/DndColumnSelectControl/DndMetricSelect.tsx 10.31% <100.00%> (ø)
... and 34 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 73a2cc3...047d93a. Read the comment docs.

@ktmud ktmud force-pushed the presto-sql-generator branch from 3d7cbd5 to bfbdf6f Compare March 23, 2021 22:37
@@ -465,7 +467,7 @@ class SqlaTable( # pylint: disable=too-many-public-methods,too-many-instance-at
database_id = Column(Integer, ForeignKey("dbs.id"), nullable=False)
fetch_values_predicate = Column(String(1000))
owners = relationship(owner_class, secondary=sqlatable_user, backref="tables")
database = relationship(
database: Database = relationship(
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bycatch: this hint allows quick jump to instance methods/attributes in the IDE.

if db_engine_spec.allows_alias_in_select:
label = db_engine_spec.make_label_compatible(label_expected)
sqla_col = sqla_col.label(label)
return sqla_col
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved to make sure built-in methods are at the top of the file.

data_["time_grain_sqla"] = grains
data_["time_grain_sqla"] = [
(g.duration, g.name) for g in self.database.grains() or []
]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove indirectness to fix typing.

if isinstance(col, Label) and re.search(
f"\\b{col.name}\\b", orderby_clause
):
col.name = f"{col.name}__"
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the easiest way to resolve column name conflicts. As it turns out, adding table prefix to source columns used in metrics as proposed here is not as easy as it seems, especially when we need to account for free form custom SQL.

:param select_exprs: all columns in the select clause
:return: columns to be included in the final select clause
"""
return select_exprs
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bycatch, this methods is not useful anymore after #9954

Unless users have custom engines that use this method, this clean up should be safe.

}
else:
# remove `poll_interval` from databases that do not support it
extra = {**extra, "engine_params": {}}
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bycatch, this fixes the case when you repetitively run independent tests with different backends as poll_interval param is only accepted by Presto connector.

response = responses["queries"][0]
assert len(response) == 2
assert response["language"] == "sql"
return response["query"]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Refactor to DRY.

@ktmud ktmud requested review from villebro, bkyryliuk, betodealmeida and etr2460 and removed request for villebro and bkyryliuk March 24, 2021 03:15
@ktmud ktmud force-pushed the presto-sql-generator branch 7 times, most recently from f2dee34 to 3690a89 Compare March 26, 2021 22:20
@ktmud ktmud added the need:review Requires a code review label Mar 26, 2021
@ktmud
Copy link
Member Author

ktmud commented Mar 29, 2021

@etr2460 @villebro This is ready for review. Do you mind taking a look?

@ktmud ktmud force-pushed the presto-sql-generator branch from 3690a89 to f3149aa Compare March 30, 2021 21:15
@junlincc
Copy link
Member

bump.. appreciate reviews 🙏
@zhaoyongjie @amitmiran137

@zhaoyongjie
Copy link
Member

zhaoyongjie commented Mar 31, 2021

Generally, order by clause is "executed" after the select clause, so the Presto raise an invalid reference error.
I have tested in Presto cluster, LGTM!

image

Copy link
Member

@zhaoyongjie zhaoyongjie left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

the same as a source column. In this case, we update the SELECT alias to
another name to avoid the conflict.
"""
if self.database.db_engine_spec.allows_alias_to_source_column:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I recommend making this change by default.
for instance, this SQL in Postgres. the 'ORDER BY' clause still refer to the original score column. We should try to avoid the same Column Name and Column Alias

SELECT name, sum(score) as score
FROM (
  SELECT 'a' as name, 4 as score
  UNION ALL
  SELECT 'b', 5
  UNION ALL
  SELECT 'a', 4
) t
GROUP BY name
ORDER by max(score) desc

image

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is to allow users to run the generated SQL in SQL Lab directly and get the same output, so I'm a little hesitant to change change the column alias users provided unless absolutely necessary.

Copy link
Member

@bkyryliuk bkyryliuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good work !
couple small nits and it would be great to cover codepath flagged by the codecov

superset/connectors/sqla/models.py Outdated Show resolved Hide resolved
}
else:
# remove `poll_interval` from databases that do not support it
extra = {**extra, "engine_params": {}}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

extra = {**extra, "engine_params": {}} - I think it is better to figure out where poll_interval is set and not to set it for the databases that do not support it.

setup_presto_if_needed function should not modify other database settings

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's set by setup_presto_if_needed. We only have one database connection for test cases, when I want to run a test multiple times with different backend, it updates configs of the one example database connection in Superset. This function has to update this database connection unless we delete and re-add one every time we run a test.

setup_presto_if_needed shouldn't update settings for other database backends, but it also didn't clean up itself.

@ktmud ktmud force-pushed the presto-sql-generator branch from f3149aa to 047d93a Compare April 1, 2021 17:58
@ktmud ktmud merged commit 4789074 into apache:master Apr 2, 2021
@ktmud ktmud deleted the presto-sql-generator branch April 2, 2021 01:10
amitmiran137 pushed a commit that referenced this pull request Apr 2, 2021
* master: (26 commits)
  chore: bump to new superset-ui version (#13932)
  fix: do not run containers as root by default in Helm chart (#13917)
  feat(explore): adhoc column formatting for Table chart (#13758)
  fix(sqla-query): order by aggregations in Presto and Hive (#13739)
  feat(alert/report): add ALERTS_ATTACH_REPORTS feature flags + feature (#13894)
  test: Fixes PropertiesModal_spec (#13548)
  fix: Pin Prophet dependency after breaking changes (#13852)
  test: Adds tests to dnd controls (#13650)
  test: Adds tests to the AnnotationLayer component (#13748)
  test: Refactor and enhance tests for the Explore DatasourcePanel Component (#13799)
  Add tests (#13778)
  test: DisplayQueryButton (#13750)
  Fixing condition around left margin for dashboard layout. Fixes #13863 (#13905)
  Revert "fix: select table overlay (#13694)" (#13901)
  test: Adds tests to the OptionControls component (#13729)
  test: DatasourceControl (#13605)
  tests for function handleScroll (#13896)
  test: Adds tests to the CustomFrame component (#13675)
  test: Adds tests to the AdvancedFrame component (#13664)
  test: DataTableControl (#13668)
  ...
@junlincc junlincc removed the need:review Requires a code review label Apr 2, 2021
lyndsiWilliams pushed a commit to preset-io/superset that referenced this pull request Apr 7, 2021
@junlincc junlincc added the risk:hard-to-test The change will be hard to test label Apr 13, 2021
allanco91 pushed a commit to allanco91/superset that referenced this pull request May 21, 2021
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 1.2.0 labels Mar 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels risk:hard-to-test The change will be hard to test size/L 🚢 1.2.0
Projects
None yet
5 participants