Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Ensure Presto database engine spec correctly handles Trino #20729

Closed
wants to merge 4 commits into from

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented Jul 16, 2022

SUMMARY

This PR remedies a few issues with Trino—which extends the Presto database engine spec. Specifically it adds a few more safeguards when fetching extra metadata as part of the SQL Lab schema/table workflow.

BEFORE/AFTER SCREENSHOTS OR ANIMATED GIF

TESTING INSTRUCTIONS

CI.

ADDITIONAL INFORMATION

  • Has associated issue:
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@codecov
Copy link

codecov bot commented Jul 16, 2022

Codecov Report

Merging #20729 (d5c5da0) into master (f89ba0c) will decrease coverage by 11.48%.
The diff coverage is 45.83%.

❗ Current head d5c5da0 differs from pull request most recent head 0ea8123. Consider uploading reports for the commit 0ea8123 to get more accurate results

@@             Coverage Diff             @@
##           master   #20729       +/-   ##
===========================================
- Coverage   66.33%   54.85%   -11.49%     
===========================================
  Files        1767     1767               
  Lines       67295    67305       +10     
  Branches     7144     7144               
===========================================
- Hits        44643    36919     -7724     
- Misses      20824    28558     +7734     
  Partials     1828     1828               
Flag Coverage Δ
hive 53.14% <45.83%> (+<0.01%) ⬆️
mysql ?
postgres ?
presto 53.04% <37.50%> (+<0.01%) ⬆️
python 57.77% <45.83%> (-23.70%) ⬇️
sqlite ?
unit 50.47% <37.50%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/db_engine_specs/presto.py 40.87% <38.88%> (-47.11%) ⬇️
superset/db_engine_specs/trino.py 39.39% <66.66%> (-34.02%) ⬇️
superset/utils/dashboard_import_export.py 0.00% <0.00%> (-100.00%) ⬇️
superset/key_value/commands/update.py 0.00% <0.00%> (-88.89%) ⬇️
superset/key_value/commands/delete.py 0.00% <0.00%> (-85.30%) ⬇️
superset/key_value/commands/delete_expired.py 0.00% <0.00%> (-80.77%) ⬇️
superset/dashboards/commands/importers/v0.py 15.62% <0.00%> (-76.25%) ⬇️
superset/datasets/commands/update.py 25.00% <0.00%> (-69.05%) ⬇️
superset/datasets/commands/create.py 29.41% <0.00%> (-68.63%) ⬇️
superset/datasets/commands/importers/v0.py 24.03% <0.00%> (-67.45%) ⬇️
... and 279 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests. Learn more

)

if cols:
full_table_name = table_name
Copy link
Member Author

@john-bodley john-bodley Jul 16, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same logic as previous just indented under the if cols: statement.

Unlike Presto where get_indexes returns [] for a non-partition table, Trino returns [{'name': 'partition', 'column_names': [], 'unique': False}]. Rather than overriding the engine specific normalize_indexes method I though it would be more prudent to make this method more robust given there was already an expectation that there may be no columns associated with the index, i.e., a non-partitioned table.


engine = cls.get_engine(database, schema)
with closing(engine.raw_connection()) as conn:
cursor = conn.cursor()
sql = f"SHOW CREATE VIEW {schema}.{table}"
try:
cls.execute(cursor, sql)
return cls.fetch_data(cursor, 1)[0][0]
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should never have been a pyhive.exc exception to begin with.

@@ -892,19 +892,6 @@ def test_get_create_view_exception(self):
)
schema = "schema"
table = "table"
with self.assertRaises(Exception):
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not entirely sure when/why a broad exception should be raised.

@dungdm93
Copy link
Contributor

dungdm93 commented Jul 22, 2022

@john-bodley Trino using official driver, not PyHive as Presto.
This mean that a lot of functions like get_table_names, get_view_names, handle_cursor, etc... are very different between 2 drivers. Further more, Trino and Presto are now 2 different projects, and going to different direction. So their features will be more and more different.

Tightly couple though 2 engine specs will make it hard to extends and introduce more bugs. Do u think we should revert #20152 to avoid inherit messy code from PrestoEngineSpec.

@john-bodley john-bodley force-pushed the john-bodley--trino-presto branch 3 times, most recently from 29247f7 to 93ec699 Compare August 4, 2022 23:19
@john-bodley
Copy link
Member Author

@dungdm93 I think in the future when Presto and Trino further diverge it makes sense, however at the moment there is a significant amount of the code that is the same and I sense adhering to the DRY principle (where possible makes sense).

@john-bodley john-bodley requested a review from ktmud August 5, 2022 00:25
@john-bodley john-bodley force-pushed the john-bodley--trino-presto branch from 93ec699 to 10ece47 Compare August 5, 2022 00:26
Copy link
Member

@ktmud ktmud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a couple of non-blocking nits

if cols:
full_table_name = table_name
if schema_name and "." not in table_name:
full_table_name = "{}.{}".format(schema_name, table_name)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

f-string?

Copy link
Member Author

@john-bodley john-bodley Aug 5, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ktmud this code is unchanged, i.e., it's now nested under the if cols: section and thus I would prefer not make changes to said code, at least in this PR.

latest_parts = tuple([None] * len(col_names))
metadata["partitions"] = {
"cols": cols,
"latest": dict(zip(col_names, latest_parts)),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel we should probably change the signature of latest_partition() itself to return this dict---would worth another PR.

For this code, it seems it can be simplified as:

Suggested change
"latest": dict(zip(col_names, latest_parts)),
"latest": {
col: latest_parts[i] if latest_parts else None
for i, col in enumerate(col_names)
}

@dungdm93
Copy link
Contributor

dungdm93 commented Aug 5, 2022

@john-bodley 2 projects are now different enough to have separated EngineSpec (at least in driver side).
You just save a few line of code by let TrinoEngineSpec inherit messy code from PrestoEngineSpec (and introducing bugs).

at the moment there is a significant amount of the code that is the same

The significant amount of the code you said is used to patch PyHive driver. Trino in the other hand strictly follow DB-API 2, so default implementations from BaseEngineSpec are good enough.
Other utility functions like estimate_statement_cost, query_cost_formatter, etc... may the same. However, those func can be reused by move to an utils.py if u wanna do DRY.

@john-bodley john-bodley force-pushed the john-bodley--trino-presto branch 3 times, most recently from 2d29e64 to c6608a6 Compare August 5, 2022 17:58

except DatabaseError: # not a VIEW
return cls.fetch_data(cursor, 1)[0][0]
except SupersetDBAPIDatabaseError: # not a VIEW
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sadly because we're using a raw connection, i.e., outside of the SQLAlchemy realm, the error comes from the underlying DBAPI, i.e., it's wrapped by SQLAlchemy, and thus we need to define the get_dbapi_exception_mapping mapping.

@john-bodley
Copy link
Member Author

@ktmud thanks for the review. I've actually had to update the logic since your approval per #20729 (comment).

@john-bodley john-bodley force-pushed the john-bodley--trino-presto branch from c6608a6 to 0ea8123 Compare August 5, 2022 18:04
@dungdm93
Copy link
Contributor

dungdm93 commented Aug 6, 2022

@john-bodley an other way to reusing code is the mixing pattern.

@villebro
Copy link
Member

villebro commented Aug 9, 2022

@john-bodley I have to agree with @dungdm93 here - Over the years the Presto spec has caused me considerable grief due to the code being IMO messy and difficult to maintain. For that reason I've been very happy to see the work that @dungdm93 has done on the Trino spec, which IMO is of a great quality and easy follow and extend. So rather than double up on using the old Presto spec code, I'd rather see us gradually phasing it out.

WRT to compatibility of the drivers, I think they're fairly far away from each other right now. For instance, currently clicking on a table in SQL Lab with the recommended trino driver gives off a metadata fetching error (causes a 500 in the backend):
image

This is due to the latest_partition method in the Presto spec being incompatible with the trino driver. There are other similar issues; I'm opening a PR soon to fix query cancellation, which also works slightly differently, but there appear to be other differences, too.

If we really want to share code between the Presto and Trino specs I would recommend doing something similar to what we're doing for Postgres and Postgres-like databases, where we have the uncontroversial shared pieces in an abstract PostgresBaseEngineSpec which the actual implementations, like PostgresEngineSpec, HanaEngineSpec, RedshiftEngineSpec etc use. In this case, we could have a PrestoBaseEngineSpec into which we move the high quality shared code, and then leave the messy bits in the old PrestoEngineSpec, while gradually refactoring them into the shared PrestoBaseEngineSpec, all while adding proper and relevant tests for them.

@dungdm93
Copy link
Contributor

@villebro perhaps latest_partition issue is fixed in trinodb/trino-python-client#219. Waiting for next release.

@john-bodley
Copy link
Member Author

Closing in favor of #21066.

@john-bodley john-bodley closed this Sep 2, 2022
@mistercrunch mistercrunch deleted the john-bodley--trino-presto branch March 26, 2024 16:14
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants