Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(mysql): handle string typed decimal results #24241

Merged
merged 7 commits into from
Sep 29, 2023

Conversation

villebro
Copy link
Member

@villebro villebro commented May 29, 2023

SUMMARY

MySQL returns DECIMAL typed values as strings, rather than Decimal or float. This causes issues in the frontend, as most plugins expect GenericDataType.NUMERIC values to be of type number, not string. To get around this we introduce a new property column_type_mutators on BaseEngineSpec, where type-specific callbacks can be introduced to mutate the results coming back from the database. Here we implement logic on the MySQL spec to change string-based decimals to Decimal, similar to what Postgres already does automatically (I assume most connectors do the same).

The new logic can be expensive for big datasets, due to the need to iterate over the full result set. For this reason the for-loop is only triggered if there are columns present that need to be mutated.

AFTER

Now the Big Number chart works as expected on Trino Decimal types (notice the value is now number in the response payload):
image

BEFORE

Previously Big Number chart would display "No data" due to the data type being string:
image

TESTING INSTRUCTIONS

ADDITIONAL INFORMATION

  • Has associated issue: closes EXPORT_CSV decimal separator format doesn't work for Decimal MySQL column #18218
  • Required feature flags:
  • Changes UI
  • Includes DB Migration (follow approval process in SIP-59)
    • Migration is atomic, supports rollback & is backwards-compatible
    • Confirm DB migration upgrade and downgrade tested
    • Runtime estimates and downtime expectations provided
  • Introduces new feature or API
  • Removes existing feature or API

@codecov
Copy link

codecov bot commented May 29, 2023

Codecov Report

Merging #24241 (043668c) into master (50535e4) will increase coverage by 0.00%.
The diff coverage is 100.00%.

❗ Current head 043668c differs from pull request most recent head 261baf4. Consider uploading reports for the commit 261baf4 to get more accurate results

@@           Coverage Diff           @@
##           master   #24241   +/-   ##
=======================================
  Coverage   68.30%   68.31%           
=======================================
  Files        1957     1957           
  Lines       75594    75611   +17     
  Branches     8224     8224           
=======================================
+ Hits        51637    51654   +17     
  Misses      21849    21849           
  Partials     2108     2108           
Flag Coverage Δ
hive 53.37% <64.28%> (+<0.01%) ⬆️
mysql 78.91% <75.00%> (-0.01%) ⬇️
postgres 78.98% <75.00%> (-0.01%) ⬇️
presto 53.29% <64.28%> (+<0.01%) ⬆️
python 82.82% <100.00%> (+<0.01%) ⬆️
sqlite 77.52% <75.00%> (-0.01%) ⬇️
unit 53.44% <100.00%> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Impacted Files Coverage Δ
superset/db_engine_specs/base.py 90.96% <100.00%> (+0.15%) ⬆️
superset/db_engine_specs/mysql.py 98.85% <100.00%> (+0.02%) ⬆️
superset/db_engine_specs/trino.py 84.12% <100.00%> (+0.38%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Comment on lines 378 to 380
[["1.23456", "abc"]],
[("dec", "decimal(12,6)"), ("str", "varchar(3)")],
[(Decimal("1.23456"), "abc")],
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice how we're changing from list[list[Any]] to list[tuple[Any, ...]] here - this is because this is what SQLA drivers should return. This will save us from having to do the conversion later here:

# only do expensive recasting if datatype is not standard list of tuples
if data and (not isinstance(data, list) or not isinstance(data[0], tuple)):
data = [tuple(row) for row in data]

for row in description
if (
func := cls.column_type_mutators.get(
type(cls.get_sqla_column_type(cls.get_datatype(row[1])))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we add a comment that explains the logic here? This dict comprehension might be hard to understand without the context of this PR

Copy link
Member

@kgabryje kgabryje left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM + a small nitpick about adding comments

)
)
}
if column_type_mutators:
Copy link
Member

@zhaoyongjie zhaoyongjie May 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Typically decimal is a fixed point data structure but native float in Python is not fixed point data, so some drivers return a string to represent fixed point. column_type_mutators adds some extra transformation on the decimal column so the query performance might become lower than before.

From the screenshot before and after, I saw the coltypes field changed from 1 -> 0, do we have some other approach which that just change the coltypes instead to mutate entire values of column?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch about the coltypes, let me investigate why that's gone missing.

Wrt to the query performance hit, another option is to make the frontend more resilient to receiving various types. In this case we could cast the value to number in the frontend if the coltype is GenericDataType.NUMERIC. However, I'd prefer to do this type of transformation in the backend to ensure that the plugins don't have to deal with all the nuances that SQLAlchemy throws our way.

I'd be curious to hear other peoples thoughts, @michael-s-molina @john-bodley any thoughts here?

Copy link
Member Author

@villebro villebro May 30, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It hit me, that without this change, the linked issue won't be fixed, as CSV export happens purely in the backend. With this new pattern we could even normalize non-standard timestamps early in the flow, eliminating the need for doing it in the frontend, further decoupling the quirks of the analytical databases from the downstream logic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, I just noticed we already have logic like this in the codebase:

# Peek at the schema to determine which column values, if any,
# require sanitization.
columns_to_sanitize: List[PlacedSanitizeFunc] = _find_columns_to_sanitize(
cursor
)

@john-bodley
Copy link
Member

john-bodley commented May 30, 2023

@villebro for Trino the DECIMAL type is no longer experimental per trinodb/trino-python-client#308, i.e., now by default these types use the native Python type.

Furthermore, regarding MySQL, doesn't the handling of the DECIMAL depend on how the various dialects handle this?

@villebro
Copy link
Member Author

@villebro for Trino the DECIMAL type is no longer experimental per trinodb/trino-python-client#308, i.e., now by default these types use the native Python type.

Oh nice, thanks for the heads up. Let me confirm that by bumping the version 👍 I notice the original PR was changing other return types, too, have we confirmed that Superset handles them well?

Furthermore, regarding MySQL, doesn't the handling of the DECIMAL depend on how the various dialects handle this?

That may well be - let me check with the author of the issue which dialect they were using.

@john-bodley
Copy link
Member

@villebro in #24069 I bumped the version of python-trino-client to the latest version somewhat recently, i.e., within the past couple of weeks.

@villebro
Copy link
Member Author

villebro commented Jun 1, 2023

@villebro in #24069 I bumped the version of python-trino-client to the latest version somewhat recently, i.e., within the past couple of weeks.

Oh nice, I didn't know it had this many improvements. Let me also check if those pesky non-standard timestamp strings are now replaced with proper datetimes that I had to work around in #23339

@john-bodley
Copy link
Member

@villebro do you plan on resurrecting this PR?

@villebro
Copy link
Member Author

villebro commented Aug 1, 2023

@john-bodley I do, if there's still apetite for it. But as far as Trino is concerned, this doesn't seem needed anymore. So not sure if we want to introduce this additional complexity if it's not really needed.

@villebro villebro changed the title fix(trino,mysql): handle string typed decimal results fix(mysql): handle string typed decimal results Sep 29, 2023
@pull-request-size pull-request-size bot added size/M and removed size/L labels Sep 29, 2023
@villebro
Copy link
Member Author

FYI @john-bodley I removed the Trino logic as it's now redundant, but for MySQL this does seem to be needed. I'm going to follow-up with the author of the Ocient connector to see if we can migrate their custom mutator logic to this more standardized model.

@villebro villebro merged commit 7eab59a into apache:master Sep 29, 2023
31 checks passed
@villebro villebro deleted the villebro/fix-decimal-type branch September 29, 2023 17:50
@michael-s-molina michael-s-molina added the v3.0 Label added by the release manager to track PRs to be included in the 3.0 branch label Oct 2, 2023
michael-s-molina pushed a commit that referenced this pull request Oct 2, 2023
saghatelian added a commit to 10webio/superset that referenced this pull request Oct 23, 2023
* fix: is_select with UNION (apache#25290)

(cherry picked from commit bb002d6)

* fix: Add explicit ON DELETE CASCADE for dashboard_roles (apache#25320)

(cherry picked from commit d54e827)

* fix(chart): Supporting custom SQL as temporal x-axis column with filter (apache#25126)

Co-authored-by: Kamil Gabryjelski <kamil.gabryjelski@gmail.com>

* fix: Use RLS clause instead of ID for cache key (apache#25229)

(cherry picked from commit fba66c6)

* fix: Improve the reliability of alerts & reports (apache#25239)

(cherry picked from commit f672d5d)

* fix: DashboardRoles cascade operation (apache#25349)

(cherry picked from commit a971a28)

* fix: datetime with timezone excel export (apache#25318)

Co-authored-by: Michael S. Molina <70410625+michael-s-molina@users.noreply.github.com>
(cherry picked from commit 5ebcd2a)

* fix: Workaround for Cypress ECONNRESET error (apache#25399)

(cherry picked from commit d76ff39)

* fix(sqllab): invalid persisted tab state (apache#25308) (apache#25398)

* fix: Rename on_delete parameter to ondelete (apache#25424)

(cherry picked from commit 893b45f)

* fix: preventing save button from flickering in SQL Lab (apache#25106)

(cherry picked from commit 296ff17)

* fix: chart import (apache#25425)

(cherry picked from commit a4d8f36)

* fix: swagger UI CSP error (apache#25368)

(cherry picked from commit 1716b9f)

* fix: smarter date formatter (apache#25404)

(cherry picked from commit f0080f9)

* fix(sqllab): invalid start date (apache#25437)

* fix(nativeFilters): Speed up native filters by removing unnecessary rerenders (apache#25282)

Co-authored-by: JUST.in DO IT <justin.park@airbnb.com>
(cherry picked from commit a0eeb4d)

* fix(SqlLab): make icon placement even (apache#25372)

(cherry picked from commit 11b49a6)

* fix: Duplicate items when pasting into Select (apache#25447)

(cherry picked from commit 7cf96cd)

* fix: update the SQLAlchemy model definition at json column for Log table (apache#25445)

(cherry picked from commit e83a76a)

* fix(helm chart): set chart appVersion to 3.0.0 (apache#25373)

* fix(mysql): handle string typed decimal results (apache#24241)

(cherry picked from commit 7eab59a)

* fix: Styles not loading because of faulty CSP setting (apache#25468)

(cherry picked from commit 0cebffd)

* fix(sqllab): error with lazy_gettext for tab titles (apache#25469)

(cherry picked from commit ddde178)

* fix: Address Mypy issue which is causing CI to fail (apache#25494)

(cherry picked from commit 36ed617)

* chore: Adds 3.0.1 CHANGELOG

* fix: Unable to sync columns when database or dataset name contains `+` (apache#25390)

(cherry picked from commit dbe0838)

* fix(sqllab): Broken query containing 'children' (apache#25490)

(cherry picked from commit b92957e)

* chore: Expand error detail on screencapture (apache#25519)

(cherry picked from commit ba541e8)

* fix: tags permissions error message (apache#25516)

(cherry picked from commit 50b0816)

* fix: Apply normalization to all dttm columns (apache#25147)

(cherry picked from commit 58fcd29)

* fix: REST API CSRF exempt list (apache#25590)

(cherry picked from commit 549abb5)

* fix(RLS): Fix Info Tooltip + Button Alignment on RLS Modal (apache#25400)

(cherry picked from commit a6d0e6f)

* fix: thubmnails loading - Talisman default config (apache#25486)

(cherry picked from commit 52f631a)

* fix(Presto): catch DatabaseError when testing Presto views (apache#25559)

Co-authored-by: Rui Zhao <zhaorui@dropbox.com>
(cherry picked from commit be3714e)

* fix(Charts): Set max row limit + removed the option to use an empty row limit value (apache#25579)

(cherry picked from commit f556ef5)

* fix(window): unavailable localStorage and sessionStorage (apache#25599)

* fix: finestTemporalGrainFormatter (apache#25618)

(cherry picked from commit 62bffaf)

* fix: revert fix(sqllab): Force trino client async execution (apache#24859) (apache#25541)

(cherry picked from commit e56e0de)

* chore: Updates 3.0.1 CHANGELOG

* fix(sqllab): Mistitled for new tab after rename (apache#25523)

(cherry picked from commit a520124)

* fix(sqllab): template validation error within comments (apache#25626)

(cherry picked from commit b370c66)

* fix: avoid 500 errors with SQLLAB_BACKEND_PERSISTENCE (apache#25553)

(cherry picked from commit 99f79f5)

* fix(import): Make sure query context is overwritten for overwriting imports (apache#25493)

(cherry picked from commit a0a0d80)

* fix: permalink save/overwrites in explore (apache#25112)

Co-authored-by: Elizabeth Thompson <eschutho@gmail.com>
(cherry picked from commit e58a3ab)

* fix(header navlinks): link navlinks to path prefix (apache#25495)

(cherry picked from commit 51c56dd)

* fix: improve upload ZIP file validation (apache#25658)

* fix: warning of nth-child (apache#23638)

(cherry picked from commit 16cc089)

* fix(dremio): Fixes issue with Dremio SQL generation for Charts with Series Limit (apache#25657)

(cherry picked from commit be82657)

---------

Co-authored-by: Beto Dealmeida <roberto@dealmeida.net>
Co-authored-by: John Bodley <4567245+john-bodley@users.noreply.github.com>
Co-authored-by: Zef Lin <zef@preset.io>
Co-authored-by: Kamil Gabryjelski <kamil.gabryjelski@gmail.com>
Co-authored-by: Jack Fragassi <jfragassi98@gmail.com>
Co-authored-by: Michael S. Molina <70410625+michael-s-molina@users.noreply.github.com>
Co-authored-by: JUST.in DO IT <justin.park@airbnb.com>
Co-authored-by: Jack <41238731+fisjac@users.noreply.github.com>
Co-authored-by: Daniel Vaz Gaspar <danielvazgaspar@gmail.com>
Co-authored-by: Stepan <66589759+Always-prog@users.noreply.github.com>
Co-authored-by: Corbin Bullard <corbindbullard@gmail.com>
Co-authored-by: Gyuil Han <cnabro91@gmail.com>
Co-authored-by: Celalettin Calis <celalettin1286@gmail.com>
Co-authored-by: Ville Brofeldt <33317356+villebro@users.noreply.github.com>
Co-authored-by: ʈᵃᵢ <tdupreetan@gmail.com>
Co-authored-by: Michael S. Molina <michael.s.molina@gmail.com>
Co-authored-by: mapledan <mapledan829@gmail.com>
Co-authored-by: Igor Khrol <khroliz@gmail.com>
Co-authored-by: Rui Zhao <105950525+zhaorui2022@users.noreply.github.com>
Co-authored-by: Fabien <18534166+frassinier@users.noreply.github.com>
Co-authored-by: Hugh A. Miles II <hughmil3s@gmail.com>
Co-authored-by: OskarNS <soerensen.oskar@gmail.com>
cccs-rc pushed a commit to CybercentreCanada/superset that referenced this pull request Mar 6, 2024
@mistercrunch mistercrunch added 🍒 3.0.1 🍒 3.0.2 🍒 3.0.3 🍒 3.0.4 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 3.1.0 labels Mar 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels size/M v3.0 Label added by the release manager to track PRs to be included in the 3.0 branch 🍒 3.0.1 🍒 3.0.2 🍒 3.0.3 🍒 3.0.4 🚢 3.1.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

EXPORT_CSV decimal separator format doesn't work for Decimal MySQL column
6 participants