Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sql] Correct SQL parameter formatting #5178

Merged
merged 1 commit into from
Jul 21, 2018

Conversation

john-bodley
Copy link
Member

@john-bodley john-bodley commented Jun 12, 2018

TL;DR This PR removes the need for double escaping the percent character both in SQL Lab and explorer and ensures both approaches interface with the SQLA engine in a consistent manner.

This PR came about due to an error when exporting a SQL Lab result to CSV. For example in Presto one can successfully run the following query:

SELECT '%'

which executes directly via the SQLAlchemy engine, however an error was thrown when trying to export to CSV which uses the Pandas interface where the parameters argument is {} which causes an exception in PyHive which uses the pyformat SQL parameterstyle. In essence,

SELECT '%' %  {}

is invalid and thus one would need to escape the % character, i.e.,

SELECT '%%' %  {}

The reason for the error is the Pandas SQL uses the engine connection (which doesn't expose the execute arguments) whereas SQL Lab uses a raw connection (which exposes the execute arguments).

We should ensure that both SQL Lab and explore use the same interface, which means that either i) the compiled SQL statements include escaping of % to satisfy pyformat and enforce the parameter args to be either () or {}, or ii) the compiled SQL statements remove all escaping and enforce the parameter args to be None. Note in both cases one can copy-and-paste explorer queries directly in SQL Lab, though this PR implements the second option as i) one doesn't need to escape the % character for the appropriate dialects, and ii) the raw SQL is what one would run in the native DB CLI.

The fix is simply to bypass the Pandas API for querying and mimic the logic that SQL Lab uses which ensures that for presto parameters is None and thus it will not re-format the string.

Note this PR also fixes the Hive and MySQL engine specs inline with their API, i.e., both Hive and MySQL explicitly test whether the parameters and args respectively are None (rather than doing a boolean check).

This means for MySQL in SQL Lab one can run queries of the form,

SELECT * FROM table WHERE column LIKE '%foo%'

as opposed to:

SELECT * FROM table WHERE column LIKE '%%foo%%'

since previously the MySQL engine spec defined args to {} causing the string to be formatted with the empty dictionary args.

Closes #4098, #4857

to: @betodealmeida @michellethomas @mistercrunch
cc: @graceguo-supercat @timifasubaa

@john-bodley john-bodley changed the title John bodley sql percent sign [get_df] Correct SQL parameter formatting Jun 12, 2018
@john-bodley john-bodley changed the title [get_df] Correct SQL parameter formatting [wip][sql] Correct SQL parameter formatting Jun 12, 2018
@mistercrunch
Copy link
Member

mistercrunch commented Jun 12, 2018

Very nice. Happy to see this as read_sql has been getting in the way of getting consistency across SQL Lab & charts.

@john-bodley john-bodley force-pushed the john-bodley-sql-percent-sign branch 5 times, most recently from 94bc031 to 3ce3994 Compare June 12, 2018 18:55
@john-bodley john-bodley changed the title [wip][sql] Correct SQL parameter formatting [sql] Correct SQL parameter formatting Jun 12, 2018
@codecov-io
Copy link

codecov-io commented Jun 12, 2018

Codecov Report

Merging #5178 into master will increase coverage by 0.05%.
The diff coverage is 91.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5178      +/-   ##
==========================================
+ Coverage   60.71%   60.77%   +0.05%     
==========================================
  Files         258      258              
  Lines       19701    19707       +6     
  Branches     1970     1970              
==========================================
+ Hits        11962    11977      +15     
+ Misses       7730     7721       -9     
  Partials        9        9
Impacted Files Coverage Δ
superset/sql_lab.py 75.4% <100%> (ø) ⬆️
superset/connectors/sqla/models.py 78% <100%> (+0.7%) ⬆️
superset/db_engine_specs.py 53.85% <80%> (+0.97%) ⬆️
superset/models/core.py 86.71% <94.11%> (+0.14%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d483ed1...eeebcb0. Read the comment docs.

with closing(engine.raw_connection()) as conn:
with closing(conn.cursor()) as cursor:
for sql in sqls:
cursor.execute(sql, **self.db_engine_spec.cursor_execute_kwargs)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mistercrunch my one concern is for Hive and whether running async=True (as defined by the engine spec) makes sense here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm yeah looks like this won't work. The async=True is to get the progress bar. Some options:

  • Make db_engine_spec.handle_cursor broader as in an execute method that returns the last recordset or dataframe. This means moving a fair amount of logic from SQL Lab and explore into db_engine_spec. It's probably the right thing to do.
  • hacks around intercepting and disabling async when in explore/dashboard, but not in SQL Lab. That's more hacky but less disruptive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I originally was thinking of wrapping the execute method, though one still needs to decipher when to run in sync (explorer) vs. async (SQL Lab) mode.

@john-bodley john-bodley force-pushed the john-bodley-sql-percent-sign branch 6 times, most recently from 5168c7e to 87c205e Compare June 17, 2018 03:49
@john-bodley
Copy link
Member Author

@mistercrunch I address your comment regarding the async logic for SQL Lab vs. explorer. Would you mind taking a look?

for sql in sqls:
self.db_engine_spec.execute(cursor, sql)
df = pd.DataFrame.from_records(
data=list(cursor.fetchall()),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This LGTM. There's a bit of complexity/differences still left around the way explore vs sqllab get their data. I think the main difference is around Hive, the async param and whether handle_cursor is called or not (called only in SQL Lab).

This PR is a step in the right direction. Maybe eventually we want to make bring more into db_engine_spec, and have a get_records that receives update_process=False that internally would deal with async.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mistercrunch do you want me to makes these changes here, or should these issues be addressed in a future PR?

@@ -36,7 +36,7 @@ setenv =
SUPERSET_CONFIG = tests.superset_test_config
SUPERSET_HOME = {envtmpdir}
py27-mysql: SUPERSET__SQLALCHEMY_DATABASE_URI = mysql://mysqluser:mysqluserpassword@localhost/superset?charset=utf8
py34-mysql: SUPERSET__SQLALCHEMY_DATABASE_URI = mysql://mysqluser:mysqluserpassword@localhost/superset
py{34,36}-mysql: SUPERSET__SQLALCHEMY_DATABASE_URI = mysql://mysqluser:mysqluserpassword@localhost/superset
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Side note, we may be able to drop 3.4 support soon. While 2.7 should have a long tail of support, I don't think it's as needed with 3.4

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This PR just adds the missing Python 3.6 version which I was using for testing locally. Note we've kind of already dropped 3.4 per here. I'll create a new PR to remove reference to it in Tox.

@john-bodley
Copy link
Member Author

Note this PR is gated by #5206 as I didn't want to have to mutate multiple columns in the migration script.

@john-bodley
Copy link
Member Author

@mistercrunch are you ok if I merge this PR?

@mistercrunch
Copy link
Member

mistercrunch commented Jul 20, 2018

Yup @john-bodley go for it.

@john-bodley john-bodley merged commit 7fcc2af into apache:master Jul 21, 2018
@john-bodley john-bodley deleted the john-bodley-sql-percent-sign branch July 21, 2018 19:01
timifasubaa pushed a commit to airbnb/superset-fork that referenced this pull request Jul 25, 2018
graceguo-supercat pushed a commit to graceguo-supercat/superset that referenced this pull request Jul 26, 2018
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0 labels Feb 27, 2024
@betodealmeida betodealmeida mentioned this pull request Apr 29, 2024
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

About data_format problem such as '%%Y-%%m-%%d' in SQL editor and Visualize
3 participants