Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[sql lab] a better approach at limiting queries #4947

Merged
merged 5 commits into from
May 14, 2018

Conversation

mistercrunch
Copy link
Member

Currently there are two mechanisms that we use to enforce the row
limiting constraints, depending on the database engine:

  1. use dbapi's cursor.fetchmany()
  2. wrap the SQL into a limiting subquery

Method 1 isn't great as it can result in the database server storing
larger than required result sets in memory expecting another fetch
command while we know we don't need that.

Method 2 has a positive side of working with all database engines,
whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy
does the work as specified for the dialect. On the downside though
the query optimizer might not be able to optimize this as much as an
approach that doesn't use a subquery.

Since most modern DBs use the LIMIT syntax, this adds a regex approach
to modify the query and force a LIMIT clause without using a subquery
for the database that support this syntax and uses method 2 for all
others.

Currently there are two mechanisms that we use to enforce the row
limiting constraints, depending on the database engine:
1. use dbapi's `cursor.fetchmany()`
2. wrap the SQL into a limiting subquery

Method 1 isn't great as it can result in the database server storing
larger than required result sets in memory expecting another fetch
command while we know we don't need that.

Method 2 has a positive side of working with all database engines,
whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy
does the work as specified for the dialect. On the downside though
the query optimizer might not be able to optimize this as much as an
approach that doesn't use a subquery.

Since most modern DBs use the LIMIT syntax, this adds a regex approach
to modify the query and force a LIMIT clause without using a subquery
for the database that support this syntax and uses method 2 for all
others.
@codecov-io
Copy link

codecov-io commented May 10, 2018

Codecov Report

Merging #4947 into master will increase coverage by 0.02%.
The diff coverage is 90.9%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #4947      +/-   ##
==========================================
+ Coverage    77.1%   77.13%   +0.02%     
==========================================
  Files          44       44              
  Lines        8644     8649       +5     
==========================================
+ Hits         6665     6671       +6     
+ Misses       1979     1978       -1
Impacted Files Coverage Δ
superset/sql_parse.py 98.79% <100%> (-0.06%) ⬇️
superset/models/core.py 86.47% <100%> (-0.08%) ⬇️
superset/sql_lab.py 75.27% <100%> (+0.95%) ⬆️
superset/db_engine_specs.py 53.74% <84.61%> (+0.87%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b391676...706d860. Read the comment docs.

Copy link
Member

@betodealmeida betodealmeida left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but I have a few comments.

)
return database.compile_sqla_query(qry)
elif LimitMethod.FORCE_LIMIT:
no_limit = re.sub(r'(?i)\s+LIMIT\s+\d+;?(\s|;)*$', '', sql)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Scary! :-O

BTW, you can use backreferences here to replace the LIMIT number in one pass (untested):

sql = re.sub(r'(?i)(.*\s+LIMIT\s+)\d+\s*;?\s*$', r'\1 {0}'.format(limit), sql)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I barely understand what I wrote here :/

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some people, when confronted with a problem, think "I know, I'll use regular expressions." Now they have two problems.

Yeah, I just realized my example does not work when there's no LIMIT clause. :-/

You know about verbose regexes? Eg:

no_limit = re.sub(r"""
    (?ix)        # case insensitive, verbose
    \s+          # whitespace
    LIMIT\s+\d+  # LIMIT $ROWS
    ;?           # optional semi-colon
    (\s|;)*$     # any number of trailing whitespace or semi-colons, til the end
""", '', sql)

Though in this case I think it's pretty straightforward. You might wanna rstrip whitespace and semicolons from the end of sql, that would simplify the regex, no?

).limit(limit)
)
return self.compile_sqla_query(qry)
return self.db_engine_spec.apply_limit_to_sql(sql, limit, self)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Better update the method name here.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, rename wrap_sql_limit to something else.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

went with apply_limit_to_sql

db = Database(sqlalchemy_uri='mysql://localhost')
limited = MySQLEngineSpec.apply_limit_to_sql(sql, 1000, db)
expected = 'SELECT * FROM (SELECT * FROM a LIMIT 10) LIMIT 1000'
self.assertEquals(expected, limited)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm nervous about the regular expression, I'd add a few more unit tests covering some weird cases, eg:

SELECT
    'LIMIT 1000' AS a
  , b
FROM
    table
LIMIT
    1000
<EOF>

And maybe some stuff with extra space at the end:

...
LIMIT      1000  ;    
  <EOF>

@mistercrunch
Copy link
Member Author

Addressed comments, merging

@john-bodley
Copy link
Member

john-bodley commented May 19, 2018

@betodealmeida @mistercrunch are you sure we want to be augmenting a query that someone authors? This seems overly magical to me, i.e., if you believe you're running the SELECT * FROM schema.table query under the covers you're really running SELECT * FROM schema.table LIMIT n. This can be somewhat confusing if one tries to download and analyze a CSV file assuming all the rows are there or if one was trying to track said query using the audit logs etc.

SQL is a powerful language and it seems like educating users about how best to use SQL rather than adding somewhat opaque handrails.

michellethomas pushed a commit to michellethomas/panoramix that referenced this pull request May 24, 2018
* [sql lab] a better approach at limiting queries

Currently there are two mechanisms that we use to enforce the row
limiting constraints, depending on the database engine:
1. use dbapi's `cursor.fetchmany()`
2. wrap the SQL into a limiting subquery

Method 1 isn't great as it can result in the database server storing
larger than required result sets in memory expecting another fetch
command while we know we don't need that.

Method 2 has a positive side of working with all database engines,
whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy
does the work as specified for the dialect. On the downside though
the query optimizer might not be able to optimize this as much as an
approach that doesn't use a subquery.

Since most modern DBs use the LIMIT syntax, this adds a regex approach
to modify the query and force a LIMIT clause without using a subquery
for the database that support this syntax and uses method 2 for all
others.

* Fixing build

* Fix lint

* Added more tests

* Fix tests
@mistercrunch
Copy link
Member Author

I think we agree that we need a way to prevent users from running infinite billion rows query that will hurt the db server, network, web server and ultimately crash their browser.

Now we've been doing this two ways depending on the DB engine, fetchmany, which I think puts too much strain on the database server. It's rough on Presto/Hive for instance and in many cases is a similar internal cost as running the whole thing from the db's perspective. Then there's the wrap-in-a-subquery approach, that's better and prefer that to fetchmany, though I think many db optimizer won't do as good with that as with the LIMIT in the main core query. It requires more complex query editing but is preferable to the subquery approach as it may result in a simpler query plan.

I also looked at how other tools seem to be doing this, and they edit the user query much like we do here, likely for the reasons listed above.

timifasubaa pushed a commit to timifasubaa/incubator-superset that referenced this pull request May 31, 2018
* [sql lab] a better approach at limiting queries

Currently there are two mechanisms that we use to enforce the row
limiting constraints, depending on the database engine:
1. use dbapi's `cursor.fetchmany()`
2. wrap the SQL into a limiting subquery

Method 1 isn't great as it can result in the database server storing
larger than required result sets in memory expecting another fetch
command while we know we don't need that.

Method 2 has a positive side of working with all database engines,
whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy
does the work as specified for the dialect. On the downside though
the query optimizer might not be able to optimize this as much as an
approach that doesn't use a subquery.

Since most modern DBs use the LIMIT syntax, this adds a regex approach
to modify the query and force a LIMIT clause without using a subquery
for the database that support this syntax and uses method 2 for all
others.

* Fixing build

* Fix lint

* Added more tests

* Fix tests
timifasubaa pushed a commit to airbnb/superset-fork that referenced this pull request Jul 25, 2018
* [sql lab] a better approach at limiting queries

Currently there are two mechanisms that we use to enforce the row
limiting constraints, depending on the database engine:
1. use dbapi's `cursor.fetchmany()`
2. wrap the SQL into a limiting subquery

Method 1 isn't great as it can result in the database server storing
larger than required result sets in memory expecting another fetch
command while we know we don't need that.

Method 2 has a positive side of working with all database engines,
whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy
does the work as specified for the dialect. On the downside though
the query optimizer might not be able to optimize this as much as an
approach that doesn't use a subquery.

Since most modern DBs use the LIMIT syntax, this adds a regex approach
to modify the query and force a LIMIT clause without using a subquery
for the database that support this syntax and uses method 2 for all
others.

* Fixing build

* Fix lint

* Added more tests

* Fix tests
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
* [sql lab] a better approach at limiting queries

Currently there are two mechanisms that we use to enforce the row
limiting constraints, depending on the database engine:
1. use dbapi's `cursor.fetchmany()`
2. wrap the SQL into a limiting subquery

Method 1 isn't great as it can result in the database server storing
larger than required result sets in memory expecting another fetch
command while we know we don't need that.

Method 2 has a positive side of working with all database engines,
whether they use LIMIT, ROWNUM, TOP or whatever else since sqlalchemy
does the work as specified for the dialect. On the downside though
the query optimizer might not be able to optimize this as much as an
approach that doesn't use a subquery.

Since most modern DBs use the LIMIT syntax, this adds a regex approach
to modify the query and force a LIMIT clause without using a subquery
for the database that support this syntax and uses method 2 for all
others.

* Fixing build

* Fix lint

* Added more tests

* Fix tests
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.26.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.26.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants