Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fetch a batch of rows from bigquery #5632

Merged
merged 3 commits into from
Aug 15, 2018

Conversation

sumedhsakdeo
Copy link
Contributor

@sumedhsakdeo sumedhsakdeo commented Aug 14, 2018

While running superset with Google BigQuery as the database I found that the queries are very slow. To fetch 1000 rows, it was taking approx. ~2 minutes. On further investigation I found that, they way our cursor is configured, it makes an REST API call for every row fetched, instead of one API call to fetch a batch of rows. pybigquery handles this batch fetch configuration, however, it does not work with cursor from the raw_connection used in superset.

After my change, the query to fetch 1000 rows from BigQuery takes ~1.5 seconds in Superset, down from ~120seconds. Which is on par with the query runtime, when run from BigQuery query editor.

If this value is not set, the default value is set to 1, as described here,
https://googlecloudplatform.github.io/google-cloud-python/latest/_modules/google/cloud/bigquery/dbapi/cursor.html#Cursor
"""
arraysize = 5000
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of setting the arraysize as a class attr, that would be the expected way to set it in derived classes. Maybe add those two lines in BaseEngineSpec.fetch_data:

if cls.arraysize:
    cursor.arraysize = cls.arraysize

and set arraysize = None in the base class

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also let' s add a short comment on the arraysize = 5000 line that says where you got that number from (the default value in pybigquery)

@codecov-io
Copy link

codecov-io commented Aug 15, 2018

Codecov Report

Merging #5632 into master will increase coverage by <.01%.
The diff coverage is 75%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master    #5632      +/-   ##
==========================================
+ Coverage   63.48%   63.49%   +<.01%     
==========================================
  Files         360      360              
  Lines       22892    22896       +4     
  Branches     2548     2551       +3     
==========================================
+ Hits        14534    14537       +3     
- Misses       8343     8344       +1     
  Partials       15       15
Impacted Files Coverage Δ
superset/db_engine_specs.py 54.31% <75%> (+0.12%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d601ff4...18c48ff. Read the comment docs.

@mistercrunch mistercrunch merged commit c9bd5a6 into apache:master Aug 15, 2018
mistercrunch pushed a commit to lyft/incubator-superset that referenced this pull request Aug 21, 2018
* Fetch a batch of rows from bigquery

* unused const

* review comments

(cherry picked from commit c9bd5a6)
betodealmeida pushed a commit to lyft/incubator-superset that referenced this pull request Aug 22, 2018
* Fetch a batch of rows from bigquery

* unused const

* review comments

(cherry picked from commit c9bd5a6)
betodealmeida pushed a commit to lyft/incubator-superset that referenced this pull request Aug 22, 2018
* Fetch a batch of rows from bigquery

* unused const

* review comments

(cherry picked from commit c9bd5a6)
wenchma pushed a commit to wenchma/incubator-superset that referenced this pull request Nov 16, 2018
* Fetch a batch of rows from bigquery

* unused const

* review comments
@mistercrunch mistercrunch added 🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0 labels Feb 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🏷️ bot A label used by `supersetbot` to keep track of which PR where auto-tagged with release labels 🚢 0.28.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants