-
Notifications
You must be signed in to change notification settings - Fork 13.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fetch a batch of rows from bigquery #5632
Fetch a batch of rows from bigquery #5632
Conversation
superset/db_engine_specs.py
Outdated
If this value is not set, the default value is set to 1, as described here, | ||
https://googlecloudplatform.github.io/google-cloud-python/latest/_modules/google/cloud/bigquery/dbapi/cursor.html#Cursor | ||
""" | ||
arraysize = 5000 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the idea of setting the arraysize as a class attr, that would be the expected way to set it in derived classes. Maybe add those two lines in BaseEngineSpec.fetch_data
:
if cls.arraysize:
cursor.arraysize = cls.arraysize
and set arraysize = None
in the base class
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
also let' s add a short comment on the arraysize = 5000
line that says where you got that number from (the default value in pybigquery
)
Codecov Report
@@ Coverage Diff @@
## master #5632 +/- ##
==========================================
+ Coverage 63.48% 63.49% +<.01%
==========================================
Files 360 360
Lines 22892 22896 +4
Branches 2548 2551 +3
==========================================
+ Hits 14534 14537 +3
- Misses 8343 8344 +1
Partials 15 15
Continue to review full report at Codecov.
|
* Fetch a batch of rows from bigquery * unused const * review comments (cherry picked from commit c9bd5a6)
* Fetch a batch of rows from bigquery * unused const * review comments (cherry picked from commit c9bd5a6)
* Fetch a batch of rows from bigquery * unused const * review comments (cherry picked from commit c9bd5a6)
* Fetch a batch of rows from bigquery * unused const * review comments
While running superset with Google BigQuery as the database I found that the queries are very slow. To fetch 1000 rows, it was taking approx. ~2 minutes. On further investigation I found that, they way our cursor is configured, it makes an REST API call for every row fetched, instead of one API call to fetch a batch of rows. pybigquery handles this batch fetch configuration, however, it does not work with cursor from the raw_connection used in superset.
After my change, the query to fetch 1000 rows from BigQuery takes ~1.5 seconds in Superset, down from ~120seconds. Which is on par with the query runtime, when run from BigQuery query editor.