Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(bigquery): to_dataframe uses faster to_arrow + to_pandas when pyarrow is available #10027

Merged
merged 7 commits into from
Jan 15, 2020

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Dec 30, 2019

…pyarrow is available

Related to similar PR #9997 but for the google-cloud-bigquery library.

Fixes https://issuetracker.google.com/140579733

@tswast tswast requested a review from a team December 30, 2019 19:39
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Dec 30, 2019
@tswast tswast requested a review from plamut December 30, 2019 19:41
@tswast tswast added the kokoro:run Add this label to force Kokoro to re-run the tests. label Jan 3, 2020
@yoshi-kokoro yoshi-kokoro removed the kokoro:run Add this label to force Kokoro to re-run the tests. label Jan 3, 2020
@tswast tswast force-pushed the b140579733-bq-to_dataframe-part-2 branch from 12654c9 to 374608b Compare January 7, 2020 20:01
@tswast tswast requested a review from shollyman January 7, 2020 20:01
Copy link
Contributor

@plamut plamut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks fine code-wise IMO.

Two remarks:

  • Do we have a common representative table at hand to verify the stated performance gains and compare the results? If not, I can still manually create a dummy table with 10M floats, just like in the bug description.
  • The coverage check failure is legitimate and should be fixed.

@tswast
Copy link
Contributor Author

tswast commented Jan 10, 2020

In https://friendliness.dev/2019/07/29/bigquery-arrow/, we sampled the tlc_green_trips public data by running a SQL query like:

SELECT *
FROM table_name
WHERE RAND() < 0.06

and writing to a destination table so that we can read directly from the table (making query time not part of the benchmark).

@tswast tswast force-pushed the b140579733-bq-to_dataframe-part-2 branch from 45ea992 to de7af59 Compare January 10, 2020 21:26
@tswast
Copy link
Contributor Author

tswast commented Jan 13, 2020

Using the same (n1-standard-4) instance from #9997, I tested the speedup in the same way.

cd google-cloud-python/bigquery
git checkout b140579733-bq-to_dataframe-part-2
pip3 install -e .
cd ../..

benchmark_bq.py

import sys
from google.cloud import bigquery

client = bigquery.Client()

table_id = "swast-scratch.to_dataframe_benchmark.tlc_green_{}pct".format(sys.argv[1])
dataframe = client.list_rows(table_id).to_dataframe(create_bqstorage_client=True)
print("Got {} rows.".format(len(dataframe.index)))

After:

swast@instance-1:~$ time python3 benchmark_bq.py 6
Got 950246 rows.

real    0m6.856s
user    0m3.644s
sys     0m1.132s

swast@instance-1:~$ time python3 benchmark_bq.py 12_5
Got 1980854 rows.

real    0m10.847s
user    0m6.012s
sys     0m1.464s

swast@instance-1:~$ time python3 benchmark_bq.py 25
Got 3963696 rows.

real    0m18.387s
user    0m11.240s
sys     0m2.564s

swast@instance-1:~$ time python3 benchmark_bq.py 50
Got 7917713 rows.

real    0m31.600s
user    0m21.752s
sys     0m4.432s

Before:

swast@instance-1:~$ time python3 benchmark_bq.py 6
Got 950246 rows.

real    0m8.255s
user    0m8.008s
sys     0m1.324s

swast@instance-1:~$ time python3 benchmark_bq.py 12_5
Got 1980854 rows.

real    0m14.623s
user    0m15.284s
sys     0m1.980s

swast@instance-1:~$ time python3 benchmark_bq.py 25
Got 3963696 rows.

real    0m26.064s
user    0m30.008s
sys     0m3.344s

swast@instance-1:~$ time python3 benchmark_bq.py 50
Got 7917713 rows.

real    0m48.729s
user    1m1.356s
sys     0m7.452s

The speedup is a bit better than what we saw in #9997 at 1.542x here. Still not quite the 2x I was seeing in the summer, though. I think the reason we're seeing a larger difference here is that we're reading from multiple streams in parallel. This makes the download time go faster and maybe more dataframes to concatenate for the final dataframe result.

@tswast tswast changed the title fix(bigquery): to_dataframe uses 2x faster to_arrow + to_pandas when … refactor(bigquery): to_dataframe uses faster to_arrow + to_pandas when pyarrow is available Jan 13, 2020
@tswast tswast requested a review from plamut January 13, 2020 20:50
Copy link
Contributor

@plamut plamut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly to #9997, the observed performance gains are not huge on my 50 Mbps internet connection (network I/O consumes most of the time), but nevertheless more or less consistently reproducible.

This was referenced Jan 29, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants