refactor(bigquery): `to_dataframe` uses faster `to_arrow` + `to_pandas` when `pyarrow` is available #10027

tswast · 2019-12-30T19:39:38Z

…pyarrow is available

Related to similar PR #9997 but for the google-cloud-bigquery library.

Fixes https://issuetracker.google.com/140579733

plamut

Looks fine code-wise IMO.

Two remarks:

Do we have a common representative table at hand to verify the stated performance gains and compare the results? If not, I can still manually create a dummy table with 10M floats, just like in the bug description.
The coverage check failure is legitimate and should be fixed.

tswast · 2020-01-10T18:33:42Z

In https://friendliness.dev/2019/07/29/bigquery-arrow/, we sampled the tlc_green_trips public data by running a SQL query like:

SELECT *
FROM table_name
WHERE RAND() < 0.06

and writing to a destination table so that we can read directly from the table (making query time not part of the benchmark).

…pyarrow is available

…dataframe-part-2

tswast · 2020-01-13T20:50:01Z

Using the same (n1-standard-4) instance from #9997, I tested the speedup in the same way.

cd google-cloud-python/bigquery
git checkout b140579733-bq-to_dataframe-part-2
pip3 install -e .
cd ../..

benchmark_bq.py

import sys
from google.cloud import bigquery

client = bigquery.Client()

table_id = "swast-scratch.to_dataframe_benchmark.tlc_green_{}pct".format(sys.argv[1])
dataframe = client.list_rows(table_id).to_dataframe(create_bqstorage_client=True)
print("Got {} rows.".format(len(dataframe.index)))

After:

swast@instance-1:~$ time python3 benchmark_bq.py 6
Got 950246 rows.

real    0m6.856s
user    0m3.644s
sys     0m1.132s

swast@instance-1:~$ time python3 benchmark_bq.py 12_5
Got 1980854 rows.

real    0m10.847s
user    0m6.012s
sys     0m1.464s

swast@instance-1:~$ time python3 benchmark_bq.py 25
Got 3963696 rows.

real    0m18.387s
user    0m11.240s
sys     0m2.564s

swast@instance-1:~$ time python3 benchmark_bq.py 50
Got 7917713 rows.

real    0m31.600s
user    0m21.752s
sys     0m4.432s

Before:

swast@instance-1:~$ time python3 benchmark_bq.py 6
Got 950246 rows.

real    0m8.255s
user    0m8.008s
sys     0m1.324s

swast@instance-1:~$ time python3 benchmark_bq.py 12_5
Got 1980854 rows.

real    0m14.623s
user    0m15.284s
sys     0m1.980s

swast@instance-1:~$ time python3 benchmark_bq.py 25
Got 3963696 rows.

real    0m26.064s
user    0m30.008s
sys     0m3.344s

swast@instance-1:~$ time python3 benchmark_bq.py 50
Got 7917713 rows.

real    0m48.729s
user    1m1.356s
sys     0m7.452s

The speedup is a bit better than what we saw in #9997 at 1.542x here. Still not quite the 2x I was seeing in the summer, though. I think the reason we're seeing a larger difference here is that we're reading from multiple streams in parallel. This makes the download time go faster and maybe more dataframes to concatenate for the final dataframe result.

plamut

Similarly to #9997, the observed performance gains are not huge on my 50 Mbps internet connection (network I/O consumes most of the time), but nevertheless more or less consistently reproducible.

tswast requested a review from a team December 30, 2019 19:39

googlebot added the cla: yes This human has signed the Contributor License Agreement. label Dec 30, 2019

tswast requested a review from plamut December 30, 2019 19:41

tswast added the kokoro:run Add this label to force Kokoro to re-run the tests. label Jan 3, 2020

yoshi-kokoro removed the kokoro:run Add this label to force Kokoro to re-run the tests. label Jan 3, 2020

tswast force-pushed the b140579733-bq-to_dataframe-part-2 branch from 12654c9 to 374608b Compare January 7, 2020 20:01

tswast requested a review from shollyman January 7, 2020 20:01

tswast assigned shollyman and plamut Jan 7, 2020

plamut reviewed Jan 9, 2020

View reviewed changes

tswast mentioned this pull request Jan 10, 2020

BigQuery: use union categorical to concatenate pages in to_dataframe when categorical dtype is requested #8044

Closed

tswast added 5 commits January 10, 2020 15:08

fix(bigquery): to_dataframe uses 2x faster to_arrow + to_pandas when …

e07e326

…pyarrow is available

fix: skip to_arrow tests when pyarrow is missing

05bb784

test: update test to work around numpy array encoding of nested arrays

99aa5b7

test: add test for tabledata.list with no rows

2e35f78

Merge remote-tracking branch 'upstream/master' into b140579733-bq-to_…

de7af59

…dataframe-part-2

tswast force-pushed the b140579733-bq-to_dataframe-part-2 branch from 45ea992 to de7af59 Compare January 10, 2020 21:26

tswast added 2 commits January 13, 2020 11:52

test: boost test coverage

49f0d3c

chore: fix lint

5208f71

tswast changed the title ~~fix(bigquery): to_dataframe uses 2x faster to_arrow + to_pandas when …~~ refactor(bigquery): to_dataframe uses faster to_arrow + to_pandas when pyarrow is available Jan 13, 2020

tswast requested a review from plamut January 13, 2020 20:50

plamut approved these changes Jan 15, 2020

View reviewed changes

plamut merged commit c71d5f8 into googleapis:master Jan 15, 2020

tswast deleted the b140579733-bq-to_dataframe-part-2 branch January 15, 2020 21:18

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor(bigquery): `to_dataframe` uses faster `to_arrow` + `to_pandas` when `pyarrow` is available #10027

refactor(bigquery): `to_dataframe` uses faster `to_arrow` + `to_pandas` when `pyarrow` is available #10027

tswast commented Dec 30, 2019

plamut left a comment

tswast commented Jan 10, 2020

tswast commented Jan 13, 2020

plamut left a comment •

edited

Loading

refactor(bigquery): to_dataframe uses faster to_arrow + to_pandas when pyarrow is available #10027

refactor(bigquery): to_dataframe uses faster to_arrow + to_pandas when pyarrow is available #10027

Conversation

tswast commented Dec 30, 2019

plamut left a comment

Choose a reason for hiding this comment

tswast commented Jan 10, 2020

tswast commented Jan 13, 2020

plamut left a comment • edited Loading

Choose a reason for hiding this comment

refactor(bigquery): `to_dataframe` uses faster `to_arrow` + `to_pandas` when `pyarrow` is available #10027

refactor(bigquery): `to_dataframe` uses faster `to_arrow` + `to_pandas` when `pyarrow` is available #10027

plamut left a comment •

edited

Loading