Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor(bigquerystorage): to_dataframe on an arrow stream uses faster to_arrow + to_pandas, internally #9997

Merged
merged 1 commit into from
Jan 15, 2020

Conversation

tswast
Copy link
Contributor

@tswast tswast commented Dec 18, 2019

…to_arrow + to_pandas, internally

Towards https://issuetracker.google.com/140579733

@tswast tswast requested a review from a team December 18, 2019 20:02
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Dec 18, 2019
@tswast tswast added api: bigquerystorage Issues related to the BigQuery Storage API. and removed cla: yes This human has signed the Contributor License Agreement. labels Dec 18, 2019
@googlebot googlebot added the cla: yes This human has signed the Contributor License Agreement. label Dec 18, 2019
@tswast tswast requested a review from plamut December 19, 2019 16:25
@tswast tswast requested a review from shollyman January 7, 2020 17:20
Copy link
Contributor

@plamut plamut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me code-wise, but I still need to verify the performance gain on a "big" table (ran out of time in this review pass 😃 ).

@plamut
Copy link
Contributor

plamut commented Jan 11, 2020

Unfortunately, I cannot confirm the speedup, the timings with and without this fix are about the same on my machine (i7, 16GB RAM, ~25 Mbs network).

It's probably because I also wasn't able to reproduce the speed difference in the first place using the sample from the issue description (tested on 2x 10e6 table with random floats).

@tswast What data did you use in testing? Maybe I can try with that ...

@tswast
Copy link
Contributor Author

tswast commented Jan 13, 2020

I'm getting a more modest speed-up than I was getting this past summer, too. It's at least some speedup, though. The taxi cab data has a more variety of data types. I wonder if the latest Pandas does a better job of combining floating point columns with concatenate now?

From a VM in us-central:

sudo apt-get install python3-pip git
pip3 install --upgrade pandas
pip3 install --upgrade pyarrow
pip3 install --upgrade google-cloud-core
pip3 install --upgrade google-api-core[grpc]
git clone https://github.com/tswast/google-cloud-python.git
cd google-cloud-python/bigquery_storage
git checkout b140579733-bq-to_dataframe
pip3 install -e .
cd ../..

benchmark_bqstorage.py

import sys
from google.cloud import bigquery_storage_v1beta1

# TODO(developer): Set the project_id variable.
project_id = 'swast-scratch'
#
# The read session is created in this project. This project can be
# different from that which contains the table.

client = bigquery_storage_v1beta1.BigQueryStorageClient()

# This example reads baby name data from the public datasets.
table_ref = bigquery_storage_v1beta1.types.TableReference()
table_ref.project_id = "swast-scratch"
table_ref.dataset_id = "to_dataframe_benchmark"
table_ref.table_id = "tlc_green_{}pct".format(sys.argv[1])

parent = "projects/{}".format(project_id)
session = client.create_read_session(
    table_ref,
    parent,
    format_=bigquery_storage_v1beta1.enums.DataFormat.ARROW,
    sharding_strategy=(bigquery_storage_v1beta1.enums.ShardingStrategy.LIQUID),
    requested_streams=1,
)  # API request.

reader = client.read_rows(
    bigquery_storage_v1beta1.types.StreamPosition(stream=session.streams[0])
)

dataframe = reader.rows(session).to_dataframe()
print("Got {} rows.".format(len(dataframe.index)))

After this change:

swast@instance-1:~$ time python3 benchmark_bqstorage.py 6
Got 950246 rows.

real    0m7.219s
user    0m3.296s
sys     0m1.564s

swast@instance-1:~$ time python3 benchmark_bqstorage.py 12_5
Got 1980854 rows.

real    0m10.859s
user    0m5.952s
sys     0m2.156s

swast@instance-1:~$ time python3 benchmark_bqstorage.py 25
Got 3963696 rows.

real    0m22.156s
user    0m11.188s
sys     0m4.340s

swast@instance-1:~$ time python3 benchmark_bqstorage.py 50
Got 7917713 rows.

real    0m44.529s
user    0m21.124s
sys     0m8.976s

Before this change:

cd google-cloud-python
git checkout master
cd ..

swast@instance-1:~$ time python3 benchmark_bqstorage.py 6
Got 950246 rows.

real    0m7.301s
user    0m6.572s
sys     0m1.148s

swast@instance-1:~$ time python3 benchmark_bqstorage.py 12_5
Got 1980854 rows.

real    0m13.595s
user    0m12.584s
sys     0m2.056s

swast@instance-1:~$ time python3 benchmark_bqstorage.py 25
Got 3963696 rows.

real    0m26.442s
user    0m24.552s
sys     0m3.904s

swast@instance-1:~$ time python3 benchmark_bqstorage.py 50
Got 7917713 rows.

real    0m50.157s
user    0m48.556s
sys     0m7.876s

@tswast tswast changed the title fix(bigquerystorage): to_dataframe on an arrow stream uses 2x faster … fix(bigquerystorage): to_dataframe on an arrow stream uses faster to_arrow + to_pandas, internally Jan 13, 2020
@tswast tswast changed the title fix(bigquerystorage): to_dataframe on an arrow stream uses faster to_arrow + to_pandas, internally refactor(bigquerystorage): to_dataframe on an arrow stream uses faster to_arrow + to_pandas, internally Jan 13, 2020
@tswast tswast requested a review from plamut January 13, 2020 20:09
@tswast
Copy link
Contributor Author

tswast commented Jan 13, 2020

I made the swast-scratch.to_dataframe_benchmark.tlc_green_6pct, etc. tables publicly readable, so you should be able to reproduce my results as well.

I've updated the PR title to remove the "2x", as it's a more modest 1.126x speedup.

@plamut
Copy link
Contributor

plamut commented Jan 15, 2020

@tswast Thanks for the data, and the benchmarking results!

I was finally able to reproduce a significant difference pretty consistently (~33 seconds vs. ~35 seconds for the 6pct test, for example).

As this was much slower than your results, it became apparent that the network I/O dominated the timings on my end (I'm on a 50 Mbps fiber optic). By shutting down everything that might have affected the average download speed, I was able to bring the variance down by enough to not lose the signal in the noise (didn't initially expect that such measures would be necessary based on the reported 2x gains 🙂).

Copy link
Contributor

@plamut plamut left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The speedup is indeed noticeable 👍

@OmarMAmin
Copy link

@tswast
Sorry for asking here, not sure if there's somewhere else to ask,
Does that mean that results from here aren't currently valid?
https://medium.com/google-cloud/announcing-google-cloud-bigquery-version-1-17-0-1fc428512171

using to_dataframe() is equivalent to to_arrow().to_pandas() if pyarrow is installed, and fastavro is no longer a requirement?

@tswast
Copy link
Contributor Author

tswast commented Dec 12, 2022

Yes, that's true. We should see similar performance for to_dataframe() compared to to_arrow().to_pandas() after this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquerystorage Issues related to the BigQuery Storage API. cla: yes This human has signed the Contributor License Agreement.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants