-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Long download times for large query results in bigrquery #199
Comments
@Praxiteles - Hi, I'm the 'individual'! Your speeds sound far worse than those I was experiencing when I was using the default
Looks something like the below. Assumes you have
|
@craigcitro can you confirm that this is expected behaviour? (i.e. to get good performance you should save to gcs, and then download from there?) |
Intuitively, this sounds about right. In the Web UI, you usually have to save a Destination Table if the results are too large and then export that table to GCS. It doesn't even give you the option to save as csv. |
I should add also that bigqueryR |
Could someone please provide a reprex with a query on a public dataset? (i.e. some query that yeilds around 10 mb in size). I need to figure out where the bottleneck is. |
@hadley - to confirm, this isn't an issue specific to this package - we've had the same issue with python and tableau. Our contact at Google more or less confirmed that this was expected behaviour (not in quite so many words). Couldn't get reprex() going on my Linux box but this query yields data thats about 10mb:
|
@barnettjacob @hadley Just to clarify the issue....the exact same bigrquery query takes just minutes to download data in RStudio but hours when running the same query in a Jupyter notebook. That seems to defy the idea that this is expected behavior. Is there something about the package and the way it is used by Jupyter that could cause this? |
@Praxiteles - that's a strange one. It's not what I was describing in my 'community' post and isn't what I've had confirmed by others - i.e. downloading from BQ into Rstudio via this package is much slower than using GCS as an intermediary step. I think the discussion above about expected behaviour relates to the differential between the respective download speeds from BQ and GCS. |
To be clear, I don't think bigrquery will ever be as fast as saving to GCS and downloading using google's command line tool (because it does a bunch of nice parallel http streaming stuff), but I don't think there should be such a massive difference as there is now. |
Hmmmm, @barnettjacob that takes only 7s for me. If I double it to ~24 meg, it takes 15s. If I double it again to ~47 meg, it takes 31s. |
The bottleneck is currently mostly in jsonlite parsing the json. There's no obvious way to make this faster without a lot of work. |
Closing in favour of #224 — currently the biggest overhead is parsing the json, but we should be able to skip that step by doing the parsing "by hand" in C++. I think that should give around a 5x speedup. Once that is complete I might explore doing the http requests in parallel, but that will only make a difference if it's the http request that's the bottleneck. |
For future sleuths, I'm linking this to: googleapis/python-bigquery-pandas#167 I have also written a documented solution here that uses 'reticulate' and the Google Cloud python API |
Our query_exec command is taking 2 hours to download a 27MB dataset (~320K rows) when max_pages is set to Inf and page_size is set to 10K.
We seem to be having the same issue as this individual: https://community.rstudio.com/t/bigrquery-large-datasets/2632
They seem to have found that their 1GB file download times went from 1 hour down to 2 mins when they downloaded it as JSON from Google Cloud Storage.
Theoretically, our dataset streaming down from BigQuery should take just minutes as well.
Given the discrepant times between services for the same dataset, is this a bug or is there a solution to download large datasets from Google BigQuery via bigrquery more quickly?
To clarify: The exact same bigrquery query takes just minutes to download data in RStudio but hours when running the same query in a Jupyter notebook.
The text was updated successfully, but these errors were encountered: