Long download times for large query results in bigrquery #199

Praxiteles · 2018-02-17T16:21:20Z

Our query_exec command is taking 2 hours to download a 27MB dataset (~320K rows) when max_pages is set to Inf and page_size is set to 10K.

We seem to be having the same issue as this individual: https://community.rstudio.com/t/bigrquery-large-datasets/2632
They seem to have found that their 1GB file download times went from 1 hour down to 2 mins when they downloaded it as JSON from Google Cloud Storage.

Theoretically, our dataset streaming down from BigQuery should take just minutes as well.

Given the discrepant times between services for the same dataset, is this a bug or is there a solution to download large datasets from Google BigQuery via bigrquery more quickly?

To clarify: The exact same bigrquery query takes just minutes to download data in RStudio but hours when running the same query in a Jupyter notebook.

barnettjacob · 2018-02-28T12:54:57Z

@Praxiteles - Hi, I'm the 'individual'! Your speeds sound far worse than those I was experiencing when I was using the default page_size = 1000. We did a couple of things to alleviate this:

Used larger page sizes - we tested a bunch and 100k seemed to be the quickest on datasets of about 1GB (although we have had some failures recently with page sizezs that big).
Created a function based on the feedback I got in that thread but also inspired by
(copied from) https://twitter.com/KanAugust/status/931177068313788416 which automates the exporting of the table to GCS, downloads the file(s), and imports them into R as a single dataframe (and gets rid of all the tmp stuff). This gives a dramatic speed improvement.

Looks something like the below. Assumes you have googleCloudStorageR installed and authenticated (I use a json snippet from the google project - see the docs for the package which are pretty good).

bq_query_gcs <- function(sql, project = 'xx-my-project-xx', use_legacy_sql = F, quiet = FALSE,
                         target_dataset = 'xx-my-dataset-xx', target_bucket = 'xx-my-bucket-xx',
                         target_directory = getwd(), job_name = NULL, multi_file = FALSE,
                         read_csv_col_types = NULL, read_csv_guess_max = 10000){

  #set random job name
  if(is.null(job_name)){
    job_name <- paste0(sample(LETTERS, 15), collapse = '')
  }

  destination_table <- paste0(target_dataset, '.', job_name)

  #write Query to temporary table
  job <- bigrquery::insert_query_job(sql, project = project, destination_table = destination_table, default_dataset = NULL, use_legacy_sql = use_legacy_sql)
  job <- bigrquery::wait_for(job, quiet = quiet)

  #export to gcs
  if(multi_file == TRUE){
    job_name_gcs <- paste0(job_name, '_*')
   } else {
    job_name_gcs <- job_name
    }

  job <- bigrquery::insert_extract_job(project = project, dataset = target_dataset, table = job_name, destination_uris = paste0('gs://', target_bucket, '/', job_name_gcs, '.csv.gz'),
                                       print_header = TRUE, field_delimiter=",", destination_format="CSV", compression="GZIP")
  job <- bigrquery::wait_for(job, quiet = quiet)

  #remove temoporary tables from BQ
  bigrquery::delete_table(project = project, dataset = target_dataset, table = job_name)

  #import from gcs and tidy up
  objects <- googleCloudStorageR::gcs_list_objects(target_bucket)
  to_download <- grep(job_name, objects$name, value = T)

  lapply(to_download, function(name){
    googleCloudStorageR::gcs_get_object(bucket = target_bucket, name, overwrite = TRUE, saveToDisk = paste0(target_directory, '/', name))
    googleCloudStorageR::gcs_delete_object(name, bucket = target_bucket)
  })

  #read from temporary files
  df <- lapply(to_download, function(file){
      suppressMessages(read_csv(file, progress = FALSE, col_types = read_csv_col_types, guess_max = read_csv_guess_max))
  }) %>% bind_rows()

  #clear temporary files
  lapply(to_download, function(file){
    file.remove(file)
  })

  return(df)

}

hadley · 2018-03-28T22:27:52Z

@craigcitro can you confirm that this is expected behaviour? (i.e. to get good performance you should save to gcs, and then download from there?)

j450h1 · 2018-03-29T00:26:13Z

Intuitively, this sounds about right. In the Web UI, you usually have to save a Destination Table if the results are too large and then export that table to GCS. It doesn't even give you the option to save as csv.

riccardopinosio · 2018-03-29T11:27:47Z

I should add also that bigqueryR
uses exactly this approach for larger queries.

hadley · 2018-04-08T20:08:03Z

Could someone please provide a reprex with a query on a public dataset? (i.e. some query that yeilds around 10 mb in size). I need to figure out where the bottleneck is.

barnettjacob · 2018-04-08T22:18:44Z

@hadley - to confirm, this isn't an issue specific to this package - we've had the same issue with python and tableau. Our contact at Google more or less confirmed that this was expected behaviour (not in quite so many words).

Couldn't get reprex() going on my Linux box but this query yields data thats about 10mb:

library(bigrquery)
sql <- "SELECT * FROM `bigquery-public-data.hacker_news.comments` LIMIT 20000"
df <- query_exec(sql, project = 'my-project', use_legacy_sql = F)

Praxiteles · 2018-04-08T22:54:23Z

@barnettjacob @hadley Just to clarify the issue....the exact same bigrquery query takes just minutes to download data in RStudio but hours when running the same query in a Jupyter notebook. That seems to defy the idea that this is expected behavior. Is there something about the package and the way it is used by Jupyter that could cause this?

barnettjacob · 2018-04-09T07:51:58Z

@Praxiteles - that's a strange one. It's not what I was describing in my 'community' post and isn't what I've had confirmed by others - i.e. downloading from BQ into Rstudio via this package is much slower than using GCS as an intermediary step.

I think the discussion above about expected behaviour relates to the differential between the respective download speeds from BQ and GCS.

hadley · 2018-04-09T13:57:26Z

To be clear, I don't think bigrquery will ever be as fast as saving to GCS and downloading using google's command line tool (because it does a bunch of nice parallel http streaming stuff), but I don't think there should be such a massive difference as there is now.

hadley · 2018-04-09T14:04:58Z

Hmmmm, @barnettjacob that takes only 7s for me. If I double it to ~24 meg, it takes 15s. If I double it again to ~47 meg, it takes 31s.

hadley · 2018-04-09T14:11:24Z

The bottleneck is currently mostly in jsonlite parsing the json. There's no obvious way to make this faster without a lot of work.

hadley · 2018-04-11T15:09:07Z

Closing in favour of #224 — currently the biggest overhead is parsing the json, but we should be able to skip that step by doing the parsing "by hand" in C++. I think that should give around a 5x speedup. Once that is complete I might explore doing the http requests in parallel, but that will only make a difference if it's the http request that's the bottleneck.

nmatare · 2018-06-15T16:43:56Z

For future sleuths, I'm linking this to: googleapis/python-bigquery-pandas#167

I have also written a documented solution here that uses 'reticulate' and the Google Cloud python API

hadley added feature a feature request or enhancement api 🕸️ labels Mar 28, 2018

hadley closed this as completed Apr 11, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Long download times for large query results in bigrquery #199

Long download times for large query results in bigrquery #199

Praxiteles commented Feb 17, 2018 •

edited

Loading

barnettjacob commented Feb 28, 2018

hadley commented Mar 28, 2018

j450h1 commented Mar 29, 2018 •

edited

Loading

riccardopinosio commented Mar 29, 2018

hadley commented Apr 8, 2018

barnettjacob commented Apr 8, 2018

Praxiteles commented Apr 8, 2018

barnettjacob commented Apr 9, 2018 •

edited

Loading

hadley commented Apr 9, 2018

hadley commented Apr 9, 2018

hadley commented Apr 9, 2018

hadley commented Apr 11, 2018

nmatare commented Jun 15, 2018 •

edited

Loading

Long download times for large query results in bigrquery #199

Long download times for large query results in bigrquery #199

Comments

Praxiteles commented Feb 17, 2018 • edited Loading

barnettjacob commented Feb 28, 2018

hadley commented Mar 28, 2018

j450h1 commented Mar 29, 2018 • edited Loading

riccardopinosio commented Mar 29, 2018

hadley commented Apr 8, 2018

barnettjacob commented Apr 8, 2018

Praxiteles commented Apr 8, 2018

barnettjacob commented Apr 9, 2018 • edited Loading

hadley commented Apr 9, 2018

hadley commented Apr 9, 2018

hadley commented Apr 9, 2018

hadley commented Apr 11, 2018

nmatare commented Jun 15, 2018 • edited Loading

Praxiteles commented Feb 17, 2018 •

edited

Loading

j450h1 commented Mar 29, 2018 •

edited

Loading

barnettjacob commented Apr 9, 2018 •

edited

Loading

nmatare commented Jun 15, 2018 •

edited

Loading