Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Long download times for large query results in bigrquery #199

Closed
Praxiteles opened this issue Feb 17, 2018 · 13 comments
Closed

Long download times for large query results in bigrquery #199

Praxiteles opened this issue Feb 17, 2018 · 13 comments
Labels
api 🕸️ feature a feature request or enhancement

Comments

@Praxiteles
Copy link

Praxiteles commented Feb 17, 2018

Our query_exec command is taking 2 hours to download a 27MB dataset (~320K rows) when max_pages is set to Inf and page_size is set to 10K.

We seem to be having the same issue as this individual: https://community.rstudio.com/t/bigrquery-large-datasets/2632
They seem to have found that their 1GB file download times went from 1 hour down to 2 mins when they downloaded it as JSON from Google Cloud Storage.

Theoretically, our dataset streaming down from BigQuery should take just minutes as well.

Given the discrepant times between services for the same dataset, is this a bug or is there a solution to download large datasets from Google BigQuery via bigrquery more quickly?

To clarify: The exact same bigrquery query takes just minutes to download data in RStudio but hours when running the same query in a Jupyter notebook.

@barnettjacob
Copy link

@Praxiteles - Hi, I'm the 'individual'! Your speeds sound far worse than those I was experiencing when I was using the default page_size = 1000. We did a couple of things to alleviate this:

  1. Used larger page sizes - we tested a bunch and 100k seemed to be the quickest on datasets of about 1GB (although we have had some failures recently with page sizezs that big).

  2. Created a function based on the feedback I got in that thread but also inspired by
    (copied from) https://twitter.com/KanAugust/status/931177068313788416 which automates the exporting of the table to GCS, downloads the file(s), and imports them into R as a single dataframe (and gets rid of all the tmp stuff). This gives a dramatic speed improvement.

Looks something like the below. Assumes you have googleCloudStorageR installed and authenticated (I use a json snippet from the google project - see the docs for the package which are pretty good).

bq_query_gcs <- function(sql, project = 'xx-my-project-xx', use_legacy_sql = F, quiet = FALSE,
                         target_dataset = 'xx-my-dataset-xx', target_bucket = 'xx-my-bucket-xx',
                         target_directory = getwd(), job_name = NULL, multi_file = FALSE,
                         read_csv_col_types = NULL, read_csv_guess_max = 10000){

  #set random job name
  if(is.null(job_name)){
    job_name <- paste0(sample(LETTERS, 15), collapse = '')
  }

  destination_table <- paste0(target_dataset, '.', job_name)

  #write Query to temporary table
  job <- bigrquery::insert_query_job(sql, project = project, destination_table = destination_table, default_dataset = NULL, use_legacy_sql = use_legacy_sql)
  job <- bigrquery::wait_for(job, quiet = quiet)

  #export to gcs
  if(multi_file == TRUE){
    job_name_gcs <- paste0(job_name, '_*')
   } else {
    job_name_gcs <- job_name
    }

  job <- bigrquery::insert_extract_job(project = project, dataset = target_dataset, table = job_name, destination_uris = paste0('gs://', target_bucket, '/', job_name_gcs, '.csv.gz'),
                                       print_header = TRUE, field_delimiter=",", destination_format="CSV", compression="GZIP")
  job <- bigrquery::wait_for(job, quiet = quiet)

  #remove temoporary tables from BQ
  bigrquery::delete_table(project = project, dataset = target_dataset, table = job_name)

  #import from gcs and tidy up
  objects <- googleCloudStorageR::gcs_list_objects(target_bucket)
  to_download <- grep(job_name, objects$name, value = T)

  lapply(to_download, function(name){
    googleCloudStorageR::gcs_get_object(bucket = target_bucket, name, overwrite = TRUE, saveToDisk = paste0(target_directory, '/', name))
    googleCloudStorageR::gcs_delete_object(name, bucket = target_bucket)
  })

  #read from temporary files
  df <- lapply(to_download, function(file){
      suppressMessages(read_csv(file, progress = FALSE, col_types = read_csv_col_types, guess_max = read_csv_guess_max))
  }) %>% bind_rows()

  #clear temporary files
  lapply(to_download, function(file){
    file.remove(file)
  })

  return(df)

}

@hadley
Copy link
Member

hadley commented Mar 28, 2018

@craigcitro can you confirm that this is expected behaviour? (i.e. to get good performance you should save to gcs, and then download from there?)

@hadley hadley added feature a feature request or enhancement api 🕸️ labels Mar 28, 2018
@j450h1
Copy link
Contributor

j450h1 commented Mar 29, 2018

Intuitively, this sounds about right. In the Web UI, you usually have to save a Destination Table if the results are too large and then export that table to GCS. It doesn't even give you the option to save as csv.

@riccardopinosio
Copy link

I should add also that bigqueryR
uses exactly this approach for larger queries.

@hadley
Copy link
Member

hadley commented Apr 8, 2018

Could someone please provide a reprex with a query on a public dataset? (i.e. some query that yeilds around 10 mb in size). I need to figure out where the bottleneck is.

@barnettjacob
Copy link

@hadley - to confirm, this isn't an issue specific to this package - we've had the same issue with python and tableau. Our contact at Google more or less confirmed that this was expected behaviour (not in quite so many words).

Couldn't get reprex() going on my Linux box but this query yields data thats about 10mb:

library(bigrquery)
sql <- "SELECT * FROM `bigquery-public-data.hacker_news.comments` LIMIT 20000"
df <- query_exec(sql, project = 'my-project', use_legacy_sql = F)

@Praxiteles
Copy link
Author

@barnettjacob @hadley Just to clarify the issue....the exact same bigrquery query takes just minutes to download data in RStudio but hours when running the same query in a Jupyter notebook. That seems to defy the idea that this is expected behavior. Is there something about the package and the way it is used by Jupyter that could cause this?

@barnettjacob
Copy link

barnettjacob commented Apr 9, 2018

@Praxiteles - that's a strange one. It's not what I was describing in my 'community' post and isn't what I've had confirmed by others - i.e. downloading from BQ into Rstudio via this package is much slower than using GCS as an intermediary step.

I think the discussion above about expected behaviour relates to the differential between the respective download speeds from BQ and GCS.

@hadley
Copy link
Member

hadley commented Apr 9, 2018

To be clear, I don't think bigrquery will ever be as fast as saving to GCS and downloading using google's command line tool (because it does a bunch of nice parallel http streaming stuff), but I don't think there should be such a massive difference as there is now.

@hadley
Copy link
Member

hadley commented Apr 9, 2018

Hmmmm, @barnettjacob that takes only 7s for me. If I double it to ~24 meg, it takes 15s. If I double it again to ~47 meg, it takes 31s.

@hadley
Copy link
Member

hadley commented Apr 9, 2018

The bottleneck is currently mostly in jsonlite parsing the json. There's no obvious way to make this faster without a lot of work.

@hadley
Copy link
Member

hadley commented Apr 11, 2018

Closing in favour of #224 — currently the biggest overhead is parsing the json, but we should be able to skip that step by doing the parsing "by hand" in C++. I think that should give around a 5x speedup. Once that is complete I might explore doing the http requests in parallel, but that will only make a difference if it's the http request that's the bottleneck.

@hadley hadley closed this as completed Apr 11, 2018
@nmatare
Copy link

nmatare commented Jun 15, 2018

For future sleuths, I'm linking this to: googleapis/python-bigquery-pandas#167

I have also written a documented solution here that uses 'reticulate' and the Google Cloud python API

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api 🕸️ feature a feature request or enhancement
Projects
None yet
Development

No branches or pull requests

6 participants