-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Uploads Using Python Google Cloud Storage/Bigquery Client Libraries are Very Slow #238
Uploads Using Python Google Cloud Storage/Bigquery Client Libraries are Very Slow #238
Comments
Thanks for the report. I'm not sure what Upload code path is actually shared across the BQ and Storage clients, so I've moved this issue to that shared package. |
I looked at the source to |
Coul |
It can, but doesn't by default. |
Here's an update. I decided to start with google-cloud-storage and gsutil, since they are lower-level. I ran the Uploading this with gsutil and the storage Python client (GCS) took the same time, ~16s. I also created a 3G randomly-generated CSV. Uploading with gsutils tool about 168s and with GCS ~150s. I ran tests on a compute-engine instance, to try to minimize network effects. A complication was that there seemed to be some sort of throttling at play. Successive runs got slower, sometimes a lot slower. I found I need to wait a couple of minutes between runs to get consistent results. In any case, I can't reproduce the reported slowness of the storage Python client. Some additional notes:
On to BigQuery. :) |
I'm not finding a significant difference in uploading Here are some times: The library is is a bit slower. It's using a small (1MB) chunk size. Increasing that to 100MB gives times: 1m52s, 1m42s. |
@KevinTydlacka your script has:
But I don't see it being reused. Could your times be including that? Generating the file takes a long time. I adapted your script slightly:
|
Hmm no I comment that out/delete it and just reuse the same generated file after it is created on the first run. Since it was quick I just tested against Python 3.9.6 and am seeing the same (slow) behavior. It seems odd that it would be a credentialing issue. I will try it out on a little Compute Engine instance again and see if I can get you more info. It's worth noting that whoever responded to my initial bug report was able to replicate the issue. I'm not sure if you can see who that is to collect details from them too in case it helps target the investigation. |
OK @jimfulton I tried to duplicate your setup and am seeing the same (originally reported/slow) behavior. Details: Made a new Compute Engine e2-micro instance in us-central-1a with the following specs: Set Installed Installed
Moved over a 938MB sample file generated from my script above. Attempted the upload ( Ran your script from your gist against the same file, with the same destination (I only edited the dataset and table in your script) and got the same slow behavior. Output from your script was:
The difference in python versions shouldn't matter since I still see the same issues on recent Python versions ( |
Sorry, just saw this. I'll try reproducing your setup. :) |
I tried to reproduce this. I created an e2-micro instance in us-central-1a using a ubuntu-minimal-1804-bionic-v20210720 image (the only 18.04 image I saw). The
The timing:
It appears to be taking ~17s to transmit the data and 48s to process. The overall time is ~65s (as low as 63 on other runs). That's much larger than the 19s you're reporting. I have no idea why you're seeing much shorter times. Using the library is taking ~124s for me, which is in the same ballpark that you're seeing, and is about twice as long as (I tried running with Python 2, which is what Increasing the chuck size used internally by the library from 1M to 100M yielded times as low as 58s, which is in line with I need to look at what 'bq` is doing internally. |
They said:
Their times are in line with what I'm seeing. The library is taking about 127s and the total Comparing just upload time with the library time doesn't make a lot of sense, because the library time includes upload and processing. |
I verified that Note that total time, including processing time, can vary quite a bit for both In any case, we'll update the library chunk size to 100MB. |
This was originally opened as a Stackoverflow question, then reported as a Bug through the IssueTracker site, but neither have gotten any traction...I'm thinking this is the best place to report this issue since the maintainers might have quicker insight into what the issue might be.
Issue
Using python google.cloud.bigquery.client.Client load_table_from_file() or uploading them first to my storage bucket using google.cloud.storage.blob.Blob upload_from_file() (to then load to BQ) both result in awfully slow uploads (1-4 MBps) when gsutil/Dropbox show 10x speeds on same machine/environment.
I am using python 3.6.9. All metrics are tested using a single file upload. I have tried:
Running in a docker container on a Google Compute Engine Ubuntu VM.
Running in a docker container on my mac.
Running on my Mac using just python (no docker).
Uploading from the whole file in memory, from disk, uncompressed, gzipped. No difference.
Using older and the most recent python client library versions.
For older (1.24.0 Bigquery, 1.25.0 Storage) clients I see 1-3MB per second upload speeds. For the 2.13.1 Bigquery client I see 3-4MB per second.
All these tests resulted in identically slow performance.
I have 900+ Mbps up/down on my mac. The Dropbox python client library running on the same setups easily smokes these speeds using the exact same files, and using gsutil/bq on my mac also shows 10x + speeds for the same file.
Environment details
3.6
21.1.3
google-cloud-bigquery
version: MultipleSteps to reproduce
Attempt to upload a large (1-4GB) CSV file to an existing BQ table using the python biquery client load_table_from_file() method. See code example for a sample script as well as sample
bq
command line comparison.Same limited speed can be observed when trying to use the python storage client blob.upload_from_file().
Code example (Find and Replace
UPDATE_THIS
)Making sure to follow these steps will guarantee the quickest resolution possible.
Thanks!
The text was updated successfully, but these errors were encountered: