Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Uploads Using Python Google Cloud Storage/Bigquery Client Libraries are Very Slow #238

Closed
KevinTydlacka opened this issue Jul 15, 2021 · 14 comments · Fixed by googleapis/python-bigquery#799
Assignees
Labels
api: storage Issues related to the googleapis/google-resumable-media-python API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.

Comments

@KevinTydlacka
Copy link

This was originally opened as a Stackoverflow question, then reported as a Bug through the IssueTracker site, but neither have gotten any traction...I'm thinking this is the best place to report this issue since the maintainers might have quicker insight into what the issue might be.

Issue

Using python google.cloud.bigquery.client.Client load_table_from_file() or uploading them first to my storage bucket using google.cloud.storage.blob.Blob upload_from_file() (to then load to BQ) both result in awfully slow uploads (1-4 MBps) when gsutil/Dropbox show 10x speeds on same machine/environment.

I am using python 3.6.9. All metrics are tested using a single file upload. I have tried:

Running in a docker container on a Google Compute Engine Ubuntu VM.
Running in a docker container on my mac.
Running on my Mac using just python (no docker).
Uploading from the whole file in memory, from disk, uncompressed, gzipped. No difference.
Using older and the most recent python client library versions.
For older (1.24.0 Bigquery, 1.25.0 Storage) clients I see 1-3MB per second upload speeds. For the 2.13.1 Bigquery client I see 3-4MB per second.

All these tests resulted in identically slow performance.

I have 900+ Mbps up/down on my mac. The Dropbox python client library running on the same setups easily smokes these speeds using the exact same files, and using gsutil/bq on my mac also shows 10x + speeds for the same file.

Environment details

  • OS type and version:
  • Python version: 3.6
  • pip version: 21.1.3
  • google-cloud-bigquery version: Multiple

Steps to reproduce

  1. Attempt to upload a large (1-4GB) CSV file to an existing BQ table using the python biquery client load_table_from_file() method. See code example for a sample script as well as sample bq command line comparison.

  2. Same limited speed can be observed when trying to use the python storage client blob.upload_from_file().

Code example (Find and Replace UPDATE_THIS)

import csv
import logging
import random
import string

from google.cloud import bigquery
from google.oauth2 import service_account


SVC_ACCT_JSON_FILENAME = "creds.json"
GCP_PROJECT_ID = "UPDATE_THIS"
DATASET_ID = "UPDATE_THIS"
BQ_TABLE_ID = "UPDATE_THIS"

CSV_FILENAME = "test_table_data.csv"
CSV_NUM_ROWS = 3000000

"""
-- Create table in BQ before running this script
CREATE TABLE UPDATE_THIS.UPDATE_THIS
(
  col1_str STRING,
  col2_str STRING,
  col3_str STRING,
  col4_str STRING,
  col5_str STRING,
  col6_str STRING,
  col7_int INT64,
  col8_int INT64
)

"""

# Command line comparison. This uses full bandwidth ~70 MBps on my mac vs
# about 4MBps on the same machine using this script/python client
"""
bq load \
    --source_format=CSV \
    --replace=true \
    --skip_leading_rows=1 \
    UPDATE_THIS:UPDATE_THIS.UPDATE_THIS \
    ./test_table_data.csv \
    col1_str:STRING,col2_str:STRING,col3_str:STRING,col4_str:STRING,col5_str:STRING,col6_str:STRING,col7_int:INTEGER,col8_int:INTEGER
"""


def main():
    generate_csv()  # Run first time then reuse

    # Create client
    credentials = service_account.Credentials.from_service_account_file(
        SVC_ACCT_JSON_FILENAME,
        scopes=["https://www.googleapis.com/auth/cloud-platform"],
    )
    bq_client = bigquery.Client(
        credentials=credentials,
        project=GCP_PROJECT_ID
    )

    dataset_ref = bq_client.dataset(DATASET_ID)
    table_ref = dataset_ref.table(BQ_TABLE_ID)

    config = bigquery.LoadJobConfig()
    config.autodetect = False
    config.source_format = "CSV"
    config.skip_leading_rows = 1
    config.write_disposition = "WRITE_TRUNCATE"

    logging.info("Beginning load job...")
    with open(CSV_FILENAME, "rb") as source_file:
        job = bq_client.load_table_from_file(
            source_file,
            table_ref,
            job_config=config
        )
    job.result()  # Starts job and waits for table load to complete.
    logging.info("Job ID: %s", job.job_id)
    if job.errors is None and job.error_result is None:
        logging.info("BQ load job complete without error!")
        logging.info(
            "Loaded %d rows",
            job.output_rows
        )
    else:
        msg = ("bderr: BQ load job failed with error_result: "
               f"{job.error_result} and errors: {job.errors}")
        logging.error(msg)


def generate_csv():
    """Generates csv of string/int data types. File size should be around 1GB
    and include the header.

    """
    logging.info("Generating CSV...")
    header = [
        "col1_str",
        "col2_str",
        "col3_str",
        "col4_str",
        "col5_str",
        "col6_str",
        "col7_int",
        "col8_int"
    ]

    char_bank = string.ascii_letters + string.digits
    with open(CSV_FILENAME, "w") as fout:
        w_csv = csv.writer(fout)
        w_csv.writerow(header)
        for x in range(CSV_NUM_ROWS):
            if x % 100000 == 0:
                logging.info("Written %d out of %d rows...", x, CSV_NUM_ROWS)
            w_csv.writerow([
                "".join(random.choices(char_bank, k=48)),
                "".join(random.choices(char_bank, k=48)),
                "".join(random.choices(char_bank, k=48)),
                "".join(random.choices(char_bank, k=48)),
                "".join(random.choices(char_bank, k=48)),
                "".join(random.choices(char_bank, k=48)),
                random.randrange(100000000),
                random.randrange(100000000),
            ])


if __name__ == "__main__":
    fmt = "%(asctime)s %(name)-25s  %(module)-24s %(levelname)9s: %(message)s"
    logging.basicConfig(format=fmt)
    logging.getLogger().setLevel(logging.INFO)
    logging.info("SCRIPT START")
    main()
    logging.info("SCRIPT END")

Making sure to follow these steps will guarantee the quickest resolution possible.

Thanks!

@tswast tswast transferred this issue from googleapis/python-bigquery Jul 15, 2021
@product-auto-label product-auto-label bot added the api: storage Issues related to the googleapis/google-resumable-media-python API. label Jul 15, 2021
@tswast
Copy link
Contributor

tswast commented Jul 15, 2021

Thanks for the report. I'm not sure what bq load is doing differently from the Python client, so maybe that gives an area to look for improvement.

Upload code path is actually shared across the BQ and Storage clients, so I've moved this issue to that shared package.

@tswast
Copy link
Contributor

tswast commented Jul 15, 2021

I looked at the source to bigquery_client.py (the bq CLI tool). It's using https://github.com/googleapis/google-api-python-client/blob/11d78e0fe9290d1a8c516a0e2f2e019dbd1877f9/googleapiclient/http.py#L552 rather than this resumable media package, so maybe there are some performance gaining lessons that can be derived from that?

@tswast tswast added the api: bigquery Issues related to the BigQuery API. label Jul 15, 2021
@jimfulton jimfulton self-assigned this Jul 15, 2021
@tseaver
Copy link
Contributor

tseaver commented Jul 15, 2021

Coul gsutil actually be breaking the file apart into chunks, uploading them separately in parallel, and composing them together when done?

@product-auto-label product-auto-label bot removed the api: bigquery Issues related to the BigQuery API. label Jul 16, 2021
@jimfulton
Copy link

Coul gsutil actually be breaking the file apart into chunks, uploading them separately in parallel, and composing them together when done?

It can, but doesn't by default.

@jimfulton
Copy link

Here's an update. I decided to start with google-cloud-storage and gsutil, since they are lower-level.

I ran the generate_csv function to generate 895MB CSV file.

Uploading this with gsutil and the storage Python client (GCS) took the same time, ~16s.

I also created a 3G randomly-generated CSV. Uploading with gsutils tool about 168s and with GCS ~150s.

I ran tests on a compute-engine instance, to try to minimize network effects.

A complication was that there seemed to be some sort of throttling at play. Successive runs got slower, sometimes a lot slower. I found I need to wait a couple of minutes between runs to get consistent results.

In any case, I can't reproduce the reported slowness of the storage Python client.

Some additional notes:

  • I verified that the Python client is using a 100MB chunk size by default. Increasing it to 300MB didn't improve the time for the 3G upload.
  • gsutil is written in Python :)
  • I played with the parallel-upload feature of gsutil, but didn't see much an improvement for these file sizes.

On to BigQuery. :)

@jimfulton
Copy link

jimfulton commented Jul 16, 2021

I'm not finding a significant difference in uploading test_table_data.csv (895MB) between bq and the bigquery client library.

Here are some times:
bq: 1m21.745s, 2m5.444s
BigQuery Python library: 3m9s, 2m39s, 2m:55s, 2m9s

The library is is a bit slower. It's using a small (1MB) chunk size. Increasing that to 100MB gives times: 1m52s, 1m42s.

@jimfulton
Copy link

@KevinTydlacka your script has:

generate_csv() # Run first time then reuse

But I don't see it being reused. Could your times be including that? Generating the file takes a long time.

I adapted your script slightly:

  • I separated file generation into a separate script that I ran once.
  • I'm running on a bare tiny compute-engine instance.
  • I'm using Python 3.6.14
  • I'm setting up creds via the GOOGLE_APPLICATION_CREDENTIALS environment variable.

@jimfulton
Copy link

@KevinTydlacka
Copy link
Author

Hmm no I comment that out/delete it and just reuse the same generated file after it is created on the first run.

Since it was quick I just tested against Python 3.9.6 and am seeing the same (slow) behavior. It seems odd that it would be a credentialing issue. I will try it out on a little Compute Engine instance again and see if I can get you more info.

It's worth noting that whoever responded to my initial bug report was able to replicate the issue. I'm not sure if you can see who that is to collect details from them too in case it helps target the investigation.

@KevinTydlacka
Copy link
Author

KevinTydlacka commented Jul 16, 2021

OK @jimfulton I tried to duplicate your setup and am seeing the same (originally reported/slow) behavior. Details:

Made a new Compute Engine e2-micro instance in us-central-1a with the following specs:
image

Set GOOGLE_APPLICATION_CREDENTIALS env variable to point to the .json credentials file for a service account on the project with the following Roles:
image

Installed venv, installed and upgraded pip3 then made a virtual environment using the default Python 3.6.9 that comes on Ubuntu 18.0.4 LTS.

Installed google-cloud-bigquery python module in the virtual environment. PIP freeze shows this:

cachetools==4.2.2
certifi==2021.5.30
cffi==1.14.6
charset-normalizer==2.0.3
google-api-core==1.31.0
google-auth==1.33.0
google-cloud-bigquery==2.21.0
google-cloud-core==1.7.1
google-crc32c==1.1.2
google-resumable-media==1.3.1
googleapis-common-protos==1.53.0
grpcio==1.38.1
idna==3.2

Moved over a 938MB sample file generated from my script above.

Attempted the upload (bq load) using BigQuery CLI 2.0.69 that comes stock on the VM using the full command I have commented out in the script above. The whole thing took 19 seconds to complete successfully.

Ran your script from your gist against the same file, with the same destination (I only edited the dataset and table in your script) and got the same slow behavior. Output from your script was:

[]
[]
 154.9734525680542
[Row((3000000,), {'f0_': 0})]

The difference in python versions shouldn't matter since I still see the same issues on recent Python versions (3.9.6). And the only difference between what we're doing is really a different credential, project, and table. Let me know if you think of anything else. Thanks for working on this!

@unforced unforced added type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns. priority: p2 Moderately-important priority. Fix may not be included in next release. labels Jul 17, 2021
@jimfulton
Copy link

Sorry, just saw this. I'll try reproducing your setup. :)

@jimfulton
Copy link

I tried to reproduce this. I created an e2-micro instance in us-central-1a using a ubuntu-minimal-1804-bionic-v20210720 image (the only 18.04 image I saw). bq reports "This is BigQuery CLI 2.0.70".

The bq load command I'm using:

bq load \
    --source_format=CSV \
    --replace=true \
    --skip_leading_rows=1 \
    riversnake.test_table_data \
    ./test_table_data.csv \
    col1_str:STRING,col2_str:STRING,col3_str:STRING,col4_str:STRING,col5_str:STRING,col6_str:STRING,col7_int:INTEGER,col8_int:INTEGER

The timing:

jim@test-bq-upload-vm:~/p/238$ time sh bq.sh
Upload complete.
Waiting on bqjob_r27cd9c58679eb36c_0000017aca9f2563_1 ... (48s) Current status: DONE   

real	1m4.794s
user	0m2.792s
sys	0m2.246s

It appears to be taking ~17s to transmit the data and 48s to process. The overall time is ~65s (as low as 63 on other runs).

That's much larger than the 19s you're reporting. I have no idea why you're seeing much shorter times.

Using the library is taking ~124s for me, which is in the same ballpark that you're seeing, and is about twice as long as bq.

(I tried running with Python 2, which is what bq is using, and that took longer, ~160s.)

Increasing the chuck size used internally by the library from 1M to 100M yielded times as low as 58s, which is in line with bq. With large chunk sizes, however, the times are a lot more variable.

I need to look at what 'bq` is doing internally.

@jimfulton
Copy link

It's worth noting that whoever responded to my initial bug report was able to replicate the issue.
I'm not sure if you can see who that is to collect details from them too in case it helps target the investigation.

They said:

We are still investigating the issue. At this point we obtained [1] for the script an>> d [2] for the bq command, where the “Upload complete” was achieved in about 11 second>> s.
Further updates will be published here.

[1]
2021-06-30 06:55:01,496 root test_uploads INFO>> : Beginning load job...
2021-06-30 06:57:08,662 root test_uploads INFO>> : Job ID: XXXXXXXX-XXXX-XXXX-XXXX-XXXXXXXXXXXX
2021-06-30 06:57:08,662 root test_uploads INFO>> : BQ load job complete without error!

[2]
Upload complete.
Waiting on bqjob_XXXXXXXXXXXXXXXXX_XXXXXXXXXXXXXXXX_X ... (48s) Current status: DONE

Their times are in line with what I'm seeing. The library is taking about 127s and the total bq time is 59s (11s + 48s). (They aren't showing wall time for the bq run, so it's a little hard to compare with my wall time for the total.)

Comparing just upload time with the library time doesn't make a lot of sense, because the library time includes upload and processing.

@jimfulton
Copy link

I verified that bq uses a chunk size of 100MB. Upping the library's chuck size to 100MB makes the total times for the library and bq the same.

Note that total time, including processing time, can vary quite a bit for both bq and the library, because processing time can vary a lot depending on the computing resources allocated to the job. That makes measurement tricky. :)

In any case, we'll update the library chunk size to 100MB.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: storage Issues related to the googleapis/google-resumable-media-python API. priority: p2 Moderately-important priority. Fix may not be included in next release. type: bug Error or flaw in code with unintended results or allowing sub-optimal usage patterns.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants