Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'table.insert_data()' frequently fails in the Python BiqQuery client with error: [Errno 32] Broken pipe #2491

Closed
rishsriv opened this issue Oct 4, 2016 · 10 comments
Assignees
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release.

Comments

@rishsriv
Copy link

rishsriv commented Oct 4, 2016

Streaming inserts to a BigQuery table frequently fail with error: [Errno 32] Broken pipe on Debian 7.11 with Python 2.7.3 and google-cloud-python 0.20. The error occurs around 30% of the time.

This happens with a small number of rows (<2000), where each row is less than 2kb. Hence, it is unlikely to be a quota problem.

The code is executed on a Google Cloud Compute VM every 30 minutes. It has failed in 6 out of the last 26 runs.

Here is the code that I am using:

from google.cloud import bigquery
client = bigquery.Client(project=PROJECT_NAME)
dataset = client.dataset(DATASET_NAME)
table = dataset.table(TABLE_NAME, schema=SCHEMA)

#some code to generate ROWS to be inserted

table.insert_data(ROWS)

Here is the full error traceback:

Traceback (most recent call last):
  File "scrape_cronjob.py", line 277, in <module>
    errors = table.insert_data(ROWS)
 File "/usr/local/lib/python2.7/dist-packages/google/cloud/bigquery/table.py", line 773, in insert_data
    data=data)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/connection.py", line 346, in api_request
    target_object=_target_object)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/connection.py", line 244, in _make_request
    return self._do_request(method, url, headers, data, target_object)
  File "/usr/local/lib/python2.7/dist-packages/google/cloud/connection.py", line 273, in _do_request
    body=data)
  File "/usr/local/lib/python2.7/dist-packages/oauth2client/transport.py", line 169, in new_request
    redirections, connection_type)
  File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1609, in request
    (response, content) = self._request(conn, authority, uri, request_uri, method, body, headers, redirections, cachekey)
  File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1351, in _request
    (response, content) = self._conn_request(conn, request_uri, method, body, headers)
  File "/usr/local/lib/python2.7/dist-packages/httplib2/__init__.py", line 1273, in _conn_request
    conn.request(method, request_uri, body, headers)
  File "/usr/lib/python2.7/httplib.py", line 1000, in request
    self._send_request(method, url, body, headers)
  File "/usr/lib/python2.7/httplib.py", line 1034, in _send_request
    self.endheaders(body)
  File "/usr/lib/python2.7/httplib.py", line 996, in endheaders
    self._send_output(message_body)
  File "/usr/lib/python2.7/httplib.py", line 847, in _send_output
    self.send(msg)
  File "/usr/lib/python2.7/httplib.py", line 823, in send
    self.sock.sendall(data)
  File "/usr/lib/python2.7/ssl.py", line 229, in sendall
    v = self.send(data[count:])
  File "/usr/lib/python2.7/ssl.py", line 198, in send
    v = self._sslobj.write(data)
error: [Errno 32] Broken pipe
@daspecster daspecster added the api: bigquery Issues related to the BigQuery API. label Oct 4, 2016
@daspecster
Copy link
Contributor

Thanks for reporting!

Do you happen to know roughly how long it takes this operation takes on average?

@rishsriv
Copy link
Author

rishsriv commented Oct 4, 2016

Happy to report! It takes less than a one or two seconds.

@daspecster
Copy link
Contributor

I'm not entirely sure of the scenario but if you're streaming the data across locations then you might see this more often.

I don't see that there has been any recent issues on https://status.cloud.google.com.

I would suggest writing logic to retry on a failure like this, if I understand correctly though, it looks like this is a non-transactional so there could be duplication(although they could be removed).

@tseaver @dhermes, any ideas?

@rishsriv
Copy link
Author

rishsriv commented Oct 4, 2016

Thanks for the update. I am indeed streaming the data across locations (Asia to North America).

I am currently using the following logic, which ensures that the rows are virtually all of the time, but would be nice to have ways to make this more robust :)

inserted_successfully = False
num_tries = 0
while not inserted_successfully and num_tries < 5:
    try:            
        errors = table.insert_data(ROWS)
        print 'num_errors', len(errors)
        inserted_successfully = True
    except Exception as e:
        print 'insert failed with exception: %s\ntrying again'%e
        num_tries += 1

@tseaver
Copy link
Contributor

tseaver commented Oct 4, 2016

@rishsriv It might be "safer" to pass row_ids with your rows, allowing the back-end to de-duplicate any which happen to be inserted twice. I have logged #2492 to make that more obvious.

@tseaver tseaver added the flaky label Oct 4, 2016
@tseaver
Copy link
Contributor

tseaver commented Oct 4, 2016

We should probably also look at promoting our "retry handlers" from being "private" to our system_tests/ into actual public APIs of the library.

@daspecster
Copy link
Contributor

@tseaver, could the GAPIC LRO's help here if it was implemented with GAPIC?

@tseaver
Copy link
Contributor

tseaver commented Oct 4, 2016

@daspecster nope, for two reasons:

  • Bigquery is a JSON-only service.
  • Table.insert_data does not return an LRO.

@dhermes
Copy link
Contributor

dhermes commented Oct 4, 2016

@tseaver We get retries for free with urrlib3 / requests. I prefer to just finish ditching httplib2 to fix these broken socket errors

@lukesneeringer lukesneeringer added the priority: p2 Moderately-important priority. Fix may not be included in next release. label Apr 19, 2017
@tswast
Copy link
Contributor

tswast commented Aug 11, 2017

Closing this issue as we are tracking internally as a GA requirement.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: bigquery Issues related to the BigQuery API. priority: p2 Moderately-important priority. Fix may not be included in next release.
Projects
None yet
Development

No branches or pull requests

6 participants