Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pubsub Error: The operation was aborted #2661

Closed
c0b opened this issue Oct 9, 2017 · 15 comments
Closed

pubsub Error: The operation was aborted #2661

c0b opened this issue Oct 9, 2017 · 15 comments
Assignees
Labels
api: pubsub Issues related to the Pub/Sub API.

Comments

@c0b
Copy link
Contributor

c0b commented Oct 9, 2017

Background: I have a simple one function project that is doing continuously loading data into bigquery table, triggered on every cloud storage file upload finish, by GCS pubsub-notifications, which is on when every new file uploaded to a particular bucket with particular prefix, then trigger my function to run via a pubsub notification, it was considered a perfect case for the GCF, however it's blocked by https://issuetracker.google.com/issues/66695033 https://stackoverflow.com/questions/45304673/random-apierror-invalid-credentials-calling-bigquery-from-google-cloud-functi failed by some Random ApiError: Invalid Credentials once every a couple of days; I believe that's an GCF operation issue, the Google Cloud Function isn't really production ready;

So I switched to use a plain VM from GCE, use a little wrapper to manually subscribe to this pubsub topic (from above same GCS pubsub-notifications) and whenever a messages come in, call the same function designed for GCF, the function is very simple just call bigquery.dataset('...').table('...').import(gcs.bucket('...').file('...')).then('...')

it was running fine but however didn't last long, each run lasts for not more than a week in GCE then aborted itself; my solution is to have to have a shell wrapper in an endless loop while :; do node ... ; done

but wonder why pubsub aborted grpc connection in the case? the libraries in use are the latest
@google-cloud/pubsub@0.14.4 @google-cloud/bigquery@0.9.6 with node-v8.6.0

events.js:182
      throw er; // Unhandled 'error' event
      ^

Error: The operation was aborted.
    at ClientDuplexStream.<anonymous> (/path/to/my-project/node_modules/@google-cloud/pubsub/src/connection-pool.js:218:21)
    at emitOne (events.js:115:13)
    at ClientDuplexStream.emit (events.js:210:7)
    at ClientDuplexStream._emitStatusIfDone (/path/to/my-project/node_modules/grpc/src/node/src/client.js:260:10)
    at ClientDuplexStream._receiveStatus (/path/to/my-project/node_modules/grpc/src/node/src/client.js:233:8)
    at /path/to/my-project/node_modules/grpc/src/node/src/client.js:757:12
@stephenplusplus stephenplusplus added the api: pubsub Issues related to the Pub/Sub API. label Oct 9, 2017
@callmehiphop
Copy link
Contributor

@murgatroid99 any ideas why we might be getting an aborted error here?

@murgatroid99
Copy link

I'm not that familiar with the surface APIs of this library; what exactly is it that is running for a week before failing? Do you know what is supposed to be happening there, in terms of how the gRPC API is being called? Is it a sequence of requests or a single long-lived stream, or something else?

@callmehiphop
Copy link
Contributor

For this specific usage we open 5 bidi streams, when any of the streams close we usually replace it with a new stream but not on ABORT errors.

@murgatroid99
Copy link

OK. It looks like the ABORTED gRPC status is never generated by the library; it will only be sent by the server application code. You can find our recommended use of that status code here. So this is most likely a server-side issue.

@c0b
Copy link
Contributor Author

c0b commented Oct 9, 2017

what do you mean by a server-side issue ?

as far as I can share, the application has only the GCF function, and the wrapper function as main:

// this was supposed to deploy by GCF, but I designed the main wrapper in the same way of calling
function loadData(event) {
   // parse the event, got gcs bucket and filename from the message
   return bigquery.dataset('...').table('...')
     .import(gcs.bucket('...').file('...'))
     .then(([ job, _ ]) => job.promise())
}
exports.loadData = loadData;

async function main() {
  const topic = pubsub.topic('...');
  const [ sub, _ ] = await topic.createSubscription(...);
       // with ackDeadlineSeconds 90 seconds although bigquery import took not more than 20s
  sub.on('message', async function(msg) {
    try {
      await loadData({ data: msg });
    } catch (err) { console.error(err); }
    msg.ack();
  });
}

if (require.main === module) {
  main();
}

because the GCF is currently still a Node-v6 not the v8 I wanted, I kept the loadData function with the Promise API only without async await, can still be deployable to GCF anytime.

our architecture has some piece of code is uploading to GCS with one avro file every minute, this loadData's job is to just import it into a bigquery table;

So in GCE at runtime, this function is supposed to be called once every minute (on every avro file upload), we have monitoring of this node process, it looks like running normal, CPU usage / Memory usage doesn't go up; but however with continuously running not more than 7 days, the node process becomes down because of the pubsub Error: The operation was aborted

@murgatroid99
Copy link

I mean that this is an error that is originating from the PubSub server.

@callmehiphop
Copy link
Contributor

@jganetsk do we know why the server would return this kind of error after a week?

@c0b
Copy link
Contributor Author

c0b commented Oct 9, 2017

then in the case PubSub server aborted connection, is there anyway as workaround in my main function can handle and reconnect? and make the node process not to go down?

@callmehiphop
Copy link
Contributor

I think you should be able to catch the error via error event and re-open the subscription.

var ABORTED = 10;
var subscription;

openSubscription();

function openSubscription() {
  subscription = pubsub.subscription('my-sub');
  subscripton.on('error', handleError).on('message', onMessage);
}

function handleError(err) {
  if (err.code === ABORTED) {
    subscription.close(openSubscription);
  }
}

function onMessage(message) {
  message.ack();
}

@jganetsk
Copy link

jganetsk commented Oct 9, 2017

@callmehiphop When subscribing, the client library should catch all retryable errors (including "The operation was aborted") and reconnect instead of propagating the error to the caller.

@callmehiphop
Copy link
Contributor

@jganetsk can we get a definitive list of what error should be retryable in PubSub? Historically we've only been requested to retry on the following errors

  • UNKNOWN
  • RESOURCE_EXHAUSTED
  • INTERNAL
  • UNAVAILABLE
  • DATALOSS.

In the PubSub client we also retry on CANCELLED and DEADLINE_EXCEEDED errors.

@jganetsk
Copy link

jganetsk commented Oct 9, 2017

We did discuss this previously, and I was concerned that we were not covering enough codes for retry. Assuming that this is the list of codes: https://github.com/googleapis/googleapis/blob/master/google/rpc/code.proto

These are definitely not retryable because the user needs to take action:

  • NOT_FOUND
  • UNAUTHENTICATED
  • PERMISSION_DENIED

These would be unexpected and it's your call whether to retry them or not (I would lean towards retrying these):

  • INVALID_ARGUMENT
  • ALREADY_EXISTS
  • FAILED_PRECONDITION
  • OUT_OF_RANGE

The rest are retryable. ABORTED is definitely retryable. When pulling for messages, you really need a very good reason not to retry. And note that the client library does propagate an error if all retries consistently fail for an extended period of time (I forget what we used for that value).

@c0b
Copy link
Contributor Author

c0b commented Oct 10, 2017

yeah, I would like to see that ABORTED retry logic in the client library, instead of my app;
but will that be happening anytime soon? I would prefer to just do a pubsub library upgrade

@callmehiphop
Copy link
Contributor

@c0b I'm going to open a PR right now, it'll be pretty small so I imagine I'll have a release out by EOD.

@callmehiphop
Copy link
Contributor

@c0b I've published 0.14.5 and it adds support for retrying on abort errors.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
api: pubsub Issues related to the Pub/Sub API.
Projects
None yet
Development

No branches or pull requests

5 participants