-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translate gRPC Canceled code to Nexus HandlerErrorTypeInternal #1680
Translate gRPC Canceled code to Nexus HandlerErrorTypeInternal #1680
Conversation
@@ -462,11 +462,11 @@ func convertServiceError(err error) error { | |||
errMessage := err.Error() | |||
|
|||
switch st.Code() { | |||
case codes.AlreadyExists, codes.Canceled, codes.InvalidArgument, codes.FailedPrecondition, codes.OutOfRange: | |||
case codes.AlreadyExists, codes.InvalidArgument, codes.FailedPrecondition, codes.OutOfRange: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(making comment here because it's easier to have threads as PR comments)
To ensure these errors can be retried by the Nexus machinery in the server.
Don't we want gRPC canceled errors to not be retried? They aren't anywhere else and are usually the result of user cancel or user-supplied timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the timeout is for a single RPC not the entire operation, we want to retry the RPC (to start a workflow).
We're seeing CI failures since introducing the gRPC status translation logic and there's definitely no intent to fail these nexus operations. Not 100% where these Canceled errors are coming from, they could be coming internally from the server or the SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth investigating. I can't think of a scenario where we want to ever retry this status code. It's almost always due to explicit cancel/timeout unless there's a scenario I'm not familiar with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request timeout should only fail a single Nexus HTTP request, not the entire operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a nexus request is timing out on the worker we shouldn't even respond
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried finding this error in the logs and couldn't find it so still not sure where this is coming from. All I know now is that our release validation tests are failing when running an SDK version with this error translation enabled with "400 Bad Request": context canceled
where they didn't fail before.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this case in the logs. I can definitely see the server reporting that the SDK is returning a BadRequest error and what seems to be the corresponding StartWorkflowExecution RPC with a grpc Cancelled status.
We also know that the Nexus handler context in the SDK wasn't canceled because otherwise it would not send a response back and the server would have reported this as a timeout. I don't see a strong enough reason to fail a Nexus operation for a single RPC that ended up as grpc Cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Canceled
can also be returned by the client side if the user cancelled some grpc request they are making in a handler. In this case I understand it is desirable in this case to retry it , but it is't obvious that all users would want to map this to a HandlerErrorTypeInternal
. I think we should prioritizing how we are going to allow users to customize this policy. Updating the SDK because a client needs a different policy is not a scalable solution.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me a handler implementer making a gRPC call in a handler and then canceling it (which may be due to context being otherwise canceled) causing the gRPC call to return canceled is the same as an application error, both in Nexus handlers and in Temporal activities. But if it was due to context cancellation from actual context cancel from caller, it should not be retryable, like Temporal activities, which is what the eager ctx.Err()
checks seem to accomplish (though even they may be wrong if context cancel is not always caller cancel (e.g. could be worker shutdown), so I'd have to see integration tests for both).
So I think I agree kinda with what this PR is doing.
I think the confusion may be the fact that Nexus chose to split application error into bad request and internal just based on retryability when really just because an error is "retryable" doesn't mean it's "internal".
Maybe the following integration tests (if they don't already exists) would help confirm:
- Caller of operation cancels operation mid-gRPC call in handler, what does caller see. I expect operation cancel.
- Operation is canceled due to worker shutdown mid-gRPC call in handler, what does caller see. I expect retryable error.
- Handler implementer has their own context and calls cancel on it mid-gRPC call in handler, what does caller see. I expect retryable error (I think?).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the confusion may be the fact that Nexus chose to split application error into bad request and internal just based on retryability when really just because an error is "retryable" doesn't mean it's "internal".
Agree, we may need more statuses to better express more scenarios.
Caller of operation cancels operation mid-gRPC call in handler, what does caller see. I expect operation cancel.
What do you mean by "cancels operation"? Are you referring to the CancelOperation
RPC from the Nexus spec?
If so, yes, that is only applicable for async operations (there's going to be support for sync too soon and canceling operations that haven't started yet). If you're referring to canceling the RPC, the caller would get a ctx.Err()
and may choose to handle it however they see fit. They probably don't want to consider the operation canceled though, once the first request is sent it's important to see the operation through and not abandon it (unless the caller explicitly wants to abandon).
Operation is canceled due to worker shutdown mid-gRPC call in handler, what does caller see. I expect retryable error.
Seems like you're referring to the RPC here, yes the RPC will return a retryable error with this change.
Handler implementer has their own context and calls cancel on it mid-gRPC call in handler, what does caller see. I expect retryable error (I think?).
That would result in either context.Canceled
or gRPC Canceled
status error, both of these are retryable, and should be IMHO (this is the same case as above).
Anything blocking this PR now that we've clarified the behavior? |
I'm going to potentially revisit this auto error translation soon. I think these are good defaults but we'll see how this behaves in the wild. We may need to allow users to opt out of this or maybe only do auto error translation for |
To ensure these errors can be retried by the Nexus machinery in the server.