-
Notifications
You must be signed in to change notification settings - Fork 216
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Translate gRPC Canceled code to Nexus HandlerErrorTypeInternal #1680
Merged
bergundy
merged 2 commits into
temporalio:master
from
bergundy:nexus-grpc-canceled-to-internal
Oct 25, 2024
Merged
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
(making comment here because it's easier to have threads as PR comments)
Don't we want gRPC canceled errors to not be retried? They aren't anywhere else and are usually the result of user cancel or user-supplied timeout.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here the timeout is for a single RPC not the entire operation, we want to retry the RPC (to start a workflow).
We're seeing CI failures since introducing the gRPC status translation logic and there's definitely no intent to fail these nexus operations. Not 100% where these Canceled errors are coming from, they could be coming internally from the server or the SDK.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's worth investigating. I can't think of a scenario where we want to ever retry this status code. It's almost always due to explicit cancel/timeout unless there's a scenario I'm not familiar with.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Request timeout should only fail a single Nexus HTTP request, not the entire operation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If a nexus request is timing out on the worker we shouldn't even respond
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried finding this error in the logs and couldn't find it so still not sure where this is coming from. All I know now is that our release validation tests are failing when running an SDK version with this error translation enabled with
"400 Bad Request": context canceled
where they didn't fail before.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found this case in the logs. I can definitely see the server reporting that the SDK is returning a BadRequest error and what seems to be the corresponding StartWorkflowExecution RPC with a grpc Cancelled status.
We also know that the Nexus handler context in the SDK wasn't canceled because otherwise it would not send a response back and the server would have reported this as a timeout. I don't see a strong enough reason to fail a Nexus operation for a single RPC that ended up as grpc Cancelled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Canceled
can also be returned by the client side if the user cancelled some grpc request they are making in a handler. In this case I understand it is desirable in this case to retry it , but it is't obvious that all users would want to map this to aHandlerErrorTypeInternal
. I think we should prioritizing how we are going to allow users to customize this policy. Updating the SDK because a client needs a different policy is not a scalable solution.https://grpc.github.io/grpc/core/md_doc_statuscodes.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To me a handler implementer making a gRPC call in a handler and then canceling it (which may be due to context being otherwise canceled) causing the gRPC call to return canceled is the same as an application error, both in Nexus handlers and in Temporal activities. But if it was due to context cancellation from actual context cancel from caller, it should not be retryable, like Temporal activities, which is what the eager
ctx.Err()
checks seem to accomplish (though even they may be wrong if context cancel is not always caller cancel (e.g. could be worker shutdown), so I'd have to see integration tests for both).So I think I agree kinda with what this PR is doing.
I think the confusion may be the fact that Nexus chose to split application error into bad request and internal just based on retryability when really just because an error is "retryable" doesn't mean it's "internal".
Maybe the following integration tests (if they don't already exists) would help confirm:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree, we may need more statuses to better express more scenarios.
What do you mean by "cancels operation"? Are you referring to the
CancelOperation
RPC from the Nexus spec?If so, yes, that is only applicable for async operations (there's going to be support for sync too soon and canceling operations that haven't started yet). If you're referring to canceling the RPC, the caller would get a
ctx.Err()
and may choose to handle it however they see fit. They probably don't want to consider the operation canceled though, once the first request is sent it's important to see the operation through and not abandon it (unless the caller explicitly wants to abandon).Seems like you're referring to the RPC here, yes the RPC will return a retryable error with this change.
That would result in either
context.Canceled
or gRPCCanceled
status error, both of these are retryable, and should be IMHO (this is the same case as above).