Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix nexus_task_execution_failed to include OperationError outcome in start requests #1664

Merged
merged 4 commits into from
Oct 14, 2024

Conversation

bergundy
Copy link
Member

@bergundy bergundy commented Oct 8, 2024

Why?

Consistency with other failure metrics.

@Quinn-With-Two-Ns
Copy link
Contributor

Do we not have any tests for metrics emitted that need to be updated?

@bergundy
Copy link
Member Author

bergundy commented Oct 8, 2024

Do we not have any tests for metrics emitted that need to be updated?

No tests for metrics in this case. I guess I can add some.

@bergundy bergundy enabled auto-merge (squash) October 9, 2024 00:37
// Increment failure in all forms of errors:
// Internal error processing the task.
// Failure from user handler.
// Special case for the start response with operation error.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this a task failure if the process task call doesn't return a failure? I wouldn't expect failure to start things like a child workflow to increment this value.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is when a handler returns an UnsuccessfulOperationError. A user would consider it a failure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hrmm, may need a definition for this metric. We have temporal_workflow_task_execution_failed which is actually a workflow task failure (a defined Temporal term) and we have temporal_activity_execution_failed which is an activity attempt/execution failure (we don't call this a task failure).

Is this expected to be a task failure (i.e. unexpected, non-user-returned/panic, failure to process the task) or is this an execution failure (i.e. expected, user-returned error)? I wonder if there should be two metrics, temporal_nexus_task_execution_failed and temporal_nexus_operation_execution_failed?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd be fine renaming the metric to something like temporal_nexus_operation_execution_failed not sure why we need two metrics if activity only needs one

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My issue with your suggestion is that there are other ways that the execution would fail, e.g. by returning a non retryable error in your handler or the workflow in the workflow run operation fails asynchronously.
For activities we don't distinguish between retryable and non retryable errors, so I figured this model is good enough.
What we have now maps to whether a handler method returned an error or not (or there was an internal error within the SDK).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Definition is, it's incremented if any of the following are true:

  1. The user defined handler returns any error (or panics)
  2. There's an internal error in the SDK (e.g. bug) while processing the task

Copy link
Member

@cretz cretz Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user defined handler returns any error (or panics)

Does this include UnsuccessfulOperationError and ErrOperationStillRunning errors on GetResult? Do we really want a metric that can't differentiate between a proper failure and an internal one? Or should we not count operation failures?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We haven't implemented GetResult but I think we could exclude ErrOperationStillRunning. This change includes UnsuccessfulOperationError in the metric, where it didn't used to be included before.

The rationale is that all of HandlerErrorTypeBadRequest, non-retryable ApplicationError, and UnsuccessfulOperationError are non-retryable errors from the caller perspective and essentially fail the operation.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I agree that it's not perfect. It wasn't perfect before either.

Copy link
Member

@cretz cretz Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the metrics should be split between user-returned operation errors and unexpected RPC failures similar to workflows. People often just stop using activity failure metrics because they are using intentional failures for some things, and now they can't even know what panics. They have to create their own metrics to differentiate. The difference is more obvious in other languages where unexpected failures/exceptions are more common and the languages are used to differentiating explicit RPC failure from accidental one.

But if the definition of this metric is decided otherwise, ok.

@bergundy
Copy link
Member Author

Discussed offline, we'll add a label to this metric to help distinguish between the different error types in a followup PR.
I can do a fast follow next week.

@bergundy bergundy merged commit e503995 into temporalio:master Oct 14, 2024
13 checks passed
ReyOrtiz pushed a commit to ReyOrtiz/temporal-sdk-go that referenced this pull request Dec 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants