Fix nexus_task_execution_failed to include OperationError outcome in start requests #1664

bergundy · 2024-10-08T22:19:53Z

Why?

Consistency with other failure metrics.

…start requests

Quinn-With-Two-Ns · 2024-10-08T22:29:41Z

Do we not have any tests for metrics emitted that need to be updated?

bergundy · 2024-10-08T23:00:48Z

Do we not have any tests for metrics emitted that need to be updated?

No tests for metrics in this case. I guess I can add some.

cretz · 2024-10-09T12:58:54Z

internal/internal_nexus_task_poller.go

+	// Increment failure in all forms of errors:
+	// Internal error processing the task.
+	// Failure from user handler.
+	// Special case for the start response with operation error.


Is this a task failure if the process task call doesn't return a failure? I wouldn't expect failure to start things like a child workflow to increment this value.

This is when a handler returns an UnsuccessfulOperationError. A user would consider it a failure.

Hrmm, may need a definition for this metric. We have temporal_workflow_task_execution_failed which is actually a workflow task failure (a defined Temporal term) and we have temporal_activity_execution_failed which is an activity attempt/execution failure (we don't call this a task failure).

Is this expected to be a task failure (i.e. unexpected, non-user-returned/panic, failure to process the task) or is this an execution failure (i.e. expected, user-returned error)? I wonder if there should be two metrics, temporal_nexus_task_execution_failed and temporal_nexus_operation_execution_failed?

I'd be fine renaming the metric to something like temporal_nexus_operation_execution_failed not sure why we need two metrics if activity only needs one

My issue with your suggestion is that there are other ways that the execution would fail, e.g. by returning a non retryable error in your handler or the workflow in the workflow run operation fails asynchronously.
For activities we don't distinguish between retryable and non retryable errors, so I figured this model is good enough.
What we have now maps to whether a handler method returned an error or not (or there was an internal error within the SDK).

Definition is, it's incremented if any of the following are true:

The user defined handler returns any error (or panics)

There's an internal error in the SDK (e.g. bug) while processing the task

The user defined handler returns any error (or panics)

Does this include UnsuccessfulOperationError and ErrOperationStillRunning errors on GetResult? Do we really want a metric that can't differentiate between a proper failure and an internal one? Or should we not count operation failures?

We haven't implemented GetResult but I think we could exclude ErrOperationStillRunning. This change includes UnsuccessfulOperationError in the metric, where it didn't used to be included before.

The rationale is that all of HandlerErrorTypeBadRequest, non-retryable ApplicationError, and UnsuccessfulOperationError are non-retryable errors from the caller perspective and essentially fail the operation.

But I agree that it's not perfect. It wasn't perfect before either.

I think the metrics should be split between user-returned operation errors and unexpected RPC failures similar to workflows. People often just stop using activity failure metrics because they are using intentional failures for some things, and now they can't even know what panics. They have to create their own metrics to differentiate. The difference is more obvious in other languages where unexpected failures/exceptions are more common and the languages are used to differentiating explicit RPC failure from accidental one.

But if the definition of this metric is decided otherwise, ok.

bergundy · 2024-10-11T17:49:12Z

Discussed offline, we'll add a label to this metric to help distinguish between the different error types in a followup PR.
I can do a fast follow next week.

…start requests (temporalio#1664)

Fix nexus_task_execution_failed to include OperationError outcome in …

c260384

…start requests

bergundy requested a review from a team as a code owner October 8, 2024 22:19

bergundy mentioned this pull request Oct 8, 2024

Add Nexus related SDK metrics temporalio/documentation#3134

Merged

Merge branch 'master' into nexus-fail-metric-fix

fcd60b2

Quinn-With-Two-Ns mentioned this pull request Oct 8, 2024

Change when nexus_task_execution_failed is emitted temporalio/sdk-java#2261

Merged

Add metrics tests

15b8106

bergundy enabled auto-merge (squash) October 9, 2024 00:37

cretz reviewed Oct 9, 2024

View reviewed changes

Quinn-With-Two-Ns approved these changes Oct 9, 2024

View reviewed changes

Merge branch 'master' into nexus-fail-metric-fix

5a03c9a

bergundy merged commit e503995 into temporalio:master Oct 14, 2024
13 checks passed

bergundy mentioned this pull request Oct 14, 2024

Add Nexus failure_reason metric tag #1671

Merged

ReyOrtiz pushed a commit to ReyOrtiz/temporal-sdk-go that referenced this pull request Dec 5, 2024

Fix nexus_task_execution_failed to include OperationError outcome in …

5541e4f

…start requests (temporalio#1664)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix nexus_task_execution_failed to include OperationError outcome in start requests #1664

Fix nexus_task_execution_failed to include OperationError outcome in start requests #1664

bergundy commented Oct 8, 2024

Quinn-With-Two-Ns commented Oct 8, 2024

bergundy commented Oct 8, 2024

cretz Oct 9, 2024

bergundy Oct 9, 2024

cretz Oct 9, 2024

Quinn-With-Two-Ns Oct 9, 2024

bergundy Oct 9, 2024

bergundy Oct 9, 2024

cretz Oct 9, 2024 •

edited

Loading

bergundy Oct 9, 2024

bergundy Oct 9, 2024

cretz Oct 10, 2024 •

edited

Loading

bergundy commented Oct 11, 2024

Fix nexus_task_execution_failed to include OperationError outcome in start requests #1664

Fix nexus_task_execution_failed to include OperationError outcome in start requests #1664

Conversation

bergundy commented Oct 8, 2024

Why?

Quinn-With-Two-Ns commented Oct 8, 2024

bergundy commented Oct 8, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cretz Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

bergundy commented Oct 11, 2024

cretz Oct 9, 2024 •

edited

Loading

cretz Oct 10, 2024 •

edited

Loading