Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[consumer] Allow annotating consumer errors with metadata #9041

Closed
wants to merge 45 commits into from
Closed
Show file tree
Hide file tree
Changes from 18 commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
58704ec
Add new consumererror error types
evan-bradley Dec 5, 2023
a4f738d
impi
evan-bradley Jul 3, 2024
6a932c1
crosslink
evan-bradley Jul 3, 2024
2888dbc
tidy
evan-bradley Jul 3, 2024
4447c00
Merge branch 'main' into issue-7047
evan-bradley Jul 9, 2024
7ddbc0d
tidy
evan-bradley Jul 10, 2024
9e27e93
crosslink
evan-bradley Jul 10, 2024
f4216c5
Update consumer/consumererror/README.md
evan-bradley Jul 14, 2024
a1ca2e1
Address PR feedback
evan-bradley Jul 14, 2024
3fb5426
Merge remote-tracking branch 'upstream/main' into issue-7047
evan-bradley Jul 24, 2024
3b80b56
Update design doc
evan-bradley Jul 24, 2024
fe1802b
Update design doc
evan-bradley Jul 25, 2024
ea959b6
Update design doc
evan-bradley Jul 25, 2024
fc40a8a
Add example
evan-bradley Jul 25, 2024
05d6fa9
Address PR feedback
evan-bradley Jul 26, 2024
d19438a
Include obsreport as an error consumer
evan-bradley Jul 26, 2024
d5f7a37
Draft implementation of error type
evan-bradley Jul 26, 2024
9fe7880
Update
evan-bradley Jul 26, 2024
53cbc05
Improve metadata
evan-bradley Jul 26, 2024
0d5af43
Add metadata to interface
evan-bradley Jul 26, 2024
2aa9e86
Address PR feedback
evan-bradley Jul 30, 2024
8c05443
Merge remote-tracking branch 'upstream/main' into issue-7047
evan-bradley Jul 30, 2024
c8c40ef
Address PR feedback
evan-bradley Aug 5, 2024
c26e31a
Update consumer/consumererror/README.md
evan-bradley Aug 5, 2024
3aadc29
Address PR feedback
evan-bradley Aug 5, 2024
a0b8bfd
Merge remote-tracking branch 'upstream/main' into issue-7047
evan-bradley Aug 5, 2024
6f85be2
Fix interface implementation
evan-bradley Aug 7, 2024
b55f661
go mod tidy
evan-bradley Aug 7, 2024
ef40c0d
Reorder imports
evan-bradley Aug 7, 2024
dba92c6
make crosslink
evan-bradley Aug 7, 2024
dd69f4a
Fix CI
evan-bradley Aug 7, 2024
7300bda
Polish and tests
evan-bradley Aug 8, 2024
e953920
Update changelog
evan-bradley Aug 8, 2024
a3fb82d
Minor cleanup
evan-bradley Aug 8, 2024
12ffaba
Merge remote-tracking branch 'upstream/main' into issue-7047
evan-bradley Aug 8, 2024
b3cbde5
Update consumer/consumererror/error.go
evan-bradley Aug 9, 2024
1d1affc
Address PR feedback
evan-bradley Aug 9, 2024
e776426
Merge remote-tracking branch 'upstream/main' into issue-7047
evan-bradley Aug 29, 2024
404ebd5
Combine ErrorContainer and Error into a single type
evan-bradley Aug 29, 2024
be9b41a
Address PR feedback
evan-bradley Aug 29, 2024
ebe59e6
Add more tests
evan-bradley Aug 30, 2024
0bf7ee0
Merge branch 'main' into issue-7047
evan-bradley Aug 30, 2024
951b019
Minor updates to readme
evan-bradley Aug 30, 2024
3a56990
Clarify example
evan-bradley Sep 3, 2024
55ea049
Merge remote-tracking branch 'upstream/main' into issue-7047
evan-bradley Sep 3, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
25 changes: 25 additions & 0 deletions .chloggen/breaking-retryable-errors.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: breaking

# The name of the component, or a single word describing the area of concern, (e.g. otlpreceiver)
component: consumererror

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Add optional arguments to `NewTraces`, `NewMetrics`, and `NewLogs`

# One or more tracking issues or pull requests related to the change
issues: [7047]

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext: This change has no impact unless you depend on the type signature of these functions.

# Optional: The change log or logs in which this entry should be included.
# e.g. '[user]' or '[user, api]'
# Include 'user' if the change is relevant to end users.
# Include 'api' if there is a change to a library API.
# Default: '[user]'
change_logs: [api]
25 changes: 25 additions & 0 deletions .chloggen/issue-7047.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,25 @@
# Use this changelog template to create an entry for release notes.

# One of 'breaking', 'deprecation', 'new_component', 'enhancement', 'bug_fix'
change_type: enhancement

# The name of the component, or a single word describing the area of concern, (e.g. otlpreceiver)
component: consumererror

# A brief description of the change. Surround your text with quotes ("") if it needs to start with a backtick (`).
note: Provide additional error types to allow annotating errors

# One or more tracking issues or pull requests related to the change
issues: [7047]

# (Optional) One or more lines of additional information to render under the primary note.
# These lines will be padded with 2 spaces and then inserted directly into the document.
# Use pipe (|) for multiline entries.
subtext: Allows putting things like status codes, retry information, etc. on consumer errors.

# Optional: The change log or logs in which this entry should be included.
# e.g. '[user]' or '[user, api]'
# Include 'user' if the change is relevant to end users.
# Include 'api' if there is a change to a library API.
# Default: '[user]'
change_logs: []
257 changes: 257 additions & 0 deletions consumer/consumererror/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,257 @@
# Consumer errors

This package contains error types that should be returned by a consumer when an
error occurs while processing telemetry. The error types included in this
package provide functionality for communicating details about
the error for use upstream in the pipeline. Ideally the error returned by a
component in its `consume` function should be from this package.

## Error classes

**Retryable**: Errors are retryable if re-submitting data to a sink may result
in a successful submission.
Comment on lines +11 to +12
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to carry some informations:

  • [required] what to retry on;
  • [optional] retry interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a classification of error, I cover the details of what retryable errors contain here: https://github.com/open-telemetry/opentelemetry-collector/pull/9041/files#diff-e0fa8222784a2c0c2683b70e2b6a7ccf2c54b6477acd7e2518839e38060bdcf5R90.

We discussed this last week and determined that the caller can provide the data to retry, but we're going to focus on exactly what goes into a retryable error after this PR. Right now we're mainly focusing on the HTTP/gRPC errors.


**Permanent**: Errors are permanent if submission will always fail for the
current data. Errors are considered permanent unless they are explicitly marked
as retryable.
Comment on lines +14 to +16
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For permanent errors, we should probably always return the "number of not accepted" entries? Is that reasonable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do that since the caller will already know how many items failed.


## Use cases

**Retry logic**: Errors should be allowed to include information necessary to
perform retries.

**Indicating partial success**: Errors can indicate that not all items were
accepted, for example as in an OTLP partial success message.
evan-bradley marked this conversation as resolved.
Show resolved Hide resolved

**Communicating network error codes**: Errors should allow embedding information
necessary for the Collector to act as a proxy for a backend, i.e. relay a status
code returned from a backend in a response to a system upstream from the
Collector.
Comment on lines +28 to +31
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jpkrohling had a more detailed use-case here, but would be interesting to hear what problem we are solving. If I remember correctly was something that the collector should not retry and all retries to be done by the caller, but cannot remember exactly, so better to document in details here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can provide more examples if you like, but the simplest one is the one I described here: the Collector can act as a proxy and relay a code from a backend back to the caller. This works for retries too, if the code is e.g. an HTTP 429 status code.


## Current targets for using errors
evan-bradley marked this conversation as resolved.
Show resolved Hide resolved

**Receivers**: Receivers should be able to consume multiple errors downstream
and determine the best course of action based on the user's configuration. This
may entail either keeping the retry queue inside of the Collector by having the
receiver keep track of retries, or may involve having the caller manage the
retry queue by returning a retryable network code with relevant information.

**exporterhelper**: When an exporter returns a retryable error, the
exporterhelper can use this information to manage the sending queue if it is
enabled. Permanent errors will be forwarded back up the pipeline.
evan-bradley marked this conversation as resolved.
Show resolved Hide resolved

**obsreport**: Recording partial success information can ensure we correctly
track the number of failed telemetry records in the pipeline. Right now, all
records will be considered to be failed, which isn't accurate when partial
successes occur.

## Creating Errors

Errors can be created by calling `consumererror.New(err, opts...)` where `err`
is the underlying error, and `opts` is one of the provided options for supplying
additional metadata:

- `consumererror.WithRetry`
- `consumererror.WithPartial`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can a error be retry and partial? Are any combination possible?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll add a section about this. I think we have to assume that the user knows what they are doing, even if we likely consider it an error. The rule is that an error is retryable if it includes retry information, so in this case we have to assume that a component had partial success, but also supports retries on partial success, even if OTLP considers partial success permanent.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume is retryable, what data are you retrying? Based on my understanding "partial" only contains the number of failed entries, so you cannot retry because you don't know what failed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You would have to retry the whole payload, and the backend would need to support filtering out the items that have already been accepted. From the perspective of a component upstream from where this error was created, we don't know anything about where the data will be sent and will just have to work with the available context.

I don't expect anyone to actually do this, but I also don't see any meaningful way for us to prevent it. The best we could likely do would be to swallow one of the options and log a warning on the console if we can determine a precedence for the errors, but I would still prefer to leave that to upstream components to deal with. I'd also prefer to keep things resilient and not return an error while creating the error.

Even if we break out the error cases into separate types, if you have to join anything with errors.Join, you're back in the same place where contradictory errors can be placed next to each other.

- `consumererror.WithGRPCStatus`
- `consumererror.WithHTTPStatus`
Comment on lines +61 to +62
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, can these 2 be merged?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a component uses both of these, the assumption is that it has its own HTTP<->gRPC translation, and wants to specify both statuses. The error will produce the given status if it has it, otherwise will convert from the status from the other transport.

- `consumererror.WithMetadata`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing any use-case for this, why do we need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for extensibility. I don't think we would add this until it was requested, but basically a custom exporter and receiver could communicate using a custom struct type by putting the struct inside the error with WithMetadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove it for now? We should problably ensure the model is extendable but not over-prescriptive for something we don't have yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If extensibility is an explicit goal, I'd like to keep this in the design doc. I've added warnings to things that don't actually exist right now.


All options can be combined, we assume that the component knows what it is doing
when seemingly conflicting options.

Two examples:

- `WithRetry` and `WithPartial` are included together: Partial successes are
considered permanent errors in OTLP, which conflicts with making an error
retryable by including `WithRetry`. However, per our definition of what makes
a permanent error, this error has been marked as retryable, and therefore we
assume the component producing this error supports retyable partial success
errors.
Comment on lines +84 to +89
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous question about this and what it means if I know that 5 out of 10 spans failed to send.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I know that 5/10 spans failed, then I can record that in obsreport, but will resubmit all 10; maybe the backend uses the span IDs to make span submission idempotent. I don't have a real use-case in mind for this, what I want to cover in this section is that we can use the rules established here to handle seemingly invalid error combinations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a real use-case in mind for this, what I want to cover in this section is that we can use the rules established here to handle seemingly invalid error combinations.

When in doubt leave it out. I want us to not add any API if we don't have a good use-case.

Copy link
Contributor Author

@evan-bradley evan-bradley Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This doesn't add any API, the point of this paragraph is just to explain how seemingly conflicting error combinations could be reconciled. WithRetry and WithPartial are intended to be used separately, and already have use cases in mind. We just needed to explain what happens if a component author attempts to use them together.

- `WithGRPCStatus` and `WithHTTPStatus` are included together: While the
component likely only sent data over one of these transports, our errors will
produce the given status if it is included on the error, otherwise it will
translate a status from the status for the other transport. If both of these
are given, we assume the component wanted to perform its own status
conversion, and we will simply give the status for the requested transport
without performing any conversion.

**Example**:

```go
consumererror.New(err,
consumererror.WithRetry(
consumerrerror.WithRetryDelay(10 * time.Second)
),
consumererror.WithGRPCStatus(codes.InvalidArgument),
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example must be invalid. codes.InvalidArgument should not be retryable.

In general, it still seems like all these options provide too much flexibility, making it confusing to use and easy to misuse.

For example, why GRPC and HTTP have to be options? Why can't we just have different constructors like consumererror.NewFromGRPC(grpc.Status)? The status should have the retry information that we can retrieve if applicable. If we expect exporters to add extra info on top of grpc error, they can manually parse the grpc status and create a collector "internal" error another way.

Copy link
Contributor Author

@evan-bradley evan-bradley Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an example of an intentionally unusual option combination since we describe above how combinations that are invalid in OTLP are handled. I'll move this and change it to a more normal combination so we're not highlighting anything we wouldn't recommend. It is worth noting that this is only non-retryable per the OTLP spec and other protocols could consider this a retryable code.

I still think options are the best path forward even if they allow states that are not valid in OTLP, but I'm open to exploring different approaches. A few questions on how we would transition from HTTP/gRPC options to constructors:

Why can't we just have different constructors like consumererror.NewFromGRPC(grpc.Status)? The status should have the retry information that we can retrieve if applicable.

Wouldn't this still be susceptible to the same issue, where an exporter creates a gRPC status with both codes.InvalidArgument and retryable information? Also, where would this information be given when using an HTTP constructor like consumer.NewFromHTTP?

If we expect exporters to add extra info on top of grpc error, they can manually parse the grpc status and create a collector "internal" error another way.

Could you give an example of how we would do this?

```

### Retrying data submission

*NOTE*: This section is a draft and will be developed separately from the rest
of this proposal.

If an error is transient, the `WithRetry` option corresponding to the relevant
signal should be used to indicate that the error is retryable and to pass on any
retry settings. These settings can come from the data sink or be determined by
the component, such as through runtime conditions or from user settings.

The data for the retry will be provided by the component performing the retry.
This will require all processing to be completely redone; in the future,
including data from the failed component so as to not retry this processing may
be made as an available option.

To ensure only the failed pipeline branch is retried, the sequence of components
that created the error will be recorded by a pipeline utility as the error goes
back up the pipeline.

**Note**: If retry information is not included in an error, the error will be
considered permanent and will not be retried.

**Usage:**

```go
consumererror.WithRetry(
consumerrerror.WithRetryDelay(10 * time.Second)
)
```

The delay is an optional setting that can be provided if it is available.

### Indicating partial success

If the component receives an OTLP partial success message (or other indication
of partial success), it should include this information with a count of the
failed records.

**Usage:**

```go
consumererror.WithPartial(failedRecords)
```

### Indicating error codes from network transports

If the failure occurred due to a network transaction, the exporter should record
the status code of the message received from the backend. This information can
be then relayed to the receiver caller if necessary. Note that when the upstream
component reads a code, it will read a code for its desired transport, and the
code may be translated depending whether the input and output transports are
different. For example, a gRPC exporter may record a gRPC status. If a gRPC
receiver reads this status, it will be exactly the provided status. If an HTTP
receiver reads the status, it wants an HTTP status, and the gRPC status will be
converted to an equivalent HTTP code.

**Usage:**

```go
consumererror.WithGRPCStatus(codes.InvalidArgument)
consumererror.WithHTTPStatus(http.StatusTooManyRequests)
```

### Including custom data

Custom data can be included as well for any additional information that needs to
be propagated back up the pipeline. It is up to the consuming component if or
how this data will be used.

**Usage:**

```go
consumererror.WithMetadata(MyMetadataStuct{})
```

To keep error analysis simple when looking at an error upstream in a pipeline,
the component closest to the source of an error or set of errors should make a
decision about the nature of the error. The following are a few places where
special considerations may need to be made.

## Using errors

### Fanouts

When a fanout receives multiple errors, it will combine them with
`(consumererror.Error).Combine(errs...)` and pass them upstream. The upstream
component can then later pull all errors out for analysis.

### Retrieving errors

When a receiver gets a response that includes an error, it can get the data out
by doing something similar to the following. Note that this uses the `ErrorData`
type, which is for reading error data, as opposed to the `Error` type, which is
for recording errors.

```go
cerr := consumererror.Error{}
var errData []consumerError.ErrorData

if errors.As(err, &cerr) {
errData := cerr.Data()

for _, data := range errData {
data.HTTPStatus()
data.Retryable()
data.Partial()
}
}
```

### Error data

Obtaining data from an error can be done using an interface that looks something like this:

```go
type ErrorData interface {
// Returns the underlying error
Error() error

// Returns nil if a status is not available.
HTTPStatus() *int

// Returns nil if a status is not available.
GRPCStatus() *status.Status

// Returns nil if the error contains no retry information.
Retryable() *Retryable

// Returns a count of failed records or nil if no partial success information was recorded.
Partial() *int
evan-bradley marked this conversation as resolved.
Show resolved Hide resolved
}

type Retryable struct {}

// Returns nil if no delay was set, indicating to use the default.
// This makes it so a delay of `0` indicates to resend immediately.
func (r *Retryable) Delay() *time.Duration {}
```

## Other considerations

### Mixed error classes

When a receiver sees a mixture of permanent and retryable errors from downstream
in the pipeline, it must first consider whether retries are enabled within the
Collector.

**Retries are enabled**: Ignore the permanent errors, retry data submission for
only components that indicated retryable errors.

**Retries are disabled**: In an asynchronous pipeline, simply do not retry any
data. In a synchronous pipeline, the receiver should return a permanent error
code indicating to the caller that it should not retry the data submission. This
is intended to not induce extra failures where we know the data submission will
fail, but this behavior could be made configurable by the user.

### Signal conversion

When converting between signals in a pipeline, it is expected that the connector
performing the conversion should perform the translation necessary in the error
for any signal item counts. If the converted count cannot be determined, the full
count of pre-converted signals should be returned.

### Asynchronous processing
djaglowski marked this conversation as resolved.
Show resolved Hide resolved

The use of any components that do asynchronous processing in a pipeline will cut
off error backpropagation at the asynchronous component. The asynchronous
component may communicate error information using the Collector's own signals.
Loading
Loading