Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[consumer] Allow annotating consumer errors with metadata #9041

Open
wants to merge 37 commits into
base: main
Choose a base branch
from

Conversation

evan-bradley
Copy link
Contributor

Description:

Revival of #7439

This explores one possible way to allow adding metadata to errors returned from consumers. The goal here is to allow transmitting more data back up the pipeline if there is an error at some stage, with the goal of it being used by an upstream component, e.g. a component that will retry data, or a receiver that will propagate an error code back to the sender.

The current design eliminates the permanent/retryable error types in favor of a single error type that supports adding signal data to be retried. If no data is added to be retried, the error is considered permanent. Currently there is no distinction made between the signals for the sake of simplicity, the caller should know what signal is used when retrieving the retryable items from the error. Any options for retrying the data (e.g. a delay) are offered as options when adding data to retry.

The error type currently supports a few general metadata fields that are copied when a downstream error is wrapped:

  • Partial successes can be expressed by setting the number of rejected items.
  • gRPC and HTTP status codes can be set and translated between if necessary.

Link to tracking Issue:

Resolves #7047

cc @dmitryax

consumer/consumererror/consumererror.go Outdated Show resolved Hide resolved
consumer/consumererror/consumererror.go Outdated Show resolved Hide resolved
consumer/consumererror/consumererror.go Outdated Show resolved Hide resolved
@jmacd
Copy link
Contributor

jmacd commented Jan 10, 2024

Happy to see this added. As discussed in #9260, there is a potential to propagate backwards the information contained in PartialSuccess responses from OTLP exports.

I worry about the code complexity introduced to have "success error" responses, meaning error != nil but the interpretation being success. However, this is what it will take to back-propagate partial successes, we want callers to see success with metadata about the number of rejected points if possible. Great to see this, thanks @evan-bradley.

@jmacd
Copy link
Contributor

jmacd commented Jan 10, 2024

As discussed in open-telemetry/oteps#238, it would be useful for setting the correct otel.outcome label, for callers to have access to the underlying gRPC and/or HTTP error code. Thanks!

consumer/consumererror/consumererror.go Outdated Show resolved Hide resolved
consumer/consumererror/consumererror.go Outdated Show resolved Hide resolved
consumer/consumererror/partial.go Outdated Show resolved Hide resolved
@mx-psi mx-psi added needed-for-1.0 release:required-for-ga Must be resolved before GA release and removed needed-for-1.0 labels Feb 7, 2024
Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Copy link
Contributor

This PR was marked stale due to lack of activity. It will be closed in 14 days.

Copy link

codecov bot commented Apr 11, 2024

Codecov Report

Attention: Patch coverage is 98.70130% with 1 line in your changes missing coverage. Please review.

Project coverage is 91.67%. Comparing base (ecbe02e) to head (1d1affc).

Files Patch % Lines
consumer/consumererror/error.go 98.03% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #9041      +/-   ##
==========================================
+ Coverage   91.64%   91.67%   +0.02%     
==========================================
  Files         406      408       +2     
  Lines       19001    19078      +77     
==========================================
+ Hits        17414    17490      +76     
- Misses       1227     1228       +1     
  Partials      360      360              

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@evan-bradley evan-bradley marked this pull request as ready for review April 11, 2024 19:31
@evan-bradley evan-bradley requested a review from a team as a code owner April 11, 2024 19:31
consumer/consumererror/statuserrors.go Outdated Show resolved Hide resolved
consumer/consumererror/partial.go Outdated Show resolved Hide resolved
consumer/consumererror/README.md Outdated Show resolved Hide resolved
Comment on lines 49 to 50
- `consumererror.WithRetry`
- `consumererror.WithPartial`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's assume is retryable, what data are you retrying? Based on my understanding "partial" only contains the number of failed entries, so you cannot retry because you don't know what failed.

Comment on lines +11 to +12
**Retryable**: Errors are retryable if re-submitting data to a sink may result
in a successful submission.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to carry some informations:

  • [required] what to retry on;
  • [optional] retry interval?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just a classification of error, I cover the details of what retryable errors contain here: https://github.com/open-telemetry/opentelemetry-collector/pull/9041/files#diff-e0fa8222784a2c0c2683b70e2b6a7ccf2c54b6477acd7e2518839e38060bdcf5R90.

We discussed this last week and determined that the caller can provide the data to retry, but we're going to focus on exactly what goes into a retryable error after this PR. Right now we're mainly focusing on the HTTP/gRPC errors.

Comment on lines +14 to +16
**Permanent**: Errors are permanent if submission will always fail for the
current data. Errors are considered permanent unless they are explicitly marked
as retryable.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For permanent errors, we should probably always return the "number of not accepted" entries? Is that reasonable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we need to do that since the caller will already know how many items failed.

Comment on lines +26 to +29
**Communicating network error codes**: Errors should allow embedding information
necessary for the Collector to act as a proxy for a backend, i.e. relay a status
code returned from a backend in a response to a system upstream from the
Collector.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think @jpkrohling had a more detailed use-case here, but would be interesting to hear what problem we are solving. If I remember correctly was something that the collector should not retry and all retries to be done by the caller, but cannot remember exactly, so better to document in details here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can provide more examples if you like, but the simplest one is the one I described here: the Collector can act as a proxy and relay a code from a backend back to the caller. This works for retries too, if the code is e.g. an HTTP 429 status code.

consumer/consumererror/README.md Outdated Show resolved Hide resolved
consumer/consumererror/README.md Show resolved Hide resolved
- `consumererror.WithPartial`
- `consumererror.WithGRPCStatus`
- `consumererror.WithHTTPStatus`
- `consumererror.WithMetadata`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not seeing any use-case for this, why do we need it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is for extensibility. I don't think we would add this until it was requested, but basically a custom exporter and receiver could communicate using a custom struct type by putting the struct inside the error with WithMetadata.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove it for now? We should problably ensure the model is extendable but not over-prescriptive for something we don't have yet

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If extensibility is an explicit goal, I'd like to keep this in the design doc. I've added warnings to things that don't actually exist right now.

Comment on lines +65 to +70
- `WithRetry` and `WithPartial` are included together: Partial successes are
considered permanent errors in OTLP, which conflicts with making an error
retryable by including `WithRetry`. However, per our definition of what makes
a permanent error, this error has been marked as retryable, and therefore we
assume the component producing this error supports retyable partial success
errors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See my previous question about this and what it means if I know that 5 out of 10 spans failed to send.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I know that 5/10 spans failed, then I can record that in obsreport, but will resubmit all 10; maybe the backend uses the span IDs to make span submission idempotent. I don't have a real use-case in mind for this, what I want to cover in this section is that we can use the rules established here to handle seemingly invalid error combinations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a real use-case in mind for this, what I want to cover in this section is that we can use the rules established here to handle seemingly invalid error combinations.

When in doubt leave it out. I want us to not add any API if we don't have a good use-case.

Copy link
Contributor Author

@evan-bradley evan-bradley Aug 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. This doesn't add any API, the point of this paragraph is just to explain how seemingly conflicting error combinations could be reconciled. WithRetry and WithPartial are intended to be used separately, and already have use cases in mind. We just needed to explain what happens if a component author attempts to use them together.

Copy link
Member

@dmitryax dmitryax left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we briefly define a transition plan in the doc?

consumer/consumererror/README.md Outdated Show resolved Hide resolved
consumer/consumererror/README.md Outdated Show resolved Hide resolved
- `consumererror.WithPartial`
- `consumererror.WithGRPCStatus`
- `consumererror.WithHTTPStatus`
- `consumererror.WithMetadata`
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe remove it for now? We should problably ensure the model is extendable but not over-prescriptive for something we don't have yet

consumer/consumererror/error.go Outdated Show resolved Hide resolved
consumer/consumererror/error.go Outdated Show resolved Hide resolved
consumer/consumererror/error.go Outdated Show resolved Hide resolved
consumer/consumererror/error.go Outdated Show resolved Hide resolved
@evan-bradley
Copy link
Contributor Author

Should we briefly define a transition plan in the doc?

I'm okay with that. I've put it in a section at the bottom for now.

@evan-bradley
Copy link
Contributor Author

@dmitryax @bogdandrutu This is ready for another round of reviews. The scope should be fairly minimal now, we can cover additional functionality in follow-ups.

I will be on PTO starting next week, so it would be nice if we could get this in this week.

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall feedback for me is that we just want to add more "things". I don't want to block this, but I also feel that we should make sure we add only the things we are 100% sure about it.

Comment on lines +65 to +70
- `WithRetry` and `WithPartial` are included together: Partial successes are
considered permanent errors in OTLP, which conflicts with making an error
retryable by including `WithRetry`. However, per our definition of what makes
a permanent error, this error has been marked as retryable, and therefore we
assume the component producing this error supports retyable partial success
errors.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have a real use-case in mind for this, what I want to cover in this section is that we can use the rules established here to handle seemingly invalid error combinations.

When in doubt leave it out. I want us to not add any API if we don't have a good use-case.

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall:

  1. I really like the MultiErr (current Error) approach and the APIs related to that. It is a nice way to encapsulate fanout errors.
  2. I am not sure if the ErrorData is the right thing though. Is this only for "Transport" errors? I would not return that bool but instead have a enum or a way to say "IsHttp" or "isGrpc" then have "AsHttp" or "AsGrpc" name to be discussed.

// out by an upstream component by calling `Error.Data`.
//
// Experimental: This API is at the early stage of development and may change without backward compatibility
type ErrorData interface {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the point of making it interface vs struct?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The goal is that component authors can't instantiate it themselves. If we're not concerned about that, I don't see any issues making it a struct.

consumer/consumererror/error.go Outdated Show resolved Hide resolved
}

// ErrorOption allows annotating an Error with metadata.
type ErrorOption func(error *errorData)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not recommended to have a public type that wraps a private type. I think the best way to do this is to use the interface pattern, see

type FactoryOption interface {

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I like this a bit better, thanks. Implemented.

errors []ErrorData
}

// ErrorData is intended to be used to encapsulate various information
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would call this "Error" and the other "Errors"(or ErrorSlice)?

Maybe "MultiErr" for the container? Not sure, but throwing ideas.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've renamed them to ErrorContainer (for the type that accumulates errors as they move upstream) and Error for the actual error with the data. What do you think?

// Data returns all the accumulated ErrorData errors
// emitted by components downstream in the pipeline.
// These can then be aggregated or worked with individually.
func (e *Error) Data() []ErrorData {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is wrong I think, we should not return the internal slice since the caller can modify it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From what I can tell, this creates a new slice instance where modifications don't affect the original slice.

I added a test here that seems to show this isn't an issue, is there anything else I should check for to ensure we don't pass out a mutable copy of the slice? https://github.com/open-telemetry/opentelemetry-collector/pull/9041/files#diff-8c40d97e3ff1a76b10c0c2b02a2cea669f1ffdffc2363cfdf86ba8489cb9f59cR92

evan-bradley and others added 2 commits August 9, 2024 08:05
Co-authored-by: Bogdan Drutu <bogdandrutu@gmail.com>
@evan-bradley
Copy link
Contributor Author

evan-bradley commented Aug 9, 2024

Overall feedback for me is that we just want to add more "things". I don't want to block this, but I also feel that we should make sure we add only the things we are 100% sure about it.

Right now we're only adding HTTP/gRPC status code propagation, everything else is marked as unimplemented and still in the design stage. Do you have concerns with HTTP/gRPC status codes, or the way they're communicated?

I really like the MultiErr (current Error) approach and the APIs related to that. It is a nice way to encapsulate fanout errors.

I played with that, but the issue I encountered is that we may want to read these error types at the receiver level, when there may be multiple fanouts in the pipeline. So you could see something like the following:

I don't see how we pull out the joined errors without traversing the whole error tree by repeatedly calling Unwrap. Having a container type to keep track of all the errors just seems simpler.

I see now, I mistook what you said for the Uber multierr package. Glad you like it.

I am not sure if the ErrorData is the right thing though. Is this only for "Transport" errors? I would not return that bool but instead have a enum or a way to say "IsHttp" or "isGrpc" then have "AsHttp" or "AsGrpc" name to be discussed.

This is meant to hold information related to anything the component would like to communicate upstream. Right now we're starting with HTTP and gRPC status codes, but we would like to add retry information, partial success information, etc. in the future. The goal is to start small and iterate as we design the additional options.

@bogdandrutu
Copy link
Member

This is meant to hold information related to anything the component would like to communicate upstream. Right now we're starting with HTTP and gRPC status codes, but we would like to add retry information, partial success information, etc. in the future. The goal is to start small and iterate as we design the additional options.

Can you please respond to the rest of the comments and see how the design of the "ErrorData" would look like?

@evan-bradley
Copy link
Contributor Author

Can you please respond to the rest of the comments and see how the design of the "ErrorData" would look like?

Sure thing. The other questions were based on the first question, so I wanted to make sure we were on the same page there if possible.

I am not sure if the ErrorData is the right thing though. Is this only for "Transport" errors? I would not return that bool but instead have a enum or a way to say "IsHttp" or "isGrpc" then have "AsHttp" or "AsGrpc" name to be discussed.

  1. Enum: We could do a list of enums, e.g. [IS_RETRYABLE, IS_GRPC], then expose this with a method like Types() []Type. Since an error can contain multiple error types this would have to be a list instead of just a single enum value.
  2. IsHTTP + AsHTTP: This essentially conveys the same information as HTTPStatus() (code int, isHTTP bool), so from a functionality standpoint it's not an issue. I would rename IsHTTP to HasHTTP though, since errors can contain multiple types of data. I'm not sure I see the benefit of retrieving this information across two methods, though, what's wrong with the boolean?

@bogdandrutu
Copy link
Member

bogdandrutu commented Aug 11, 2024

Enum: We could do a list of enums, e.g. [IS_RETRYABLE, IS_GRPC], then expose this with a method like Types() []Type. Since an error can contain multiple error types this would have to be a list instead of just a single enum value.
IsHTTP + AsHTTP: This essentially conveys the same information as HTTPStatus() (code int, isHTTP bool), so from a functionality standpoint it's not an issue. I would rename IsHTTP to HasHTTP though, since errors can contain multiple types of data. I'm not sure I see the benefit of retrieving this information across two methods, though, what's wrong with the boolean?

Before answering any of these questions, need to better understand how "correctly" (recommended, or whatever is the right term) chained consumers should error? For me to know the "right" design, I need to understand better the usage. Here are some examples, and we can debate on them what is the correct way (eventually document this). For these examples assume to use logs:

First Example

A -> B (drops logs with invalid trace identifiers) -> C -> D -> E (exports to Loki)

For a request with 10 logs:

  • B finds 2 "logs" that are not valid, for example "trace.id" is 34 hex characters.
  • E tries to export the logs and 2 out of the 8 remaining (assuming B dropped the 2 invalid logs) are reject with 400 http request.

Can you please tell me exactly which component should propagate/append errors and how?

Second Exmple

                       / -> E - > F
        / ->  C -> D -
A -> B -               \ -> G -> H
        \
         \ -> I -> J -> K

For a request with 10 logs:

  • F fails to send 4 logs with 429 http request.
  • H fails to send 2 logs with 400 http request.
  • D drops 2 logs because of invalid trace identifiers (trace.id too large).
  • K fails to send 1 logs with INVALID_ARGUMENT grpc status code.

Can you please tell me exactly which component should propagate/append errors and how?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release:required-for-ga Must be resolved before GA release
Projects
Status: Waiting for reviews
Development

Successfully merging this pull request may close these issues.

Investigate how to expose exporterhelper.NewThrottleRetry in the consumererror
10 participants