Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/otlp] Retry RESOURCE_EXHAUSTED only if the server returns RetryInfo #5147

Merged

Conversation

svrakitin
Copy link
Contributor

@svrakitin svrakitin commented Apr 4, 2022

Description:

This makes us retry onResourceExhausted only if retry info is provided. It also makes code a bit more explicit.

In my case it caused an issue in production where upstream denied requests above max_recv_msg_size and we kept retrying.

Link to tracking Issue:
#1123

@svrakitin svrakitin requested review from a team and jpkrohling April 4, 2022 19:24
@codecov
Copy link

codecov bot commented Apr 4, 2022

Codecov Report

Merging #5147 (ce65d16) into main (ef7754a) will increase coverage by 0.12%.
The diff coverage is 100.00%.

@@            Coverage Diff             @@
##             main    #5147      +/-   ##
==========================================
+ Coverage   90.52%   90.65%   +0.12%     
==========================================
  Files         187      187              
  Lines       11041    11052      +11     
==========================================
+ Hits         9995    10019      +24     
+ Misses        824      814      -10     
+ Partials      222      219       -3     
Impacted Files Coverage Δ
exporter/otlpexporter/otlp.go 90.72% <100.00%> (+16.30%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ef7754a...ce65d16. Read the comment docs.

@Aneurysm9
Copy link
Member

Can we instead change that case to determine whether the retry info has been provided in the response? The spec requires retry of transient errors and says that RESOURCE_EXHAUSTED SHOULD be treated as transient. I think it makes sense to treat it as transient if retry info is provided and permanent otherwise.

@svrakitin
Copy link
Contributor Author

@Aneurysm9 Makes sense, gonna work on this.

@svrakitin svrakitin force-pushed the do-not-retry-resource-exhausted branch 2 times, most recently from afd826c to 02cc1e1 Compare April 5, 2022 15:42
@svrakitin svrakitin changed the title [exporter/otlp] Do not retry on ResourceExhausted error [exporter/otlp] Only retry on ResourceExhausted with retry info Apr 5, 2022
@svrakitin
Copy link
Contributor Author

svrakitin commented Apr 5, 2022

@Aneurysm9 I updated the PR to separate "always-retryable" and "never-retryable" error codes. Does it makes sense? Meanwhile I am going to write some tests for that as those seem to be missing (which turned out to be tricky).

@Aneurysm9
Copy link
Member

I updated the PR to separate "always-retryable" and "never-retryable" error codes.

This seems like a good approach.

@bogdandrutu
Copy link
Member

@tigrannajaryan

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not understand what this change does. A couple questions:

  • Is the OTLP specification correct or something needs to be changed there?
  • If the specification is correct which part of the specification is currently violated by the implementation in this repo that this PR fixes?

(Blocking temporarily to avoid accidental merging until all is clarified).

@svrakitin
Copy link
Contributor Author

svrakitin commented Apr 7, 2022

@tigrannajaryan This implementation effectively treats ResourceExhausted as transient (retryable) only if upstream signals backpressure. Otherwise, it will be treated as permanent.

It doesn't violate the spec as this only modifies the client behavior to drop data immediately if ResourceExhausted can't be retried. I would suggest to update the spec to recommend clients to retry on ResourceExhausted only if upstream signals backpressure.

@tigrannajaryan
Copy link
Member

It doesn't violate the spec as this only modifies the client behavior to drop data immediately if ResourceExhausted can't be retried.

This does appear to contradict what the spec says:

The client SHOULD interpret gRPC status codes as retryable or not-retryable according to the following table:

|gRPC Code|Retryable?|
|---------|----------|
|RESOURCE_EXHAUSTED|Yes|

So, to make it clear, you are saying that RESOURCE_EXHAUSTED should not be an unconditional "Yes" in that table. It should be conditional:

  • If the response includes a RetryInfo then RESOURCE_EXHAUSTED is retryable.
  • If the response does not include a RetryInfo then RESOURCE_EXHAUSTED is not retryable.

Is this correct?

@svrakitin
Copy link
Contributor Author

@tigrannajaryan Exactly.

@tigrannajaryan
Copy link
Member

So, I think we need to update spec to reflect this. I believe this can be considered a bug in the specification, so it is an allowed change.

@svrakitin
Copy link
Contributor Author

@tigrannajaryan Should I prepare the change to the spec before proceeding with this?

@tigrannajaryan
Copy link
Member

@tigrannajaryan Should I prepare the change to the spec before proceeding with this?

Yes, that would be great.

@svrakitin svrakitin force-pushed the do-not-retry-resource-exhausted branch from 02cc1e1 to fea9bf6 Compare April 8, 2022 16:54
@svrakitin svrakitin changed the title [exporter/otlp] Only retry on ResourceExhausted with retry info [exporter/otlp] Retry ResourceExhausted only if the server returns RetryInfo Apr 8, 2022
@svrakitin svrakitin changed the title [exporter/otlp] Retry ResourceExhausted only if the server returns RetryInfo [exporter/otlp] Retry RESOURCE_EXHAUSTED only if the server returns RetryInfo Apr 8, 2022
@svrakitin
Copy link
Contributor Author

I made the code follow the new addition to the spec and added the test to check RetryInfo/no-RetryInfo cases.

Copy link
Member

@tigrannajaryan tigrannajaryan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Much better now, thanks.
Let's wait for open-telemetry/opentelemetry-specification#2480 to be merged before merging this.

@tigrannajaryan tigrannajaryan dismissed their stale review April 8, 2022 17:30

Logic is clear now.

@svrakitin svrakitin force-pushed the do-not-retry-resource-exhausted branch from 4cc4819 to 3acf067 Compare April 12, 2022 10:26
@tigrannajaryan tigrannajaryan added the ready-to-merge Code review completed; ready to merge by maintainers label Apr 13, 2022
@tigrannajaryan
Copy link
Member

@Aneurysm9 @bogdandrutu any further comments or good to merge?

@bogdandrutu
Copy link
Member

@svrakitin please resolve conflicts, and I will merge

@bogdandrutu
Copy link
Member

@svrakitin unit tests are failing

@svrakitin
Copy link
Contributor Author

@bogdandrutu Yeah, I didn't notice that the tests now use go.uber.org/atomic. Updated the test.

@tigrannajaryan tigrannajaryan merged commit cdd11e7 into open-telemetry:main Apr 19, 2022
@svrakitin svrakitin deleted the do-not-retry-resource-exhausted branch April 19, 2022 18:41
Nicholaswang pushed a commit to Nicholaswang/opentelemetry-collector that referenced this pull request Jun 7, 2022
…etryInfo (open-telemetry#5147)

This makes us retry on`ResourceExhausted` only if retry info is provided. It also makes code a bit more explicit.

In my case it caused an issue in production where upstream denied requests above `max_recv_msg_size` and we kept retrying.

**Link to tracking Issue:**
open-telemetry#1123
Nicholaswang added a commit to Nicholaswang/opentelemetry-collector that referenced this pull request Jun 7, 2022
Nicholaswang added a commit to Nicholaswang/opentelemetry-collector that referenced this pull request Jul 12, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ready-to-merge Code review completed; ready to merge by maintainers
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants