Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: tolerate immediately recoverable stream faults, improve logging #1019

Merged

Conversation

guidobrei
Copy link
Member

This PR

  • Improves error logging for in process resolver with remote mode.
  • Harmonizes backoff implementations across different gRPC handlers.
  • Uses FlagdOptions.getRetryBackoffMs() to initialize the backoff in all Backoff scenarios. GrpcStreamConnector previously used a hardcoded value of 2 seconds.
  • Immediately reconnect on first stream error in GrpcStreamConnector. This removes a backoff when a planned deadline exceeds and the connector reconnects.
  • Unified standard max jitter of 250ms for all backoff use-cases

Fixes #1010

Notes

Different to #1010, error logs are not written when the max retry delay is reached, but already at the second error in a row.
Waiting for max retry delay (120 seconds) with exponential backoff starting with 2 seconds would require 126 seconds until the first error gets visible.

Instead, error logs are generated whenever an error queue payload is emitted. Only on the first error we try to reconnect immediately without any backoff (only with default jitter 250ms max) and without emitting an error payload. Starting with the second error in a row we log an error and emit the error payload.

The initial Backoff is now FlagdOptions.getRetryBackoffMs() in GrpcStreamConnector (new) and GrpcConnector (no change).
For the GrpcStreamConnector this means an initial Backoff of 1 sec (default option) instead of 2 secs.

I've also removed the special handling of DEADLINE_EXCEEDED' errors, as the connector now tries to reconnect silently on any first error. This also solves DEADLINE_EXCEEDED` issues related to Envoy, where a wrong gRPC status code is reported. See here

With the first immediate retry the new Backoff times for GrpcStreamConnector are now:

  • 0s
  • 1s
  • 2s
  • 4s
  • 8s
  • ...
  • 120s

Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
@toddbaert

This comment was marked as resolved.

@toddbaert

This comment was marked as resolved.

@toddbaert
Copy link
Member

toddbaert commented Oct 11, 2024

Wow... so this is illuminating. Apparently, deadlines can impact the reconnect logic of the underlying CONNECTION (not stream). gRPC has an underlying connection retry mechanism independent of the stream reconnect. I seems at least in grpc-java, our deadline, as well as the fact our testbed is now down for 5 seconds (instead of just for an instant), has had an impact on that.

I'm convinced at this point that the changes in our deadlines and to bring flagd down for 5s (see resolved comment above) instead of just for a second are what broke our e2e tests. The simple solution was to change the cadence of the "unstable" containers... instead of being down for 5s and up for 5s, we go down for 5s and stay up for 20s... considering the connection backoff algorithm linked above this sensibly improves the reliability of the test; before this change I couldn't make the test pass; now it passes consistently; the downside is the tests can take longer if we are unlucky - for that reason I've updated the reconnect tests to wait a max of 240s (though they generally complete much before that). See my changes.

@toddbaert toddbaert force-pushed the feat/1010-improve-flagd-error-logging branch 2 times, most recently from f6ebe2f to 276644c Compare October 11, 2024 02:57
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>
@toddbaert toddbaert force-pushed the feat/1010-improve-flagd-error-logging branch from 276644c to 801e5d0 Compare October 11, 2024 03:02
@toddbaert toddbaert self-requested a review October 11, 2024 03:03
@toddbaert
Copy link
Member

toddbaert commented Oct 11, 2024

Could you inject your backoff stuff into the EventStreamObserver.java (for the RPC resolver) and take advantage of the same logging/eventing rules there too, for consistency (meaning immediately and silently retry 1 time)? And could you remove the DEADLINE_EXCEEDING exception logic there as well? https://github.com/guidobrei/open-feature-java-sdk-contrib/blob/801e5d0e4ee3dc8adaf0cbb79878279c996899f6/providers/flagd/src/main/java/dev/openfeature/contrib/providers/flagd/resolver/grpc/EventStreamObserver.java#L58-L68

Copy link
Member

@toddbaert toddbaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approved, but consider: #1019 (comment)

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>
@toddbaert toddbaert force-pushed the feat/1010-improve-flagd-error-logging branch from 801e5d0 to 6d9b7f8 Compare October 11, 2024 14:56
@toddbaert
Copy link
Member

@guidobrei I updated this by rebasing on main, sorry!

git reset --hard HEAD~10 && git pull should sync your local branch back up with this.

@toddbaert toddbaert self-requested a review October 15, 2024 17:56
Copy link
Member

@toddbaert toddbaert left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work! Nice to see this improvement. I will soon use this as the basis for some new specifications in the flagd provider specs.

@toddbaert toddbaert changed the title feat(flagd): Improve flagd retry logic and error logging feat: tolerate "instantly recoverable" stream faults, improve logging Oct 15, 2024
@toddbaert toddbaert changed the title feat: tolerate "instantly recoverable" stream faults, improve logging feat: tolerate immediately recoverable stream faults, improve logging Oct 15, 2024
@toddbaert toddbaert merged commit 3110076 into open-feature:main Oct 15, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[flagd] improve error logging for reconnect scenarios
6 participants