feat: tolerate immediately recoverable stream faults, improve logging #1019

guidobrei · 2024-10-10T19:35:20Z

This PR

Improves error logging for in process resolver with remote mode.
Harmonizes backoff implementations across different gRPC handlers.
Uses FlagdOptions.getRetryBackoffMs() to initialize the backoff in all Backoff scenarios. GrpcStreamConnector previously used a hardcoded value of 2 seconds.
Immediately reconnect on first stream error in GrpcStreamConnector. This removes a backoff when a planned deadline exceeds and the connector reconnects.
Unified standard max jitter of 250ms for all backoff use-cases

Notes

Different to #1010, error logs are not written when the max retry delay is reached, but already at the second error in a row.
Waiting for max retry delay (120 seconds) with exponential backoff starting with 2 seconds would require 126 seconds until the first error gets visible.

Instead, error logs are generated whenever an error queue payload is emitted. Only on the first error we try to reconnect immediately without any backoff (only with default jitter 250ms max) and without emitting an error payload. Starting with the second error in a row we log an error and emit the error payload.

The initial Backoff is now FlagdOptions.getRetryBackoffMs() in GrpcStreamConnector (new) and GrpcConnector (no change).
For the GrpcStreamConnector this means an initial Backoff of 1 sec (default option) instead of 2 secs.

I've also removed the special handling of DEADLINE_EXCEEDED' errors, as the connector now tries to reconnect silently on any first error. This also solves DEADLINE_EXCEEDED` issues related to Envoy, where a wrong gRPC status code is reported. See here

With the first immediate retry the new Backoff times for GrpcStreamConnector are now:

0s
1s
2s
4s
8s
...
120s

Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>

toddbaert · 2024-10-11T02:46:35Z

Wow... so this is illuminating. Apparently, deadlines can impact the reconnect logic of the underlying CONNECTION (not stream). gRPC has an underlying connection retry mechanism independent of the stream reconnect. I seems at least in grpc-java, our deadline, as well as the fact our testbed is now down for 5 seconds (instead of just for an instant), has had an impact on that.

I'm convinced at this point that the changes in our deadlines and to bring flagd down for 5s (see resolved comment above) instead of just for a second are what broke our e2e tests. The simple solution was to change the cadence of the "unstable" containers... instead of being down for 5s and up for 5s, we go down for 5s and stay up for 20s... considering the connection backoff algorithm linked above this sensibly improves the reliability of the test; before this change I couldn't make the test pass; now it passes consistently; the downside is the tests can take longer if we are unlucky - for that reason I've updated the reconnect tests to wait a max of 240s (though they generally complete much before that). See my changes.

...ain/java/dev/openfeature/contrib/providers/flagd/resolver/common/backoff/BackoffService.java

...in/java/dev/openfeature/contrib/providers/flagd/resolver/common/backoff/CombinedBackoff.java

Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>

toddbaert · 2024-10-11T03:04:47Z

Could you inject your backoff stuff into the EventStreamObserver.java (for the RPC resolver) and take advantage of the same logging/eventing rules there too, for consistency (meaning immediately and silently retry 1 time)? And could you remove the DEADLINE_EXCEEDING exception logic there as well? https://github.com/guidobrei/open-feature-java-sdk-contrib/blob/801e5d0e4ee3dc8adaf0cbb79878279c996899f6/providers/flagd/src/main/java/dev/openfeature/contrib/providers/flagd/resolver/grpc/EventStreamObserver.java#L58-L68

toddbaert

Approved, but consider: #1019 (comment)

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>

toddbaert · 2024-10-12T23:42:47Z

@guidobrei I updated this by rebasing on main, sorry!

git reset --hard HEAD~10 && git pull should sync your local branch back up with this.

…b.com/guidobrei/open-feature-java-sdk-contrib into feat/1010-improve-flagd-error-logging

Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>

...ature/contrib/providers/flagd/resolver/common/backoff/GrpcStreamConnectorBackoffService.java

toddbaert

Great work! Nice to see this improvement. I will soon use this as the basis for some new specifications in the flagd provider specs.

guidobrei added 4 commits October 10, 2024 08:50

feat(flagd): Log stream errors and metadata errors separately

6ac6414

Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>

feat(flagd): Add reusable backoff strategies

e42f663

Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>

feat(flagd): Use backoff service in GrpcConnector

786f489

Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>

feat(flagd): Small refactorings

18807cf

Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>

guidobrei requested a review from a team as a code owner October 10, 2024 19:35

github-actions bot assigned beeme1mr, Kavindu-Dodan, thisthat and toddbaert Oct 10, 2024

github-actions bot requested review from beeme1mr, Kavindu-Dodan, thisthat and toddbaert October 10, 2024 19:35

This comment was marked as resolved.

Sign in to view

toddbaert force-pushed the feat/1010-improve-flagd-error-logging branch 2 times, most recently from f6ebe2f to 276644c Compare October 11, 2024 02:57

toddbaert reviewed Oct 11, 2024

View reviewed changes

...ain/java/dev/openfeature/contrib/providers/flagd/resolver/common/backoff/BackoffService.java Show resolved Hide resolved

toddbaert reviewed Oct 11, 2024

View reviewed changes

...ain/java/dev/openfeature/contrib/providers/flagd/resolver/common/backoff/BackoffService.java Show resolved Hide resolved

toddbaert reviewed Oct 11, 2024

View reviewed changes

...in/java/dev/openfeature/contrib/providers/flagd/resolver/common/backoff/CombinedBackoff.java Show resolved Hide resolved

guidobrei added 4 commits October 10, 2024 23:02

feat(flagd): Log stream errors and metadata errors separately

9858567

Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>

feat(flagd): Add reusable backoff strategies

923932b

Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>

feat(flagd): Use backoff service in GrpcConnector

77760bb

Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>

feat(flagd): Small refactorings

9c4ab05

Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>

toddbaert force-pushed the feat/1010-improve-flagd-error-logging branch from 276644c to 801e5d0 Compare October 11, 2024 03:02

toddbaert self-requested a review October 11, 2024 03:03

toddbaert approved these changes Oct 11, 2024

View reviewed changes

fixup: reconnect tests

6d9b7f8

Signed-off-by: Todd Baert <todd.baert@dynatrace.com>

toddbaert force-pushed the feat/1010-improve-flagd-error-logging branch from 801e5d0 to 6d9b7f8 Compare October 11, 2024 14:56

guidobrei added 2 commits October 14, 2024 17:45

Merge branch 'feat/1010-improve-flagd-error-logging' of https://githu…

05cd1fe

…b.com/guidobrei/open-feature-java-sdk-contrib into feat/1010-improve-flagd-error-logging

feat(flagd): Use GrpcStreamConnectorBackoffService in GrpcConnector

396db2b

Signed-off-by: Guido Breitenhuber <guido.breitenhuber@dynatrace.com>

guidobrei commented Oct 15, 2024

View reviewed changes

...ature/contrib/providers/flagd/resolver/common/backoff/GrpcStreamConnectorBackoffService.java Show resolved Hide resolved

Merge branch 'main' into feat/1010-improve-flagd-error-logging

6eaa5fc

aepfli approved these changes Oct 15, 2024

View reviewed changes

toddbaert self-requested a review October 15, 2024 17:56

toddbaert approved these changes Oct 15, 2024

View reviewed changes

toddbaert changed the title ~~feat(flagd): Improve flagd retry logic and error logging~~ feat: tolerate "instantly recoverable" stream faults, improve logging Oct 15, 2024

toddbaert changed the title ~~feat: tolerate "instantly recoverable" stream faults, improve logging~~ feat: tolerate immediately recoverable stream faults, improve logging Oct 15, 2024

toddbaert merged commit 3110076 into open-feature:main Oct 15, 2024
6 checks passed

github-actions bot mentioned this pull request Oct 15, 2024

chore(main): release dev.openfeature.contrib.providers.flagd 0.9.1 #1001

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: tolerate immediately recoverable stream faults, improve logging #1019

feat: tolerate immediately recoverable stream faults, improve logging #1019

guidobrei commented Oct 10, 2024

This comment was marked as resolved.

This comment was marked as resolved.

toddbaert commented Oct 11, 2024 •

edited

Loading

toddbaert commented Oct 11, 2024 •

edited

Loading

toddbaert left a comment

toddbaert commented Oct 12, 2024

toddbaert left a comment

feat: tolerate immediately recoverable stream faults, improve logging #1019

feat: tolerate immediately recoverable stream faults, improve logging #1019

Conversation

guidobrei commented Oct 10, 2024

This PR

Notes

This comment was marked as resolved.

This comment was marked as resolved.

toddbaert commented Oct 11, 2024 • edited Loading

toddbaert commented Oct 11, 2024 • edited Loading

toddbaert left a comment

Choose a reason for hiding this comment

toddbaert commented Oct 12, 2024

toddbaert left a comment

Choose a reason for hiding this comment

toddbaert commented Oct 11, 2024 •

edited

Loading

toddbaert commented Oct 11, 2024 •

edited

Loading