Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpclb: keep RR Subchannel state in TRANSIENT_FAILURE until becoming READY #7816

Merged
merged 1 commit into from
Jan 15, 2021

Conversation

zhangkun83
Copy link
Contributor

If all RR servers are unhealthy, it's possible that at least one
connection is CONNECTING at every moment which causes RR to stay in
CONNECTING. It's better to keep the TRANSIENT_FAILURE state in that
case so that fail-fast RPCs can fail fast.

The same changes have been made for RoundRobinLoadBalancer in #6657

…READY

If all RR servers are unhealthy, it's possible that at least one
connection is CONNECTING at every moment which causes RR to stay in
CONNECTING. It's better to keep the TRANSIENT_FAILURE state in that
case so that fail-fast RPCs can fail fast.

The same changes have been made for RoundRobinLoadBalancer in grpc#6657
@zhangkun83 zhangkun83 requested a review from ejona86 January 15, 2021 21:59
@zhangkun83 zhangkun83 merged commit 23d2796 into grpc:master Jan 15, 2021
@zhangkun83 zhangkun83 deleted the grpclb-staytransientfailure branch January 15, 2021 23:19
// Switch subchannel1 to TRANSIENT_FAILURE, making the general state TRANSIENT_FAILURE too.
Status error = Status.UNAVAILABLE.withDescription("error1");
deliverSubchannelState(subchannel1, ConnectivityStateInfo.forTransientFailure(error));
inOrder.verify(helper).updateBalancingState(eq(TRANSIENT_FAILURE), pickerCaptor.capture());
Copy link
Contributor

@voidzcy voidzcy Mar 11, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test is weird. subchannel2 is still connecting, so the overall state should be CONNECTING. This is mostly because subchannels with CONNECTING state are ignored when aggregating the overall state.

So, observing TRANSIENT_FAILURE isn't related to your change. And I believe that should be fixed. However, for this test case specifically, you should bring all subchannels into TRANSIENT_FAILURE and then perform the test for behaviors related to your change.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like we should stop ignoring CONNECTING with this change (and use the same logic as RR and elsewhere)

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Jun 10, 2021
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants