Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PickFirstLeafLoadBalancer does not emit TRANSIENT_FAILURE states #11082

Closed
raboof opened this issue Apr 6, 2024 · 16 comments
Closed

PickFirstLeafLoadBalancer does not emit TRANSIENT_FAILURE states #11082

raboof opened this issue Apr 6, 2024 · 16 comments
Assignees
Milestone

Comments

@raboof
Copy link
Contributor

raboof commented Apr 6, 2024

When using channel.notifyWhenStateChanged and trying to connect to addresses that don't accept connections, the PickFirstLoadBalancer emits TRANSIENT_FAILURE states, while the PickFirstLeafLoadBalancer just stays in CONNECTING.

What version of gRPC-Java are you using?

I noticed this after updating from 1.62.2 to 1.63.0. I can also reproduce it on 1.62.2 when I set the GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST environment variable to true (which has become the default in 1.63.0).

What is your environment?

Linux (NixOS unstable), Oracle Java 1.8.0_362

What did you expect to see?

Alternating between CONNECTING and TRANSIENT_FAILURE states

What did you see instead?

Silence after entering the CONNECTING state

Steps to reproduce the bug

I don't have a particularly minimal reproducer, but can reliably show the problem with the "NonBalancingIntegrationSpecNetty" test in Akka gRPC (apache/pekko-grpc#271 (comment))

@temawi
Copy link
Contributor

temawi commented Apr 8, 2024

@larry-safran You might be best to investigate this one.

@YifeiZhuang
Copy link
Contributor

I can not reproduce it yet in the unit test. The existing UT covers some basic situations for the channel to report TRANSIENT_FAILURE after CONNECTING, e.g. with multiple and one address, initial iteration complete for all the addresses
https://github.com/grpc/grpc-java/blob/master/core/src/test/java/io/grpc/internal/PickFirstLeafLoadBalancerTest.java#L333
https://github.com/grpc/grpc-java/blob/master/core/src/test/java/io/grpc/internal/PickFirstLeafLoadBalancerTest.java#L531

@raboof
Copy link
Contributor Author

raboof commented Apr 11, 2024

Thank you for looking into this, much appreciated - I'll see if I can find the time to dig into what's different between those grpc-java UT's and the behavior I'm seeing in the pekko-grpc integration test. Might have to wait a couple of weeks, though - busy period here.

@larry-safran
Copy link
Contributor

Are you providing multiple addresses or only a single address? Is it only in CONNECTING for a limited time or remains that way permanently?

@raboof
Copy link
Contributor Author

raboof commented Apr 16, 2024 via email

@raboof
Copy link
Contributor Author

raboof commented Apr 25, 2024

I created a possible unit test reproducer in https://github.com/grpc/grpc-java/compare/master...raboof:grpc-java:test-for-PickFirstLeafLoadBalancer-11082?expand=1 - I'm not too familiar with the grpc-java codebase, so it is possible that I'm misunderstanding something and not accurately reproducing the issue, but it might be a good starting point for further analysis. The behaviour does look similar to what I'm seeing in the pekko-grpc failure, where isPassComplete keeps returning false (because addressIndex.isValid() remains true).

@pjfanning
Copy link

1.64.0 seems to work fin in the Apache Pekko gRPC tests. - apache/pekko-grpc#311

@raboof
Copy link
Contributor Author

raboof commented May 23, 2024

I'm OK with closing this as the Pekko gRPC tests suggests the problem is gone on 1.64. My reproducer at https://github.com/grpc/grpc-java/compare/master...raboof:grpc-java:test-for-PickFirstLeafLoadBalancer-11082?expand=1 still fails when rebased, but I guess that's likely a problem with the test rather than with the implementation.

@ejona86
Copy link
Member

ejona86 commented May 23, 2024

We'd expect 1.64 would be fixed, because we disabled PickFirstLeafLoadBalancer (by default). But we'll be turning it on again in the future, so this is appropriate to keep open and us address.

@ejona86 ejona86 added this to the 1.65 milestone May 23, 2024
@raboof
Copy link
Contributor Author

raboof commented May 23, 2024

We'd expect 1.64 would be fixed, because we disabled PickFirstLeafLoadBalancer (by default).

Ah 😄

But we'll be turning it on again in the future, so this is appropriate to keep open and us address.

Makes sense, LMK if you want me to test anything!

@ejona86 ejona86 modified the milestones: 1.65, 1.66 Jun 11, 2024
@larry-safran
Copy link
Contributor

@raboof Looking at your test code for reproducing, you are sending the subchannel state twice to the same subchannel which is why isPassComplete keeps returning false. Changing the logic for the second onSubchannelState call makes the test case pass.

    inOrder.verify(mockSubchannel2).start(stateListenerCaptor.capture());
    SubchannelStateListener stateListener2 = stateListenerCaptor.getValue();
    stateListener2.onSubchannelState(ConnectivityStateInfo.forTransientFailure(CONNECTION_ERROR));

Is it possible that the pacheko test that was failing isn't actually providing a failure on the second subchannel?

@raboof
Copy link
Contributor Author

raboof commented Jun 19, 2024

ah, sorry for botching the reproducer ;)

Is it possible that the pekko test that was failing isn't actually providing a failure on the second subchannel?

hmm, that seems unlikely, as I'm not failing subchannels explicitly there (just passing a service with two invalid addresses).

@larry-safran
Copy link
Contributor

@raboof There was another change that went into 1.64 related to load balancers that might have also had an impact on the problem. Could you try 1.64 with GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST=true and see if the problem is still there?

raboof added a commit to raboof/grpc-java that referenced this issue Jun 21, 2024
Possible reproducer for grpc#11082 .

I'm not too familiar with the grpc-java codebase, so it is possible
that I'm misunderstanding something and not accurately reproducing
the issue, but it might be a good starting point for further
analysis. The behaviour does look similar to what I'm seeing in
the pekko-grpc failure, where `isPassComplete` keeps returning
`false` (because `addressIndex.isValid()` remains `true`).
@raboof
Copy link
Contributor Author

raboof commented Jun 21, 2024

OK, I can confirm our test still fails on 1.64 with GRPC_EXPERIMENTAL_ENABLE_NEW_PICK_FIRST=true, but there seems to be something else going on than I assumed earlier:

In the pekko test we're passing a list of two EquivalentAddressGroups to io.grpc.NameResolver.Listener#onAddresses that both contain the same (invalid) address, ending up with a single subchannel but an addressIndex of size 2. Indeed when we pass two different invalid addresses it works fine.

Is that invalid input or something you'd like to support/handle?

@kannanjgithub kannanjgithub modified the milestones: 1.66, 1.67 Aug 1, 2024
@ejona86
Copy link
Member

ejona86 commented Sep 3, 2024

This was fixed by #11342 + #11345

@ejona86 ejona86 closed this as completed Sep 3, 2024
@ejona86 ejona86 modified the milestones: 1.67, 1.66 Sep 3, 2024
@raboof
Copy link
Contributor Author

raboof commented Sep 3, 2024

Nice! Indeed pekko-grpc updated to v1.66.0 which should have this fix and default to PickFirstLeafLoadBalancer, and the integration tests are happy. Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants