Client requests fail with NullPointerException in PickFirstLeafLoadBalancer after temporary DNS resolution failures #11227

gubanov · 2024-05-21T13:46:16Z

What version of gRPC-Java are you using?

1.63.0

What is your environment?

Linux Ubuntu 20.04
Openjdk version "21.0.2" 2024-01-16 LTS

What did you expect to see?

No Uncaught exception in the SynchronizationContext. Panic! java.lang.NullPointerException: Cannot invoke "io.grpc.internal.PickFirstLeafLoadBalancer$SubchannelData.getSubchannel()" because "subchannelData" is null errors in client requests.

What did you see instead?

In logs I see NPE errors in PickFirstLeafLoadBalancer, causing client requests to fail.
These errors are intermittent and happen on some apparently random VMs.

After some investigation I found temporary DNS look up failures that happened before these NPE started to appear.
DNS issues is something happening on our side, but I believe client should be able to handle temporary resolution failures.

Logs are provided below:

2024-05-20 17:28:13,099 WARN           [grpc-default-executor-8397] i.g.i.ManagedChannelImpl: [Channel<17>: (some.internal.host.net:1234)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host some.internal.host.net, cause=java.lang.RuntimeException: java.net.UnknownHostException: some.internal.host.net
 at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:223)
 at io.grpc.internal.DnsNameResolver.doResolve(DnsNameResolver.java:282)
 at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:318)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.net.UnknownHostException: some.internal.host.net.net
 at java.base/java.net.InetAddress$CachedLookup.get(InetAddress.java:988)
 at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1818)
 at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1688)
 at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:632)
 at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:219)
 ... 5 more
}

2024-05-20 17:28:18,481 ERROR          [grpc-default-executor-8395] i.g.i.ManagedChannelImpl: [Channel<17>: (some.internal.host.net:1234)] Uncaught exception in the SynchronizationContext. Panic!
java.lang.NullPointerException: Cannot invoke "io.grpc.internal.PickFirstLeafLoadBalancer$SubchannelData.getSubchannel()" because "subchannelData" is null
 at io.grpc.internal.PickFirstLeafLoadBalancer.acceptResolvedAddresses(PickFirstLeafLoadBalancer.java:138)
 at io.grpc.internal.AutoConfiguredLoadBalancerFactory$AutoConfiguredLoadBalancer.tryAcceptResolvedAddresses(AutoConfiguredLoadBalancerFactory.java:142)
 at io.grpc.internal.ManagedChannelImpl$NameResolverListener$1NamesResolved.run(ManagedChannelImpl.java:1877)
 at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:94)
 at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:126)
 at io.grpc.internal.ManagedChannelImpl$NameResolverListener.onResult(ManagedChannelImpl.java:1891)
 at io.grpc.internal.RetryingNameResolver$RetryingListener.onResult(RetryingNameResolver.java:98)
 at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:333)
 at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
 at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
 at java.base/java.lang.Thread.run(Thread.java:1583)

Steps to reproduce the bug

I don't have simple steps to reproduce unfortunately, but I think I found the call sequence causing the issue:

io.grpc.internal.DnsNameResolver#resolveAddresses throws RuntimeException with UnknownHostException as a cause
io.grpc.internal.DnsNameResolver#doResolve catches it and returns result.error = Status.UNAVAILABLE
io.grpc.internal.DnsNameResolver.Resolve#run then calls savedListener.onError(result.error)
where savedListener is io.grpc.internal.ManagedChannelImpl.NameResolverListener
io.grpc.internal.ManagedChannelImpl.NameResolverListener#onError effectively calls helper.lb.handleNameResolutionError(error)
where lb is io.grpc.internal.PickFirstLeafLoadBalancer
io.grpc.internal.PickFirstLeafLoadBalancer#handleNameResolutionError finally calls subchannels.clear(), but does nothing with addressIndex
as a result we have subchannels cleared, but addressIndex stays intact
so we end up with inconsistent state between subchannels and addressIndex: addressIndex.seekTo(previousAddress) is true, but subchannels.get(previousAddress) is false
this inconsistency is causing the above NPE

The text was updated successfully, but these errors were encountered:

sergiitk · 2024-05-21T20:19:43Z

This issue was fixed in the v1.63.1 release:

Change defaults to use the older PickFirstLoadBalancer and disable Happy Eyeballs. This disables a performance optimization added in v1.63.0 (Change HappyEyeballs and new pick first LB flags default value to false #11120) We have had a report that the new implementation can trigger a NullPointerException

Note that v1.64.0 is not affected too.

ejona86 · 2024-05-23T16:31:48Z

Users should be protected from seeing this, but reopening for us to fix before re-enabling the new pick-first policy.

ejona86 · 2024-05-23T17:27:53Z

This looks to be a very useful report. We had been told of this, but had little to go on because it had been on random Android devices with little context to what happened earlier. I had done a brief audit, but nothing had jumped out to me.

nyukhalov · 2024-06-03T03:06:34Z

Hey everyone, just a heads-up to help prioritize this: this issue is affecting our production system running on AWS.
Thanks!

ejona86 · 2024-06-03T15:16:56Z

@nyukhalov, this issue is not impacting the latest patch release of any version. We disabled the newer code path. This is tracking us fixing the new code path so we can enable it again.

sergiitk closed this as completed May 21, 2024

sergiitk added the bug label May 23, 2024

ejona86 reopened this May 23, 2024

sergiitk assigned larry-safran May 23, 2024

sergiitk added this to the 1.65 milestone May 23, 2024

ejona86 modified the milestones: 1.65, 1.66 Jun 11, 2024

larry-safran mentioned this issue Jun 18, 2024

Eliminate NPE after recovering from a temporary name resolution failure. #11298

Merged

larry-safran closed this as completed in #11298 Jun 21, 2024

github-actions bot locked as resolved and limited conversation to collaborators Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Client requests fail with NullPointerException in PickFirstLeafLoadBalancer after temporary DNS resolution failures #11227

Client requests fail with NullPointerException in PickFirstLeafLoadBalancer after temporary DNS resolution failures #11227

gubanov commented May 21, 2024

sergiitk commented May 21, 2024

ejona86 commented May 23, 2024

ejona86 commented May 23, 2024

nyukhalov commented Jun 3, 2024

ejona86 commented Jun 3, 2024

Client requests fail with NullPointerException in PickFirstLeafLoadBalancer after temporary DNS resolution failures #11227

Client requests fail with NullPointerException in PickFirstLeafLoadBalancer after temporary DNS resolution failures #11227

Comments

gubanov commented May 21, 2024

What version of gRPC-Java are you using?

What is your environment?

What did you expect to see?

What did you see instead?

Steps to reproduce the bug

sergiitk commented May 21, 2024

ejona86 commented May 23, 2024

ejona86 commented May 23, 2024

nyukhalov commented Jun 3, 2024

ejona86 commented Jun 3, 2024