You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Linux Ubuntu 20.04
Openjdk version "21.0.2" 2024-01-16 LTS
What did you expect to see?
No Uncaught exception in the SynchronizationContext. Panic! java.lang.NullPointerException: Cannot invoke "io.grpc.internal.PickFirstLeafLoadBalancer$SubchannelData.getSubchannel()" because "subchannelData" is null errors in client requests.
What did you see instead?
In logs I see NPE errors in PickFirstLeafLoadBalancer, causing client requests to fail.
These errors are intermittent and happen on some apparently random VMs.
After some investigation I found temporary DNS look up failures that happened before these NPE started to appear.
DNS issues is something happening on our side, but I believe client should be able to handle temporary resolution failures.
Logs are provided below:
2024-05-20 17:28:13,099 WARN [grpc-default-executor-8397] i.g.i.ManagedChannelImpl: [Channel<17>: (some.internal.host.net:1234)] Failed to resolve name. status=Status{code=UNAVAILABLE, description=Unable to resolve host some.internal.host.net, cause=java.lang.RuntimeException: java.net.UnknownHostException: some.internal.host.net
at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:223)
at io.grpc.internal.DnsNameResolver.doResolve(DnsNameResolver.java:282)
at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:318)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Caused by: java.net.UnknownHostException: some.internal.host.net.net
at java.base/java.net.InetAddress$CachedLookup.get(InetAddress.java:988)
at java.base/java.net.InetAddress.getAllByName0(InetAddress.java:1818)
at java.base/java.net.InetAddress.getAllByName(InetAddress.java:1688)
at io.grpc.internal.DnsNameResolver$JdkAddressResolver.resolveAddress(DnsNameResolver.java:632)
at io.grpc.internal.DnsNameResolver.resolveAddresses(DnsNameResolver.java:219)
... 5 more
}
2024-05-20 17:28:18,481 ERROR [grpc-default-executor-8395] i.g.i.ManagedChannelImpl: [Channel<17>: (some.internal.host.net:1234)] Uncaught exception in the SynchronizationContext. Panic!
java.lang.NullPointerException: Cannot invoke "io.grpc.internal.PickFirstLeafLoadBalancer$SubchannelData.getSubchannel()" because "subchannelData" is null
at io.grpc.internal.PickFirstLeafLoadBalancer.acceptResolvedAddresses(PickFirstLeafLoadBalancer.java:138)
at io.grpc.internal.AutoConfiguredLoadBalancerFactory$AutoConfiguredLoadBalancer.tryAcceptResolvedAddresses(AutoConfiguredLoadBalancerFactory.java:142)
at io.grpc.internal.ManagedChannelImpl$NameResolverListener$1NamesResolved.run(ManagedChannelImpl.java:1877)
at io.grpc.SynchronizationContext.drain(SynchronizationContext.java:94)
at io.grpc.SynchronizationContext.execute(SynchronizationContext.java:126)
at io.grpc.internal.ManagedChannelImpl$NameResolverListener.onResult(ManagedChannelImpl.java:1891)
at io.grpc.internal.RetryingNameResolver$RetryingListener.onResult(RetryingNameResolver.java:98)
at io.grpc.internal.DnsNameResolver$Resolve.run(DnsNameResolver.java:333)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
at java.base/java.lang.Thread.run(Thread.java:1583)
Steps to reproduce the bug
I don't have simple steps to reproduce unfortunately, but I think I found the call sequence causing the issue:
io.grpc.internal.DnsNameResolver#resolveAddresses throws RuntimeException with UnknownHostException as a cause
io.grpc.internal.DnsNameResolver#doResolve catches it and returns result.error = Status.UNAVAILABLE
io.grpc.internal.DnsNameResolver.Resolve#run then calls savedListener.onError(result.error)
where savedListener is io.grpc.internal.ManagedChannelImpl.NameResolverListener
io.grpc.internal.ManagedChannelImpl.NameResolverListener#onError effectively calls helper.lb.handleNameResolutionError(error)
where lb is io.grpc.internal.PickFirstLeafLoadBalancer
io.grpc.internal.PickFirstLeafLoadBalancer#handleNameResolutionError finally calls subchannels.clear(), but does nothing with addressIndex
as a result we have subchannels cleared, but addressIndex stays intact
so we end up with inconsistent state between subchannels and addressIndex: addressIndex.seekTo(previousAddress) is true, but subchannels.get(previousAddress) is false
this inconsistency is causing the above NPE
The text was updated successfully, but these errors were encountered:
This looks to be a very useful report. We had been told of this, but had little to go on because it had been on random Android devices with little context to what happened earlier. I had done a brief audit, but nothing had jumped out to me.
@nyukhalov, this issue is not impacting the latest patch release of any version. We disabled the newer code path. This is tracking us fixing the new code path so we can enable it again.
What version of gRPC-Java are you using?
1.63.0
What is your environment?
Linux Ubuntu 20.04
Openjdk version "21.0.2" 2024-01-16 LTS
What did you expect to see?
No
Uncaught exception in the SynchronizationContext. Panic! java.lang.NullPointerException: Cannot invoke "io.grpc.internal.PickFirstLeafLoadBalancer$SubchannelData.getSubchannel()" because "subchannelData" is null
errors in client requests.What did you see instead?
In logs I see NPE errors in PickFirstLeafLoadBalancer, causing client requests to fail.
These errors are intermittent and happen on some apparently random VMs.
After some investigation I found temporary DNS look up failures that happened before these NPE started to appear.
DNS issues is something happening on our side, but I believe client should be able to handle temporary resolution failures.
Logs are provided below:
Steps to reproduce the bug
I don't have simple steps to reproduce unfortunately, but I think I found the call sequence causing the issue:
io.grpc.internal.DnsNameResolver#resolveAddresses
throwsRuntimeException
withUnknownHostException
as a causeio.grpc.internal.DnsNameResolver#doResolve
catches it and returnsresult.error = Status.UNAVAILABLE
io.grpc.internal.DnsNameResolver.Resolve#run
then callssavedListener.onError(result.error)
where
savedListener
isio.grpc.internal.ManagedChannelImpl.NameResolverListener
io.grpc.internal.ManagedChannelImpl.NameResolverListener#onError
effectively callshelper.lb.handleNameResolutionError(error)
where
lb
isio.grpc.internal.PickFirstLeafLoadBalancer
io.grpc.internal.PickFirstLeafLoadBalancer#handleNameResolutionError
finally callssubchannels.clear()
, but does nothing withaddressIndex
subchannels
cleared, butaddressIndex
stays intactso we end up with inconsistent state between
subchannels
andaddressIndex
:addressIndex.seekTo(previousAddress)
is true, butsubchannels.get(previousAddress)
is falsethis inconsistency is causing the above NPE
The text was updated successfully, but these errors were encountered: