Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failure to shutdown NettyEventLoopProvider.releaseEventLoop() leaves bad executor in cache #445

Closed
philipa opened this issue Oct 15, 2020 · 11 comments · Fixed by #447
Closed
Assignees
Labels
Milestone

Comments

@philipa
Copy link

philipa commented Oct 15, 2020

If this happens...

io.netty.channel.ChannelException: eventfd_write() failed: Bad file descriptor
	at io.netty.channel.epoll.Native.eventFdWrite(Native Method)
	at io.netty.channel.epoll.EpollEventLoop.wakeup(EpollEventLoop.java:167)
	at io.netty.util.concurrent.SingleThreadEventExecutor.shutdownGracefully(SingleThreadEventExecutor.java:670)
	at io.netty.util.concurrent.MultithreadEventExecutorGroup.shutdownGracefully(MultithreadEventExecutorGroup.java:163)
	at com.hivemq.client.internal.netty.NettyEventLoopProvider.releaseEventLoop(NettyEventLoopProvider.java:123)
	at com.hivemq.client.internal.mqtt.MqttClientConfig.lambda$releaseEventLoop$0(MqttClientConfig.java:200)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:333)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

Then all subsequent connect operations using the same netty executor fail:

2020-10-14 19:19:49.452|WARN|main||Force-closing a channel whose registration task was not accepted by an event loop: [id: 0xebe333de]|io.netty.channel.AbstractChannel
java.util.concurrent.RejectedExecutionException: event executor terminated
	at io.netty.util.concurrent.SingleThreadEventExecutor.reject(SingleThreadEventExecutor.java:926)
	at io.netty.util.concurrent.SingleThreadEventExecutor.offerTask(SingleThreadEventExecutor.java:353)
	at io.netty.util.concurrent.SingleThreadEventExecutor.addTask(SingleThreadEventExecutor.java:346)
	at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:828)
	at io.netty.util.concurrent.SingleThreadEventExecutor.execute(SingleThreadEventExecutor.java:818)
	at io.netty.channel.AbstractChannel$AbstractUnsafe.register(AbstractChannel.java:471)
	at io.netty.channel.SingleThreadEventLoop.register(SingleThreadEventLoop.java:87)
	at io.netty.channel.SingleThreadEventLoop.register(SingleThreadEventLoop.java:81)
	at io.netty.bootstrap.AbstractBootstrap.initAndRegister(AbstractBootstrap.java:323)
	at io.netty.bootstrap.Bootstrap.doResolveAndConnect(Bootstrap.java:155)
	at io.netty.bootstrap.Bootstrap.connect(Bootstrap.java:148)
	at com.hivemq.client.internal.mqtt.handler.connect.MqttConnAckSingle.connect(MqttConnAckSingle.java:91)
	at com.hivemq.client.internal.mqtt.handler.connect.MqttConnAckSingle.subscribeActual(MqttConnAckSingle.java:68)
	at io.reactivex.Single.subscribe(Single.java:3666)
	at io.reactivex.internal.operators.single.SingleObserveOn.subscribeActual(SingleObserveOn.java:35)
	at io.reactivex.Single.subscribe(Single.java:3666)
	at com.hivemq.client.internal.rx.RxFutureConverter$RxSingleFuture.<init>(RxFutureConverter.java:113)
	at com.hivemq.client.internal.rx.RxFutureConverter.toFuture(RxFutureConverter.java:43)
	at com.hivemq.client.internal.mqtt.MqttAsyncClient.connect(MqttAsyncClient.java:116)
	at com.hivemq.client.internal.mqtt.mqtt3.Mqtt3AsyncClientView.connect(Mqtt3AsyncClientView.java:97)
	at com.hivemq.client.internal.mqtt.mqtt3.Mqtt3AsyncClientView.connect(Mqtt3AsyncClientView.java:89)
@philipa
Copy link
Author

philipa commented Oct 15, 2020

This is happening occasionally in some integration tests I have. I'm going to try a trivial fix:

diff --git a/src/main/java/com/hivemq/client/internal/netty/NettyEventLoopProvider.java b/src/main/java/com/hivemq/client/internal/netty/NettyEventLoopProvider.java
index 68975154..6b9a127e 100644
--- a/src/main/java/com/hivemq/client/internal/netty/NettyEventLoopProvider.java
+++ b/src/main/java/com/hivemq/client/internal/netty/NettyEventLoopProvider.java
@@ -119,10 +119,10 @@ public class NettyEventLoopProvider {
     public synchronized void releaseEventLoop(final @Nullable Executor executor) {
         final Entry entry = entries.get(executor);
         if (--entry.referenceCount == 0) {
+            entries.remove(executor);
             if (!(executor instanceof MultithreadEventLoopGroup)) {
                 entry.eventLoopGroup.shutdownGracefully(0, 0, TimeUnit.MILLISECONDS);
             }
-            entries.remove(executor);
         }
     }

I'll report back whether this works soon. Do with it what you wish. (I'm not going to sign your CLA for this trivial fix).

@SgtSilvio
Copy link
Member

Hi @philipa
We really want to find the cause for the error you are seeing.
Could you provide a sample of your integration tests?
How many integration tests do you have that use the client library?
How many client instances do you use per test?

Unfortunately the patch you provided will very likely not fix this as the code is inside a synchronized method so the order of the code lines should not change anything.

@philipa
Copy link
Author

philipa commented Oct 16, 2020

Hi Silvio,

The good news is that the fix above does appear to stabilise our tests. They weren't failing every time, so let me monitor for a few more days. With the fix, I'm not even seeing the event loop exception logged. which is strange.

This isn't a concurrency issue, so synchronized isn't relevant. The problem is that shutdownGracefully invalidates the executor before throwing the exception, leaving a bad executor in the entries cache. It could also be fixed using a try/finally.

I have one integration test suite, with a handful of integration tests, run serially. Each test starts a MqttClient, connecting over TCP. There's little point in sharing the tests themselves because they rely on other unreleased, proprietary code. Clients that manage to connect successfully are explicitly disconnected.

Searching for netty eventFDWrite and "Bad file descriptor" shows several matches for other netty users. Regardless of the underlying cause, ensuring the entries cache is always cleaned up seems like good defensive programming.

@SgtSilvio
Copy link
Member

SgtSilvio commented Oct 16, 2020

@philipa You are right, have overseen that it is the shutdownGracefully call that throws the exception and then leaves the corrupt event loop in the map.
So we will apply your patch together with a comment.
Still it would be interesting to find out the cause why shutdownGracefully throws the "Bad file descriptor" exception.

@philipa
Copy link
Author

philipa commented Oct 16, 2020

Thanks. I'll keep investigating and let you know.

@SgtSilvio
Copy link
Member

This seems to be the exact same problem: grpc/grpc-java#6002
The root cause seems to be netty/netty#9362, fixed here: netty/netty#9535

@SgtSilvio
Copy link
Member

@philipa Which version of netty-transport-native-epoll are you using?
The stacktrace seems to be from a version between 4.1.33 and 4.1.36?
The root cause seems to be fixed in 4.1.42.

@SgtSilvio SgtSilvio added the bug label Oct 19, 2020
@SgtSilvio SgtSilvio self-assigned this Oct 19, 2020
@SgtSilvio SgtSilvio added this to the 1.2.2 milestone Oct 19, 2020
@philipa
Copy link
Author

philipa commented Oct 19, 2020

Ahhh - dependency hell.

Looks like I've got a version of littleproxy that shades in 4.1.34. I'm using the fork from https://github.com/mrog/LittleProxy. I see no reason to use the shaded version so I'll stop doing that, switch back to mqttt-cllient 1.2.1, and report back here if this also solves the problem.

@SgtSilvio
Copy link
Member

SgtSilvio commented Oct 20, 2020

If you use maven, please use:

    <dependencies>
        <dependency>
            <groupId>com.hivemq</groupId>
            <artifactId>hivemq-mqtt-client</artifactId>
            <version>1.2.1</version>
        </dependency>
        <dependency>
            <groupId>com.hivemq</groupId>
            <artifactId>hivemq-mqtt-client-epoll</artifactId>
            <version>1.2.1</version>
            <type>pom</type>
        </dependency>
    </dependencies>

If you use gradle, please use:

dependencies {
    implementation("com.hivemq:hivemq-mqtt-client:1.2.1")
    implementation(platform("com.hivemq:hivemq-mqtt-client-epoll:1.2.1"))
}

@philipa
Copy link
Author

philipa commented Oct 21, 2020

Yes, that's what I'm using.

Now I've straightened out my dependency conflicts, I can report hivemq-mqtt-client 1.2.1 / netty 4.1.3.5 is working fine.

Thanks!

@SgtSilvio
Copy link
Member

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants