Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

所有nioEventLoopGroup在进入ReplicaSession.colseSession后无限等待 #1355

Closed
kkk6285137 opened this issue Feb 16, 2023 · 3 comments
Closed
Labels
good-first-issue type/bug This issue reports a bug.

Comments

@kkk6285137
Copy link

kkk6285137 commented Feb 16, 2023

现象

2月14号晚上9点半某机房3个客户端服务实例(该机房总共7个实例)突然报大量pegasus调用超时异常,持续近一个小时没有恢复。在将报异常的客户端实例重启后,客户端恢复正常。检查pegasus服务端相关的监控打点,未见QPS激增、耗时增加等情况。

问题追踪

保留了一个客户端实例的jstack现场,发现所有的nioEventLoopGroup在调用进入ReplicaSession.colseSession逻辑后,处于wait状态。部分线程调用栈信息如下所示。

"nioEventLoopGroup-2-3" #678 prio=10 os_prio=0 tid=0x00007f85487f4000 nid=0x6fc0 in Object.wait() [0x00007f81fda9b000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:502)
	at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:252)
	- locked <0x0000000749ba8ad0> (a io.netty.channel.DefaultChannelPromise)
	at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131)
	at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30)
	at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:403)
	at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119)
	at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession.closeSession(ReplicaSession.java:121)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession.tryNotifyFailureWithSeqID(ReplicaSession.java:295)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession$4.run(ReplicaSession.java:345)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

"nioEventLoopGroup-2-2" #677 prio=10 os_prio=0 tid=0x00007f85487f3000 nid=0x6fbf in Object.wait() [0x00007f83a76f7000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:502)
	at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:252)
	- locked <0x0000000749861118> (a io.netty.channel.DefaultChannelPromise)
	at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131)
	at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30)
	at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:403)
	at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119)
	at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession.closeSession(ReplicaSession.java:121)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession.tryNotifyFailureWithSeqID(ReplicaSession.java:295)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession$4.run(ReplicaSession.java:345)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:497)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

"nioEventLoopGroup-2-1" #676 prio=10 os_prio=0 tid=0x00007f854aa4b000 nid=0x6fbe in Object.wait() [0x00007f81f6023000]
   java.lang.Thread.State: WAITING (on object monitor)
	at java.lang.Object.wait(Native Method)
	at java.lang.Object.wait(Object.java:502)
	at io.netty.util.concurrent.DefaultPromise.await(DefaultPromise.java:252)
	- locked <0x00000007496942e0> (a io.netty.channel.DefaultChannelPromise)
	at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:131)
	at io.netty.channel.DefaultChannelPromise.await(DefaultChannelPromise.java:30)
	at io.netty.util.concurrent.DefaultPromise.sync(DefaultPromise.java:403)
	at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:119)
	at io.netty.channel.DefaultChannelPromise.sync(DefaultChannelPromise.java:30)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession.closeSession(ReplicaSession.java:121)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession.tryNotifyFailureWithSeqID(ReplicaSession.java:295)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession$4.run(ReplicaSession.java:345)
	at io.netty.util.concurrent.PromiseTask.runTask(PromiseTask.java:98)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:170)
	at io.netty.util.concurrent.AbstractEventExecutor.safeExecute(AbstractEventExecutor.java:164)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:472)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:500)
	at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:989)
	at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
	at io.netty.util.concurrent.FastThreadLocalRunnable.run(FastThreadLocalRunnable.java:30)
	at java.lang.Thread.run(Thread.java:748)

继续追踪当天晚上的运行日志,发现nioEventLoopGroup线程在9点24分输出

2023-02-14 21:24:37,147 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.136.160.9:54801): actively close the session because it's not responding for 10 seconds

但是未能继续输出channel to rpc_address %s closed 的日志 。表明这些线程在9点24分后,就卡在wait状态。这个时间点也与我收到故障报警的时间相同。
我比较疑惑为什么pegasus线程会进入该状态,为什么没有能从wait状态恢复正常?

部分日志信息如下:

2023-02-14 21:24:29,231 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005109 failed: 
2023-02-14 21:24:29,231 [nioEventLoopGroup-2-3] WARN  - passport_ptn_login_cache: replica server(rpc_address(10.114.1.15:54801)) rpc timeout for gpid(gpid(16.11)), operator(com.xiaomi.infra.pegasus.operator.rrdb_get_operator@1dfa3ab9), try(1), error_code(ERR_TIMEOUT), not retry
2023-02-14 21:24:29,231 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309853 failed: 
2023-02-14 21:24:29,231 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309853 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:29,231 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309854 failed: 
2023-02-14 21:24:29,231 [nioEventLoopGroup-2-3] WARN  - passport_ptn_login_cache: replica server(rpc_address(10.114.2.44:54801)) rpc timeout for gpid(gpid(16.4)), operator(com.xiaomi.infra.pegasus.operator.rrdb_get_operator@1b6db4a0), try(1), error_code(ERR_TIMEOUT), not retry
2023-02-14 21:24:29,231 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005110 failed: 
2023-02-14 21:24:29,231 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005110 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:29,231 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005111 failed: 
2023-02-14 21:24:29,231 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005111 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:29,231 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309855 failed: 
2023-02-14 21:24:29,231 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309855 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,133 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): actively close the session because it's not responding for 10 seconds
2023-02-14 21:24:32,160 [nioEventLoopGroup-2-3] INFO  - channel to rpc_address(10.114.1.15:54801) closed
2023-02-14 21:24:32,160 [nioEventLoopGroup-2-3] WARN  - passport_ptn_login_cache: replica server(rpc_address(10.114.1.15:54801)) doesn't serve gpid(gpid(16.12)), operator(com.xiaomi.infra.pegasus.operator.rrdb_check_and_set_operator@68bb8261), try(1), error_code(ERR_SESSION_RESET), need query meta
2023-02-14 21:24:32,160 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309856 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,169 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309857 failed: 
2023-02-14 21:24:32,169 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): actively close the session because it's not responding for 10 seconds
2023-02-14 21:24:32,169 [nioEventLoopGroup-2-3] INFO  - channel to rpc_address(10.114.2.44:54801) closed
2023-02-14 21:24:32,169 [nioEventLoopGroup-2-3] WARN  - passport_ptn_login_cache: replica server(rpc_address(10.114.2.44:54801)) doesn't serve gpid(gpid(16.4)), operator(com.xiaomi.infra.pegasus.operator.rrdb_check_and_set_operator@1c6b0b76), try(1), error_code(ERR_SESSION_RESET), need query meta
2023-02-14 21:24:32,169 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005113 failed: 
2023-02-14 21:24:32,169 [nioEventLoopGroup-2-3] WARN  - passport_ptn_login_cache: replica server(rpc_address(10.114.1.15:54801)) rpc timeout for gpid(gpid(16.12)), operator(com.xiaomi.infra.pegasus.operator.rrdb_check_and_set_operator@5088d5b0), try(1), error_code(ERR_TIMEOUT), not retry
2023-02-14 21:24:32,169 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309858 failed: 
2023-02-14 21:24:32,169 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309858 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,169 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005114 failed: 
2023-02-14 21:24:32,169 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005114 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005115 failed: 
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005115 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005116 failed: 
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005116 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309859 failed: 
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309859 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309860 failed: 
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309860 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309861 failed: 
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309861 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005117 failed: 
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005117 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005118 failed: 
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005118 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309862 failed: 
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309862 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309863 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005119 failed: 
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005119 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] WARN  - passport_ptn_login_cache: replica server(rpc_address(10.114.2.44:54801)) rpc timeout for gpid(gpid(16.3)), operator(com.xiaomi.infra.pegasus.operator.rrdb_get_operator@26b6b3b5), try(1), error_code(ERR_TIMEOUT), not retry
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309864 failed: 
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309864 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005120 failed: 
2023-02-14 21:24:32,188 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005120 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,189 [nioEventLoopGroup-2-3] WARN  - passport_ptn_login_cache: replica server(rpc_address(10.114.1.15:54801)) rpc timeout for gpid(gpid(16.12)), operator(com.xiaomi.infra.pegasus.operator.rrdb_check_and_set_operator@122a9588), try(1), error_code(ERR_TIMEOUT), not retry
2023-02-14 21:24:32,352 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005122 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,370 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005123 failed: 
2023-02-14 21:24:32,370 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005123 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,370 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309865 failed: 
2023-02-14 21:24:32,370 [nioEventLoopGroup-2-3] WARN  - passport_ptn_login_cache: replica server(rpc_address(10.114.2.44:54801)) rpc timeout for gpid(gpid(16.3)), operator(com.xiaomi.infra.pegasus.operator.rrdb_check_and_set_operator@565e1c00), try(1), error_code(ERR_TIMEOUT), not retry
2023-02-14 21:24:32,371 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309866 failed: 
2023-02-14 21:24:32,371 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309866 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,371 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005124 failed: 
2023-02-14 21:24:32,371 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005124 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,371 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309867 failed: 
2023-02-14 21:24:32,371 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309867 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,371 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309868 failed: 
2023-02-14 21:24:32,371 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309868 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,371 [nioEventLoopGroup-2-3] WARN  - passport_ptn_login_cache: replica server(rpc_address(10.114.2.44:54801)) rpc timeout for gpid(gpid(16.3)), operator(com.xiaomi.infra.pegasus.operator.rrdb_get_operator@40ed662e), try(1), error_code(ERR_TIMEOUT), not retry
2023-02-14 21:24:32,371 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309869 failed: 
2023-02-14 21:24:32,371 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309869 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:32,371 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309870 failed: 
--
2023-02-14 21:24:37,074 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005187 failed: 
2023-02-14 21:24:37,074 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005187 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,074 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005188 failed: 
2023-02-14 21:24:37,074 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005188 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,074 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005189 failed: 
2023-02-14 21:24:37,074 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005189 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,074 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309948 failed: 
2023-02-14 21:24:37,074 [nioEventLoopGroup-2-3] WARN  - passport_ptn_login_cache: replica server(rpc_address(10.114.2.44:54801)) rpc timeout for gpid(gpid(16.3)), operator(com.xiaomi.infra.pegasus.operator.rrdb_get_operator@1741089c), try(1), error_code(ERR_TIMEOUT), not retry
2023-02-14 21:24:37,074 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005190 failed: 
2023-02-14 21:24:37,074 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005190 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,074 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005191 failed: 
2023-02-14 21:24:37,074 [nioEventLoopGroup-2-3] WARN  - passport_ptn_login_cache: replica server(rpc_address(10.114.1.15:54801)) rpc timeout for gpid(gpid(16.12)), operator(com.xiaomi.infra.pegasus.operator.rrdb_get_operator@490c0e10), try(1), error_code(ERR_TIMEOUT), not retry
2023-02-14 21:24:37,075 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309949 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005192 failed: 
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005192 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309950 failed: 
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309950 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005193 failed: 
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005193 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] WARN  - passport_user_action_region: replica server(rpc_address(10.114.2.44:54801)) rpc timeout for gpid(gpid(11.19)), operator(com.xiaomi.infra.pegasus.operator.rrdb_get_operator@6f691167), try(1), error_code(ERR_TIMEOUT), not retry
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309951 failed: 
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309951 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309952 failed: 
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309952 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309953 failed: 
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309953 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005194 failed: 
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005194 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] WARN  - passport_user_action_region: replica server(rpc_address(10.114.1.15:54801)) rpc timeout for gpid(gpid(11.109)), operator(com.xiaomi.infra.pegasus.operator.rrdb_get_operator@23bbede), try(1), error_code(ERR_TIMEOUT), not retry
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005195 failed: 
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005195 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] WARN  - passport_user_action_region: replica server(rpc_address(10.114.2.44:54801)) rpc timeout for gpid(gpid(11.100)), operator(com.xiaomi.infra.pegasus.operator.rrdb_get_operator@1d2153ed), try(1), error_code(ERR_TIMEOUT), not retry
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309954 failed: 
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309954 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005196 failed: 
2023-02-14 21:24:37,120 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005196 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,138 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309955 failed: 
2023-02-14 21:24:37,138 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309955 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,138 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309956 failed: 
2023-02-14 21:24:37,138 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309956 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,138 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.1.15:54801) write seqid 514005197 failed: 
2023-02-14 21:24:37,138 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005197 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,138 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309957 failed: 
2023-02-14 21:24:37,138 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309957 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,138 [nioEventLoopGroup-2-3] WARN  - passport_user_action_region: replica server(rpc_address(10.114.1.15:54801)) rpc timeout for gpid(gpid(11.94)), operator(com.xiaomi.infra.pegasus.operator.rrdb_get_operator@34f4a2d7), try(1), error_code(ERR_TIMEOUT), not retry
2023-02-14 21:24:37,138 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.1.15:54801): 514005198 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,147 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309958 failed: 
2023-02-14 21:24:37,147 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309958 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,147 [nioEventLoopGroup-2-3] INFO  - rpc_address(10.114.2.44:54801) write seqid 514309959 failed: 
2023-02-14 21:24:37,147 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.114.2.44:54801): 514309959 is removed by others, current error ERR_TIMEOUT, isTimeoutTask false
2023-02-14 21:24:37,147 [nioEventLoopGroup-2-3] WARN  - rpc_address(10.136.160.9:54801): actively close the session because it's not responding for 10 seconds
2023-02-14 22:37:13,355 [nioEventLoopGroup-7-2] WARN  - Channel [id: 0x3acda6b5, L:/10.162.4.3:57031 ! R:/10.132.145.43:34601] for session rpc_address(10.132.145.43:34601) is inactive
2023-02-14 22:37:13,355 [nioEventLoopGroup-7-2] INFO  - rpc_address(10.132.145.43:34601): mark the session to be disconnected from state=CONNECTED
2023-02-14 22:37:13,355 [nioEventLoopGroup-7-2] WARN  - Channel [id: 0xfdf4e624, L:/10.162.4.3:60985 ! R:/10.162.62.11:35801] for session rpc_address(10.162.62.11:35801) is inactive
2023-02-14 22:37:13,355 [nioEventLoopGroup-7-2] INFO  - rpc_address(10.162.62.11:35801): mark the session to be disconnected from state=CONNECTED
2023-02-14 22:37:13,355 [nioEventLoopGroup-7-3] WARN  - Channel [id: 0xe942d2fa, L:/10.162.4.3:35711 ! R:/10.132.43.24:35801] for session rpc_address(10.132.43.24:35801) is inactive
2023-02-14 22:37:13,355 [nioEventLoopGroup-7-3] INFO  - rpc_address(10.132.43.24:35801): mark the session to be disconnected from state=CONNECTED
2023-02-14 22:37:13,355 [nioEventLoopGroup-7-1] WARN  - Channel [id: 0xc5748ec6, L:/10.162.4.3:38554 ! R:/10.162.64.31:35801] for session rpc_address(10.162.64.31:35801) is inactive
2023-02-14 22:37:13,355 [nioEventLoopGroup-7-1] INFO  - rpc_address(10.162.64.31:35801): mark the session to be disconnected from state=CONNECTED
2023-02-14 22:37:13,355 [nioEventLoopGroup-7-3] WARN  - Channel [id: 0x5a6ea942, L:/10.162.4.3:63661 ! R:/10.132.42.43:35801] for session rpc_address(10.132.42.43:35801) is inactive
2023-02-14 22:37:13,355 [nioEventLoopGroup-7-3] INFO  - rpc_address(10.132.42.43:35801): mark the session to be disconnected from state=CONNECTED
2023-02-14 22:37:13,356 [nioEventLoopGroup-7-1] WARN  - Channel [id: 0x99203515, L:/10.162.4.3:16570 ! R:/10.132.43.4:35801] for session rpc_address(10.132.43.4:35801) is inactive
2023-02-14 22:37:13,356 [nioEventLoopGroup-7-1] INFO  - rpc_address(10.132.43.4:35801): mark the session to be disconnected from state=CONNECTED
2023-02-14 22:37:13,356 [nioEventLoopGroup-7-4] WARN  - Channel [id: 0xc7b622b6, L:/10.162.4.3:61679 ! R:/10.162.62.32:35801] for session rpc_address(10.162.62.32:35801) is inactive
2023-02-14 22:37:13,356 [nioEventLoopGroup-7-4] INFO  - rpc_address(10.162.62.32:35801): mark the session to be disconnected from state=CONNECTED
2023-02-14 22:37:13,356 [nioEventLoopGroup-7-2] WARN  - Channel [id: 0x4c823fb1, L:/10.162.4.3:21723 ! R:/10.162.65.1:35801] for session rpc_address(10.162.65.1:35801) is inactive
2023-02-14 22:37:13,356 [nioEventLoopGroup-7-2] INFO  - rpc_address(10.162.65.1:35801): mark the session to be disconnected from state=CONNECTED
2023-02-14 22:37:13,356 [nioEventLoopGroup-7-4] WARN  - Channel [id: 0x64beb514, L:/10.162.4.3:59103 ! R:/10.132.42.23:35801] for session rpc_address(10.132.42.23:35801) is inactive
2023-02-14 22:37:13,356 [nioEventLoopGroup-7-4] INFO  - rpc_address(10.132.42.23:35801): mark the session to be disconnected from state=CONNECTED
2023-02-14 22:37:13,356 [nioEventLoopGroup-7-3] WARN  - Channel [id: 0x2709b68d, L:/10.162.4.3:41936 ! R:/10.132.43.2:35801] for session rpc_address(10.132.43.2:35801) is inactive
2023-02-14 22:37:13,356 [nioEventLoopGroup-7-3] INFO  - rpc_address(10.132.43.2:35801): mark the session to be disconnected from state=CONNECTED
2023-02-14 22:37:13,356 [nioEventLoopGroup-7-4] WARN  - Channel [id: 0x91f9c939, L:/10.162.4.3:30249 ! R:/10.132.42.3:35801] for session rpc_address(10.132.42.3:35801) is inactive

当然,这个问题看起来也很奇怪,它有一定的偶然性:客户端实例离上次重启有几个月的时间,突然报此问题,且仅有部分实例报此问题。又有一定的必然性:同时有三个实例在同一时间都出现了这一问题。
我初步怀疑可能当时同一机房的这三个实例在那个时间点出现了一定的网络抖动,导致nioEventLoopGroup进入了一个错误状态,未能恢复。但我没有足够的证据。
如果需要更多的日志信息,可以与我联系。

@kkk6285137 kkk6285137 added the type/bug This issue reports a bug. label Feb 16, 2023
@acelyc111
Copy link
Member

@kkk6285137 Thanks for your reporting!

请问Pegasus及java client的版本是怎样的?

@kkk6285137
Copy link
Author

jkd版本:openjdk version "1.8.0_202"
pegasus版本:https://github.com/XiaoMi/pegasus-java-client/tree/1.11-thrift-0.11.0-inlined-release

@acelyc111
Copy link
Member

acelyc111 commented Mar 4, 2023

@kkk6285137 抱歉这个还没找到原因,可以加微信详聊下吗?
微信号:acelyc111

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good-first-issue type/bug This issue reports a bug.
Projects
None yet
Development

No branches or pull requests

2 participants