Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] ConcurrentSnapshotsIT testMasterFailoverOnFinalizationLoop failing #101876

Closed
tlrx opened this issue Nov 7, 2023 · 2 comments
Closed

[CI] ConcurrentSnapshotsIT testMasterFailoverOnFinalizationLoop failing #101876

tlrx opened this issue Nov 7, 2023 · 2 comments
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs medium-risk An open issue or test failure that is a medium risk to future releases Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI

Comments

@tlrx
Copy link
Member

tlrx commented Nov 7, 2023

This test failed two times today with a listener being executed twice:

rg.elasticsearch.snapshots.ConcurrentSnapshotsIT > testMasterFailoverOnFinalizationLoop FAILED
    com.carrotsearch.randomizedtesting.UncaughtExceptionError: Captured an uncaught exception in thread: Thread[id=7911, name=Thread-51, state=RUNNABLE, group=TGRP-ConcurrentSnapshotsIT]
        at __randomizedtesting.SeedInfo.seed([54D1EB587A8E7FD5:55807F457CEDE6BA]:0)

        Caused by:
        java.lang.AssertionError: org.elasticsearch.ElasticsearchException: org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener/ChannelActionListener{TaskTransportChannel{task=73}{TcpTransportChannel{req=33}{cluster:admin/snapshot/delete}{Netty4TcpChannel{localAddress=/127.0.0.1:20923, remoteAddress=/127.0.0.1:45894, profile=default}}}}/org.elasticsearch.action.support.master.TransportMasterNodeAction$$Lambda/0x00007f15e4953ce0@f49da3c
            at __randomizedtesting.SeedInfo.seed([54D1EB587A8E7FD5]:0)
            at org.elasticsearch.action.ActionListener$4.assertFirstRun(ActionListener.java:324)
            at org.elasticsearch.action.ActionListener$4.onFailure(ActionListener.java:335)
            at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.handleException(TransportMasterNodeAction.java:269)

A similar test failure was reported in #101652 and fixed by #99355, but because the stacktrace is different I report it in a new issue.

https://gradle-enterprise.elastic.co/s/taabgxondrawu/
https://gradle-enterprise.elastic.co/s/vs5aky72bagkk/

Build scan:
https://gradle-enterprise.elastic.co/s/vs5aky72bagkk/tests/:server:internalClusterTest/org.elasticsearch.snapshots.ConcurrentSnapshotsIT/testMasterFailoverOnFinalizationLoop
Reproduction line:

./gradlew ':server:internalClusterTest' --tests "org.elasticsearch.snapshots.ConcurrentSnapshotsIT.testMasterFailoverOnFinalizationLoop" -Dtests.seed=A9C979C6EBC7A9E5 -Dtests.locale=pt-BR -Dtests.timezone=US/Pacific -Druntime.java=21

Applicable branches:
main, 8.11

Reproduces locally?:
Didn't try

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.snapshots.ConcurrentSnapshotsIT&tests.test=testMasterFailoverOnFinalizationLoop

Failure excerpt:

java.lang.AssertionError: org.elasticsearch.ElasticsearchException: org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener/ChannelActionListener{TaskTransportChannel{task=75}{TcpTransportChannel{req=21}{cluster:admin/snapshot/delete}{Netty4TcpChannel{localAddress=/127.0.0.1:18582, remoteAddress=/127.0.0.1:37788, profile=default}}}}/org.elasticsearch.action.support.master.TransportMasterNodeAction$$Lambda/0x00007ff8b890f628@446be68a

  at __randomizedtesting.SeedInfo.seed([A9C979C6EBC7A9E5]:0)
  at org.elasticsearch.action.ActionListener$4.assertFirstRun(ActionListener.java:324)
  at org.elasticsearch.action.ActionListener$4.onFailure(ActionListener.java:335)
  at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$1.handleException(TransportMasterNodeAction.java:269)
  at org.elasticsearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1421)
  at org.elasticsearch.transport.InboundHandler.doHandleException(InboundHandler.java:475)
  at org.elasticsearch.transport.InboundHandler.handleException(InboundHandler.java:462)
  at org.elasticsearch.transport.InboundHandler.handlerResponseError(InboundHandler.java:453)
  at org.elasticsearch.transport.InboundHandler.executeResponseHandler(InboundHandler.java:145)
  at org.elasticsearch.transport.InboundHandler.messageReceived(InboundHandler.java:122)
  at org.elasticsearch.transport.InboundHandler.inboundMessage(InboundHandler.java:96)
  at org.elasticsearch.transport.TcpTransport.inboundMessage(TcpTransport.java:832)
  at org.elasticsearch.transport.InboundPipeline.forwardFragments(InboundPipeline.java:124)
  at org.elasticsearch.transport.InboundPipeline.doHandleBytes(InboundPipeline.java:96)
  at org.elasticsearch.transport.InboundPipeline.handleBytes(InboundPipeline.java:61)
  at org.elasticsearch.transport.netty4.Netty4MessageInboundHandler.channelRead(Netty4MessageInboundHandler.java:48)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
  at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:103)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:444)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
  at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:412)
  at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1410)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:440)
  at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:420)
  at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:919)
  at io.netty.channel.nio.AbstractNioByteChannel$NioByteUnsafe.read(AbstractNioByteChannel.java:166)
  at io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:788)
  at io.netty.channel.nio.NioEventLoop.processSelectedKeysPlain(NioEventLoop.java:689)
  at io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:652)
  at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:562)
  at io.netty.util.concurrent.SingleThreadEventExecutor$4.run(SingleThreadEventExecutor.java:997)
  at io.netty.util.internal.ThreadExecutorMap$2.run(ThreadExecutorMap.java:74)
  at java.lang.Thread.run(Thread.java:1583)

  Caused by: org.elasticsearch.ElasticsearchException: org.elasticsearch.action.ActionListenerImplementations$RunBeforeActionListener/ChannelActionListener{TaskTransportChannel{task=75}{TcpTransportChannel{req=21}{cluster:admin/snapshot/delete}{Netty4TcpChannel{localAddress=/127.0.0.1:18582, remoteAddress=/127.0.0.1:37788, profile=default}}}}/org.elasticsearch.action.support.master.TransportMasterNodeAction$$Lambda/0x00007ff8b890f628@446be68a

    at org.elasticsearch.action.ActionListener$4.assertFirstRun(ActionListener.java:323)
    at org.elasticsearch.action.ActionListener$4.onFailure(ActionListener.java:335)
    at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction.lambda$doStart$2(TransportMasterNodeAction.java:233)
    at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.acceptException(ActionListenerImplementations.java:186)
    at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    at org.elasticsearch.action.ActionListenerImplementations$DelegatingResponseActionListener.onFailure(ActionListenerImplementations.java:191)
    at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    at org.elasticsearch.action.ActionListenerImplementations.safeAcceptException(ActionListenerImplementations.java:62)
    at org.elasticsearch.action.ActionListenerImplementations.safeOnFailure(ActionListenerImplementations.java:73)
    at org.elasticsearch.action.DelegatingActionListener.onFailure(DelegatingActionListener.java:27)
    at org.elasticsearch.action.support.ContextPreservingActionListener.onFailure(ContextPreservingActionListener.java:39)
    at org.elasticsearch.action.ActionListener.onFailure(ActionListener.java:247)
    at org.elasticsearch.snapshots.SnapshotsService.failListenersIgnoringException(SnapshotsService.java:2763)
    at org.elasticsearch.snapshots.SnapshotsService.failAllListenersOnMasterFailOver(SnapshotsService.java:2479)
    at org.elasticsearch.snapshots.SnapshotsService$RemoveSnapshotDeletionAndContinueTask.onFailure(SnapshotsService.java:2532)
    at org.elasticsearch.cluster.service.MasterService$ExecutionResult.notifyFailure(MasterService.java:975)
    at org.elasticsearch.cluster.service.MasterService.executeAndPublishBatch(MasterService.java:223)
    at org.elasticsearch.cluster.service.MasterService$BatchingTaskQueue$Processor.lambda$run$2(MasterService.java:1626)
    at org.elasticsearch.action.ActionListener.run(ActionListener.java:368)
    at org.elasticsearch.cluster.service.MasterService$BatchingTaskQueue$Processor.run(MasterService.java:1623)
    at org.elasticsearch.cluster.service.MasterService$5.lambda$doRun$0(MasterService.java:1237)
    at org.elasticsearch.action.ActionListener.run(ActionListener.java:368)
    at org.elasticsearch.cluster.service.MasterService$5.doRun(MasterService.java:1216)
    at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:983)
    at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642)
    at java.lang.Thread.run(Thread.java:1583)

@tlrx tlrx added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Nov 7, 2023
@elasticsearchmachine elasticsearchmachine added blocker Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. labels Nov 7, 2023
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@volodk85 volodk85 self-assigned this Nov 7, 2023
@volodk85 volodk85 added medium-risk An open issue or test failure that is a medium risk to future releases and removed blocker labels Nov 7, 2023
@craigtaverner
Copy link
Contributor

This failed this morning and test history shows four failures in the last week. According to test-triage rules we should mute a test that fails more than once, so I made the mute PR at #102404

aniljangirdev pushed a commit to aniljangirdev/elasticsearch that referenced this issue Nov 21, 2023
This failed four times in the last week. The issue to fix is elastic#101876
elasticsearchmachine pushed a commit that referenced this issue Nov 27, 2023
Snapshot listeners can be concurrently resolved from two different
*clusterApplierService* and *masterService* task threads. If a listener
is a **single action** listener, meaning that it has to be resolved only
once, the following traces can occur, see #101876

```
at org.elasticsearch.action.ActionListener$4.assertFirstRun(ActionListener.java:324)
at org.elasticsearch.action.ActionListener$4.onFailure(ActionListener.java:335)
```

Fix

Line up *resolve listener and remove it from tracking collection*
operations over snapshot listeners in order to avoid double invocation
by separate threads

Fix for #101876
timgrein pushed a commit to timgrein/elasticsearch that referenced this issue Nov 30, 2023
…102439)

Snapshot listeners can be concurrently resolved from two different
*clusterApplierService* and *masterService* task threads. If a listener
is a **single action** listener, meaning that it has to be resolved only
once, the following traces can occur, see elastic#101876

```
at org.elasticsearch.action.ActionListener$4.assertFirstRun(ActionListener.java:324)
at org.elasticsearch.action.ActionListener$4.onFailure(ActionListener.java:335)
```

Fix

Line up *resolve listener and remove it from tracking collection*
operations over snapshot listeners in order to avoid double invocation
by separate threads

Fix for elastic#101876
2lambda123 pushed a commit to 2lambda123/elastic-elasticsearch that referenced this issue May 3, 2024
This failed four times in the last week. The issue to fix is elastic/elasticsearch#101876
2lambda123 pushed a commit to 2lambda123/elastic-elasticsearch that referenced this issue May 3, 2024
This failed four times in the last week. The issue to fix is elastic/elasticsearch#101876
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs medium-risk An open issue or test failure that is a medium risk to future releases Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

4 participants