Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] Review Cross Cluster Replication compatibility with Segment Replication Enabled #3823

Closed
Tracked by #2194
Rishikesh1159 opened this issue Jul 8, 2022 · 28 comments
Assignees
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep

Comments

@Rishikesh1159
Copy link
Member

Is your feature request related to a problem? Please describe.
As part of this issue, we want to identify if Segment Replication enabled cluster have any issues with Cross Cluster Replication. In case there are issues, identify changes needed to make it compatible.

Describe the solution you'd like
-> Run clusters in local with segment replication enabled and test if CCR (Cross Cluster Replication) is working as expected and did not break.
-> Add an integration test with segment replication enabled and CCR, make sure this test passes.

@Rishikesh1159 Rishikesh1159 added enhancement Enhancement or improvement to existing feature or request untriaged distributed framework and removed untriaged labels Jul 8, 2022
@Rishikesh1159 Rishikesh1159 self-assigned this Jul 8, 2022
@Rishikesh1159
Copy link
Member Author

Rishikesh1159 commented Jul 18, 2022

Tested running CCR plugin (Gradle check/integTests) against all segment replication changes. All the testing to verify this is done on local machine.

Setup:
-> Cloned opensearch repo locally and cherry picked all segment replication commits into 2.x branch. Resolved few merge conflicts in the process.
-> Publish this to MavenLocal.
-> Clone Cross cluster replication and common-utils plugin repos and change snapshot version to whichever 2.x branch builds to in build.gradle file.
-> Publish common-utils to maven Local and then resolve any errors when running CCR plugins.

Findings:
-> Gradle check/ all integTests passes with both segment replication enabled and disabled.
-> When clusters are started using ./gradlew clean run -PnumNodes=3 Opensearch logs are generated in CCR. example : build/testclusters/leaderCluster-0/logs/leaderCluster.log
-> Tried replicating few docs by following steps mentioned here: Getting Started. Successfully able to do all steps.
-> Final conclusion, all segment replication changes until now did not break CCR plugin.

Next Steps:
-> All segment replication changes in phase 1 should be backported to 2.x branch.
-> Make CCR plugin compatible with 2.x branch and make sure it works with segment replication enabled or disabled.
-> Write integration tests on CCR plugin side to make sure everything works as expected with segment replication enabled or disabled.

@Bukhtawar
Copy link
Collaborator

/cc : @ankitkala

@ankitkala
Copy link
Member

Yep. Overall testing looks good to me. Segment replication for replica doesn't break logical cross-cluster-replication.

@ankitkala
Copy link
Member

ankitkala commented Jul 22, 2022

Hey @Rishikesh1159 , I had few followup questions.

  1. While running the Integ tests, was Segment Replication enabled for all the indices(i.e. setting enabled by default)? Just want to ensure that our test passed with CCR on leader indices with Segment replication enabled.
  2. Can you also do a quick test where leader index also has replicas doing segment copy? For CCR, we try to load balance where follower shard pulls translog operations from leader's primary as well as replica shards. Just want to verify that fetch from replica shard is also happening as expected.

@Rishikesh1159
Copy link
Member Author

Hey @ankitkala, sorry for late response. Thanks, for point out few things. Actually I realized when running integ test suite of CCR plugin, I didn't have segment replication enabled. There is no setting for segment replication to set it on for all indices by default. We have to manually add index setting .put(IndexMetadata.SETTING_REPLICATION_TYPE, ReplicationType.SEGMENT) on each integ test manually before running it.

There is no way for us now to just enable segrep in one place and run all of integTest suite of CCR plugin. But instead we have to add enable setting in each of integTest/integTest class manually before we run it and then we can succesfully test that CCR works fine with seg-rep enabled.

Sorry, for missing this part before, I will start testing with the 2 scenarios you mentioned in above comment.

@Rishikesh1159
Copy link
Member Author

Rishikesh1159 commented Aug 8, 2022

  • Setup everything before starting testing using these steps.
  • Start testing these two Tests with segment replication enabled at index level and make sure it passes.
  • If there are any failures in running above two test cases, open new issues for failures.
  • After running above two tests try running remaining integTests manually and open any issues if failures happen

Manually testing each integTest is no longer needed after trying Anikit's suggestion of enabling seg-rep as default index setting.

@ankitkala
Copy link
Member

I think you build OS locally such that Feature flagopensearch.experimental.feature.replication_type.enabled and index setting index.replication.type default to Segment replication. With this, you won't have to do any additional changes in CCR tests.

@Rishikesh1159
Copy link
Member Author

Thanks @ankitkala for suggestion. I tried setting seg-rep as default index type and ran integTests of CCR. There are lot of test cases that are failing and most of test cases are failing with same failures. Here are few failures:

REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.replication.integ.rest.ClusterRerouteFollowerIT.test replication works after rerouting a shard from one node to another in follower cluster" -Dtests.seed=1021840EC08B8CCA -Dtests.security.manager=true -Dtests.locale=zh-Hant-HK -Dtests.timezone=America/Paramaribo -Druntime.java=17
 
org.opensearch.replication.integ.rest.ClusterRerouteFollowerIT > test replication works after rerouting a shard from one node to another in follower cluster FAILED
    org.opensearch.client.ResponseException: method [PUT], host [http://127.0.0.1:43871], URI [/_plugins/_replication/follower_index/_start?wait_for_restore=false], status line [HTTP/1.1 500 Internal Server Error]
    {"error":{"root_cause":[{"type":"illegal_state_exception","reason":"Timed out when waiting for persistent task after 30s"}],"type":"illegal_state_exception","reason":"Timed out when waiting for persistent task after 30s"},"status":500}
        at __randomizedtesting.SeedInfo.seed([1021840EC08B8CCA:CA23175B7A0C1250]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:345)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
        at app//org.opensearch.replication.ReplicationHelpersKt.startReplication(ReplicationHelpers.kt:91)
        at app//org.opensearch.replication.ReplicationHelpersKt.startReplication$default(ReplicationHelpers.kt:60)
        at app//org.opensearch.replication.integ.rest.ClusterRerouteFollowerIT.test replication works after rerouting a shard from one node to another in follower cluster(ClusterRerouteFollowerIT.kt:44)
 
 
Suite: Test class org.opensearch.replication.integ.rest.ClusterRerouteFollowerIT
  1> [2022-08-09T14:53:35,199][INFO ][o.o.r.i.r.ClusterRerouteFollowerIT] [test replication works after rerouting a shard from one node to another in follower cluster] before test
  2> 8月 09, 2022 2:53:35 下午 org.opensearch.client.RestClient logResponse
  2> WARNING: request [PUT http://[::1]:37555/_template/all] returned 1 warnings: [299 OpenSearch-2.3.0-SNAPSHOT-f8bc77074508da63049eecacfa5cbcefa7cd00a6 "Deprecated field [template] used, replaced by [index_patterns]"]
  1> [2022-08-09T14:54:05,563][INFO ][o.o.r.i.r.ClusterRerouteFollowerIT] [test replication works after rerouting a shard from one node to another in follower cluster] after test
  2> REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.replication.integ.rest.ClusterRerouteFollowerIT.test replication works after rerouting a shard from one node to another in follower cluster" -Dtests.seed=1021840EC08B8CCA -Dtests.security.manager=true -Dtests.locale=zh-Hant-HK -Dtests.timezone=America/Paramaribo -Druntime.java=17
  2> org.opensearch.client.ResponseException: method [PUT], host [http://127.0.0.1:43871], URI [/_plugins/_replication/follower_index/_start?wait_for_restore=false], status line [HTTP/1.1 500 Internal Server Error]
    {"error":{"root_cause":[{"type":"illegal_state_exception","reason":"Timed out when waiting for persistent task after 30s"}],"type":"illegal_state_exception","reason":"Timed out when waiting for persistent task after 30s"},"status":500}
        at __randomizedtesting.SeedInfo.seed([1021840EC08B8CCA:CA23175B7A0C1250]:0)
        at app//org.opensearch.client.RestClient.convertResponse(RestClient.java:375)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:345)
        at app//org.opensearch.client.RestClient.performRequest(RestClient.java:320)
        at app//org.opensearch.replication.ReplicationHelpersKt.startReplication(ReplicationHelpers.kt:91)
        at app//org.opensearch.replication.ReplicationHelpersKt.startReplication$default(ReplicationHelpers.kt:60)
        at app//org.opensearch.replication.integ.rest.ClusterRerouteFollowerIT.test replication works after rerouting a shard from one node to another in follower cluster(ClusterRerouteFollowerIT.kt:44)
  2> NOTE: leaving temporary files on disk at: /home/rrpasham/Documents/cross-cluster-replication/build/testrun/integTest/temp/org.opensearch.replication.integ.rest.ClusterRerouteFollowerIT_1021840EC08B8CCA-001
  2> NOTE: test params are: codec=Asserting(Lucene92): {}, docValues:{}, maxPointsInLeafNode=970, maxMBSortInHeap=7.9275913059125305, sim=Asserting(RandomSimilarity(queryNorm=true): {}), locale=zh-Hant-HK, timezone=America/Paramaribo
  2> NOTE: Linux 5.15.0-1015-aws amd64/Oracle Corporation 17.0.2 (64-bit)/cpus=48,threads=1,free=370506496,total=536870912
  2> NOTE: All tests run in this JVM: [BasicReplicationIT, MultiClusterSetupIT, ReplicationIntegTestCaseIT, ClusterRerouteFollowerIT]
 
REPRODUCE WITH: ./gradlew ':integTest' --tests "org.opensearch.replication.integ.rest.ClusterRerouteLeaderIT.test replication works after rerouting a shard from one node to another in leader cluster" -Dtests.seed=1021840EC08B8CCA -Dtests.security.manager=true -Dtests.locale=en-MT -Dtests.timezone=America/Argentina/Tucuman -Druntime.java=17
 
org.opensearch.replication.integ.rest.ClusterRerouteLeaderIT > test replication works after rerouting a shard from one node to another in leader cluster FAILED
    java.net.ConnectException: Connection refused
        at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:953)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:332)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:335)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:351)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:351)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:320)
        at org.opensearch.replication.MultiClusterRestTestCase.getPrimaryNodeForShard(MultiClusterRestTestCase.kt:487)
        at org.opensearch.replication.integ.rest.ClusterRerouteLeaderIT.test_replication_works_after_rerouting_a_shard_from_one_node_to_another_in_leader_cluster$lambda-2(ClusterRerouteLeaderIT.kt:62)
        at org.opensearch.test.OpenSearchTestCase.assertBusy(OpenSearchTestCase.java:1037)
        at org.opensearch.replication.integ.rest.ClusterRerouteLeaderIT.test replication works after rerouting a shard from one node to another in leader cluster(ClusterRerouteLeaderIT.kt:61)
 
        Caused by:
        java.net.ConnectException: Connection refused
            at java.base/sun.nio.ch.Net.pollConnect(Native Method)
            at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
            at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946)
            at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174)
            at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148)
            at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351)
            at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
            at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
            at java.base/java.lang.Thread.run(Thread.java:833)
 
    java.net.ConnectException: Connection refused
        at org.opensearch.client.RestClient.extractAndWrapCause(RestClient.java:953)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:332)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:335)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:351)
        at org.opensearch.client.RestClient.performRequest(RestClient.java:320)
        at org.opensearch.replication.MultiClusterRestTestCase.wipeIndicesFromCluster(MultiClusterRestTestCase.kt:397)
        at org.opensearch.replication.MultiClusterRestTestCase.wipeCluster(MultiClusterRestTestCase.kt:344)
        at org.opensearch.replication.MultiClusterRestTestCase.wipeClusters(MultiClusterRestTestCase.kt:339)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
        at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:77)
        at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
        at java.base/java.lang.reflect.Method.invoke(Method.java:568)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:996)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:44)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
        at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
        at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
        at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
        at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
        at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
        at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
        at org.junit.rules.RunRules.evaluate(RunRules.java:20)
        at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
        at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
        at java.base/java.lang.Thread.run(Thread.java:833)
 
        Caused by:
        java.net.ConnectException: Connection refused
            at java.base/sun.nio.ch.Net.pollConnect(Native Method)
            at java.base/sun.nio.ch.Net.pollConnectNow(Net.java:672)
            at java.base/sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:946)
            at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvent(DefaultConnectingIOReactor.java:174)
            at org.apache.http.impl.nio.reactor.DefaultConnectingIOReactor.processEvents(DefaultConnectingIOReactor.java:148)
            at org.apache.http.impl.nio.reactor.AbstractMultiworkerIOReactor.execute(AbstractMultiworkerIOReactor.java:351)
            at org.apache.http.impl.nio.conn.PoolingNHttpClientConnectionManager.execute(PoolingNHttpClientConnectionManager.java:221)
            at org.apache.http.impl.nio.client.CloseableHttpAsyncClientBase$1.run(CloseableHttpAsyncClientBase.java:64)
            ... 1 more

@ankitkala
Copy link
Member

Can you also check the cluster logs.
Should be under build/testclusters/followCluster-0/logs/followCluster.log and build/testclusters/leaderCluster-0/logs/leaderCluster.log

@ankitkala
Copy link
Member

So I was trying to manually test the CCR with Segment replication on leader. I was able to test out the happy case for the 2 scenarios:

  1. Single node cluster without replicas on leader.
  2. Multi node cluster with replicas configured on leader index.

However, there were 2 issues that i observed which I've called out below. But i was able to test out the end to end flow irregardless of these issues.

As a next step, I'll verify that all CCR Integration test should pass with SegRep enabled indices.
I've pushed few minor changes in CCR repo here
which can be used incase anyone wants to test.


Issue 1. SegRep on system index

Since we were creating segment replication enabled indices by default for testing, CCR's system index had issues where read after writes were inconsistent. I was able to get around that by enforcing logical replication on the system index. But hopefully this issues would be resolved by the time we configure Segment as default replication type.

[2022-08-16T20:32:09,376][WARN ][o.o.p.PersistentTasksClusterService] [followCluster-0] persistent task replication:[remote-index][0] failed
org.opensearch.ResourceNotFoundException: Metadata for remote-index doesn't exist
        at org.opensearch.replication.metadata.store.ReplicationMetadataStore.getMetadata(ReplicationMetadataStore.kt:146) ~[?:?]
        at org.opensearch.replication.metadata.store.ReplicationMetadataStore$getMetadata$1.invokeSuspend(ReplicationMetadataStore.kt) ~[?:?]
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33) ~[?:?]
        at kotlinx.coroutines.UndispatchedCoroutine.afterResume(CoroutineContext.kt:147) ~[?:?]
        at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102) ~[?:?]
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46) ~[?:?]
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106) ~[?:?]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
[2022-08-16T20:32:09,422][ERROR][o.o.r.t.s.ShardReplicationTask] [followCluster-0] [remote-index][0] Task failed due to ResourceNotFoundException[Metadata for remote-index doesn't exist]
        at org.opensearch.replication.metadata.store.ReplicationMetadataStore.getMetadata(ReplicationMetadataStore.kt:146)
        at org.opensearch.replication.metadata.store.ReplicationMetadataStore$getMetadata$1.invokeSuspend(ReplicationMetadataStore.kt)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:33)
        at kotlinx.coroutines.UndispatchedCoroutine.afterResume(CoroutineContext.kt:147)
        at kotlinx.coroutines.AbstractCoroutine.resumeWith(AbstractCoroutine.kt:102)
        at kotlin.coroutines.jvm.internal.BaseContinuationImpl.resumeWith(ContinuationImpl.kt:46)
        at kotlinx.coroutines.DispatchedTask.run(DispatchedTask.kt:106)
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747)
        at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
        at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
        at java.base/java.lang.Thread.run(Thread.java:832)


Issue 2. Segment Replication failing on the follower index

These are the Segment Replication failures I observed on the follower cluster. I don't think these are side effect of doing cross cluster replication though.

Caused by: org.opensearch.cluster.coordination.CoordinationStateRejectedException: incoming term 5 does not match current term 6
	at org.opensearch.cluster.coordination.CoordinationState.handleJoin(CoordinationState.java:256) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.coordination.Coordinator.handleJoin(Coordinator.java:1172) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at java.util.Optional.ifPresent(Optional.java:176) ~[?:?]
	at org.opensearch.cluster.coordination.Coordinator.processJoinRequest(Coordinator.java:640) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.coordination.Coordinator.lambda$handleJoinRequest$7(Coordinator.java:603) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.transport.ClusterConnectionManager.connectToNode(ClusterConnectionManager.java:138) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.transport.TransportService.connectToNode(TransportService.java:437) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.transport.TransportService.connectToNode(TransportService.java:421) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.coordination.Coordinator.handleJoinRequest(Coordinator.java:588) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.coordination.JoinHelper.lambda$new$1(JoinHelper.java:184) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]

org.opensearch.OpenSearchException: Segment Replication failed
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:235) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:235) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionListener.completeWith(ActionListener.java:345) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionListener.completeWith(ActionListener.java:345) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:189) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.indices.replication.SegmentReplicationTarget.finalizeReplication(SegmentReplicationTarget.java:189) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:147) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.indices.replication.SegmentReplicationTarget.lambda$startReplication$2(SegmentReplicationTarget.java:147) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:126) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:343) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at java.util.ArrayList.forEach(ArrayList.java:1511) [?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.BaseFuture.set(BaseFuture.java:160) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture.onResponse(ListenableFuture.java:141) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.StepListener.innerOnResponse(StepListener.java:77) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.NotifyOnceListener.onResponse(NotifyOnceListener.java:55) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:161) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:161) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1369) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1369) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) [opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
        at java.lang.Thread.run(Thread.java:832) [?:?]
Caused by: org.opensearch.common.util.concurrent.UncategorizedExecutionException: Failed execution
Caused by: org.opensearch.common.util.concurrent.UncategorizedExecutionException: Failed execution
        at org.opensearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:109) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.FutureUtils.rethrowExecutionException(FutureUtils.java:109) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:101) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:101) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:125) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:125) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        ... 35 more
        ... 35 more
Caused by: java.util.concurrent.ExecutionException: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /Volumes/ws/cross-cluster-replication/build/testclusters/followCluster-1/data/nodes/0/indices/p6-3kiEPQkmjSkdn5HvRyA/0/index/write.lock
Caused by: java.util.concurrent.ExecutionException: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /Volumes/ws/cross-cluster-replication/build/testclusters/followCluster-2/data/nodes/0/indices/p6-3kiEPQkmjSkdn5HvRyA/0/index/write.lock
        at org.opensearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:286) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.BaseFuture$Sync.getValue(BaseFuture.java:286) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:260) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.BaseFuture$Sync.get(BaseFuture.java:260) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:82) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.BaseFuture.get(BaseFuture.java:82) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:94) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.FutureUtils.get(FutureUtils.java:94) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:125) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.ListenableFuture$1.doRun(ListenableFuture.java:125) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
        ... 35 more
        ... 35 more
Caused by: org.apache.lucene.store.LockObtainFailedException: Lock held by this virtual machine: /Volumes/ws/cross-cluster-replication/build/testclusters/followCluster-1/data/nodes/0/indices/p6-3kiEPQkmjSkdn5HvRyA/0/index/write.lock

@dreamer-89 dreamer-89 self-assigned this Aug 17, 2022
@dreamer-89
Copy link
Member

dreamer-89 commented Aug 19, 2022

Thanks @ankitkala for sharing the error stack trace. I am able to reproduce the errors above by running integ tests ./gradlew integTests.

I noticed both primary & replica uses ReplicationEngine (extends InternalEngine from OpenSearch) though NRTReplicationEngine is needed on replica shards for SegmentReplication to work.

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 19, 2022

Tried to update the engine factory lambda opensearch-project/cross-cluster-replication#486

The test failure reflect as timeout error on tests This is happening because INDEX_REPLICATION_TYPE_SETTING is not getting set and thus fails while getting engine instance. Adding INDEX_REPLICATION_TYPE_SETTING setting while creating index doesn't work.

    {"error":{"root_cause":[{"type":"illegal_state_exception","reason":"Timed out when waiting for persistent task after 30s"}],"type":"illegal_state_exception","reason":"Timed out when waiting for persistent task after 30s"},"status":500}

while follower cluster's primary node shows org.opensearch.indices.recovery.RecoveryFailedException (trace below). This appears to be a case of empty/null settings tripping the engine factory method while getting index settings.

...
Caused by: java.lang.NullPointerException: Cannot invoke "String.equals(Object)" because the return value of "org.opensearch.common.settings.Settings.get(String)" is null
	at org.opensearch.replication.ReplicationPlugin.getEngineFactory$lambda-15(ReplicationPlugin.kt:367) ~[?:?]
	at org.opensearch.index.shard.IndexShard.innerOpenEngineAndTranslog(IndexShard.java:1965) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:1929) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.shard.StoreRecovery.lambda$restore$7(StoreRecovery.java:549) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	... 23 more

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 21, 2022

The tests were failing because of resource_not_found_exception possibly because of CCR system index replication-metadata-store which needs to be document replicated for immediate read purposes (suggested by @ankitkala #3823 (comment))

Code Branches Used

OpenSearch: dreamer-89@14c48e2 (feature_flag on engine, hard-coding index type to SEGMENT)
CCR: dreamer-89/cross-cluster-replication@c179a08 (hard code DOCUMENT type for system index .replication-metadata-store, which gets updated to SEGMENT due to engine side changes)

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 21, 2022

I re-ran the tests with fix above; along with changes captured in above branches. The tests now passed successfully.

ubuntu@ip-172-31-41-165:~/cross-cluster-replication$ ./gradlew integTest
=======================================
OpenSearch Build Hamster says Hello!
  Gradle Version        : 7.4.2
  OS Info               : Linux 5.15.0-1017-aws (amd64)
  JDK Version           : 17 (OpenJDK)
  JAVA_HOME             : /home/ubuntu/sw/jdk-17.0.1
  Random Testing Seed   : 4524C9251D7BF75B
  In FIPS 140 mode      : false
=======================================

> Task :integTest
OpenJDK 64-Bit Server VM warning: Ignoring option --illegal-access=warn; support was removed in 17.0
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.opensearch.bootstrap.BootstrapForTesting (file:/home/ubuntu/.m2/repository/org/opensearch/test/framework/2.3.0-SNAPSHOT/framework-2.3.0-SNAPSHOT.jar)
WARNING: Please consider reporting this to the maintainers of org.opensearch.bootstrap.BootstrapForTesting
WARNING: System::setSecurityManager will be removed in a future release
WARNING: A terminally deprecated method in java.lang.System has been called
WARNING: System::setSecurityManager has been called by org.gradle.api.internal.tasks.testing.worker.TestWorker (file:/home/ubuntu/.gradle/wrapper/dists/gradle-7.4.2-all/9uukhhbclvbegdvsww0j0cr3p/gradle-7.4.2/lib/plugins/gradle-testing-base-7.4.2.jar)
WARNING: Please consider reporting this to the maintainers of org.gradle.api.internal.tasks.testing.worker.TestWorker
WARNING: System::setSecurityManager will be removed in a future release

> Task :test
OpenJDK 64-Bit Server VM warning: Ignoring option --illegal-access=warn; support was removed in 17.0
OpenJDK 64-Bit Server VM warning: Ignoring option --illegal-access=warn; support was removed in 17.0
OpenJDK 64-Bit Server VM warning: Ignoring option --illegal-access=warn; support was removed in 17.0
OpenJDK 64-Bit Server VM warning: Ignoring option --illegal-access=warn; support was removed in 17.0
OpenJDK 64-Bit Server VM warning: Ignoring option --illegal-access=warn; support was removed in 17.0
OpenJDK 64-Bit Server VM warning: Ignoring option --illegal-access=warn; support was removed in 17.0

Deprecated Gradle features were used in this build, making it incompatible with Gradle 8.0.

You can use '--warning-mode all' to show the individual deprecation warnings and determine if they come from your own scripts or plugins.

See https://docs.gradle.org/7.4.2/userguide/command_line_interface.html#sec:command_line_warnings

BUILD SUCCESSFUL in 10m 37s
12 actionable tasks: 3 executed, 9 up-to-date

@ankitkala : As existing tests seem to work, I will be closing this issue. Please let me know if there are any open issues.

@ankitkala
Copy link
Member

Great! Thanks @dreamer-89 . Can you also share a snapshot of the test result?
It should be present in cross-cluster-replication/build/reports/tests/integTest/index.html

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 22, 2022

Great! Thanks @dreamer-89 . Can you also share a snapshot of the test result? It should be present in cross-cluster-replication/build/reports/tests/integTest/index.html

Thanks @ankitkala. Uploaded the report below.
integTest.zip

Instance: c5.24xlarge
OS: ubuntu

For visibility pasting the report snapshot as well.

Screen Shot 2022-08-21 at 7 18 32 PM

@ankitkala
Copy link
Member

Thanks. Looks good.

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 23, 2022

@ankitkala : From segment replication compatibility perspective, I have only ran the integ tests (./gradlew integTest) which seems to pass. Please let me know if there are any additional tests (or manual) that can confirm this compatibility.

Also, I would suggest you want to try out integ and manual tests on your end before we close out this issue.

CC @mch2

@dreamer-89
Copy link
Member

@ankitkala : Just wanted to check if you get a chance to verify CCR and segment replication are compatible and working as expected.

@ankitkala
Copy link
Member

ankitkala commented Aug 31, 2022

Hey, apologies for the delayed response. I was able to test out the changes from following branches and CCR tests did run successfully.
CCR Branch: https://github.com/dreamer-89/cross-cluster-replication/commits/temp
Opensearch: https://github.com/dreamer-89/OpenSearch/commits/ccr_testing

However I did observed that the replicas on the follower didn't load NRTReplicationEngine. Upon fixing that(i.e. only check for config.isReadOnlyReplica for loading the engine inside CCR plugin), CCR tests started failing with these exceptions on follower cluster. Basically, the cross cluster replication worked as expected but i'm assuming there was some issue with stop replication.

[2022-09-01T15:52:22,462][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [followCluster-0] fatal error in thread [opensearch[followCluster-0][clusterApplierService#updateTask][T#1]], exiting
java.lang.AssertionError: should not be called by a cluster state applier. reason [the applied cluster state is not yet available]
	at org.opensearch.cluster.service.ClusterApplierService.assertNotCalledFromClusterStateApplier(ClusterApplierService.java:427) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.state(ClusterApplierService.java:211) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterService.state(ClusterService.java:170) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterService.localNode(ClusterService.java:154) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.indices.replication.checkpoint.PublishCheckpointAction.publish(PublishCheckpointAction.java:108) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.indices.replication.checkpoint.SegmentReplicationCheckpointPublisher.publish(SegmentReplicationCheckpointPublisher.java:36) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.shard.CheckpointRefreshListener.afterRefresh(CheckpointRefreshListener.java:44) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1840) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1969) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.engine.Engine.flush(Engine.java:1179) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.engine.Engine.flushAndClose(Engine.java:1939) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.shard.IndexShard.close(IndexShard.java:1631) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.IndexService.closeShard(IndexService.java:586) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.IndexService.removeShard(IndexService.java:566) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.IndexService.close(IndexService.java:365) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:898) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.indices.cluster.IndicesClusterStateService.removeIndices(IndicesClusterStateService.java:429) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:270) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:606) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:593) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:561) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:484) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:186) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]

@dreamer-89
Copy link
Member

Hey, apologies for the delayed response. I was able to test out the changes from following branches and CCR tests did run successfully. CCR Branch: https://github.com/dreamer-89/cross-cluster-replication/commits/temp Opensearch: https://github.com/dreamer-89/OpenSearch/commits/ccr_testing

However I did observed that the replicas on the follower didn't load NRTReplicationEngine. Upon fixing that(i.e. only check for config.isReadOnlyReplica for loading the engine inside CCR plugin), CCR tests started failing with these exceptions on follower cluster. Basically, the cross cluster replication worked as expected but i'm assuming there was some issue with stop replication.

[2022-09-01T15:52:22,462][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [followCluster-0] fatal error in thread [opensearch[followCluster-0][clusterApplierService#updateTask][T#1]], exiting
java.lang.AssertionError: should not be called by a cluster state applier. reason [the applied cluster state is not yet available]
	at org.opensearch.cluster.service.ClusterApplierService.assertNotCalledFromClusterStateApplier(ClusterApplierService.java:427) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.state(ClusterApplierService.java:211) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterService.state(ClusterService.java:170) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterService.localNode(ClusterService.java:154) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.indices.replication.checkpoint.PublishCheckpointAction.publish(PublishCheckpointAction.java:108) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.indices.replication.checkpoint.SegmentReplicationCheckpointPublisher.publish(SegmentReplicationCheckpointPublisher.java:36) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.shard.CheckpointRefreshListener.afterRefresh(CheckpointRefreshListener.java:44) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.apache.lucene.search.ReferenceManager.notifyRefreshListenersRefreshed(ReferenceManager.java:275) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
	at org.apache.lucene.search.ReferenceManager.doMaybeRefresh(ReferenceManager.java:182) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
	at org.apache.lucene.search.ReferenceManager.maybeRefreshBlocking(ReferenceManager.java:240) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
	at org.opensearch.index.engine.InternalEngine.refresh(InternalEngine.java:1840) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.engine.InternalEngine.flush(InternalEngine.java:1969) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.engine.Engine.flush(Engine.java:1179) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.engine.Engine.flushAndClose(Engine.java:1939) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.shard.IndexShard.close(IndexShard.java:1631) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.IndexService.closeShard(IndexService.java:586) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.IndexService.removeShard(IndexService.java:566) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.index.IndexService.close(IndexService.java:365) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:898) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.indices.cluster.IndicesClusterStateService.removeIndices(IndicesClusterStateService.java:429) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.indices.cluster.IndicesClusterStateService.applyClusterState(IndicesClusterStateService.java:270) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:606) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.callClusterStateAppliers(ClusterApplierService.java:593) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.applyChanges(ClusterApplierService.java:561) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService.runTask(ClusterApplierService.java:484) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.cluster.service.ClusterApplierService$UpdateTask.run(ClusterApplierService.java:186) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedOpenSearchThreadPoolExecutor.java:282) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.PrioritizedOpenSearchThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedOpenSearchThreadPoolExecutor.java:245) ~[opensearch-2.3.0-SNAPSHOT.jar:2.3.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) ~[?:?]
	at java.lang.Thread.run(Thread.java:832) [?:?]

Thanks @ankitkala for sharing this. As discussed separately, I will try to run Basic Integration Test to repro above. I will update with my findings.

@ankitkala
Copy link
Member

I think the issue is that when we stop the replication for CCR, we close and reopen the index to reload the Engine.
The index close is being handled by ClusterApplierService & IndicesClusterStateService which eventually is invoking Engine.flushAndClose as seen in the stacktrace above.
This flushAndClose leads to a refresh and subsequently PublishCheckpointAction.publish.

This is leading to the stacktrace above where ClusterApplierService calls are ending up again going to ClusterApplierService and failing with assertion error.
This behaviour isn't specific to CCR and might also happen if we try to close the index i guess. I think we might need to refactor the refresh listener mechanism to async maybe to avoid this cyclic call to ClusterApplierService.

@mch2
Copy link
Member

mch2 commented Sep 2, 2022

I think the issue is that when we stop the replication for CCR, we close and reopen the index to reload the Engine. The index close is being handled by ClusterApplierService & IndicesClusterStateService which eventually is invoking Engine.flushAndClose as seen in the stacktrace above. This flushAndClose leads to a refresh and subsequently PublishCheckpointAction.publish.

This is leading to the stacktrace above where ClusterApplierService calls are ending up again going to ClusterApplierService and failing with assertion error. This behaviour isn't specific to CCR and might also happen if we try to close the index i guess. I think we might need to refactor the refresh listener mechanism to async maybe to avoid this cyclic call to ClusterApplierService.

What is the state of the shard during the refresh listener's publish? We should be blocking publish here if the shard is closed? Maybe we need another check on the shard state in the action?

@dreamer-89
Copy link
Member

dreamer-89 commented Sep 14, 2022

Apologies for delay on this task; got occupied with 2.3 release work. I re-looked into the integration with latest changes from 2.x (2.4.0 SNAPSHOT).
I am seeing a different assertion failure post stop replication action (stack trace below). The error is happening due to assert on engine type to be InternalEngine which doesn't hold true after index is re-opened (after close, stop replication workflow). Digging more on this.

Error trace

[2022-09-13T18:21:42,976][ERROR][o.o.b.OpenSearchUncaughtExceptionHandler] [followCluster-0] fatal error in thread [opensearch[followCluster-0][fetch_shard_started][T#1]], exiting
java.lang.AssertionError: null
	at org.opensearch.index.shard.IndexShard.lambda$getProcessedLocalCheckpoint$17(IndexShard.java:2735) ~[opensearch-2.4.0-SNAPSHOT.jar:2.4.0-SNAPSHOT]
	at java.util.Optional.orElseGet(Optional.java:364) ~[?:?]
	at org.opensearch.index.shard.IndexShard.getProcessedLocalCheckpoint(IndexShard.java:2733) ~[opensearch-2.4.0-SNAPSHOT.jar:2.4.0-SNAPSHOT]
	at org.opensearch.index.shard.IndexShard.lambda$getLatestReplicationCheckpoint$6(IndexShard.java:1410) ~[opensearch-2.4.0-SNAPSHOT.jar:2.4.0-SNAPSHOT]
	at java.util.Optional.map(Optional.java:260) ~[?:?]
	at org.opensearch.index.shard.IndexShard.getLatestReplicationCheckpoint(IndexShard.java:1405) ~[opensearch-2.4.0-SNAPSHOT.jar:2.4.0-SNAPSHOT]
	at org.opensearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:214) ~[opensearch-2.4.0-SNAPSHOT.jar:2.4.0-SNAPSHOT]
	at org.opensearch.gateway.TransportNodesListGatewayStartedShards.nodeOperation(TransportNodesListGatewayStartedShards.java:81) ~[opensearch-2.4.0-SNAPSHOT.jar:2.4.0-SNAPSHOT]
	at org.opensearch.action.support.nodes.TransportNodesAction.nodeOperation(TransportNodesAction.java:200) ~[opensearch-2.4.0-SNAPSHOT.jar:2.4.0-SNAPSHOT]
	at org.opensearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:328) ~[opensearch-2.4.0-SNAPSHOT.jar:2.4.0-SNAPSHOT]
	at org.opensearch.action.support.nodes.TransportNodesAction$NodeTransportHandler.messageReceived(TransportNodesAction.java:324) ~[opensearch-2.4.0-SNAPSHOT.jar:2.4.0-SNAPSHOT]
	at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-2.4.0-SNAPSHOT.jar:2.4.0-SNAPSHOT]
	at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-2.4.0-SNAPSHOT.jar:2.4.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-2.4.0-SNAPSHOT.jar:2.4.0-SNAPSHOT]
	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-2.4.0-SNAPSHOT.jar:2.4.0-SNAPSHOT]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) ~[?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) ~[?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
[2022-09-13T18:21:42,978][DEBUG][o.o.c.c.PublicationTransportHandler] [followCluster-0] received diff cluster state version [34] with uuid [Nsz67eMYTUmVZvtHr7mCKA], diff size [545]

@dreamer-89
Copy link
Member

dreamer-89 commented Sep 15, 2022

This error appears because when index is closed the NoOpEngine is returned inside IndicesService.java. TransportNodesListGatewayStartedShards transport (used to fetch shard infos) recently updated to fetch ReplicationCheckpoint when index has segment replication enabled. This checkpoint is currently fetched from IndexShard.getLatestReplicationCheckpoint, which asserts engine should be InternalEngine which doesn’t hold true for closed indices. Added a fix after which all tests pass. There are test failures but they are not reproducible when run in isolation.
CC @ankitkala @mch2

Used latest changes from 2.x. Branches used for repro

  1. https://github.com/dreamer-89/OpenSearch/tree/ccr_testing
  2. https://github.com/dreamer-89/cross-cluster-replication/tree/ccr_testing
  3. https://github.com/opensearch-project/common-utils/tree/ccr_testing

Segment replication specific

Below test fails on 2.x with same stack trace as mentioned here. This is fixed by adding close index check on getLatestReplicationCheckpoint in commit

    public void testIndexReopenClose() throws Exception {
        internalCluster().startNodes(2);
        createIndex(INDEX_NAME);
        ensureGreen(INDEX_NAME);

        client().prepareIndex(INDEX_NAME).setId("1").setSource("foo", "bar").setRefreshPolicy(WriteRequest.RefreshPolicy.IMMEDIATE).get();
        flushAndRefresh(INDEX_NAME);
        waitForReplicaUpdate();

        client().admin().indices().prepareClose(INDEX_NAME).get();
        // Add another node to kick off TransportNodesListGatewayStartedShards which fetches latestReplicationCheckpoint for SegRep enabled indices
        internalCluster().startNode();
        client().admin().indices().prepareOpen(INDEX_NAME).get();
        ensureGreen(INDEX_NAME);
    }

@dreamer-89
Copy link
Member

dreamer-89 commented Sep 15, 2022

I tried running the tests using OpenSearch main (3.0.0 SNAPSHOT). But looks it is broken currently. Need to use getHistoryOperationsFromTranslogFile alternative to fix this integration before ccr integ tests against OpenSearch can be retried.

@ankitkala
Copy link
Member

@mch2 I again tested the compatibility between segrep & CCR.

  • Didn't observe any issue with manual testing. I could verify that leader index and follower index both were using segrep but the ccr was using docrep.
  • Integration tests on CCR were failing with different issues (persistent task didn't come up, quorum loss, refresh failing on CCR system index, etc). I forced the CCR system index to use doc rep and all issues went away. All integ test cases(except one) passed(screenshot below). I retried the failing test and that also succeeded in subsequent run. 2

Screenshot 2023-02-21 at 7 29 49 PM

Changes done to test:
https://github.com/ankitkala/cross-cluster-replication/tree/ccr_segrep_compatibility
https://github.com/ankitkala/OpenSearch/tree/ccr_segrep_compatibility

@ankitkala
Copy link
Member

Since the new changes to follower shard are sent by one of the leader shard randomly (between primary and replicas), I also verified that leader's replica shards are also able to provide the translog operations. We can close this issue now.

Few additional changes we'll need to do in future:

  • For SegRep with remote store, always fetch changes from leader's primary shard.
  • If SegRep is enabled for all indices in future, force the CCR system index to be on docrep(assuming that we're still seeing issues with segrep on system indices).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Indexing:Replication Issues and PRs related to core replication framework eg segrep
Projects
None yet
Development

No branches or pull requests

6 participants