Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] [BUG] file handle leak in SegmentFileTransferHandler #4205

Closed
dreamer-89 opened this issue Aug 12, 2022 · 2 comments
Closed
Assignees
Labels
bug Something isn't working distributed framework enhancement Enhancement or improvement to existing feature or request

Comments

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 12, 2022

Describe the bug

store.directory().openInput in SegmentFileTransferHandler fails with runtime file handle leak exception. Based on remaining usages, this call can be added in try-with-resources block as IndexInput implements the Closeable interface.

file handle leaks: [FileChannel(/Users/singhnjb/OpenSearch/server/build/testrun/internalClusterTest/temp/org.opensearch.indices.replication.SegmentReplicationIT_E528474D6EDB4FDE-001/tempDir-005/node_t1/d0/nodes/0/indices/m2VhNPQ4SueBsbqz3y6juA/0/index/_b_Lucene90_0.pos), FileChannel(/Users/singhnjb/OpenSearch/server/build/testrun/internalClusterTest/temp/org.opensearch.indices.replication.SegmentReplicationIT_E528474D6EDB4FDE-001/tempDir-005/node_t1/d0/nodes/0/indices/m2VhNPQ4SueBsbqz3y6juA/0/index/_b.fdt), FileChannel(/Users/singhnjb/OpenSearch/server/build/testrun/internalClusterTest/temp/org.opensearch.indices.replication.SegmentReplicationIT_E528474D6EDB4FDE-001/tempDir-005/node_t1/d0/nodes/0/indices/m2VhNPQ4SueBsbqz3y6juA/0/index/_b_Lucene90_0.pos), FileChannel(/Users/singhnjb/OpenSearch/server/build/testrun/internalClusterTest/temp/org.opensearch.indices.replication.SegmentReplicationIT_E528474D6EDB4FDE-001/tempDir-005/node_t1/d0/nodes/0/indices/m2VhNPQ4SueBsbqz3y6juA/0/index/_b.fdt), FileChannel(/Users/singhnjb/OpenSearch/server/build/testrun/internalClusterTest/temp/org.opensearch.indices.replication.SegmentReplicationIT_E528474D6EDB4FDE-001/tempDir-005/node_t1/d0/nodes/0/indices/m2VhNPQ4SueBsbqz3y6juA/0/index/_b.fdt)]
java.lang.RuntimeException: file handle leaks: [FileChannel(/Users/singhnjb/OpenSearch/server/build/testrun/internalClusterTest/temp/org.opensearch.indices.replication.SegmentReplicationIT_E528474D6EDB4FDE-001/tempDir-005/node_t1/d0/nodes/0/indices/m2VhNPQ4SueBsbqz3y6juA/0/index/_b_Lucene90_0.pos), FileChannel(/Users/singhnjb/OpenSearch/server/build/testrun/internalClusterTest/temp/org.opensearch.indices.replication.SegmentReplicationIT_E528474D6EDB4FDE-001/tempDir-005/node_t1/d0/nodes/0/indices/m2VhNPQ4SueBsbqz3y6juA/0/index/_b.fdt), FileChannel(/Users/singhnjb/OpenSearch/server/build/testrun/internalClusterTest/temp/org.opensearch.indices.replication.SegmentReplicationIT_E528474D6EDB4FDE-001/tempDir-005/node_t1/d0/nodes/0/indices/m2VhNPQ4SueBsbqz3y6juA/0/index/_b_Lucene90_0.pos), FileChannel(/Users/singhnjb/OpenSearch/server/build/testrun/internalClusterTest/temp/org.opensearch.indices.replication.SegmentReplicationIT_E528474D6EDB4FDE-001/tempDir-005/node_t1/d0/nodes/0/indices/m2VhNPQ4SueBsbqz3y6juA/0/index/_b.fdt), FileChannel(/Users/singhnjb/OpenSearch/server/build/testrun/internalClusterTest/temp/org.opensearch.indices.replication.SegmentReplicationIT_E528474D6EDB4FDE-001/tempDir-005/node_t1/d0/nodes/0/indices/m2VhNPQ4SueBsbqz3y6juA/0/index/_b.fdt)]
	at __randomizedtesting.SeedInfo.seed([E528474D6EDB4FDE]:0)
	at org.apache.lucene.tests.mockfile.LeakFS.onClose(LeakFS.java:63)
	at org.apache.lucene.tests.mockfile.FilterFileSystem.close(FilterFileSystem.java:69)
	at org.apache.lucene.tests.mockfile.FilterFileSystem.close(FilterFileSystem.java:70)
	at org.apache.lucene.tests.util.TestRuleTemporaryFilesCleanup.afterAlways(TestRuleTemporaryFilesCleanup.java:223)
	at com.carrotsearch.randomizedtesting.rules.TestRuleAdapter$1.afterAlways(TestRuleAdapter.java:31)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:833)
Caused by: java.lang.Exception
	at org.apache.lucene.tests.mockfile.LeakFS.onOpen(LeakFS.java:46)
	at org.apache.lucene.tests.mockfile.HandleTrackingFS.callOpenHook(HandleTrackingFS.java:82)
	at org.apache.lucene.tests.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:202)
	at org.apache.lucene.tests.mockfile.HandleTrackingFS.newFileChannel(HandleTrackingFS.java:171)
	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:298)
	at java.base/java.nio.channels.FileChannel.open(FileChannel.java:357)
	at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78)
	at org.opensearch.index.store.FsDirectoryFactory$HybridDirectory.openInput(FsDirectoryFactory.java:166)
	at org.apache.lucene.tests.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:816)
	at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101)
	at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101)
	at org.opensearch.indices.replication.SegmentFileTransferHandler$1.onNewResource(SegmentFileTransferHandler.java:107)
	at org.opensearch.indices.replication.SegmentFileTransferHandler$1.onNewResource(SegmentFileTransferHandler.java:97)
	at org.opensearch.indices.recovery.MultiChunkTransfer.getNextRequest(MultiChunkTransfer.java:183)
	at org.opensearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:157)
	at org.opensearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:98)
	at org.opensearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:123)
	at org.opensearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:111)
	at org.opensearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:102)
	at org.opensearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:109)
	at org.opensearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$3(MultiChunkTransfer.java:151)
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80)
	at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299)
	at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180)
	at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299)
	at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:161)
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69)
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1369)
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393)
	at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387)
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635)
	... 1 more

To Reproduce
Steps to reproduce the behavior:
This issue occurs randomly on executing SegmentReplicationIT.java tests. I found out this issue while running below new integration test (created to repro a different issue)

    public void testPrimaryShardAllocatorUsesFurthestAheadReplica() throws Exception {
        final Settings settings = Settings.builder()
            .put(indexSettings()).put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 6)
            .put(IndexMetadata.SETTING_REPLICATION_TYPE, ReplicationType.SEGMENT)
            .build();
        final String clusterManagerNode = internalCluster().startClusterManagerOnlyNode(Settings.EMPTY);
        final String primaryNode = internalCluster().startDataOnlyNode(Settings.EMPTY);
        createIndex(INDEX_NAME, settings);
        final String firstReplica = internalCluster().startDataOnlyNode(Settings.EMPTY);
        final String secondReplica = internalCluster().startDataOnlyNode(Settings.EMPTY);

        // Index docs & refresh to bring all replicas to initial checkpoint
        indexDocs(scaledRandomIntBetween(20, 200));
        flushAndRefresh(INDEX_NAME);

        final String thirdReplica = internalCluster().startDataOnlyNode(Settings.EMPTY);
        final String fourthReplica = internalCluster().startDataOnlyNode(Settings.EMPTY);
        final String fifthReplica = internalCluster().startDataOnlyNode(Settings.EMPTY);
        final String sixthReplica = internalCluster().startDataOnlyNode(Settings.EMPTY);

        for(int i=0;i<10;i++) {
            logger.info("Iteration {} --> ", i);
            indexDocsAndRefresh(scaledRandomIntBetween(10, 100));
        }

        final Index index = resolveIndex(INDEX_NAME);
        logger.info("--> primaryShard RC {}", getIndexShard(primaryNode).getLatestReplicationCheckpoint());
        internalCluster().stopRandomNode(InternalTestCluster.nameFilter(primaryNode));

        for(int i=0;i<5;i++) {
            logger.info("Iteration {} --> ", i);
            indexDocsAndRefresh(scaledRandomIntBetween(10, 100));
        }
        ensureYellow(INDEX_NAME);

    }

Expected behavior
Test should fail with runtime exception.

Host/Environment (please complete the following information):

  • OS: iOS
@dreamer-89 dreamer-89 added bug Something isn't working untriaged labels Aug 12, 2022
@dreamer-89 dreamer-89 changed the title [Segment Replication] [BUG] file handle leak [Segment Replication] [BUG] file handle leak in SegmentFileTransferHandler Aug 12, 2022
@dreamer-89
Copy link
Member Author

This issue also shows up as; open files preventing preventing directory close action.

[2022-08-16T00:21:24,972][WARN ][o.o.i.IndexService       ] [node_t1] [test-idx-1] [0] failed to close store on shard removal (reason: [shutdown])
java.lang.RuntimeException: MockDirectoryWrapper: cannot close: there are still 1 open files: {_d.cfs=1}
	at org.apache.lucene.tests.store.MockDirectoryWrapper.close(MockDirectoryWrapper.java:876) ~[lucene-test-framework-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
	at org.apache.lucene.store.FilterDirectory.close(FilterDirectory.java:111) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
	at org.apache.lucene.store.FilterDirectory.close(FilterDirectory.java:111) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
	at org.opensearch.index.store.Store$StoreDirectory.innerClose(Store.java:878) ~[main/:?]
	at org.opensearch.index.store.Store.closeInternal(Store.java:474) ~[main/:?]
	at org.opensearch.index.store.Store$1.closeInternal(Store.java:187) ~[main/:?]
	at org.opensearch.common.util.concurrent.AbstractRefCounted.decRef(AbstractRefCounted.java:76) ~[opensearch-core-3.0.0-SNAPSHOT.jar:?]
	at org.opensearch.index.store.Store.decRef(Store.java:449) ~[main/:?]
	at org.opensearch.index.store.Store.close(Store.java:456) ~[main/:?]
	at org.opensearch.index.IndexService.closeShard(IndexService.java:613) [main/:?]
	at org.opensearch.index.IndexService.removeShard(IndexService.java:581) [main/:?]
	at org.opensearch.index.IndexService.close(IndexService.java:368) [main/:?]
	at org.opensearch.indices.IndicesService.removeIndex(IndicesService.java:903) [main/:?]
	at org.opensearch.indices.IndicesService.lambda$doStop$2(IndicesService.java:413) [main/:?]
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
	at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: java.lang.RuntimeException: unclosed IndexInput: _d.cfs
	at org.apache.lucene.tests.store.MockDirectoryWrapper.addFileHandle(MockDirectoryWrapper.java:783) ~[lucene-test-framework-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
	at org.apache.lucene.tests.store.MockDirectoryWrapper.openInput(MockDirectoryWrapper.java:835) ~[lucene-test-framework-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
	at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
	at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
	at org.opensearch.indices.replication.SegmentFileTransferHandler$1.onNewResource(SegmentFileTransferHandler.java:107) ~[main/:?]
	at org.opensearch.indices.replication.SegmentFileTransferHandler$1.onNewResource(SegmentFileTransferHandler.java:97) ~[main/:?]
	at org.opensearch.indices.recovery.MultiChunkTransfer.getNextRequest(MultiChunkTransfer.java:183) ~[main/:?]
	at org.opensearch.indices.recovery.MultiChunkTransfer.handleItems(MultiChunkTransfer.java:157) ~[main/:?]
	at org.opensearch.indices.recovery.MultiChunkTransfer$1.write(MultiChunkTransfer.java:98) ~[main/:?]
	at org.opensearch.common.util.concurrent.AsyncIOProcessor.processList(AsyncIOProcessor.java:123) ~[main/:?]
	at org.opensearch.common.util.concurrent.AsyncIOProcessor.drainAndProcessAndRelease(AsyncIOProcessor.java:111) ~[main/:?]
	at org.opensearch.common.util.concurrent.AsyncIOProcessor.put(AsyncIOProcessor.java:99) ~[main/:?]
	at org.opensearch.indices.recovery.MultiChunkTransfer.addItem(MultiChunkTransfer.java:109) ~[main/:?]
	at org.opensearch.indices.recovery.MultiChunkTransfer.lambda$handleItems$3(MultiChunkTransfer.java:151) ~[main/:?]
	at org.opensearch.action.ActionListener$1.onResponse(ActionListener.java:80) ~[main/:?]
	at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[main/:?]
	at org.opensearch.action.ActionListener$4.onResponse(ActionListener.java:180) ~[main/:?]
	at org.opensearch.action.ActionListener$6.onResponse(ActionListener.java:299) ~[main/:?]
	at org.opensearch.action.support.RetryableAction$RetryingListener.onResponse(RetryableAction.java:161) ~[main/:?]
	at org.opensearch.action.ActionListenerResponseHandler.handleResponse(ActionListenerResponseHandler.java:69) ~[main/:?]
	at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleResponse(TransportService.java:1369) ~[main/:?]
	at org.opensearch.transport.InboundHandler.doHandleResponse(InboundHandler.java:393) ~[main/:?]
	at org.opensearch.transport.InboundHandler.lambda$handleResponse$1(InboundHandler.java:387) ~[main/:?]
	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[main/:?]
	... 3 more

@mch2
Copy link
Member

mch2 commented Aug 19, 2022

I think the root problem here is we are not properly cancelling replication events on the former primary. I've added a PR #4225 to properly close when the shard is demoted. I see #4178 happening with this same integ test with 6 replicas. Will include that test with a fix for 4178.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework enhancement Enhancement or improvement to existing feature or request
Projects
None yet
Development

No branches or pull requests

4 participants