[Segment Replication] [BUG] Primary hits NoSuchFileException computing metadata snapshot. #4178

mch2 · 2022-08-09T20:33:48Z

Describe the bug
While running some perf tests for Segment Replication I hit an Exception while computing a metadata snapshot off of the primary's latest segment infos. I think this is caused by the primary missing a segments_N file on disk that is currently referenced by the latest SegmentInfos returned from getLatestSegmentInfos. The exception is hit

Segment Infos files: [_1.cfs, _0.cfe, _0.si, _1.cfe, _1.si, _2.si, _0.cfs, _2.cfe, _2.cfs, segments_2]
On disk files: [_0.cfe, _0.cfs, _0.si, _1.cfe, _1.cfs, _1.si, _2.cfe, _2.cfs, _2.si, segments_3, write.lock]

This means the primary needs to refresh its reader before we compute the metadata or we could hit this situation.

Further, we are invoking indexWriter.incRefDeleter while fetching the latest SegmentInfos and returning it as a closeable to release the files. However, incRefDeleter does not incref the segments_N file. - https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L5831. This means that this file could be merged away before we finish copying it to a replica.

Trace:

022-08-09T18:10:36,763][WARN ][o.o.c.r.a.AllocationService] [node-3] failing shard [failed shard, shard [so][2], node[QKw3L21KRQeGJ8tmF1ENpQ], [R], s[STARTED], a[id=7ETNwg5VQ9GEpOPHnX1UXw], message [shard failure, reason [replication failure]], failure [OpenSearchException[Segment Replication failed]; nested: RemoteTransportException[[node-3][:9300][internal:index/shard/replication/get_checkpoint_info]]; nested: NoSuchFileException[/home/ec2-user/opensearch-3.0.0/data/nodes/0/indices/sBLLTSH0QtqQ5OVQKJqUDQ/2/index/segments_2]; ], markAsStale [true]]
org.opensearch.OpenSearchException: Segment Replication failed
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:235) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-3.0.0.jar:3.0.0]
        at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1370) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-3.0.0.jar:3.0.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [node-3][:9300][internal:index/shard/replication/get_checkpoint_info]
Caused by: java.nio.file.NoSuchFileException: /home/ec2-user/opensearch-3.0.0/data/nodes/0/indices/sBLLTSH0QtqQ5OVQKJqUDQ/2/index/segments_2
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:298) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:357) ~[?:?]
        at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
        at org.opensearch.index.store.FsDirectoryFactory$HybridDirectory.openInput(FsDirectoryFactory.java:166) ~[opensearch-3.0.0.jar:3.0.0]
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
        at org.opensearch.index.store.Store$MetadataSnapshot.checksumFromLuceneFile(Store.java:1046) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:1032) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.index.store.Store$MetadataSnapshot.<init>(Store.java:895) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.index.store.Store.getMetadata(Store.java:333) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.common.CopyState.<init>(CopyState.java:58) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.OngoingSegmentReplications.getCachedCopyState(OngoingSegmentReplications.java:83) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:158) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:103) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:86) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.0.0.jar:3.0.0]
        ... 3 more

The text was updated successfully, but these errors were encountered:

dreamer-89 · 2022-08-12T18:14:03Z

Looking into it.

dreamer-89 · 2022-08-16T16:28:56Z

@mch2 : In order to repro this issue, I am trying to run integration tests (please ignore other changes added for debugging) in SegmentReplicationIT.java, but not able to hit this issue (saw this couple of time previously but not right now). The only issue I am consistently getting is #4205 and #4216

Will keep this issue open and come back in case I see this failure again. Till then focussing on #4216 which is failing more consistently.

./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationIT.*" -Dtests.seed=9EFE1B4F25B3092E -Dtests.iters=100 -Dtests.opensearch.logger.level=INFO --rerun-tasks

mch2 · 2022-08-16T18:00:46Z

I'm able to see this consistently when running benchmarks, but its more difficult getting the primary into this state within an IT. I think at a minimum we should add retries here on the replica instead of failing the shard.

mch2 · 2022-08-18T17:55:15Z

Ok figured out a bit more of whats going on here, I think theres two things lacking.

This should not fail the replica - the replica should gracefully retry. However, even after a retry the primary could still be in this state, a primary refresh will not necessarily resolve this if no new segments are flushed.
This happens here when computing a metadata snapshot on the primary:

            final String segmentsFile = segmentInfos.getSegmentsFileName();
            checksumFromLuceneFile(directory, segmentsFile, builder, logger, maxVersion, true);

The primary is in a state where the latest in memory infos has a segments file that is no longer on disk. The on-disk version is one generation ahead. I think in this case we can send the replica the latest segments_N and ignore the missing file. The replica does not actually need the missing _N file, particularly if the primary doesn't either, because we send the SegmentInfos bytes over the wire. I think in this case we could skip the metadata for the missing file and continue?

mch2 · 2022-08-19T19:59:57Z

I'm thinking we could do something like this on the primary to return the on-disk infos if the generation is higher than whats in memory -

    @Override
    public GatedCloseable<SegmentInfos> getSegmentInfosSnapshot() {
        final SegmentInfos latestInMemoryInfos = getLatestSegmentInfos();
        SegmentInfos result;
            try {
                final long lastCommitGeneration = SegmentInfos.getLastCommitGeneration(store.directory());
                final long generation = latestInMemoryInfos.getGeneration();
                logger.info("Latest gen {} - {}", generation, lastCommitGeneration);
                if (generation < lastCommitGeneration) {
                    // the latest in memory infos is behind disk, read the latest SegmentInfos from disk and return that.
                    final SegmentInfos latestCommit = SegmentInfos.readLatestCommit(store.directory());
                    result = latestCommit;
                } else {
                    result = latestInMemoryInfos;
                }
                indexWriter.incRefDeleter(result);
            } catch (IOException e) {
                throw new EngineException(shardId, e.getMessage(), e);
            }
        return new GatedCloseable<>(result, () -> indexWriter.decRefDeleter(result));
    }

dreamer-89 · 2022-08-22T23:04:05Z

@mch2 : This bug seems to be interesting. Is memory SegmentInfos falling behind on-disk copy an expected state ? Do we need to handle this state while fetching actual files in get_segment_files where we filter in on-disk copies ?

Can this inconsistency b/w memory & disk state be problematic for segment replication ?

mch2 · 2022-08-22T23:38:37Z

@mch2 : This bug seems to be interesting. Is memory SegmentInfos falling behind on-disk copy an expected state ?

I think this is happening when the primary commits without creating any new segments, because a subsequent call to refresh does not actually refresh the active reader. It happens quite frequently on the primary in testing. If we return the on-disk version here there is no risk. We'll send the infos byte[] and recreate it on the replica to load into its reader.

Do we need to handle this state while fetching actual files in get_segment_files where we filter in on-disk copies ?

Hmm, I think we need to throw & retry replication if the file missing is not segments_N, All files other than _N will be incRef'd until replication completes. So if one of those files is missing there is a bug. If segments_n is missing the replica will still fetch the byte[] and be able to read any new segments, however in this case we can't clear the replica's xlog & must preserve the latest on-disk commit point, which is done here.

Can this inconsistency b/w memory & disk state be problematic for segment replication ?

With the primary-only version of segrep we need to ensure we have a valid commit point fsync'd on disk at all times that we can potentially restart/recover from & replay ops from xlog.

dreamer-89 · 2022-08-26T19:34:10Z

Using below test where without the fix No such file exception happens. Saw one instance where No such file exception was happening but not able to repro that anymore after running below test multiple times.

    public void testDropPrimaryDuringReplication() throws Exception {
        final Settings settings = Settings.builder()
            .put(indexSettings())
            .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 6)
            .put(IndexMetadata.SETTING_REPLICATION_TYPE, ReplicationType.SEGMENT)
            .build();
        final String clusterManagerNode = internalCluster().startClusterManagerOnlyNode();
        final String primaryNode = internalCluster().startDataOnlyNode(Settings.EMPTY);
        createIndex(INDEX_NAME, settings);
        internalCluster().startDataOnlyNodes(6);

        int initialDocCount = scaledRandomIntBetween(100, 200);
        try (
            BackgroundIndexer indexer = new BackgroundIndexer(
                INDEX_NAME,
                "_doc",
                client(),
                -1,
                RandomizedTest.scaledRandomIntBetween(2, 5),
                false,
                random()
            )
        ) {
            indexer.start(initialDocCount);
            waitForDocs(initialDocCount, indexer);
            refresh(INDEX_NAME);
            // don't wait for replication to complete, stop the primary immediately.
            internalCluster().stopRandomNode(InternalTestCluster.nameFilter(primaryNode));
            ensureYellow(INDEX_NAME);

            // start another replica.
            internalCluster().startDataOnlyNode();
            ensureGreen(INDEX_NAME);

            // index another doc and refresh - without this the new replica won't catch up.
            client().prepareIndex(INDEX_NAME).setId("1").setSource("foo", "bar").get();

            flushAndRefresh(INDEX_NAME);
            waitForReplicaUpdate();
            assertSegmentStats(6);
        }
    }

dreamer-89 · 2022-08-26T19:38:46Z

One observation from code scan is that during replica promotion, replica commit SegmentInfos using Store.commitSegmentInfos where post IndexWrite commit, the segment generation starts to differ b/w in memory & on-disk copy of SegmentInfos in the failing tests. This difference happens only in few cases ~~Not sure, if this inconsistency different segment gen can lead to other side-effects.~~

mch2 · 2022-09-01T06:47:25Z

@dreamer-89 I've drafted a PR that could solve this by only sending segments over the wire and ignoring commit points - #4366. I think this is ok because we send the bytes[] of the SegmentInfos to replicas and can commit that if needed for durability. Right now the replica only "commits" if its being promoted as primary, but we'll want to perform commits when a new segment generation is received by the primary and whenever the engine closes. My draft doesn't do that right now, am looking at adding it but may want to split it up into two changes.

mch2 added bug Something isn't working distributed framework labels Aug 9, 2022

This was referenced Aug 11, 2022

[Segment Replication] [BUG] NoSuchFileException on primary during replication/get_checkpoint_info #4195

Closed

[Segment Replication] Experimental Release Tracking #3969

Closed

dreamer-89 self-assigned this Aug 12, 2022

mch2 mentioned this issue Aug 19, 2022

[Segment Replication] [BUG] file handle leak in SegmentFileTransferHandler #4205

Closed

dreamer-89 mentioned this issue Aug 23, 2022

[Segment Replication] Update getSegmentInfosSnapshot() logic to return SegmentInfos with highest generation number #4288

Closed

6 tasks

dreamer-89 mentioned this issue Aug 29, 2022

[Segment Replication] [BUG] No such file exception due to missing index file in get_checkpoint_info #4310

Closed

mch2 mentioned this issue Sep 1, 2022

Segment Replication - Fix NoSuchFileException errors caused when computing metadata snapshot on primary shards. #4366

Merged

6 tasks

mch2 mentioned this issue Sep 5, 2022

[Segment Replication] - Update replicas to commit SegmentInfos instead of relying on segments_N from primary shards. #4402

Merged

6 tasks

dreamer-89 closed this as completed in #4402 Sep 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Segment Replication] [BUG] Primary hits NoSuchFileException computing metadata snapshot. #4178

[Segment Replication] [BUG] Primary hits NoSuchFileException computing metadata snapshot. #4178

mch2 commented Aug 9, 2022 •

edited

Loading

dreamer-89 commented Aug 12, 2022

dreamer-89 commented Aug 16, 2022 •

edited

Loading

mch2 commented Aug 16, 2022

mch2 commented Aug 18, 2022

mch2 commented Aug 19, 2022

dreamer-89 commented Aug 22, 2022 •

edited

Loading

mch2 commented Aug 22, 2022 •

edited

Loading

dreamer-89 commented Aug 26, 2022

dreamer-89 commented Aug 26, 2022 •

edited

Loading

mch2 commented Sep 1, 2022

[Segment Replication] [BUG] Primary hits NoSuchFileException computing metadata snapshot. #4178

[Segment Replication] [BUG] Primary hits NoSuchFileException computing metadata snapshot. #4178

Comments

mch2 commented Aug 9, 2022 • edited Loading

dreamer-89 commented Aug 12, 2022

dreamer-89 commented Aug 16, 2022 • edited Loading

mch2 commented Aug 16, 2022

mch2 commented Aug 18, 2022

mch2 commented Aug 19, 2022

dreamer-89 commented Aug 22, 2022 • edited Loading

mch2 commented Aug 22, 2022 • edited Loading

dreamer-89 commented Aug 26, 2022

dreamer-89 commented Aug 26, 2022 • edited Loading

mch2 commented Sep 1, 2022

mch2 commented Aug 9, 2022 •

edited

Loading

dreamer-89 commented Aug 16, 2022 •

edited

Loading

dreamer-89 commented Aug 22, 2022 •

edited

Loading

mch2 commented Aug 22, 2022 •

edited

Loading

dreamer-89 commented Aug 26, 2022 •

edited

Loading