Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Segment Replication] [BUG] Primary hits NoSuchFileException computing metadata snapshot. #4178

Closed
mch2 opened this issue Aug 9, 2022 · 10 comments · Fixed by #4402
Closed
Assignees
Labels
bug Something isn't working distributed framework

Comments

@mch2
Copy link
Member

mch2 commented Aug 9, 2022

Describe the bug
While running some perf tests for Segment Replication I hit an Exception while computing a metadata snapshot off of the primary's latest segment infos. I think this is caused by the primary missing a segments_N file on disk that is currently referenced by the latest SegmentInfos returned from getLatestSegmentInfos. The exception is hit

Segment Infos files: [_1.cfs, _0.cfe, _0.si, _1.cfe, _1.si, _2.si, _0.cfs, _2.cfe, _2.cfs, segments_2]
On disk files: [_0.cfe, _0.cfs, _0.si, _1.cfe, _1.cfs, _1.si, _2.cfe, _2.cfs, _2.si, segments_3, write.lock]

This means the primary needs to refresh its reader before we compute the metadata or we could hit this situation.

Further, we are invoking indexWriter.incRefDeleter while fetching the latest SegmentInfos and returning it as a closeable to release the files. However, incRefDeleter does not incref the segments_N file. - https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/IndexWriter.java#L5831. This means that this file could be merged away before we finish copying it to a replica.

Trace:

022-08-09T18:10:36,763][WARN ][o.o.c.r.a.AllocationService] [node-3] failing shard [failed shard, shard [so][2], node[QKw3L21KRQeGJ8tmF1ENpQ], [R], s[STARTED], a[id=7ETNwg5VQ9GEpOPHnX1UXw], message [shard failure, reason [replication failure]], failure [OpenSearchException[Segment Replication failed]; nested: RemoteTransportException[[node-3][:9300][internal:index/shard/replication/get_checkpoint_info]]; nested: NoSuchFileException[/home/ec2-user/opensearch-3.0.0/data/nodes/0/indices/sBLLTSH0QtqQ5OVQKJqUDQ/2/index/segments_2]; ], markAsStale [true]]
org.opensearch.OpenSearchException: Segment Replication failed
        at org.opensearch.indices.replication.SegmentReplicationTargetService$3.onFailure(SegmentReplicationTargetService.java:235) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionRunnable.onFailure(ActionRunnable.java:103) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:54) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.OpenSearchExecutors$DirectExecutorService.execute(OpenSearchExecutors.java:341) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.notifyListener(ListenableFuture.java:120) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.lambda$done$0(ListenableFuture.java:112) ~[opensearch-3.0.0.jar:3.0.0]
        at java.util.ArrayList.forEach(ArrayList.java:1511) ~[?:?]
        at org.opensearch.common.util.concurrent.ListenableFuture.done(ListenableFuture.java:112) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.BaseFuture.setException(BaseFuture.java:178) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ListenableFuture.onFailure(ListenableFuture.java:149) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.StepListener.innerOnFailure(StepListener.java:82) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.NotifyOnceListener.onFailure(NotifyOnceListener.java:62) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionListener$4.onFailure(ActionListener.java:190) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionListener$6.onFailure(ActionListener.java:309) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFinalFailure(RetryableAction.java:201) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.support.RetryableAction$RetryingListener.onFailure(RetryableAction.java:193) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.action.ActionListenerResponseHandler.handleException(ActionListenerResponseHandler.java:74) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.TransportService$ContextRestoreResponseHandler.handleException(TransportService.java:1370) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.InboundHandler.lambda$handleException$3(InboundHandler.java:420) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:747) ~[opensearch-3.0.0.jar:3.0.0]
        at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1136) [?:?]
        at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:635) [?:?]
        at java.lang.Thread.run(Thread.java:833) [?:?]
Caused by: org.opensearch.transport.RemoteTransportException: [node-3][:9300][internal:index/shard/replication/get_checkpoint_info]
Caused by: java.nio.file.NoSuchFileException: /home/ec2-user/opensearch-3.0.0/data/nodes/0/indices/sBLLTSH0QtqQ5OVQKJqUDQ/2/index/segments_2
        at sun.nio.fs.UnixException.translateToIOException(UnixException.java:92) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:106) ~[?:?]
        at sun.nio.fs.UnixException.rethrowAsIOException(UnixException.java:111) ~[?:?]
        at sun.nio.fs.UnixFileSystemProvider.newFileChannel(UnixFileSystemProvider.java:181) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:298) ~[?:?]
        at java.nio.channels.FileChannel.open(FileChannel.java:357) ~[?:?]
        at org.apache.lucene.store.NIOFSDirectory.openInput(NIOFSDirectory.java:78) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
        at org.opensearch.index.store.FsDirectoryFactory$HybridDirectory.openInput(FsDirectoryFactory.java:166) ~[opensearch-3.0.0.jar:3.0.0]
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
        at org.apache.lucene.store.FilterDirectory.openInput(FilterDirectory.java:101) ~[lucene-core-9.3.0.jar:9.3.0 d25cebcef7a80369f4dfb9285ca7360a810b75dc - ivera - 2022-07-25 12:30:23]
        at org.opensearch.index.store.Store$MetadataSnapshot.checksumFromLuceneFile(Store.java:1046) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.index.store.Store$MetadataSnapshot.loadMetadata(Store.java:1032) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.index.store.Store$MetadataSnapshot.<init>(Store.java:895) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.index.store.Store.getMetadata(Store.java:333) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.common.CopyState.<init>(CopyState.java:58) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.OngoingSegmentReplications.getCachedCopyState(OngoingSegmentReplications.java:83) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.OngoingSegmentReplications.prepareForReplication(OngoingSegmentReplications.java:158) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:103) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.indices.replication.SegmentReplicationSourceService$CheckpointInfoRequestHandler.messageReceived(SegmentReplicationSourceService.java:86) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.RequestHandlerRegistry.processMessageReceived(RequestHandlerRegistry.java:106) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.transport.InboundHandler$RequestHandler.doRun(InboundHandler.java:453) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-3.0.0.jar:3.0.0]
        at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.0.0.jar:3.0.0]
        ... 3 more
@dreamer-89
Copy link
Member

Looking into it.

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 16, 2022

@mch2 : In order to repro this issue, I am trying to run integration tests (please ignore other changes added for debugging) in SegmentReplicationIT.java, but not able to hit this issue (saw this couple of time previously but not right now). The only issue I am consistently getting is #4205 and #4216

Will keep this issue open and come back in case I see this failure again. Till then focussing on #4216 which is failing more consistently.

./gradlew ':server:internalClusterTest' --tests "org.opensearch.indices.replication.SegmentReplicationIT.*" -Dtests.seed=9EFE1B4F25B3092E -Dtests.iters=100 -Dtests.opensearch.logger.level=INFO --rerun-tasks

@mch2
Copy link
Member Author

mch2 commented Aug 16, 2022

I'm able to see this consistently when running benchmarks, but its more difficult getting the primary into this state within an IT. I think at a minimum we should add retries here on the replica instead of failing the shard.

@mch2
Copy link
Member Author

mch2 commented Aug 18, 2022

Ok figured out a bit more of whats going on here, I think theres two things lacking.

  1. This should not fail the replica - the replica should gracefully retry. However, even after a retry the primary could still be in this state, a primary refresh will not necessarily resolve this if no new segments are flushed.
  2. This happens here when computing a metadata snapshot on the primary:
            final String segmentsFile = segmentInfos.getSegmentsFileName();
            checksumFromLuceneFile(directory, segmentsFile, builder, logger, maxVersion, true);

The primary is in a state where the latest in memory infos has a segments file that is no longer on disk. The on-disk version is one generation ahead. I think in this case we can send the replica the latest segments_N and ignore the missing file. The replica does not actually need the missing _N file, particularly if the primary doesn't either, because we send the SegmentInfos bytes over the wire. I think in this case we could skip the metadata for the missing file and continue?

@mch2
Copy link
Member Author

mch2 commented Aug 19, 2022

I'm thinking we could do something like this on the primary to return the on-disk infos if the generation is higher than whats in memory -

    @Override
    public GatedCloseable<SegmentInfos> getSegmentInfosSnapshot() {
        final SegmentInfos latestInMemoryInfos = getLatestSegmentInfos();
        SegmentInfos result;
            try {
                final long lastCommitGeneration = SegmentInfos.getLastCommitGeneration(store.directory());
                final long generation = latestInMemoryInfos.getGeneration();
                logger.info("Latest gen {} - {}", generation, lastCommitGeneration);
                if (generation < lastCommitGeneration) {
                    // the latest in memory infos is behind disk, read the latest SegmentInfos from disk and return that.
                    final SegmentInfos latestCommit = SegmentInfos.readLatestCommit(store.directory());
                    result = latestCommit;
                } else {
                    result = latestInMemoryInfos;
                }
                indexWriter.incRefDeleter(result);
            } catch (IOException e) {
                throw new EngineException(shardId, e.getMessage(), e);
            }
        return new GatedCloseable<>(result, () -> indexWriter.decRefDeleter(result));
    }

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 22, 2022

@mch2 : This bug seems to be interesting. Is memory SegmentInfos falling behind on-disk copy an expected state ? Do we need to handle this state while fetching actual files in get_segment_files where we filter in on-disk copies ?

Can this inconsistency b/w memory & disk state be problematic for segment replication ?

@mch2
Copy link
Member Author

mch2 commented Aug 22, 2022

@mch2 : This bug seems to be interesting. Is memory SegmentInfos falling behind on-disk copy an expected state ?

I think this is happening when the primary commits without creating any new segments, because a subsequent call to refresh does not actually refresh the active reader. It happens quite frequently on the primary in testing. If we return the on-disk version here there is no risk. We'll send the infos byte[] and recreate it on the replica to load into its reader.

Do we need to handle this state while fetching actual files in get_segment_files where we filter in on-disk copies ?

Hmm, I think we need to throw & retry replication if the file missing is not segments_N, All files other than _N will be incRef'd until replication completes. So if one of those files is missing there is a bug. If segments_n is missing the replica will still fetch the byte[] and be able to read any new segments, however in this case we can't clear the replica's xlog & must preserve the latest on-disk commit point, which is done here.

Can this inconsistency b/w memory & disk state be problematic for segment replication ?

With the primary-only version of segrep we need to ensure we have a valid commit point fsync'd on disk at all times that we can potentially restart/recover from & replay ops from xlog.

@dreamer-89
Copy link
Member

Using below test where without the fix No such file exception happens. Saw one instance where No such file exception was happening but not able to repro that anymore after running below test multiple times.

    public void testDropPrimaryDuringReplication() throws Exception {
        final Settings settings = Settings.builder()
            .put(indexSettings())
            .put(IndexMetadata.SETTING_NUMBER_OF_REPLICAS, 6)
            .put(IndexMetadata.SETTING_REPLICATION_TYPE, ReplicationType.SEGMENT)
            .build();
        final String clusterManagerNode = internalCluster().startClusterManagerOnlyNode();
        final String primaryNode = internalCluster().startDataOnlyNode(Settings.EMPTY);
        createIndex(INDEX_NAME, settings);
        internalCluster().startDataOnlyNodes(6);

        int initialDocCount = scaledRandomIntBetween(100, 200);
        try (
            BackgroundIndexer indexer = new BackgroundIndexer(
                INDEX_NAME,
                "_doc",
                client(),
                -1,
                RandomizedTest.scaledRandomIntBetween(2, 5),
                false,
                random()
            )
        ) {
            indexer.start(initialDocCount);
            waitForDocs(initialDocCount, indexer);
            refresh(INDEX_NAME);
            // don't wait for replication to complete, stop the primary immediately.
            internalCluster().stopRandomNode(InternalTestCluster.nameFilter(primaryNode));
            ensureYellow(INDEX_NAME);

            // start another replica.
            internalCluster().startDataOnlyNode();
            ensureGreen(INDEX_NAME);

            // index another doc and refresh - without this the new replica won't catch up.
            client().prepareIndex(INDEX_NAME).setId("1").setSource("foo", "bar").get();

            flushAndRefresh(INDEX_NAME);
            waitForReplicaUpdate();
            assertSegmentStats(6);
        }
    }

@dreamer-89
Copy link
Member

dreamer-89 commented Aug 26, 2022

One observation from code scan is that during replica promotion, replica commit SegmentInfos using Store.commitSegmentInfos where post IndexWrite commit, the segment generation starts to differ b/w in memory & on-disk copy of SegmentInfos in the failing tests. This difference happens only in few cases Not sure, if this inconsistency different segment gen can lead to other side-effects.

@mch2
Copy link
Member Author

mch2 commented Sep 1, 2022

@dreamer-89 I've drafted a PR that could solve this by only sending segments over the wire and ignoring commit points - #4366. I think this is ok because we send the bytes[] of the SegmentInfos to replicas and can commit that if needed for durability. Right now the replica only "commits" if its being promoted as primary, but we'll want to perform commits when a new segment generation is received by the primary and whenever the engine closes. My draft doesn't do that right now, am looking at adding it but may want to split it up into two changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working distributed framework
Projects
None yet
2 participants