Replica recovery could go into an endless flushing loop #28350

dnhatn · 2018-01-24T04:00:44Z

Today after writing an operation to an engine, we will call IndexShard#afterWriteOperation to flush a new commit if needed. The shouldFlush condition is purely based on the uncommitted translog size and the translog flush threshold size setting. However this can cause a replica execute an infinite loop of flushing in the following situation.

Primary has a fully baked index commit with its local checkpoint equals to max_seqno
Primary sends that fully baked commit, then replays all retained translog operations to the replica
No operations are added to Lucence on the replica as seqno of these operations are at most the local checkpoint
Once translog operations are replayed, the target calls IndexShard#afterWriteOperation to flush. If the total size of the replaying operations exceeds the flush threshold size, this call will Engine#flush. However the engine won't flush as its index writer does not have any uncommitted operations. The method IndexShard#afterWriteOperation will keep flushing as the condition shouldFlush is still true.

This issue can be avoided if we always flush if the shouldFlush condition is true.

ywelsch · 2018-01-24T08:18:25Z

@dnhatn great find.

I think that we would actually want the flush to happen in this case so that the translog can be cleaned up. The current approach here says: There's more than 500mb worth of uncommitted data (which is actually all committed), but no uncommitted change to Lucene, so let's ignore this. If we would forcibly flush even though there are no changes to Lucene, this would allow us to free the translog.
It also shows a broader issue: when the local checkpoint is stuck, there's a possibility for every newly added operation to cause a flush (incl. rolling of translog generations).

bleskes · 2018-01-24T08:35:14Z

This is a great find. I'm not sure though that this is the right fix. The main problem is that the uncommitted bytes stats is off. All ops in the translog are actually in lucene. The problem is that uncommitted bytes is calculated based on the translog gen file, which is indeed pointed to by lucene. This is amplified by the fact that we now ship more of the translog to create a history on the replica, which is not relevant for the flushing logic.

I wonder if we should always force flush at the end of recovery as an easy fix. Another option is to flush when lucene doesn't point to the right generation, even if there are no pending ops.

I want to think about this some more.

It also shows a broader issue: when the local checkpoint is stuck, there's a possibility for every newly added operation to cause a flush (incl. rolling of translog generations).

Agreed. It is a broader issue that has implication for the entire replication group. Last we talked about it we thought of having a fail safe of in line with "if a specific in sync shard lags behind with more than x ops, fail it". x can be something large like 10K ops or something. The downside of course is that it will hide bugs.

dnhatn · 2018-01-24T13:48:08Z

I agreed. I am not sure if this is a right approach either. I was trying to fix this by only sending translog operations after the local checkpoint in peer-recovery. However, this can happen in other cases hence I switched to this approach.

bleskes · 2018-01-24T14:34:05Z

I was trying to fix this by only sending translog operations after the local checkpoint in peer-recovery.

We don't do this by design - we need to build a translog with history on the replica.

dnhatn · 2018-01-24T19:30:24Z

@bleskes and @ywelsch I've updated the PR according to our discussion. Could you please take another look? Thank you!

bleskes

Thx Nhat. Reviewing this and talking things through with @ywelsch we came up with a model that conceptually simpler to digest and we feel better about than we came up with yesterday.

Here's the idea:

Remove shouldFlush from the translog only have these decisions made in should in the Engine
The shouldFlush check in the engine shouldn't rely on translog generations but rather only work with uncommitted bytes. Concretely:
a) if uncommittedBytes is < 512, return false
b) expose the Translog#sizeOfGensAboveSeqNoInBytes method that's currently unused.
c) check if sizeOfGensAboveSeqNoInBytes(localCheckpoint + 1) > uncommittedBytes . If it is , return true (as we will gain some bytes), if not return false.

WDYT?

bleskes · 2018-01-25T10:40:26Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

@@ -1492,7 +1511,7 @@ public CommitId flush(boolean force, boolean waitIfOngoing) throws EngineExcepti
                logger.trace("acquired flush lock immediately");
            }
            try {
-                if (indexWriter.hasUncommittedChanges() || force) {
+                if (indexWriter.hasUncommittedChanges() || force || shouldFlush()) {


can we add a comment explaining why we have 3 things? Basically something like - we check if:

We're forced.

There are uncommitted lucene docs in lucene

There are translog related reasons to create a new commit which point to a different place in the translog.

s1monw · 2018-01-25T13:33:08Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+        }
+        /*
+         * We should only flush ony if the shouldFlush condition can become false after flushing. This condition will change if:
+         * 1. The min translog gen of the next commit points to a different translog gen than the last commit


I think this deserves a comment why we don't take the IW#hasUncommittedChanges() into account.

should we call ensureOpen() here as well?

s1monw · 2018-01-25T13:37:28Z

server/src/main/java/org/elasticsearch/index/engine/Engine.java

@@ -817,6 +817,12 @@ public final boolean refreshNeeded() {
    // NOTE: do NOT rename this to something containing flush or refresh!
    public abstract void writeIndexingBuffer() throws EngineException;

+    /**
+     * Checks if this engine should be flushed.


can you explain that this can return false even if there are uncommitted changes. It's more of a maintainance function. maybe we should call it differently something like shouldFlushForMaintainance or maintainanceFlushPending() just suggestions to make it more clear

Yannick and I came up with shouldFlushToFreeTranslog

dnhatn · 2018-01-25T16:19:15Z

I've addressed your feedbacks. Could you please take another look? Thank you!

ywelsch · 2018-01-25T17:33:01Z

server/src/test/java/org/elasticsearch/indices/recovery/RecoveryTests.java

@@ -306,4 +307,26 @@ public void testSequenceBasedRecoveryKeepsTranslog() throws Exception {
        }
    }

+    public void testShouldFlushAfterPeerRecovery() throws Exception {


can you add Javadoc to this method to explain what the goal of this test is?

ywelsch · 2018-01-25T17:38:48Z

server/src/main/java/org/elasticsearch/index/engine/InternalEngine.java

+        final long flushThreshold = config().getIndexSettings().getFlushThresholdSize().getBytes();
+        final long uncommittedSizeOfCurrentCommit = translog.uncommittedSizeInBytes();
+        // If flushThreshold is too small, we may continuously flush even there is no uncommitted operations.
+        if (uncommittedSizeOfCurrentCommit < flushThreshold || translog.uncommittedOperations() == 0) {


maybe put the check translog.uncommittedOperations() == 0 at the beginning of the shouldFlushToFreeTranslog method.

ywelsch · 2018-01-25T17:45:39Z

server/src/test/java/org/elasticsearch/indices/recovery/RecoveryTests.java

+            shards.startAll();
+            long translogSizeOnPrimary = 0;
+            int numDocs = shards.indexDocs(between(10, 100));
+            translogSizeOnPrimary += shards.getPrimary().getTranslog().uncommittedSizeInBytes();


just define translogSizeOnPrimary here (no need to initialize)

Today after writing an operation to an engine, we will call `IndexShard#afterWriteOperation` to flush a new commit if needed. The `shouldFlush` condition is purely based on the uncommitted translog size and the translog flush threshold size setting. However this can cause a replica execute an infinite loop of flushing in the following situation. 1. Primary has a fully baked index commit with its local checkpoint equals to max_seqno 2. Primary sends that fully baked commit, then replays all retained translog operations to the replica 3. No operations are added to Lucence on the replica as seqno of these operations are at most the local checkpoint 4. Once translog operations are replayed, the target calls `IndexShard#afterWriteOperation` to flush. If the total size of the replaying operations exceeds the flush threshold size, this call will `Engine#flush`. However the engine won't flush as its index writer does not have any uncommitted operations. The method `IndexShard#afterWriteOperation` will keep flushing as the condition `shouldFlush` is still true. This issue can be avoided if we always flush if the `shouldFlush` condition is true.

If the translog flush threshold is too small (eg. smaller than the translog header), we may repeatedly flush even there is no uncommitted operation because the shouldFlush condition can still be true after flushing. This is currently avoided by adding an extra guard against the uncommitted operations. However, this extra guard makes the shouldFlush complicated. This commit replaces that extra guard by a lower bound for translog flush threshold. We keep the lower bound small for convenience in testing. Relates elastic#28350 Relates elastic#23606

* master: (23 commits) Update Netty to 4.1.16.Final (elastic#28345) Fix peer recovery flushing loop (elastic#28350) REST high-level client: add support for exists alias (elastic#28332) REST high-level client: move to POST when calling API to retrieve which support request body (elastic#28342) Add Indices Aliases API to the high level REST client (elastic#27876) Java Api clean up: remove deprecated `isShardsAcked` (elastic#28311) [Docs] Fix explanation for `from` and `size` example (elastic#28320) Adapt bwc version after backport elastic#28358 Always return the after_key in composite aggregation response (elastic#28358) Adds test name to MockPageCacheRecycler exception (elastic#28359) Adds a note in the `terms` aggregation docs regarding pagination (elastic#28360) [Test] Fix DiscoveryNodesTests.testDeltas() (elastic#28361) Update packaging tests to work with meta plugins (elastic#28336) Remove Painless Type from MethodWriter in favor of Java Class. (elastic#28346) [Doc] Fixs typo in reverse-nested-aggregation.asciidoc (elastic#28348) Reindex: Shore up rethrottle test Only assert single commit iff index created on 6.2 isHeldByCurrentThread should return primitive bool [Docs] Clarify `html` encoder in highlighting.asciidoc (elastic#27766) Fix GeoDistance query example (elastic#28355) ...

s1monw · 2018-01-26T11:27:53Z

good change and catch @dnhatn quite some insight into the system needed to get there, the dark side of the force is strong down there ;)

If the translog flush threshold is too small (eg. smaller than the translog header), we may repeatedly flush even there is no uncommitted operation because the shouldFlush condition can still be true after flushing. This is currently avoided by adding an extra guard against the uncommitted operations. However, this extra guard makes the shouldFlush complicated. This commit replaces that extra guard by a lower bound for translog flush threshold. We keep the lower bound small for convenience in testing. Relates #28350 Relates #23606

In elastic#28350, we fixed an endless flushing loop which can happen on replicas by tightening the relation between the flush action and the periodically flush condition. 1. The periodically flush condition is enabled only if it will be disabled after a flush. 2. If the periodically flush condition is true then a flush will actually happen regardless of Lucene state. (1) and (2) guarantee a flushing loop will be terminated. Sadly, the condition elastic#1 can be violated in edge cases as we used two different algorithms to evaluate the current and future uncommitted size. - We use method `uncommittedSizeInBytes` to calculate current uncommitted size. It is the sum of translogs whose generation at least the minGen (determined by a given seqno). We pick a continuous range of translogs since the minGen to evaluate the current uncommitted size. - We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future uncommitted size. It is the sum of translogs whose maxSeqNo at least the given seqNo. Here we don't pick a range but select translog one by one. Suppose we have 3 translogs gen1={elastic#1,elastic#2}, gen2={}, gen3={elastic#3} and seqno=elastic#1, uncommittedSizeInBytes is the sum of gen1, gen2, and gen3 while sizeOfGensAboveSeqNoInBytes is sum of gen1 and gen3. Gen2 is excluded because its maxSeqno is still -1. This commit ensures sizeOfGensAboveSeqNoInBytes use the same algorithm from uncommittedSizeInBytes Closes elastic#29097

In #28350, we fixed an endless flushing loop which may happen on replicas by tightening the relation between the flush action and the periodically flush condition. 1. The periodically flush condition is enabled only if it is disabled after a flush. 2. If the periodically flush condition is enabled then a flush will actually happen regardless of Lucene state. (1) and (2) guarantee that a flushing loop will be terminated. Sadly, the condition 1 can be violated in edge cases as we used two different algorithms to evaluate the current and future uncommitted translog size. - We use method `uncommittedSizeInBytes` to calculate current uncommitted size. It is the sum of translogs whose generation at least the minGen (determined by a given seqno). We pick a continuous range of translogs since the minGen to evaluate the current uncommitted size. - We use method `sizeOfGensAboveSeqNoInBytes` to calculate the future uncommitted size. It is the sum of translogs whose maxSeqNo at least the given seqNo. Here we don't pick a range but select translog one by one. Suppose we have 3 translogs `gen1={#1,#2}, gen2={}, gen3={#3} and seqno=#1`, `uncommittedSizeInBytes` is the sum of gen1, gen2, and gen3 while `sizeOfGensAboveSeqNoInBytes` is the sum of gen1 and gen3. Gen2 is excluded because its maxSeqno is still -1. This commit removes both `sizeOfGensAboveSeqNoInBytes` and `uncommittedSizeInBytes` methods, then enforces an engine to use only `sizeInBytesByMinGen` method to evaluate the periodically flush condition. Closes #29097 Relates ##28350

testShouldFlushAfterPeerRecovery was added #28350 to make sure the flushing loop triggered by afterWriteOperation eventually terminates. This test relies on the fact that we call afterWriteOperation after making changes in translog. In #44756, we roll a new generation in RecoveryTarget#finalizeRecovery but do not call afterWriteOperation. Relates #28350 Relates #45073

ShouldFlush should include uncommitted docs condition

ba2ec6a

dnhatn added >bug :Engine v7.0.0 v6.2.0 v6.1.3 labels Jan 24, 2018

dnhatn requested review from bleskes and ywelsch January 24, 2018 04:00

DaveCTurner added the v6.3.0 label Jan 24, 2018

dnhatn added 2 commits January 24, 2018 13:12

consider shouldFlush condition in flush

ad530f3

Add local checkpoint condition to shouldFlush

fe6901a

dnhatn changed the title ~~shouldFlush should include uncommitted docs condition~~ Engine should flush if shouldFlush is true Jan 24, 2018

add Engine#shouldFlush test

9789bd7

dnhatn added 2 commits January 24, 2018 14:35

grammar

18d8fe6

should flush if local checkpoint = max_seqno

15e4edf

bleskes suggested changes Jan 25, 2018

View reviewed changes

s1monw reviewed Jan 25, 2018

View reviewed changes

dnhatn added 2 commits January 25, 2018 10:50

Compare current uncommittedSize to the next uncommittedSize

f53045e

do not flush iff there is no op

ef7f713

ywelsch reviewed Jan 25, 2018

View reviewed changes

-> shouldPeriodicallyFlush

36cc8bc

ywelsch reviewed Jan 25, 2018

View reviewed changes

dnhatn added 2 commits January 25, 2018 13:19

Comment for uncommited operation checking

9bc74b3

add comment for recovery test

bcf3704

dnhatn merged commit f39402a into elastic:master Jan 25, 2018

dnhatn added the backport pending label Jan 25, 2018

dnhatn deleted the should-flush branch January 25, 2018 19:48

dnhatn removed the backport pending label Jan 25, 2018

dnhatn mentioned this pull request Jan 25, 2018

Add lower bound for translog flush threshold #28382

Merged

clintongormley changed the title ~~Fix peer recovery flushing loop~~ Replica recovery could go into an endless flushing loop Jan 30, 2018

dnhatn mentioned this pull request Feb 9, 2018

[CI] IndexShardIT#testStressMaybeFlushOrRollTranslogGeneration failure #25773

Closed

dnhatn mentioned this pull request Mar 17, 2018

Harden periodically check to avoid endless flush loop #29125

Merged

colings86 added v7.0.0-beta1 and removed v7.0.0 labels Feb 7, 2019

dnhatn mentioned this pull request Aug 5, 2019

Call afterWriteOperation after trim translog in peer recovery #45182

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replica recovery could go into an endless flushing loop #28350

Replica recovery could go into an endless flushing loop #28350

dnhatn commented Jan 24, 2018 •

edited

Loading

ywelsch commented Jan 24, 2018

bleskes commented Jan 24, 2018

dnhatn commented Jan 24, 2018 •

edited

Loading

bleskes commented Jan 24, 2018

dnhatn commented Jan 24, 2018

bleskes left a comment

bleskes Jan 25, 2018

dnhatn Jan 25, 2018

s1monw Jan 25, 2018

s1monw Jan 25, 2018

dnhatn Jan 25, 2018

s1monw Jan 25, 2018

dnhatn Jan 25, 2018

dnhatn commented Jan 25, 2018

ywelsch Jan 25, 2018

ywelsch Jan 25, 2018

ywelsch Jan 25, 2018

s1monw commented Jan 26, 2018

Replica recovery could go into an endless flushing loop #28350

Replica recovery could go into an endless flushing loop #28350

Conversation

dnhatn commented Jan 24, 2018 • edited Loading

ywelsch commented Jan 24, 2018

bleskes commented Jan 24, 2018

dnhatn commented Jan 24, 2018 • edited Loading

bleskes commented Jan 24, 2018

dnhatn commented Jan 24, 2018

bleskes left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dnhatn commented Jan 25, 2018

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

s1monw commented Jan 26, 2018

dnhatn commented Jan 24, 2018 •

edited

Loading

dnhatn commented Jan 24, 2018 •

edited

Loading