-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use local checkpoint to calculate min translog gen for recovery #51905
Conversation
Pinging @elastic/es-distributed (:Distributed/Engine) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think that the uncommittedSizeInBytes
and uncommittedOperations
metrics are pretty useless today, as they are not a measure of how much data needs to be recovered/replayed after a crash. I would rather base these metrics on the local checkpoint of the safe commit in all cases, and completely disregard the last commit (which is irrelevant for recovery).
server/src/main/java/org/elasticsearch/index/engine/CombinedDeletionPolicy.java
Show resolved
Hide resolved
++. I pushed c4bccea. |
There are relevant test failures here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This LGTM. I am not familiar enough with all 6.x development that I can confidently say that this will work in both rolling and full restart upgrades, but my search for an issue did not reveal anything. So better wait for Yannick's approval too.
server/src/main/java/org/elasticsearch/index/engine/NoOpEngine.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/index/engine/InternalEngineTests.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/index/shard/IndexShardIT.java
Outdated
Show resolved
Hide resolved
server/src/test/java/org/elasticsearch/index/engine/NoOpEngineTests.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@ywelsch @henningandersen Thanks for reviewing. |
We roll a new translog generation and trim operations that are above the global checkpoint during primary-replica resync. If the initOperations is empty, then the stale operation on the replica2 will be discarded as it is the only operation in a translog file (since #51905 where we started using the local checkpoint to calculate the minimum required translog generation for recovery). Otherwise, the stale op will be retained along with initOperations but will be skipped in snapshots. Relates #51905 Closes #52148
Since #51905, we use the local checkpoint of the safe commit to calculate the number of uncommitted operations of a translog stats. If a periodic flush triggered by afterWriteOperation completes before we sync translog, then the last commit is not safe. We also need to sync translog from Engine instead of the translog so that we can advance the safe commit. Relates #51905 Closes #52223
We need to reduce the translog sync interval for indices with translog async setting so that we can have the safe commit in the assertBusy interval. This is needed since #51905, where we use the local checkpoint of the safe commit to calculate the number of uncommitted operations of a translog stats. Closes #52251 Relates #51905
We need to reduce the translog sync interval for indices with translog async setting so that we can have the safe commit in the assertBusy interval. This is needed since #51905, where we use the local checkpoint of the safe commit to calculate the number of uncommitted operations of a translog stats. Closes #52251 Relates #51905
Asserts that no new operations are made into the translog since we re-opened the engine. Relates elastic#51905 Closes elastic#52410
…tic#51905) Today we use the translog_generation of the safe commit as the minimum required translog generation for recovery. This approach has a limitation, where we won't be able to clean up translog unless we flush. Reopening an already recovered engine will create a new empty translog, and we leave it there until we force flush. This commit removes the translog_generation commit tag and uses the local checkpoint of the safe commit to calculate the minimum required translog generation for recovery instead. Closes elastic#49970
Since elastic#51905, we skip translog recovery if the local checkpoint of the safe commit equals to the global checkpoint. This change adjusts the test not to create a new snapshot in that case. Closes elastic#52221 Relates elastic#51905
Since elastic#51905, we use the local checkpoint of the safe commit to calculate the number of uncommitted operations of a translog stats. If a periodic flush triggered by afterWriteOperation completes before we sync translog, then the last commit is not safe. We also need to sync translog from Engine instead of the translog so that we can advance the safe commit. Relates elastic#51905 Closes elastic#52223
Asserts that no new operations are made into the translog since we re-opened the engine. Relates elastic#51905 Closes elastic#52410
Today we use the translog_generation of the safe commit as the minimum required translog generation for recovery. This approach has a limitation, where we won't be able to clean up translog unless we flush. Reopening an already recovered engine will create a new empty translog, and we leave it there until we force flush. This commit removes the translog_generation commit tag and uses the local checkpoint of the safe commit to calculate the minimum required translog generation for recovery instead. Closes #49970
Since #51905, we use the local checkpoint of the safe commit to calculate the number of uncommitted operations of a translog stats. If a periodic flush triggered by afterWriteOperation completes before we sync translog, then the last commit is not safe. We also need to sync translog from Engine instead of the translog so that we can advance the safe commit. Relates #51905 Closes #52223
Today we use the translog_generation of the safe commit as the minimum required translog generation for recovery. This approach has a limitation, where we won't be able to clean up translog unless we flush. Reopening an already recovered engine will create a new empty translog, and we leave it there until we force flush.
This commit removes the translog_generation commit tag and uses the local checkpoint of the safe commit to calculate the minimum required translog generation for recovery instead.
Closes #49970