-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replicate write actions before fsyncing them #49746
Conversation
Currently this is rarely an issue as the AsyncIOProcessor runs synchronously as long as there is one request at a time This means that if you do non-concurrent indexing, that we always first fsync on the primary before we do on the replica.
…al checkpoint after fsync
Pinging @elastic/es-distributed (:Distributed/CRUD) |
@ywelsch just a heads-up, this still needs a fix to make
|
Jenkins run elasticsearch-ci/1 (test failure is unrelated GCS mock repo issue, that I'm investigating right now) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for looking into this @ywelsch . I did an initial read-through and have primarily one comment. AFAICS, we now reach out to replicas before/concurrently with fsync'ing the primary. I wonder if this is safe in case the primary dies after sending replication messages but before fsync'ing. In that case, I am not sure who will mark the primary as stale/not in-sync?
Also, in the case where the primary cannot talk to any replicas and at the same time the primary's own fsync fails, we might mark all copies stale (except for the todo noted in the code now).
server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java
Show resolved
Hide resolved
...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java
Outdated
Show resolved
Hide resolved
We discussed my concerns on another channel. The first one is not valid, since the global checkpoint will only advance when all in-sync copies have advanced their fsync'ed LCP. The second one is really an existing issue that should be addressed separately from this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good :) But I don't understand the implications of this well enough to LGTM I'm afraid :(
...in/ccr/src/test/java/org/elasticsearch/xpack/ccr/action/ShardFollowTaskReplicationTests.java
Outdated
Show resolved
Hide resolved
...rc/main/java/org/elasticsearch/xpack/ccr/action/bulk/TransportBulkShardOperationsAction.java
Outdated
Show resolved
Hide resolved
...in/ccr/src/test/java/org/elasticsearch/xpack/ccr/action/ShardFollowTaskReplicationTests.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the fix. I left a small comment about the listener. LGTM.
server/src/main/java/org/elasticsearch/action/support/replication/ReplicationOperation.java
Outdated
Show resolved
Hide resolved
...r/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java
Outdated
Show resolved
Hide resolved
Thanks for all your comments. I've pushed a5ed86d. |
This commit fixes a number of issues with data replication: - Local and global checkpoints are not updated after the new operations have been fsynced, but might capture a state before the fsync. The reason why this probably went undetected for so long is that AsyncIOProcessor is synchronous if you index one item at a time, and hence working as intended unless you have a high enough level of concurrent indexing. As we rely in other places on the assumption that we have an up-to-date local checkpoint in case of synchronous translog durability, there's a risk for the local and global checkpoints not to be up-to-date after replication completes, and that this won't be corrected by the periodic global checkpoint sync. - AsyncIOProcessor also has another "bad" side effect here: if you index one bulk at a time, the bulk is always first fsynced on the primary before being sent to the replica. Further, if one thread is tasked by AsyncIOProcessor to drain the processing queue and fsync, other threads can easily pile more bulk requests on top of that thread. Things are not very fair here, and the thread might continue doing a lot more fsyncs before returning (as the other threads pile more and more on top), which blocks it from returning as a replication request (e.g. if this thread is on the primary, it blocks the replication requests to the replicas from going out, and delaying checkpoint advancement). This commit fixes all these issues, and also simplifies the code that coordinates all the after write actions.
This commit fixes a number of issues with data replication: - Local and global checkpoints are not updated after the new operations have been fsynced, but might capture a state before the fsync. The reason why this probably went undetected for so long is that AsyncIOProcessor is synchronous if you index one item at a time, and hence working as intended unless you have a high enough level of concurrent indexing. As we rely in other places on the assumption that we have an up-to-date local checkpoint in case of synchronous translog durability, there's a risk for the local and global checkpoints not to be up-to-date after replication completes, and that this won't be corrected by the periodic global checkpoint sync. - AsyncIOProcessor also has another "bad" side effect here: if you index one bulk at a time, the bulk is always first fsynced on the primary before being sent to the replica. Further, if one thread is tasked by AsyncIOProcessor to drain the processing queue and fsync, other threads can easily pile more bulk requests on top of that thread. Things are not very fair here, and the thread might continue doing a lot more fsyncs before returning (as the other threads pile more and more on top), which blocks it from returning as a replication request (e.g. if this thread is on the primary, it blocks the replication requests to the replicas from going out, and delaying checkpoint advancement). This commit fixes all these issues, and also simplifies the code that coordinates all the after write actions.
This PR fixes a number of issues with data replication:
AsyncIOProcessor
is synchronous if you index one item at a time, and hence working as intended unless you have a high enough level of concurrent indexing. As we rely in other places on the assumption that we have an up-to-date local checkpoint in case of synchronous translog durability, there's a risk for the local and global checkpoints not to be up-to-date after replication completes, and that this won't be corrected by the periodic global checkpoint sync.AsyncIOProcessor
also has another "bad" side effect here: if you index one bulk at a time, the bulk is always first fsynced on the primary before being sent to the replica. Further, if one thread is tasked byAsyncIOProcessor
to drain the processing queue and fsync, other threads can easily pile more bulk requests on top of that thread. Things are not very fair here, and the thread might continue doing a lot more fsyncs before returning (as the other threads pile more and more on top), which blocks it from returning as a replication request (e.g. if this thread is on the primary, it blocks the replication requests to the replicas from going out, and delaying checkpoint advancement).This PR fixes all these issues, and also simplifies the code that coordinates all the after write actions.