Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replicate write actions before fsyncing them #49746

Merged
merged 14 commits into from
Dec 3, 2019

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented Dec 2, 2019

This PR fixes a number of issues with data replication:

  • Local and global checkpoints are not updated after the new operations have been fsynced, but might capture a state before the fsync. The reason why this probably went undetected for so long is that AsyncIOProcessor is synchronous if you index one item at a time, and hence working as intended unless you have a high enough level of concurrent indexing. As we rely in other places on the assumption that we have an up-to-date local checkpoint in case of synchronous translog durability, there's a risk for the local and global checkpoints not to be up-to-date after replication completes, and that this won't be corrected by the periodic global checkpoint sync.
  • AsyncIOProcessor also has another "bad" side effect here: if you index one bulk at a time, the bulk is always first fsynced on the primary before being sent to the replica. Further, if one thread is tasked by AsyncIOProcessor to drain the processing queue and fsync, other threads can easily pile more bulk requests on top of that thread. Things are not very fair here, and the thread might continue doing a lot more fsyncs before returning (as the other threads pile more and more on top), which blocks it from returning as a replication request (e.g. if this thread is on the primary, it blocks the replication requests to the replicas from going out, and delaying checkpoint advancement).

This PR fixes all these issues, and also simplifies the code that coordinates all the after write actions.

Currently this is rarely an issue as the AsyncIOProcessor runs synchronously as long as there is one request at a time

This means that if you do non-concurrent indexing, that we always first fsync on the primary before we do on the replica.
@ywelsch ywelsch added >bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. v8.0.0 v7.6.0 labels Dec 2, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/CRUD)

@original-brownbear
Copy link
Member

@ywelsch just a heads-up, this still needs a fix to make org.elasticsearch.xpack.ccr.action.ShardFollowTaskReplicationTests.CcrAction#performOnPrimary compile:

> Task :x-pack:plugin:ccr:compileTestJava FAILED
/home/brownbear/src/elasticsearch/x-pack/plugin/ccr/src/test/java/org/elasticsearch/xpack/ccr/action/ShardFollowTaskReplicationTests.java:682: error: cannot find symbol
                        ccrResult.respond(listener);
                                 ^
  symbol:   method respond(ActionListener<BulkShardOperationsResponse>)
  location: variable ccrResult of type WritePrimaryResult<BulkShardOperationsRequest,BulkShardOperationsResponse>
Note: Some input files use or override a deprecated API.
Note: Recompile with -Xlint:deprecation for details.
1 error

@original-brownbear
Copy link
Member

Jenkins run elasticsearch-ci/1 (test failure is unrelated GCS mock repo issue, that I'm investigating right now)

Copy link
Contributor

@henningandersen henningandersen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for looking into this @ywelsch . I did an initial read-through and have primarily one comment. AFAICS, we now reach out to replicas before/concurrently with fsync'ing the primary. I wonder if this is safe in case the primary dies after sending replication messages but before fsync'ing. In that case, I am not sure who will mark the primary as stale/not in-sync?

Also, in the case where the primary cannot talk to any replicas and at the same time the primary's own fsync fails, we might mark all copies stale (except for the todo noted in the code now).

@henningandersen
Copy link
Contributor

We discussed my concerns on another channel. The first one is not valid, since the global checkpoint will only advance when all in-sync copies have advanced their fsync'ed LCP.

The second one is really an existing issue that should be addressed separately from this PR.

Copy link
Member

@original-brownbear original-brownbear left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code looks good :) But I don't understand the implications of this well enough to LGTM I'm afraid :(

Copy link
Member

@dnhatn dnhatn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the fix. I left a small comment about the listener. LGTM.

@ywelsch
Copy link
Contributor Author

ywelsch commented Dec 3, 2019

Thanks for all your comments. I've pushed a5ed86d.

@ywelsch ywelsch merged commit 8c165e0 into elastic:master Dec 3, 2019
ywelsch added a commit that referenced this pull request Dec 3, 2019
This commit fixes a number of issues with data replication:

- Local and global checkpoints are not updated after the new operations have been fsynced, but
might capture a state before the fsync. The reason why this probably went undetected for so
long is that AsyncIOProcessor is synchronous if you index one item at a time, and hence working
as intended unless you have a high enough level of concurrent indexing. As we rely in other
places on the assumption that we have an up-to-date local checkpoint in case of synchronous
translog durability, there's a risk for the local and global checkpoints not to be up-to-date after
replication completes, and that this won't be corrected by the periodic global checkpoint sync.
- AsyncIOProcessor also has another "bad" side effect here: if you index one bulk at a time, the
bulk is always first fsynced on the primary before being sent to the replica. Further, if one thread
is tasked by AsyncIOProcessor to drain the processing queue and fsync, other threads can
easily pile more bulk requests on top of that thread. Things are not very fair here, and the thread
might continue doing a lot more fsyncs before returning (as the other threads pile more and
more on top), which blocks it from returning as a replication request (e.g. if this thread is on the
primary, it blocks the replication requests to the replicas from going out, and delaying
checkpoint advancement).

This commit fixes all these issues, and also simplifies the code that coordinates all the after
write actions.
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
This commit fixes a number of issues with data replication:

- Local and global checkpoints are not updated after the new operations have been fsynced, but
might capture a state before the fsync. The reason why this probably went undetected for so
long is that AsyncIOProcessor is synchronous if you index one item at a time, and hence working
as intended unless you have a high enough level of concurrent indexing. As we rely in other
places on the assumption that we have an up-to-date local checkpoint in case of synchronous
translog durability, there's a risk for the local and global checkpoints not to be up-to-date after
replication completes, and that this won't be corrected by the periodic global checkpoint sync.
- AsyncIOProcessor also has another "bad" side effect here: if you index one bulk at a time, the
bulk is always first fsynced on the primary before being sent to the replica. Further, if one thread
is tasked by AsyncIOProcessor to drain the processing queue and fsync, other threads can
easily pile more bulk requests on top of that thread. Things are not very fair here, and the thread
might continue doing a lot more fsyncs before returning (as the other threads pile more and
more on top), which blocks it from returning as a replication request (e.g. if this thread is on the
primary, it blocks the replication requests to the replicas from going out, and delaying
checkpoint advancement).

This commit fixes all these issues, and also simplifies the code that coordinates all the after
write actions.
@mfussenegger mfussenegger mentioned this pull request Mar 24, 2020
37 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/CRUD A catch all label for issues around indexing, updating and getting a doc by id. Not search. v7.6.0 v8.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants