-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Interpret ?timeout=-1
as infinite ack timeout
#107675
Interpret ?timeout=-1
as infinite ack timeout
#107675
Conversation
APIs which perform cluster state updates typically accept the `?master_timeout=` and `?timeout=` parameters to respectively set the pending task queue timeout and the acking timeout for the cluster state update. Both of these parameters accept the value `-1`, but `?master_timeout=-1` means to wait indefinitely whereas `?timeout=-1` means the same thing as `?timeout=0`, namely that acking times out immediately on commit. There are some situations where it makes sense to wait for as long as possible for nodes to ack a cluster state update. In practice this wait is bounded by other mechanisms (e.g. the lag detector will remove the node from the cluster after a couple of minutes of failing to apply cluster state updates) but these are not really the concern of clients. Therefore with this commit we change the meaning of `?timeout=-1` to mean that the acking timeout is infinite.
Documentation preview: |
Pinging @elastic/es-distributed (Team:Distributed) |
Hi @DaveCTurner, I've created a changelog YAML for you. Note that since this PR is labelled |
We've reached consensus that this is an acceptable change to make so this is good to review now. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Sorry for the delayed review. It is always a great learning opportunity to see changes around master coordination. Thanks!
if (ackTimeout.millis() < 0) { | ||
if (countDown.countDown()) { | ||
finish(); | ||
} | ||
return; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really related to this PR but for my learning. I am trying to understand the sequence of how cluster state happens. Is the following order correct?
- Master sends cluster state publish requests to all nodes
- After receiving publish response from a quorum of nodes, this
onCommit
is called. - Master sends apply commit requests to all nodes
- On each apply commit response, master calls
onNodeAck
.
When step 2 completes and all nodes responded in step 4, the overall request is considered as acknowledged.
Reading the code here, it seems to me that when onCommit
is called, there is a chance that step 4 has already completed (since it checks the countDown and call finish). But I am not sure how that can happen since onCommit
is called before any apply commit request can be sent (code)? Or is it to take care of single node cluster? I must be missing something (or even many things). I'd appreciate if you could help clarify it. Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The sequence is right.
You could be right about onCommit
never actually finishing the acking today. FWIW this code was added in #31303 (6.4.0) and we've rewritten much of the surrounding code since then. That said, it's a fairly delicate argument to prove this, whereas the "obviously correct" code as written today is robust and not meaningfully less efficient.
In particular, it's hard to see that onCommit
is called before any ApplyCommit
is sent. The relevant code is org.elasticsearch.cluster.coordination.Publication.PublicationTarget#handlePublishResponse
:
void handlePublishResponse(PublishResponse publishResponse) {
assert isWaitingForQuorum() : this;
logger.trace("handlePublishResponse: handling [{}] from [{}])", publishResponse, discoveryNode);
if (applyCommitRequest.isPresent()) {
sendApplyCommit();
} else {
try {
Publication.this.handlePublishResponse(discoveryNode, publishResponse).ifPresent(applyCommit -> {
assert applyCommitRequest.isPresent() == false;
applyCommitRequest = Optional.of(applyCommit);
ackListener.onCommit(TimeValue.timeValueMillis(currentTimeSupplier.getAsLong() - startTime));
publicationTargets.stream()
.filter(PublicationTarget::isWaitingForQuorum)
.forEach(PublicationTarget::sendApplyCommit);
});
} catch (Exception e) {
setFailed(e);
onPossibleCommitFailure();
}
}
}
As written, you could have the committing thread setting applyCommitRequest
, then pausing before calling ackListener.onCommit
, while another thread concurrently processes a later PublishResponse
, discovers that applyCommitRequest.isPresent()
and sends the ApplyCommit
. But then if you look at how this code is called eventually you discover that this all happens underneath Coordinator#mutex
so these things cannot happen concurrently. But relying on a mutex in Coordinator
to protect against concurrency in Publication
as part of the correctness argument for MasterService
is too deeply opaque for my liking.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks a lot for the explanation. TIL 🙇
Can also be set to `-1` to indicate that the request should never timeout. | ||
end::master-timeout[] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be great to also explain how master_timeout
is computed. By reading the code, it seems to start when the task is added to the queue and expires if the task does not get processed by the master. And I believe during this waiting, the task is visible via the PendingTasks API?
Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled |
APIs which perform cluster state updates typically accept the
?master_timeout=
and?timeout=
parameters to respectively set thepending task queue timeout and the acking timeout for the cluster state
update. Both of these parameters accept the value
-1
, but?master_timeout=-1
means to wait indefinitely whereas?timeout=-1
means the same thing as
?timeout=0
, namely that acking times outimmediately on commit.
There are some situations where it makes sense to wait for as long as
possible for nodes to ack a cluster state update. In practice this wait
is bounded by other mechanisms (e.g. the lag detector will remove the
node from the cluster after a couple of minutes of failing to apply
cluster state updates) but these are not really the concern of clients.
Therefore with this commit we change the meaning of
?timeout=-1
tomean that the acking timeout is infinite.