Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Interpret ?timeout=-1 as infinite ack timeout #107675

Merged

Conversation

DaveCTurner
Copy link
Contributor

APIs which perform cluster state updates typically accept the
?master_timeout= and ?timeout= parameters to respectively set the
pending task queue timeout and the acking timeout for the cluster state
update. Both of these parameters accept the value -1, but
?master_timeout=-1 means to wait indefinitely whereas ?timeout=-1
means the same thing as ?timeout=0, namely that acking times out
immediately on commit.

There are some situations where it makes sense to wait for as long as
possible for nodes to ack a cluster state update. In practice this wait
is bounded by other mechanisms (e.g. the lag detector will remove the
node from the cluster after a couple of minutes of failing to apply
cluster state updates) but these are not really the concern of clients.

Therefore with this commit we change the meaning of ?timeout=-1 to
mean that the acking timeout is infinite.

APIs which perform cluster state updates typically accept the
`?master_timeout=` and `?timeout=` parameters to respectively set the
pending task queue timeout and the acking timeout for the cluster state
update. Both of these parameters accept the value `-1`, but
`?master_timeout=-1` means to wait indefinitely whereas `?timeout=-1`
means the same thing as `?timeout=0`, namely that acking times out
immediately on commit.

There are some situations where it makes sense to wait for as long as
possible for nodes to ack a cluster state update. In practice this wait
is bounded by other mechanisms (e.g. the lag detector will remove the
node from the cluster after a couple of minutes of failing to apply
cluster state updates) but these are not really the concern of clients.

Therefore with this commit we change the meaning of `?timeout=-1` to
mean that the acking timeout is infinite.
@DaveCTurner DaveCTurner added >enhancement >breaking :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v8.15.0 labels Apr 22, 2024
Copy link
Contributor

Documentation preview:

@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@elasticsearchmachine elasticsearchmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Apr 22, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've created a changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

@DaveCTurner DaveCTurner requested a review from ywangd April 26, 2024 09:44
@DaveCTurner
Copy link
Contributor Author

We've reached consensus that this is an acceptable change to make so this is good to review now.

Copy link
Member

@ywangd ywangd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Sorry for the delayed review. It is always a great learning opportunity to see changes around master coordination. Thanks!

Comment on lines +715 to +719
if (ackTimeout.millis() < 0) {
if (countDown.countDown()) {
finish();
}
return;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not really related to this PR but for my learning. I am trying to understand the sequence of how cluster state happens. Is the following order correct?

  1. Master sends cluster state publish requests to all nodes
  2. After receiving publish response from a quorum of nodes, this onCommit is called.
  3. Master sends apply commit requests to all nodes
  4. On each apply commit response, master calls onNodeAck.

When step 2 completes and all nodes responded in step 4, the overall request is considered as acknowledged.

Reading the code here, it seems to me that when onCommit is called, there is a chance that step 4 has already completed (since it checks the countDown and call finish). But I am not sure how that can happen since onCommit is called before any apply commit request can be sent (code)? Or is it to take care of single node cluster? I must be missing something (or even many things). I'd appreciate if you could help clarify it. Thanks!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The sequence is right.

You could be right about onCommit never actually finishing the acking today. FWIW this code was added in #31303 (6.4.0) and we've rewritten much of the surrounding code since then. That said, it's a fairly delicate argument to prove this, whereas the "obviously correct" code as written today is robust and not meaningfully less efficient.

In particular, it's hard to see that onCommit is called before any ApplyCommit is sent. The relevant code is org.elasticsearch.cluster.coordination.Publication.PublicationTarget#handlePublishResponse:

        void handlePublishResponse(PublishResponse publishResponse) {
            assert isWaitingForQuorum() : this;
            logger.trace("handlePublishResponse: handling [{}] from [{}])", publishResponse, discoveryNode);
            if (applyCommitRequest.isPresent()) {
                sendApplyCommit();
            } else {
                try {
                    Publication.this.handlePublishResponse(discoveryNode, publishResponse).ifPresent(applyCommit -> {
                        assert applyCommitRequest.isPresent() == false;
                        applyCommitRequest = Optional.of(applyCommit);
                        ackListener.onCommit(TimeValue.timeValueMillis(currentTimeSupplier.getAsLong() - startTime));
                        publicationTargets.stream()
                            .filter(PublicationTarget::isWaitingForQuorum)
                            .forEach(PublicationTarget::sendApplyCommit);
                    });
                } catch (Exception e) {
                    setFailed(e);
                    onPossibleCommitFailure();
                }
            }
        }

As written, you could have the committing thread setting applyCommitRequest, then pausing before calling ackListener.onCommit, while another thread concurrently processes a later PublishResponse, discovers that applyCommitRequest.isPresent() and sends the ApplyCommit. But then if you look at how this code is called eventually you discover that this all happens underneath Coordinator#mutex so these things cannot happen concurrently. But relying on a mutex in Coordinator to protect against concurrency in Publication as part of the correctness argument for MasterService is too deeply opaque for my liking.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for the explanation. TIL 🙇

Comment on lines 1223 to 1224
Can also be set to `-1` to indicate that the request should never timeout.
end::master-timeout[]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be great to also explain how master_timeout is computed. By reading the code, it seems to start when the task is added to the queue and expires if the task does not get processed by the master. And I believe during this waiting, the task is visible via the PendingTasks API?

@DaveCTurner DaveCTurner added the auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) label Apr 30, 2024
@elasticsearchmachine
Copy link
Collaborator

Hi @DaveCTurner, I've updated the changelog YAML for you. Note that since this PR is labelled >breaking, you need to update the changelog YAML to fill out the extended information sections.

@elasticsearchmachine elasticsearchmachine merged commit fc287bd into elastic:main Apr 30, 2024
15 checks passed
@DaveCTurner DaveCTurner deleted the 2024/04/22/infinite-ack-timeout branch April 30, 2024 13:54
arteam added a commit that referenced this pull request May 2, 2024
This test doesn't fail anymore, I've run it 1000 times locally. This test
got introduced in #107050, and I believe the test got fixed in #107675.
Unfortunately, the got muted before #107675 got merged, so I can't confirm
that #107675 fixed the test on CI.
arteam added a commit that referenced this pull request May 3, 2024
This test doesn't fail anymore, I've run it 1000 times locally.

This test got introduced in #107050, and I believe the test got fixed in #107675. Unfortunately, the got muted before #107675 got merged, so I can't confirm that PR actually fixed the test on CI.
@DaveCTurner DaveCTurner restored the 2024/04/22/infinite-ack-timeout branch June 17, 2024 06:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
auto-merge-without-approval Automatically merge pull request when CI checks pass (NB doesn't wait for reviews!) >breaking :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. v8.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants