-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Only ack cluster state updates successfully applied on all nodes #30672
Conversation
Pinging @elastic/es-distributed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Woah that was easy :)
I left one discussion point regarding the repeated calls to ackedTaskListener.mustAck
.
if (!node.equals(nodes.getMasterNode())) { | ||
return; | ||
} | ||
if (node.equals(masterNode) == false && ackedTaskListener.mustAck(node) == false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It worries me a little that ackedTaskListener.mustAck(node)
may give a different response here from the call in the constructor. Can we keep a set of the expected-to-ack nodes instead? Maybe only for the purposes of an assertion, or maybe check membership of the set instead of this expression?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems a little overkill to me, especially as we only have very few implementations of this, which are all pretty straightforward. Keeping an additional set of nodes and checking it also seems computationally and memory-wise more expensive as we can potentially have a large number of listeners. I'm wondering if this just adds cruft without any real benefit. How about adding an additional sentence to the Javadocs of mustAck
that states that implementations of this method must, given the same listener instance and the same parameter return the same result?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I'm ok with it going in the docs.
This look good in the sense that it correctly responds with false on an exception. We also talked about exposing the exception in the response. Do you prefer to do it as a follow up? |
I see it unrelated to this PR, which fixes a bug in the acking functionality. I'm also not sure about the use of exposing the exception in the response (I saw that more as a discussion point, rather than a concrete plan to enact). |
I can see that argument and fine with getting this in. PR approved.
Ok. Let's discuss it. I'm curious to hear your opinion. IMO it's the immediate first question after seeing a false in the ack - was there any failure? of course we can go even further and indicate which nodes didn't ack on time and if there were multiple failures, but exposing the last exception seems super trivial and I can see it will be nice to have in our test assertions, at the very least. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM
) The cluster state acking mechanism currently incorrectly acks cluster state updates that have not successfully been applied on all nodes. In a situation, for example, where some of the nodes disconnect during publishing, and don't acknowledge receiving the new cluster state, the user-facing action (e.g. create index request) will still consider this as an ack.
* master: (25 commits) [DOCS] Splits auditing.asciidoc into smaller files Reintroduce mandatory http pipelining support (elastic#30820) Painless: Types Section Clean Up (elastic#30283) Add support for indexed shape routing in geo_shape query (elastic#30760) [test] java tests for archive packaging (elastic#30734) Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813) [DOCS] Fix more edit URLs in Stack Overview (elastic#30704) Use correct cluster state version for node fault detection (elastic#30810) Change serialization version of doc-value fields. [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (elastic#30797) [TEST] Don't expect acks when isolating nodes Add a `format` option to `docvalue_fields`. (elastic#29639) Fixes UpdateSettingsRequestStreamableTests mutate bug Mustes {p0=snapshot.get_repository/10_basic/*} YAML test Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (elastic#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (elastic#30780) Only ack cluster state updates successfully applied on all nodes (elastic#30672) ...
* 6.x: [DOCS] Fixes typos in security settings Add support for indexed shape routing in geo_shape query (#30760) [DOCS] Splits auditing.asciidoc into smaller files Painless: Types Section Clean Up (#30283) [test] java tests for archive packaging (#30734) Deprecate http.pipelining setting (#30786) [DOCS] Fix more edit URLs in Stack Overview (#30704) Use correct cluster state version for node fault detection (#30810) [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (#30797) Change serialization version of doc-value fields. Add a `format` option to `docvalue_fields`. (#29639) [TEST] Don't expect acks when isolating nodes Fixes UpdateSettingsRequestStreamableTests mutate bug Revert "Add more yaml tests for get alias API (#29513)" Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (#30780) Only ack cluster state updates successfully applied on all nodes (#30672) Replace Request#setHeaders with addHeader (#30588) [TEST] remove endless wait in RestClientTests (#30776) QA: Add xpack tests to rolling upgrade (#30795) Add support for search templates to the high-level REST client. (#30473) Reduce CLI scripts to one-liners on Windows (#30772) Fold RestGetAllSettingsAction in RestGetSettingsAction (#30561) Add more yaml tests for get alias API (#29513) [Docs] Fix script-fields snippet execution (#30693) Convert FieldCapabilitiesResponse to a ToXContentObject. (#30182) Remove assert statements from field caps documentation. (#30601) Fix a bug in FieldCapabilitiesRequest#equals and hashCode. (#30181) Add support for field capabilities to the high-level REST client. (#29664) [DOCS] Add SAML configuration information (#30548) [DOCS] Remove X-Pack references from SQL CLI (#30694) [Docs] Fix typo in circuit breaker docs (#29659) [Feature] Adding a char_group tokenizer (#24186) Increase the maximum number of filters that may be in the cache. (#30655) [Docs] Fix broken cross link in documentation Test: wait for netty threads in a JUnit ClassRule (#30763) [Security] Include an empty json object in an json array when FLS filters out all fields (#30709) [DOCS] fixed incorrect default [TEST] Wait for CS to be fully applied in testDeleteCreateInOneBulk Enable installing plugins from snapshots.elastic.co (#30765) Ignore empty completion input (#30713) Fix docs failure on language analyzers (#30722) [Docs] Fix inconsistencies in snapshot/restore doc (#30480) Add Delete Repository High Level REST API (#30666) Reduce CLI scripts to one-liners (#30759)
* master: [DOCS] Fixes typos in security settings Fix GeoShapeQueryBuilder serialization after backport [DOCS] Splits auditing.asciidoc into smaller files Reintroduce mandatory http pipelining support (#30820) Painless: Types Section Clean Up (#30283) Add support for indexed shape routing in geo_shape query (#30760) [test] java tests for archive packaging (#30734) Revert "Make http pipelining support mandatory (#30695)" (#30813) [DOCS] Fix more edit URLs in Stack Overview (#30704) Use correct cluster state version for node fault detection (#30810) Change serialization version of doc-value fields. [DOCS] Fixes broken link for native realm [DOCS] Clarified audit.index.client.hosts (#30797) [TEST] Don't expect acks when isolating nodes Add a `format` option to `docvalue_fields`. (#29639) Fixes UpdateSettingsRequestStreamableTests mutate bug Mustes {p0=snapshot.get_repository/10_basic/*} YAML test Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled" Only allow x-pack metadata if all nodes are ready (#30743) Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled Use original settings on full-cluster restart (#30780) Only ack cluster state updates successfully applied on all nodes (#30672) Expose Lucene's FeatureField. (#30618) Fix a grammatical error in the 'search types' documentation. Remove http pipelining from integration test case (#30788)
* Publish timeout is set to `0` so out of order processing of states on the node can lead to a `false` ack response * See elastic#30672 * Closes elastic#36813
i am curious about why add "node.equals(masterNode)" can make this bug fixed? would you please give some interpretation? @ywelsch |
@luyuncheng the main change is this on-line change here: https://github.com/elastic/elasticsearch/pull/30672/files#diff-4fcb364bd9048f957146db1613387765R64 The change in |
The cluster state acking mechanism currently incorrectly acks cluster state updates that have not successfully been applied on all nodes. In a situation, for example, where some of the nodes disconnect during publishing, and don't acknowledge receiving the new cluster state, the user-facing action (e.g. create index request) will still consider this as an ack.
Thanks to @DaveCTurner for finding this.