Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only ack cluster state updates successfully applied on all nodes #30672

Merged
merged 3 commits into from
May 23, 2018

Conversation

ywelsch
Copy link
Contributor

@ywelsch ywelsch commented May 17, 2018

The cluster state acking mechanism currently incorrectly acks cluster state updates that have not successfully been applied on all nodes. In a situation, for example, where some of the nodes disconnect during publishing, and don't acknowledge receiving the new cluster state, the user-facing action (e.g. create index request) will still consider this as an ack.

Thanks to @DaveCTurner for finding this.

@ywelsch ywelsch added >bug v7.0.0 :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v6.4.0 labels May 17, 2018
@ywelsch ywelsch requested review from bleskes and DaveCTurner May 17, 2018 07:51
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Woah that was easy :)

I left one discussion point regarding the repeated calls to ackedTaskListener.mustAck.

if (!node.equals(nodes.getMasterNode())) {
return;
}
if (node.equals(masterNode) == false && ackedTaskListener.mustAck(node) == false) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It worries me a little that ackedTaskListener.mustAck(node) may give a different response here from the call in the constructor. Can we keep a set of the expected-to-ack nodes instead? Maybe only for the purposes of an assertion, or maybe check membership of the set instead of this expression?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That seems a little overkill to me, especially as we only have very few implementations of this, which are all pretty straightforward. Keeping an additional set of nodes and checking it also seems computationally and memory-wise more expensive as we can potentially have a large number of listeners. I'm wondering if this just adds cruft without any real benefit. How about adding an additional sentence to the Javadocs of mustAck that states that implementations of this method must, given the same listener instance and the same parameter return the same result?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I'm ok with it going in the docs.

@bleskes
Copy link
Contributor

bleskes commented May 19, 2018

This look good in the sense that it correctly responds with false on an exception. We also talked about exposing the exception in the response. Do you prefer to do it as a follow up?

@ywelsch
Copy link
Contributor Author

ywelsch commented May 19, 2018

This look good in the sense that it correctly responds with false on an exception. We also talked about exposing the exception in the response. Do you prefer to do it as a follow up?

I see it unrelated to this PR, which fixes a bug in the acking functionality. I'm also not sure about the use of exposing the exception in the response (I saw that more as a discussion point, rather than a concrete plan to enact).

@bleskes
Copy link
Contributor

bleskes commented May 21, 2018

I see it unrelated to this PR, which fixes a bug in the acking functionality

I can see that argument and fine with getting this in. PR approved.

I'm also not sure about the use of exposing the exception in the response (I saw that more as a discussion point, rather than a concrete plan to enact).

Ok. Let's discuss it. I'm curious to hear your opinion. IMO it's the immediate first question after seeing a false in the ack - was there any failure? of course we can go even further and indicate which nodes didn't ack on time and if there were multiple failures, but exposing the last exception seems super trivial and I can see it will be nice to have in our test assertions, at the very least.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM

@ywelsch ywelsch merged commit cceaa9a into elastic:master May 23, 2018
ywelsch added a commit that referenced this pull request May 23, 2018
)

The cluster state acking mechanism currently incorrectly acks cluster state updates that have not
successfully been applied on all nodes. In a situation, for example, where some of the nodes
disconnect during publishing, and don't acknowledge receiving the new cluster state, the user-facing
action (e.g. create index request) will still consider this as an ack.
ywelsch added a commit that referenced this pull request May 23, 2018
With #30672, acking expects *all* nodes to successfully apply the cluster state.
The testElectMasterWithLatestVersion test was checking for an ack while isolating
one node in the test.

Relates to #30672
ywelsch added a commit that referenced this pull request May 23, 2018
With #30672, acking expects *all* nodes to successfully apply the cluster state.
The testElectMasterWithLatestVersion test was checking for an ack while isolating
one node in the test.

Relates to #30672
jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request May 23, 2018
* master: (25 commits)
  [DOCS] Splits auditing.asciidoc into smaller files
  Reintroduce mandatory http pipelining support (elastic#30820)
  Painless: Types Section Clean Up (elastic#30283)
  Add support for indexed shape routing in geo_shape query (elastic#30760)
  [test] java tests for archive packaging (elastic#30734)
  Revert "Make http pipelining support mandatory (elastic#30695)" (elastic#30813)
  [DOCS] Fix more edit URLs in Stack Overview (elastic#30704)
  Use correct cluster state version for node fault detection (elastic#30810)
  Change serialization version of doc-value fields.
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (elastic#30797)
  [TEST] Don't expect acks when isolating nodes
  Add a `format` option to `docvalue_fields`. (elastic#29639)
  Fixes UpdateSettingsRequestStreamableTests mutate bug
  Mustes {p0=snapshot.get_repository/10_basic/*} YAML test
  Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled"
  Only allow x-pack metadata if all nodes are ready (elastic#30743)
  Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled
  Use original settings on full-cluster restart (elastic#30780)
  Only ack cluster state updates successfully applied on all nodes (elastic#30672)
  ...
dnhatn added a commit that referenced this pull request May 24, 2018
* 6.x:
  [DOCS] Fixes typos in security settings
  Add support for indexed shape routing in geo_shape query (#30760)
  [DOCS] Splits auditing.asciidoc into smaller files
  Painless: Types Section Clean Up (#30283)
  [test] java tests for archive packaging (#30734)
  Deprecate http.pipelining setting (#30786)
  [DOCS] Fix more edit URLs in Stack Overview (#30704)
  Use correct cluster state version for node fault detection (#30810)
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (#30797)
  Change serialization version of doc-value fields.
  Add a `format` option to `docvalue_fields`. (#29639)
  [TEST] Don't expect acks when isolating nodes
  Fixes UpdateSettingsRequestStreamableTests mutate bug
  Revert "Add more yaml tests for get alias API (#29513)"
  Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled"
  Only allow x-pack metadata if all nodes are ready (#30743)
  Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled
  Use original settings on full-cluster restart (#30780)
  Only ack cluster state updates successfully applied on all nodes (#30672)
  Replace Request#setHeaders with addHeader (#30588)
  [TEST] remove endless wait in RestClientTests (#30776)
  QA: Add xpack tests to rolling upgrade (#30795)
  Add support for search templates to the high-level REST client. (#30473)
  Reduce CLI scripts to one-liners on Windows (#30772)
  Fold RestGetAllSettingsAction in RestGetSettingsAction (#30561)
  Add more yaml tests for get alias API (#29513)
  [Docs] Fix script-fields snippet execution (#30693)
  Convert FieldCapabilitiesResponse to a ToXContentObject. (#30182)
  Remove assert statements from field caps documentation. (#30601)
  Fix a bug in FieldCapabilitiesRequest#equals and hashCode. (#30181)
  Add support for field capabilities to the high-level REST client. (#29664)
  [DOCS] Add SAML configuration information (#30548)
  [DOCS] Remove X-Pack references from SQL CLI (#30694)
  [Docs] Fix typo in circuit breaker docs (#29659)
  [Feature] Adding a char_group tokenizer (#24186)
  Increase the maximum number of filters that may be in the cache. (#30655)
  [Docs] Fix broken cross link in documentation
  Test: wait for netty threads in a JUnit ClassRule (#30763)
  [Security] Include an empty json object in an json array when FLS filters out all fields (#30709)
  [DOCS] fixed incorrect default
  [TEST] Wait for CS to be fully applied in testDeleteCreateInOneBulk
  Enable installing plugins from snapshots.elastic.co (#30765)
  Ignore empty completion input (#30713)
  Fix docs failure on language analyzers (#30722)
  [Docs] Fix inconsistencies in snapshot/restore doc (#30480)
  Add Delete Repository High Level REST API (#30666)
  Reduce CLI scripts to one-liners (#30759)
dnhatn added a commit that referenced this pull request May 24, 2018
* master:
  [DOCS] Fixes typos in security settings
  Fix GeoShapeQueryBuilder serialization after backport
  [DOCS] Splits auditing.asciidoc into smaller files
  Reintroduce mandatory http pipelining support (#30820)
  Painless: Types Section Clean Up (#30283)
  Add support for indexed shape routing in geo_shape query (#30760)
  [test] java tests for archive packaging (#30734)
  Revert "Make http pipelining support mandatory (#30695)" (#30813)
  [DOCS] Fix more edit URLs in Stack Overview (#30704)
  Use correct cluster state version for node fault detection (#30810)
  Change serialization version of doc-value fields.
  [DOCS] Fixes broken link for native realm
  [DOCS] Clarified audit.index.client.hosts (#30797)
  [TEST] Don't expect acks when isolating nodes
  Add a `format` option to `docvalue_fields`. (#29639)
  Fixes UpdateSettingsRequestStreamableTests mutate bug
  Mustes {p0=snapshot.get_repository/10_basic/*} YAML test
  Revert "Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled"
  Only allow x-pack metadata if all nodes are ready (#30743)
  Mutes MachineLearningTests.testNoAttributes_givenSameAndMlEnabled
  Use original settings on full-cluster restart (#30780)
  Only ack cluster state updates successfully applied on all nodes (#30672)
  Expose Lucene's FeatureField. (#30618)
  Fix a grammatical error in the 'search types' documentation.
  Remove http pipelining from integration test case (#30788)
@ywelsch ywelsch mentioned this pull request Sep 13, 2018
ywelsch added a commit that referenced this pull request Jan 24, 2019
As acking can fail for any reason (unrelated node being too slow, node disconnecting), it should not
be required for acking to succeed in order for index requests with dynamic mapping updates to
successfully complete.

Relates to #30672 and Closes #30844
ywelsch added a commit that referenced this pull request Jan 24, 2019
As acking can fail for any reason (unrelated node being too slow, node disconnecting), it should not
be required for acking to succeed in order for index requests with dynamic mapping updates to
successfully complete.

Relates to #30672 and Closes #30844
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Jan 31, 2019
* Publish timeout is set to `0` so out of order processing of states on the node can lead to a `false` ack response
  * See elastic#30672
* Closes elastic#36813
original-brownbear added a commit that referenced this pull request Jan 31, 2019
* Publish timeout is set to `0` so out of order processing of states on the node can lead to a `false` ack response
  * See #30672
* Closes #36813
@luyuncheng
Copy link
Contributor

i am curious about why add "node.equals(masterNode)" can make this bug fixed? would you please give some interpretation? @ywelsch

@ywelsch
Copy link
Contributor Author

ywelsch commented Aug 6, 2019

@luyuncheng the main change is this on-line change here: https://github.com/elastic/elasticsearch/pull/30672/files#diff-4fcb364bd9048f957146db1613387765R64

The change in MasterService was mainly to make acking more resilient against implementation bugs as the system implicitly relied on the fact that mustAck should always return true for the current master. In order to avoid implementors of this method to violate this assumption, it was simply hardcoded in MasterService.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Cluster Coordination Cluster formation and cluster state publication, including cluster membership and fault detection. v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants