Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail shard if IndexShard#storeStats runs into an IOException #32241

Merged

Conversation

andrershov
Copy link
Contributor

@andrershov andrershov commented Jul 20, 2018

Fix for #29008.
There are two places that I'm not sure about, I've marked them TODO.
Based on previous pull request #29078, which can be closed if this one goes through.

@andrershov andrershov requested review from ywelsch and removed request for DaveCTurner July 20, 2018 18:46
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great. I left some minor comments.

@@ -917,6 +917,13 @@ public StoreStats storeStats() {
try {
return store.stats();
} catch (IOException e) {
failShard("Failing shard because of exception during storeState " + e.getMessage(), e);
//TODO should close method be called inside failShard?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is confusing. Currently it ends up being closed by IndicesClusterStateService#failAndRemoveShard . I agree with you that it's better that the shard will close itself (probably in IndexShard.ShardEventListener#onFailedEngine) but that's something for another PR. Can you remove the close method call here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes do you mean that in real application (not in unit tests), IndiciesClusterStateService will realize that shard has failed and will close the shard? That's why close call would no longer be needed.

null, new InternalEngineFactory(), () -> {}, EMPTY_EVENT_LISTENER);

recoverShardFromStore(shard);
boolean corruptIndexException = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make this final and set it within the relevant if close? that way, you can't change it again by mistake.

exceptionToThrow.set(() -> new IOException("Test IOException"));
}
ElasticsearchException e = expectThrows(ElasticsearchException.class, shard::storeStats);
assertThat(shard.state(), equalTo(IndexShardState.CLOSED));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can register a shard failure event listener and go with that for now

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes You mean, that I can just check if shard failure event listener has triggered instead of checking shard state? If I remove shard closing from implementation, then yes, actually shard state would be opened. It was in original proposed patch by you, that's why I've included logic to close the shard.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes You mean, that I can just check if shard failure event listener has triggered instead of checking shard state?

yes

It was in original proposed patch by you, that's why I've included logic to close the shard.

I told you I wasn't sure of it's quality :)

boolean corruptIndexException = true;

if (randomBoolean()) {
exceptionToThrow.set(() -> new CorruptIndexException("Test CorruptIndexException", "Test resource"));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯 for checking this.

* @param globalCheckpointSyncer callback for syncing global checkpoints
* @param indexEventListener index even listener
* @param listeners an optional set of listeners to add to the shard
* @param routing shard routing to use
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

formatting

final DirectoryService directoryService = new DirectoryService(shardId, indexSettings) {
@Override
public Directory newDirectory() throws IOException {
return newFSDirectory(shardPath.resolveIndex());

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why the extra line?

@colings86
Copy link
Contributor

@andrershov could you please add labels to this PR (area label, change type label and version label(s) )?


//TODO Should it be invoked when closing the shard?
try {
directory.close();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bleskes what do you think about explicitly closing the directory here? I've added this line, because w/o it test fails and reports unclosed resources. But is not it a shard that is responsible for directory lifecycle?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's closed by the store, which is, in turn, closed in IndexService#closeShard. That method is called by the failure listener. This part of why I didn't want you to start changing the closing logic as it will have a lot of little side effects (but I agree it's has become a mess over time).

w.r.t to this test - the easiest is to wrap the store you create in a try with resources block, so it will be always closed.

@andrershov
Copy link
Contributor Author

I've fixed formatting and nit things. @bleskes need to get your reply to 3 questions and I'll fix the rest.

@andrershov
Copy link
Contributor Author

@bleskes Fixed all the issues: 1) Removed close call for shard 2) Replaced CLOSE state check with checking that failure handler has triggered 3) Replaced explicit directory close call to try with resources for store

@andrershov andrershov added the :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. label Jul 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@andrershov
Copy link
Contributor Author

@bleskes should it be included in 7.0 only?

@andrershov andrershov self-assigned this Jul 23, 2018
Copy link
Contributor

@bleskes bleskes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@bleskes
Copy link
Contributor

bleskes commented Jul 23, 2018

@andrershov I think this is fine to into 6.4 as well

@@ -104,7 +104,7 @@ public void testRestoreSnapshotWithExistingFiles() throws IOException {
shardRouting,
shard.shardPath(),
shard.indexSettings().getIndexMetaData(),
null,
null, null,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is probably better on a separate line like the rest of the parameters.

@@ -2020,7 +2100,7 @@ public IndexSearcher wrap(IndexSearcher searcher) throws EngineException {
ShardRoutingHelper.initWithSameId(shard.routingEntry(), RecoverySource.StoreRecoverySource.EXISTING_STORE_INSTANCE),
shard.shardPath(),
shard.indexSettings().getIndexMetaData(),
wrapper,
null, wrapper,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is probably better on a separate line like the rest of the parameters.

@@ -917,6 +917,7 @@ public StoreStats storeStats() {
try {
return store.stats();
} catch (IOException e) {
failShard("Failing shard because of exception during storeState " + e.getMessage(), e);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should say storeStats, not storeState.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One more thing, looking at the implementation of failShard, we don't need to append e.getMessage() here.

@@ -356,7 +364,7 @@ protected IndexShard reinitShard(IndexShard current, ShardRouting routing, Index
routing,
current.shardPath(),
current.indexSettings().getIndexMetaData(),
null,
null, null,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: This is probably better on a separate line like the rest of the parameters.

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I left a few nits, and caught one error in the shard failure message.

@andrershov andrershov removed the request for review from ywelsch July 23, 2018 13:24
@andrershov
Copy link
Contributor Author

@jasontedor fixed the issues

Copy link
Member

@jasontedor jasontedor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @andrershov.

@andrershov andrershov merged commit 33f11e6 into elastic:master Jul 23, 2018
@jasontedor
Copy link
Member

@andrershov Please wait for all CI checks to complete before merging. This is a step we take to reduce the likelihood that we break the build for the rest of the team. 😄

andrershov added a commit that referenced this pull request Jul 23, 2018
Fail shard if IndexShard#storeStats runs into an IOException. Closes #29008
(cherry picked from commit 33f11e6)
dnhatn added a commit that referenced this pull request Jul 25, 2018
* 6.x:
  Security: revert to old way of merging automata (#32254)
  Fix a test bug in RangeQueryBuilderTests introduced in the field aliases backport.
  Introduce Application Privileges with support for Kibana RBAC (#32309)
  Undo a debugging change that snuck in during the field aliases merge.
  [test] port linux package packaging tests (#31943)
  Painless: Update More Methods to New Naming Scheme (#32305)
  Tribe: Add error with secure settings copied to tribe (#32298)
  Add V_6_3_3 version constant
  Add ERR to ranking evaluation documentation (#32314)
  [DOCS] Added link to 6.3.2 RNs
  [DOCS] Updates 6.3.2 release notes with PRs from ml-cpp repo (#32334)
  [Kerberos] Add Kerberos authentication support (#32263)
  [ML] Extract persistent task methods from MlMetadata (#32319)
  Backport - Add Snapshots Status API to High Level Rest Client (#32295)
  Make release notes ignore the `>test-failure` label. (#31309)
  [DOCS] Adds release highlights for search for 6.4 (#32095)
  Allow Integ Tests to run in a FIPS-140 JVM (#32316)
  Add support for field aliases to 6.x. (#32184)
  Register ERR metric with NamedXContentRegistry (#32320)
  fixes broken build for third-party-tests (#32315) Relates #31918 / Closes infra/issues/6085
  [DOCS] Rollup Caps API incorrectly mentions GET Jobs API (#32280)
  Rest HL client: Add put watch action (#32026) (#32191)
  Add WeightedAvg metric aggregation (#31037)
  Consistent encoder names (#29492)
  Switch monitoring to new style Requests (#32255)
  specify subdirs of lib, bin, modules in package (#32253)
  Rename ranking evaluation `quality_level` to `metric_score` (#32168)
  Add new permission for JDK11 to load JAAS libraries (#32132)
  Switch x-pack:core to new style Requests (#32252)
  Watcher: Store username on watch execution (#31873)
  Silence SSL reload test that fails on JDK 11
  Painless: Clean up add methods in PainlessLookup (#32258)
  CCE when re-throwing "shard not available" exception in TransportShardMultiGetAction (#32185)
  Fail shard if IndexShard#storeStats runs into an IOException (#32241)
  Fix `range` queries on `_type` field for singe type indices (#31756) (#32161)
  AwaitsFix RecoveryIT#testHistoryUUIDIsGenerated
  Add new fields to monitoring template for Beats state (#32085) (#32273)
  [TEST] improve REST high-level client naming conventions check (#32244)
  Check that client methods match API defined in the REST spec (#31825)
dnhatn added a commit that referenced this pull request Jul 25, 2018
* master:
  Security: revert to old way of merging automata (#32254)
  Networking: Fix test leaking buffer (#32296)
  Undo a debugging change that snuck in during the field aliases merge.
  Painless: Update More Methods to New Naming Scheme (#32305)
  [TEST] Fix assumeFalse -> assumeTrue in SSLReloadIntegTests
  Ingest: Support integer and long hex values in convert (#32213)
  Introduce fips_mode setting and associated checks (#32326)
  Add V_6_3_3 version constant
  [DOCS] Removed extraneous callout number.
  Rest HL client: Add put license action (#32214)
  Add ERR to ranking evaluation documentation (#32314)
  Introduce Application Privileges with support for Kibana RBAC (#32309)
  Build: Shadow x-pack:protocol into x-pack:plugin:core (#32240)
  [Kerberos] Add Kerberos authentication support (#32263)
  [ML] Extract persistent task methods from MlMetadata (#32319)
  Add Restore Snapshot High Level REST API
  Register ERR metric with NamedXContentRegistry (#32320)
  fixes broken build for third-party-tests (#32315)
  Allow Integ Tests to run in a FIPS-140 JVM (#31989)
  [DOCS] Rollup Caps API incorrectly mentions GET Jobs API (#32280)
  awaitsfix testRandomClusterStateUpdates
  [TEST] add version skip to weighted_avg tests
  Consistent encoder names (#29492)
  Add WeightedAvg metric aggregation (#31037)
  Switch monitoring to new style Requests (#32255)
  Rename ranking evaluation `quality_level` to `metric_score` (#32168)
  Fix a test bug around nested aggregations and field aliases. (#32287)
  Add new permission for JDK11 to load JAAS libraries (#32132)
  Silence SSL reload test that fails on JDK 11
  [test] package pre-install java check (#32259)
  specify subdirs of lib, bin, modules in package (#32253)
  Switch x-pack:core to new style Requests (#32252)
  awaitsfix SSLConfigurationReloaderTests
  Painless: Clean up add methods in PainlessLookup (#32258)
  Fail shard if IndexShard#storeStats runs into an IOException (#32241)
  AwaitsFix RecoveryIT#testHistoryUUIDIsGenerated
  Remove unnecessary warning supressions (#32250)
  CCE when re-throwing "shard not available" exception in TransportShardMultiGetAction (#32185)
  Add new fields to monitoring template for Beats state (#32085)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Indexing/Engine Anything around managing Lucene and the Translog in an open shard. v6.4.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants