Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

drop index.shard.check_on_startup: fix #32279

Merged
merged 21 commits into from
Aug 31, 2018

Conversation

vladimirdolzhenko
Copy link
Contributor

  • drop index.shard.check_on_startup: fix setting
  • put a corrupt marker if it is detected during start up check that index

Relates #31389

@vladimirdolzhenko vladimirdolzhenko added >enhancement :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. v7.0.0 labels Jul 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@vladimirdolzhenko vladimirdolzhenko requested review from DaveCTurner and removed request for bleskes August 20, 2018 12:37
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, and I'm pleased that there can be some tests for these settings. I asked a few questions about the tests.

final IndexMetaData indexMetaData = IndexMetaData.builder(indexShard.indexSettings().getIndexMetaData())
.settings(Settings.builder()
.put(indexShard.indexSettings.getSettings())
.put(IndexSettings.INDEX_CHECK_ON_STARTUP.getKey(), "checksum"))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this test also work if we set this to true? If so, could we choose randomly between these two settings, and rename the test appropriately?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In case of corrupted index IndexShard with checksum option creates corrupted marker file, while in case of true it does not create it - seems due to https://github.com/elastic/elasticsearch/pull/32279/files#diff-49e1a1b834b522f4ae6997c5defe9eb0R1939 - as IllegalStateException is not subclass of IOException https://github.com/elastic/elasticsearch/pull/32279/files#diff-49e1a1b834b522f4ae6997c5defe9eb0R1894

Is it worth to change to IOException as well ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sounds to me like it's doing the wrong thing here, but these things are subtle. @ywelsch might know: if check_on_startup: true finds a corruption it doesn't currently mark the shard as corrupted. I think it should. WDYT?

Copy link
Contributor Author

@vladimirdolzhenko vladimirdolzhenko Aug 21, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DaveCTurner @ywelsch
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html

  • checksum - Check for physical corruption.
  • true - Check for both physical and logical corruption. This is much more expensive in terms of CPU and memory usage.

true relies on Lucene CheckIndex check - that is indeed heavier operation - I would say we have to create a corruption marker

// start shard with checksum - it has to pass successfully

final ShardRouting shardRouting = ShardRoutingHelper.initWithSameId(indexShard.routingEntry(),
primary ? RecoverySource.StoreRecoverySource.EXISTING_STORE_INSTANCE : RecoverySource.PeerRecoverySource.INSTANCE
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

primary is always true here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++

assertThat("store is clean",
corruptedMarkerCount.get(), equalTo(0));

// storeProvider that does not perform check index on close - it is corrupted
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand this comment, or what this storeProvider is for. The heart of this test seems to be:

expectThrows(IndexShardRecoveryException.class, () -> newStartedShard(p -> corruptedShard, primary));

We also want the corruption marker to exist as soon as this is thrown, not after the shard is subsequently closed. If we need to test that opening a shard with a corruption marker fails, and that the corruption marker isn't removed when it's closed, then I think that should be a separate test.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, that heart of the test is
expectThrows(IndexShardRecoveryException.class, () -> newStartedShard(p -> corruptedShard, primary));

the problem with default store provider - that it checks index on close. In case of corrupted index - it fails on close (as it's corrupted) - and we have to close it to avoid resource leakage.

therefore the key point of adjusted storeProvider is

baseDirectoryWrapper.setCheckIndexOnClose(false);

Agree to split it into 2 tests - the 2nd one with opening a shard with a corruption marker fails

Files.walk(indexPath)
.filter(p -> Files.isRegularFile(p) && IndexWriter.WRITE_LOCK_NAME.equals(p.getFileName().toString()) == false)
.toArray(Path[]::new);
CorruptionUtils.corruptFile(random(), filesToCorrupt);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this definitely corrupt a file that will be checked? It doesn't, for instance, hit a translog file in a generation that's early enough that it's no longer needed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It indeed corrupts only index file due to
final Path indexPath = shardPath.getDataPath().resolve("index");

translog files are stored in another place - shardPath.getDataPath().resolve("translog");

…heckChecksum into two tests: testIndexCheckOnStartup, testShardDoesNotStartIfCorruptedMarkerIsPresent; clean up
@vladimirdolzhenko
Copy link
Contributor Author

@elasticmachine test this please

@vladimirdolzhenko
Copy link
Contributor Author

@DaveCTurner I addressed your comments - tried ran test twice but it fails on completely irrelevant part (passed on private CI) - could you please have another look into this PR ?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. I asked for a change to one of the tests and a whitespace nit. Also, if you've picked up a flaky commit from master then I recommend trying to merge a later, good, one in.

.put(IndexSettings.INDEX_CHECK_ON_STARTUP.getKey(), randomFrom("true", "checksum")))
.build();

final Path indexPath = corruptIndexFile(shardPath);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this test should create the corruption marker directly, rather than relying on the check-on-startup behaviour. Essentially, it should fail to start even in the presence of a corruption marker, even if the index itself is not corrupt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agree

};
Files.walkFileTree(indexPath, corruptedVisitor);

assertThat("store is clean",
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I'd prefer statements to go all on one line if possible. There's a few places where that could be done, such as this one.

@DaveCTurner
Copy link
Contributor

I think this should go into 6.x too - your labels indicate this will only be merged into 7.0. We also need a PR to update the breaking changes docs for 6.x to note that fix is no longer supported. I'm not sure if we should note that in the 6.0 breaking changes, because it's not worked since then.

@vladimirdolzhenko
Copy link
Contributor Author

vladimirdolzhenko commented Aug 21, 2018

@DaveCTurner this PR has to go together with #32281 - so it depends mostly on that PR - as far as it is ready we can target version and back port to 6.x

Vladimir Dolzhenko added 3 commits August 21, 2018 16:15
Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did another pass and this is nearly there - I left just a few requests to tidy up comments and assertion messages, and spotted one missing assertion.

null, null, indexShard.engineFactory,
indexShard.getGlobalCheckpointSyncer(), EMPTY_EVENT_LISTENER);

expectThrows(IndexShardRecoveryException.class, () -> newStartedShard(p -> corruptedShard, true));
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please could you assert something about the thrown exception - ideally that its message describes the affected shard and contains an appropriate message?

indexShard.flush(new FlushRequest());
closeShards(indexShard);

// start shard with checksum - it has to pass successfully
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment seems misleading: we don't start the shard here, and we expect it not to pass successfully when we do.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

makes sense, addressed your comments


final Path indexPath = corruptIndexFile(shardPath);

// check that corrupt marker is *NOT* there
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this comment is necessary here. The check is the assertion a few lines further down, and it should be clear what that assertion is doing. I think the message in the assertion could use the phrase "corruption marker" rather than "clean" for consistency.

assertThat(exception1.getCause().getMessage(), equalTo(corruptionMessage + " (resource=preexisting_corruption)"));
closeShards(corruptedShard);

// check that corrupt marker is there
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly - this comment isn't needed here.


// create corrupted marker
final String corruptionMessage = "fake ioexception";
try(Store store = createStore(indexShard.indexSettings(), shardPath)){
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: whitespace

@DaveCTurner
Copy link
Contributor

Also please update the labels on this to indicate how it will be backported. I think it should go into 6.5 too.

@vladimirdolzhenko
Copy link
Contributor Author

it has to be backported to 6.5 together with #32281

@DaveCTurner
Copy link
Contributor

We discussed back-porting in another channel and agreed:

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, we need a deprecation warning on using fix so that we can remove it in 7.0.

@vladimirdolzhenko
Copy link
Contributor Author

  • I will add deprecation message for fix

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a simpler way to write the test - see comments. Also suggested adding "which has no effect" to the deprecation message, and to use the word "accepted" rather than "supported".

@@ -299,6 +299,10 @@ public IndexShard(
logger.debug("state: [CREATED]");

this.checkIndexOnStartup = indexSettings.getValue(IndexSettings.INDEX_CHECK_ON_STARTUP);
if ("fix".equals(checkIndexOnStartup)) {
deprecationLogger.deprecated("Setting [index.shard.check_on_startup] is set to deprecated value [fix], "
+ "which will be unsupported in future");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest

which has no effect and will not be accepted in future

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

++ thanks

return false;
}

public void testCheckOnStartupDeprecatedValue() throws Exception {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this could be simplified to:

    public void testCheckOnStartupDeprecatedValue() throws Exception {
        final IndexShard newShard = newShard(true, Settings.builder().put(IndexSettings.INDEX_CHECK_ON_STARTUP.getKey(), "fix").build());
        assertWarnings("Setting [index.shard.check_on_startup] is set to deprecated value [fix], which will be unsupported in future");
        closeShards(newShard);
    }

(and using assertWarnings means it doesn't need to be in its own test fixture too)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

already figured that out ;)

public void testCheckOnStartupDeprecatedValue() throws Exception {
final Settings settings = Settings.builder().put(IndexSettings.INDEX_CHECK_ON_STARTUP.getKey(), "fix").build();

try(ThreadContext threadContext = new ThreadContext(Settings.EMPTY)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: missing whitespace after the try.

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, LGTM now.

@vladimirdolzhenko
Copy link
Contributor Author

@elasticmachine test this please

@vladimirdolzhenko
Copy link
Contributor Author

thanks @DaveCTurner 👍

@vladimirdolzhenko vladimirdolzhenko merged commit 3d82a30 into elastic:master Aug 31, 2018
vladimirdolzhenko added a commit that referenced this pull request Aug 31, 2018
Relates #31389

(cherry picked from commit 3d82a30)
vladimirdolzhenko added a commit that referenced this pull request Aug 31, 2018
completely drop `index.shard.check_on_startup: fix` for 7.0 (#33194)

Relates to #32279
dnhatn added a commit that referenced this pull request Sep 1, 2018
* master:
  Mute test watcher usage stats output
  [Rollup] Fix FullClusterRestart test
  Adjust soft-deletes version after backport into 6.5
  completely drop `index.shard.check_on_startup: fix` for 7.0 (#33194)
  Fix AwaitsFix issue number
  Mute SmokeTestWatcherWithSecurityIT testsi
  drop `index.shard.check_on_startup: fix` (#32279)
  tracked at
  [DOCS] Moves ml folder from x-pack/docs to docs (#33248)
  [DOCS] Move rollup APIs to docs (#31450)
  [DOCS] Rename X-Pack Commands section (#33005)
  TEST: Disable soft-deletes in ParentChildTestCase
  Fixes SecurityIntegTestCase so it always adds at least one alias (#33296)
  Fix pom for build-tools (#33300)
  Lazy evaluate java9home (#33301)
  SQL: test coverage for JdbcResultSet (#32813)
  Work around to be able to generate eclipse projects (#33295)
  Highlight that index_phrases only works if no slop is used (#33303)
  Different handling for security specific errors in the CLI. Fix for #33230 (#33255)
  [ML] Refactor delimited file structure detection (#33233)
  SQL: Support multi-index format as table identifier (#33278)
  MINOR: Remove Dead Code from PathTrie (#33280)
  Enable forbiddenapis server java9 (#33245)
dnhatn added a commit that referenced this pull request Sep 1, 2018
* 6.x:
  Mute test watcher usage stats output
  [Rollup] Fix FullClusterRestart test
  TEST: Disable soft-deletes in ParentChildTestCase
  TEST: Disable randomized soft-deletes settings
  Integrates soft-deletes into Elasticsearch (#33222)
  drop `index.shard.check_on_startup: fix` (#32279)
  Fix AwaitsFix issue number
  Mute SmokeTestWatcherWithSecurityIT testsi
  [DOCS] Moves ml folder from x-pack/docs to docs (#33248)
  TEST: mute more SmokeTestWatcherWithSecurityIT tests
  [DOCS] Move rollup APIs to docs (#31450)
  [DOCS] Rename X-Pack Commands section (#33005)
  Fixes SecurityIntegTestCase so it always adds at least one alias (#33296)
  TESTS: Fix Random Fail in MockTcpTransportTests (#33061) (#33307)
  MINOR: Remove Dead Code from PathTrie (#33280) (#33306)
  Fix pom for build-tools (#33300)
  Lazy evaluate java9home (#33301)
  SQL: test coverage for JdbcResultSet (#32813)
  Work around to be able to generate eclipse projects (#33295)
  Different handling for security specific errors in the CLI. Fix for #33230 (#33255)
  [ML] Refactor delimited file structure detection (#33233)
  SQL: Support multi-index format as table identifier (#33278)
  Enable forbiddenapis server java9 (#33245)
  [MUTE] SmokeTestWatcherWithSecurityIT flaky tests
  Add region ISO code to GeoIP Ingest plugin (#31669) (#33276)
  Don't be strict for 6.x
  Update serialization versions for custom IndexMetaData backport
  Replace IndexMetaData.Custom with Map-based custom metadata (#32749)
  Painless: Fix Bindings Bug (#33274)
  SQL: prevent duplicate generation for repeated aggs (#33252)
  TEST: Mute testMonitorClusterHealth
  Fix serialization of empty field capabilities response (#33263)
  Fix nested _source retrieval with includes/excludes (#33180)
  [DOCS] TLS file resources are reloadable (#33258)
  Watcher: Ensure TriggerEngine start replaces existing watches (#33157)
  Ignore module-info in jar hell checks (#33011)
  Fix docs build after #33241
  [DOC] Repository GCS ADC not supported (#33238)
  Upgrade to latest Gradle 4.10  (#32801)
  Fix/30904 cluster formation part2 (#32877)
  Move file-based discovery to core (#33241)
  HLRC: add client side RefreshPolicy (#33209)
  [Kerberos] Add unsupported languages for tests (#33253)
  Watcher: Reload properly on remote shard change (#33167)
  Fix classpath security checks for external tests. (#33066)
  [Rollup] Only allow aggregating on multiples of configured interval (#32052)
  Added deprecation warning for rescore in scroll queries (#33070)
  Apply settings filter to get cluster settings API (#33247)
  [Rollup] Re-factor Rollup Indexer into a generic indexer for re-usability   (#32743)
  HLRC: create base timed request class (#33216)
  HLRC: Use Optional in validation logic (#33104)
  Painless: Add Bindings (#33042)
jasontedor added a commit to martijnvg/elasticsearch that referenced this pull request Sep 2, 2018
* master: (64 commits)
  HLREST: add update by query API (elastic#32760)
  TEST: Increase timeout testFollowIndexAndCloseNode (elastic#33333)
  HLRC: ML Flush job (elastic#33187)
  HLRC: Adding ML Job stats (elastic#33183)
  LLREST: Drop deprecated methods (elastic#33223)
  Mute testSyncerOnClosingShard
  [DOCS] Moves machine learning APIs to docs folder (elastic#31118)
  Mute test watcher usage stats output
  [Rollup] Fix FullClusterRestart test
  Adjust soft-deletes version after backport into 6.5
  completely drop `index.shard.check_on_startup: fix` for 7.0 (elastic#33194)
  Fix AwaitsFix issue number
  Mute SmokeTestWatcherWithSecurityIT testsi
  drop `index.shard.check_on_startup: fix` (elastic#32279)
  tracked at
  [DOCS] Moves ml folder from x-pack/docs to docs (elastic#33248)
  [DOCS] Move rollup APIs to docs (elastic#31450)
  [DOCS] Rename X-Pack Commands section (elastic#33005)
  TEST: Disable soft-deletes in ParentChildTestCase
  Fixes SecurityIntegTestCase so it always adds at least one alias (elastic#33296)
  ...
jasontedor added a commit to debadair/elasticsearch that referenced this pull request Sep 2, 2018
* master: (283 commits)
  HLREST: add update by query API (elastic#32760)
  TEST: Increase timeout testFollowIndexAndCloseNode (elastic#33333)
  HLRC: ML Flush job (elastic#33187)
  HLRC: Adding ML Job stats (elastic#33183)
  LLREST: Drop deprecated methods (elastic#33223)
  Mute testSyncerOnClosingShard
  [DOCS] Moves machine learning APIs to docs folder (elastic#31118)
  Mute test watcher usage stats output
  [Rollup] Fix FullClusterRestart test
  Adjust soft-deletes version after backport into 6.5
  completely drop `index.shard.check_on_startup: fix` for 7.0 (elastic#33194)
  Fix AwaitsFix issue number
  Mute SmokeTestWatcherWithSecurityIT testsi
  drop `index.shard.check_on_startup: fix` (elastic#32279)
  tracked at
  [DOCS] Moves ml folder from x-pack/docs to docs (elastic#33248)
  [DOCS] Move rollup APIs to docs (elastic#31450)
  [DOCS] Rename X-Pack Commands section (elastic#33005)
  TEST: Disable soft-deletes in ParentChildTestCase
  Fixes SecurityIntegTestCase so it always adds at least one alias (elastic#33296)
  ...
@vladimirdolzhenko vladimirdolzhenko deleted the fix/31389_1 branch September 22, 2018 10:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. >enhancement v6.5.0 v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants