-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
drop index.shard.check_on_startup: fix
#32279
drop index.shard.check_on_startup: fix
#32279
Conversation
Pinging @elastic/es-distributed |
f5a6acd
to
843f977
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, and I'm pleased that there can be some tests for these settings. I asked a few questions about the tests.
final IndexMetaData indexMetaData = IndexMetaData.builder(indexShard.indexSettings().getIndexMetaData()) | ||
.settings(Settings.builder() | ||
.put(indexShard.indexSettings.getSettings()) | ||
.put(IndexSettings.INDEX_CHECK_ON_STARTUP.getKey(), "checksum")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this test also work if we set this to true
? If so, could we choose randomly between these two settings, and rename the test appropriately?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In case of corrupted index IndexShard
with checksum
option creates corrupted marker file, while in case of true
it does not create it - seems due to https://github.com/elastic/elasticsearch/pull/32279/files#diff-49e1a1b834b522f4ae6997c5defe9eb0R1939 - as IllegalStateException
is not subclass of IOException
https://github.com/elastic/elasticsearch/pull/32279/files#diff-49e1a1b834b522f4ae6997c5defe9eb0R1894
Is it worth to change to IOException
as well ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sounds to me like it's doing the wrong thing here, but these things are subtle. @ywelsch might know: if check_on_startup: true
finds a corruption it doesn't currently mark the shard as corrupted. I think it should. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@DaveCTurner @ywelsch
https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html
checksum
- Check for physical corruption.true
- Check for both physical and logical corruption. This is much more expensive in terms of CPU and memory usage.
true
relies on Lucene CheckIndex
check - that is indeed heavier operation - I would say we have to create a corruption marker
// start shard with checksum - it has to pass successfully | ||
|
||
final ShardRouting shardRouting = ShardRoutingHelper.initWithSameId(indexShard.routingEntry(), | ||
primary ? RecoverySource.StoreRecoverySource.EXISTING_STORE_INSTANCE : RecoverySource.PeerRecoverySource.INSTANCE |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
primary
is always true here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++
assertThat("store is clean", | ||
corruptedMarkerCount.get(), equalTo(0)); | ||
|
||
// storeProvider that does not perform check index on close - it is corrupted |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand this comment, or what this storeProvider
is for. The heart of this test seems to be:
expectThrows(IndexShardRecoveryException.class, () -> newStartedShard(p -> corruptedShard, primary));
We also want the corruption marker to exist as soon as this is thrown, not after the shard is subsequently closed. If we need to test that opening a shard with a corruption marker fails, and that the corruption marker isn't removed when it's closed, then I think that should be a separate test.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed, that heart of the test is
expectThrows(IndexShardRecoveryException.class, () -> newStartedShard(p -> corruptedShard, primary));
the problem with default store provider - that it checks index on close. In case of corrupted index - it fails on close (as it's corrupted) - and we have to close it to avoid resource leakage.
therefore the key point of adjusted storeProvider is
baseDirectoryWrapper.setCheckIndexOnClose(false);
Agree to split it into 2 tests - the 2nd one with opening a shard with a corruption marker fails
Files.walk(indexPath) | ||
.filter(p -> Files.isRegularFile(p) && IndexWriter.WRITE_LOCK_NAME.equals(p.getFileName().toString()) == false) | ||
.toArray(Path[]::new); | ||
CorruptionUtils.corruptFile(random(), filesToCorrupt); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this definitely corrupt a file that will be checked? It doesn't, for instance, hit a translog file in a generation that's early enough that it's no longer needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It indeed corrupts only index file due to
final Path indexPath = shardPath.getDataPath().resolve("index");
translog files are stored in another place - shardPath.getDataPath().resolve("translog");
…heckChecksum into two tests: testIndexCheckOnStartup, testShardDoesNotStartIfCorruptedMarkerIsPresent; clean up
@elasticmachine test this please |
@DaveCTurner I addressed your comments - tried ran test twice but it fails on completely irrelevant part (passed on private CI) - could you please have another look into this PR ? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. I asked for a change to one of the tests and a whitespace nit. Also, if you've picked up a flaky commit from master then I recommend trying to merge a later, good, one in.
.put(IndexSettings.INDEX_CHECK_ON_STARTUP.getKey(), randomFrom("true", "checksum"))) | ||
.build(); | ||
|
||
final Path indexPath = corruptIndexFile(shardPath); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this test should create the corruption marker directly, rather than relying on the check-on-startup behaviour. Essentially, it should fail to start even in the presence of a corruption marker, even if the index itself is not corrupt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agree
}; | ||
Files.walkFileTree(indexPath, corruptedVisitor); | ||
|
||
assertThat("store is clean", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I'd prefer statements to go all on one line if possible. There's a few places where that could be done, such as this one.
I think this should go into 6.x too - your labels indicate this will only be merged into 7.0. We also need a PR to update the breaking changes docs for 6.x to note that |
@DaveCTurner this PR has to go together with #32281 - so it depends mostly on that PR - as far as it is ready we can target version and back port to 6.x |
…check shard does not start if corruption marker is there; drop unnecessary new lines
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did another pass and this is nearly there - I left just a few requests to tidy up comments and assertion messages, and spotted one missing assertion.
null, null, indexShard.engineFactory, | ||
indexShard.getGlobalCheckpointSyncer(), EMPTY_EVENT_LISTENER); | ||
|
||
expectThrows(IndexShardRecoveryException.class, () -> newStartedShard(p -> corruptedShard, true)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please could you assert something about the thrown exception - ideally that its message describes the affected shard and contains an appropriate message?
indexShard.flush(new FlushRequest()); | ||
closeShards(indexShard); | ||
|
||
// start shard with checksum - it has to pass successfully |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This comment seems misleading: we don't start the shard here, and we expect it not to pass successfully when we do.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense, addressed your comments
|
||
final Path indexPath = corruptIndexFile(shardPath); | ||
|
||
// check that corrupt marker is *NOT* there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this comment is necessary here. The check is the assertion a few lines further down, and it should be clear what that assertion is doing. I think the message in the assertion could use the phrase "corruption marker" rather than "clean" for consistency.
assertThat(exception1.getCause().getMessage(), equalTo(corruptionMessage + " (resource=preexisting_corruption)")); | ||
closeShards(corruptedShard); | ||
|
||
// check that corrupt marker is there |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similarly - this comment isn't needed here.
|
||
// create corrupted marker | ||
final String corruptionMessage = "fake ioexception"; | ||
try(Store store = createStore(indexShard.indexSettings(), shardPath)){ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: whitespace
Also please update the labels on this to indicate how it will be backported. I think it should go into 6.5 too. |
it has to be backported to 6.5 together with #32281 |
We discussed back-porting in another channel and agreed:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, we need a deprecation warning on using fix
so that we can remove it in 7.0.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There's a simpler way to write the test - see comments. Also suggested adding "which has no effect" to the deprecation message, and to use the word "accepted" rather than "supported".
@@ -299,6 +299,10 @@ public IndexShard( | |||
logger.debug("state: [CREATED]"); | |||
|
|||
this.checkIndexOnStartup = indexSettings.getValue(IndexSettings.INDEX_CHECK_ON_STARTUP); | |||
if ("fix".equals(checkIndexOnStartup)) { | |||
deprecationLogger.deprecated("Setting [index.shard.check_on_startup] is set to deprecated value [fix], " | |||
+ "which will be unsupported in future"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suggest
which has no effect and will not be accepted in future
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
++ thanks
return false; | ||
} | ||
|
||
public void testCheckOnStartupDeprecatedValue() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this could be simplified to:
public void testCheckOnStartupDeprecatedValue() throws Exception {
final IndexShard newShard = newShard(true, Settings.builder().put(IndexSettings.INDEX_CHECK_ON_STARTUP.getKey(), "fix").build());
assertWarnings("Setting [index.shard.check_on_startup] is set to deprecated value [fix], which will be unsupported in future");
closeShards(newShard);
}
(and using assertWarnings
means it doesn't need to be in its own test fixture too)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
already figured that out ;)
public void testCheckOnStartupDeprecatedValue() throws Exception { | ||
final Settings settings = Settings.builder().put(IndexSettings.INDEX_CHECK_ON_STARTUP.getKey(), "fix").build(); | ||
|
||
try(ThreadContext threadContext = new ThreadContext(Settings.EMPTY)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: missing whitespace after the try
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks, LGTM now.
@elasticmachine test this please |
…ster' into fix/31389_1
thanks @DaveCTurner 👍 |
* master: Mute test watcher usage stats output [Rollup] Fix FullClusterRestart test Adjust soft-deletes version after backport into 6.5 completely drop `index.shard.check_on_startup: fix` for 7.0 (#33194) Fix AwaitsFix issue number Mute SmokeTestWatcherWithSecurityIT testsi drop `index.shard.check_on_startup: fix` (#32279) tracked at [DOCS] Moves ml folder from x-pack/docs to docs (#33248) [DOCS] Move rollup APIs to docs (#31450) [DOCS] Rename X-Pack Commands section (#33005) TEST: Disable soft-deletes in ParentChildTestCase Fixes SecurityIntegTestCase so it always adds at least one alias (#33296) Fix pom for build-tools (#33300) Lazy evaluate java9home (#33301) SQL: test coverage for JdbcResultSet (#32813) Work around to be able to generate eclipse projects (#33295) Highlight that index_phrases only works if no slop is used (#33303) Different handling for security specific errors in the CLI. Fix for #33230 (#33255) [ML] Refactor delimited file structure detection (#33233) SQL: Support multi-index format as table identifier (#33278) MINOR: Remove Dead Code from PathTrie (#33280) Enable forbiddenapis server java9 (#33245)
* 6.x: Mute test watcher usage stats output [Rollup] Fix FullClusterRestart test TEST: Disable soft-deletes in ParentChildTestCase TEST: Disable randomized soft-deletes settings Integrates soft-deletes into Elasticsearch (#33222) drop `index.shard.check_on_startup: fix` (#32279) Fix AwaitsFix issue number Mute SmokeTestWatcherWithSecurityIT testsi [DOCS] Moves ml folder from x-pack/docs to docs (#33248) TEST: mute more SmokeTestWatcherWithSecurityIT tests [DOCS] Move rollup APIs to docs (#31450) [DOCS] Rename X-Pack Commands section (#33005) Fixes SecurityIntegTestCase so it always adds at least one alias (#33296) TESTS: Fix Random Fail in MockTcpTransportTests (#33061) (#33307) MINOR: Remove Dead Code from PathTrie (#33280) (#33306) Fix pom for build-tools (#33300) Lazy evaluate java9home (#33301) SQL: test coverage for JdbcResultSet (#32813) Work around to be able to generate eclipse projects (#33295) Different handling for security specific errors in the CLI. Fix for #33230 (#33255) [ML] Refactor delimited file structure detection (#33233) SQL: Support multi-index format as table identifier (#33278) Enable forbiddenapis server java9 (#33245) [MUTE] SmokeTestWatcherWithSecurityIT flaky tests Add region ISO code to GeoIP Ingest plugin (#31669) (#33276) Don't be strict for 6.x Update serialization versions for custom IndexMetaData backport Replace IndexMetaData.Custom with Map-based custom metadata (#32749) Painless: Fix Bindings Bug (#33274) SQL: prevent duplicate generation for repeated aggs (#33252) TEST: Mute testMonitorClusterHealth Fix serialization of empty field capabilities response (#33263) Fix nested _source retrieval with includes/excludes (#33180) [DOCS] TLS file resources are reloadable (#33258) Watcher: Ensure TriggerEngine start replaces existing watches (#33157) Ignore module-info in jar hell checks (#33011) Fix docs build after #33241 [DOC] Repository GCS ADC not supported (#33238) Upgrade to latest Gradle 4.10 (#32801) Fix/30904 cluster formation part2 (#32877) Move file-based discovery to core (#33241) HLRC: add client side RefreshPolicy (#33209) [Kerberos] Add unsupported languages for tests (#33253) Watcher: Reload properly on remote shard change (#33167) Fix classpath security checks for external tests. (#33066) [Rollup] Only allow aggregating on multiples of configured interval (#32052) Added deprecation warning for rescore in scroll queries (#33070) Apply settings filter to get cluster settings API (#33247) [Rollup] Re-factor Rollup Indexer into a generic indexer for re-usability (#32743) HLRC: create base timed request class (#33216) HLRC: Use Optional in validation logic (#33104) Painless: Add Bindings (#33042)
* master: (64 commits) HLREST: add update by query API (elastic#32760) TEST: Increase timeout testFollowIndexAndCloseNode (elastic#33333) HLRC: ML Flush job (elastic#33187) HLRC: Adding ML Job stats (elastic#33183) LLREST: Drop deprecated methods (elastic#33223) Mute testSyncerOnClosingShard [DOCS] Moves machine learning APIs to docs folder (elastic#31118) Mute test watcher usage stats output [Rollup] Fix FullClusterRestart test Adjust soft-deletes version after backport into 6.5 completely drop `index.shard.check_on_startup: fix` for 7.0 (elastic#33194) Fix AwaitsFix issue number Mute SmokeTestWatcherWithSecurityIT testsi drop `index.shard.check_on_startup: fix` (elastic#32279) tracked at [DOCS] Moves ml folder from x-pack/docs to docs (elastic#33248) [DOCS] Move rollup APIs to docs (elastic#31450) [DOCS] Rename X-Pack Commands section (elastic#33005) TEST: Disable soft-deletes in ParentChildTestCase Fixes SecurityIntegTestCase so it always adds at least one alias (elastic#33296) ...
* master: (283 commits) HLREST: add update by query API (elastic#32760) TEST: Increase timeout testFollowIndexAndCloseNode (elastic#33333) HLRC: ML Flush job (elastic#33187) HLRC: Adding ML Job stats (elastic#33183) LLREST: Drop deprecated methods (elastic#33223) Mute testSyncerOnClosingShard [DOCS] Moves machine learning APIs to docs folder (elastic#31118) Mute test watcher usage stats output [Rollup] Fix FullClusterRestart test Adjust soft-deletes version after backport into 6.5 completely drop `index.shard.check_on_startup: fix` for 7.0 (elastic#33194) Fix AwaitsFix issue number Mute SmokeTestWatcherWithSecurityIT testsi drop `index.shard.check_on_startup: fix` (elastic#32279) tracked at [DOCS] Moves ml folder from x-pack/docs to docs (elastic#33248) [DOCS] Move rollup APIs to docs (elastic#31450) [DOCS] Rename X-Pack Commands section (elastic#33005) TEST: Disable soft-deletes in ParentChildTestCase Fixes SecurityIntegTestCase so it always adds at least one alias (elastic#33296) ...
index.shard.check_on_startup: fix
settingRelates #31389