Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add RemoveCorruptedShardDataCommand #32281

Merged
merged 93 commits into from
Sep 19, 2018

Conversation

vladimirdolzhenko
Copy link
Contributor

@vladimirdolzhenko vladimirdolzhenko commented Jul 23, 2018

add elasticsearch-shard command-line tool

  • We should have one tool for dealing with corruptions, both in the translog and in the lucene index
    The tool will refuse to run if there are no existing corruption markers (i.e., it will only work on known corrupted shards)
    • adds elasticsearch-shard remove-corrupted-segments tool instead of dropped in drop index.shard.check_on_startup: fix #32279 index.shard.check_on_startup: fix setting
    • actually translog does not create corrupted file marker - it has to be address in following PR
  • The tool will first run a dry run and show an analysis of what it's going to do to the user, get confirmation and then perform required operations.
    • -dry-run option could be there to provide an overview
  • The tool should fail when check index fails to drop corrupted segments in Lucene. In the future we can offer users to only recover the translog, if needed. We don't feel we need the complexity right now.
  • We should document the implication of the tool to join relationships as it may be unexpected to users.
  • The tool should generate a new history uuid to prevent ops based recoveries and CCR.
  • The tool should generate a new allocation id and tell the user what command they need to run in order for the cluster to use this shard (allocate stale primary).

Relates #31389

@vladimirdolzhenko vladimirdolzhenko added >enhancement WIP :Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. v7.0.0 labels Jul 23, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed

@jpountz
Copy link
Contributor

jpountz commented Jul 23, 2018

Such a tool would be very helpful when no sane shards are not available anymore as it is often better to lose a single segment than an entire shard! I left some comment on #31389 about the API.

confirm("Continue and clean up Lucene index files ?", terminal);
// TODO: waiting for Lucene fixes or IndexWriter (to be able to create a new segment while other files there)
// or wait while Lucene CheckIndex makes files non corrupted
Lucene.cleanLuceneIndex(dir);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just had a chat with @vladimirdolzhenko and dropping the whole index is probably the safest option at this point when CheckIndex.Status.missing is true

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a link to LUCENE-6762 to the comment

@vladimirdolzhenko vladimirdolzhenko force-pushed the fix/31389_2 branch 7 times, most recently from 9b6aece to 8e0891f Compare August 20, 2018 12:19
@vladimirdolzhenko
Copy link
Contributor Author

@lcawl could you please have a look into documentation part change ?

@lcawl
Copy link
Contributor

lcawl commented Aug 20, 2018

I've added a "Command-line tools" section in the Index Modules page, rather than including that link under the settings. @debadair will also provide some feedback.

@vladimirdolzhenko
Copy link
Contributor Author

@DaveCTurner I've addressed your comments, could you please have another look ?

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor stuff now, and I asked for another reviewer.

]
}

You must accept the possibility of data loss changing parameter `accept_data_loss` to `true`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nits, sorry:

You must accept the possibility of data loss by changing the parameter accept_data_loss to true.

for (int possibleLockId = fromNodeId; possibleLockId < toNodeId; possibleLockId++) {
final NodeEnvironment.NodeLock nodeLock;
try {
nodeLock = new NodeEnvironment.NodeLock(possibleLockId, logger, environment, p -> {});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could this just be try (Releasable nodeLock = new NodeEnvironment.NodeLock(possibleLockId, logger, environment, p -> {})) {... with the catch at the bottom of this method?

Also, if the directory doesn't exist, I think that FSDirectory.open() creates it, which is not what we want here, so the consumer passed to NodeLock's constructor needs to check for this and bail out sooner. See the protected FSDirectory(Path path, LockFactory lockFactory) constructor for instance. I think this warrants a test that we don't accidentally create a bunch of directories if we can't find the index we were looking for.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

private final Lock[] locks;
private final NodePath[] nodePaths;

public NodeLock(final int nodeId, final Logger logger,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me but I would like another pair of eyes, as it's quite important not to break this. @ywelsch WDYT?

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The NodeLock change looks ok to me. I would also like to revive the discussion that was initiated here about removing the max-local-storage "feature".

Copy link
Contributor

@DaveCTurner DaveCTurner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, great work @vladimirdolzhenko

@DaveCTurner DaveCTurner dismissed debadair’s stale review September 19, 2018 08:21

The docs are much improved (and much shorter) compared with the version that was reviewed

@vladimirdolzhenko vladimirdolzhenko merged commit a3e8b83 into elastic:master Sep 19, 2018
@vladimirdolzhenko
Copy link
Contributor Author

thank you @DaveCTurner for the valuable comments and review 💯

jasontedor added a commit to jasontedor/elasticsearch that referenced this pull request Sep 19, 2018
* master: (46 commits)
  Fixing assertions in integration test (elastic#33833)
  [CCR] Rename idle_shard_retry_delay to poll_timout in auto follow patterns (elastic#33821)
  HLRC: Delete ML calendar (elastic#33775)
  Move DocsStats into Engine (elastic#33835)
  [Docs] Clarify accessing Date methods in painless (elastic#33560)
  add elasticsearch-shard tool (elastic#32281)
  Cut over to unwrap segment reader (elastic#33843)
  SQL: Fix issue with options for QUERY() and MATCH(). (elastic#33828)
  Emphasize that filesystem-level backups don't work (elastic#33102)
  Use the global doc id to generate a random score (elastic#33599)
  Add minimal sanity checks to custom/scripted similarities. (elastic#33564)
  Profiler: Don’t profile NEXTDOC for ConstantScoreQuery. (elastic#33196)
  [CCR] Change FollowIndexAction.Request class to be more user friendly (elastic#33810)
  SQL: day and month name functions tests locale providers enforcement (elastic#33653)
  TESTS: Set SO_LINGER = 0 for MockNioTransport (elastic#32560)
  Test: Relax jarhell gradle test (elastic#33787)
  [CCR] Fail with a descriptive error if leader index does not exist (elastic#33797)
  Add ES version 6.4.2 (elastic#33831)
  MINOR: Remove Some Dead Code in Scripting (elastic#33800)
  Ensure realtime `_get` and `_termvectors` don't run on the network thread (elastic#33814)
  ...
vladimirdolzhenko added a commit to vladimirdolzhenko/elasticsearch that referenced this pull request Sep 19, 2018
@vladimirdolzhenko vladimirdolzhenko deleted the fix/31389_2 branch September 22, 2018 10:39
vladimirdolzhenko added a commit that referenced this pull request Oct 1, 2018
#32281 adds elasticsearch-shard to provide bwc version of elasticsearch-translog for 6.x; have to remove elasticsearch-translog for 7.0

Relates to #31389
kcm pushed a commit that referenced this pull request Oct 30, 2018
#32281 adds elasticsearch-shard to provide bwc version of elasticsearch-translog for 6.x; have to remove elasticsearch-translog for 7.0

Relates to #31389
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed/Store Issues around managing unopened Lucene indices. If it touches Store.java, this is a likely label. >enhancement v7.0.0-beta1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants