-
Notifications
You must be signed in to change notification settings - Fork 1.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG][Segment Replication] Stuck Initializing Shard w/ Segment Replication enabled #6084
Comments
@mattweber could it be related to this bug [1]? do you have any exceptions in the log to pinpoint the issue? [1] #5701 |
Thank you @mattweber for reporting this. I have also seen this error once while working on #5898 where replica shard remained |
@reta maybe, I will go dig though the logs again and see what I find. Its a little rough as I have 48 nodes and not shipping logs in this environment. @dreamer-89 yes I debated even opening this as well and decided just in case someone else might have hit it and have more info. Looks like that was a good call. You can resolve it using the cluster reroute api w/ cancel + allocate_replica commands. My particular index was 24 shards, 1 replica and only a single replica ran into this issue. |
Looking into it. |
Not able to reproduce this on
Index settings
|
Able to repro the issue on older FailureI see shards getting stuck in initialization for couple of reasons Ongoing replicationOn cluster-manager node, I see shard failure exceptions (code link) for stuck shard Log trace
Stuck recoveryStuck peer recovery in translog state even though all operations are completed 100%. This might be happening as we were previously recovering translog operations upto global checkpoint rather than all available. Fixed in #6366
|
Resolving this issue as it is not repro'able on latest |
Hit this bug today while benchmarking the changes in #6643. Reopening. Setup 25 primary with 1 replica, 3 data node setup. The problem happened when stopping opensearch process on one node containing
|
From what you posted it looks like the recovery is still in progress in "translog" state. For segrep replica recoveries the translog step will only write the ops to xlog.
I haven't hit this bug where a replica never finishes recovery, but I think we have a bug here in that we are attempting to copy more xlog ops where necessary, slowing recovery. Edit: With SR we still attempt to recover from the safe commit so my previous statement is not accurate. |
Closing this as not reproducible, please re-open if this happens with OS 2.7+. |
Describe the bug
I have recently enabled segment replication on an OpenSearch 2.5.0 cluster and ran into a stuck initializing shard somehow. The only thing different on this cluster is 2.5.0 upgrade + enabling segment replication. I am leaning toward being related to the segment replication since it is experimental.
To Reproduce
I have not been able to reproduce, not sure how it even happened in first place.
Expected behavior
Shards initialize and finish
Additional context
I didn't see anything suspicious in the logs and issue was easy to resolve using the shard reroute api.
The text was updated successfully, but these errors were encountered: