Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Remote store restore using API fails when the index was created, docs indexed (optional) and node was stopped before the first refresh could happen #7923

Closed
ashking94 opened this issue Jun 6, 2023 · 3 comments · Fixed by #9480
Assignees
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework Storage Issues and PRs relating to data and metadata storage v2.10.0

Comments

@ashking94
Copy link
Member

ashking94 commented Jun 6, 2023

Describe the bug
If an index (with remote segments and remote translog enabled) is created and the node hosting the shard is removed/stopped, then restoring the index using the _remotestore/_restore api leads to failure. Following exception is seen -

opensearch-master1  | [2023-06-06T12:28:54,656][INFO ][o.o.c.m.MetadataIndexStateService] [opensearch-master1] closing indices [test-index/ty9eRlntQL2_39LGyksoPg]
opensearch-master1  | [2023-06-06T12:28:54,689][INFO ][o.o.c.m.MetadataIndexStateService] [opensearch-master1] completed closing of indices [test-index]
opensearch-node2    | [2023-06-06T12:28:59,283][INFO ][o.o.p.PluginsService     ] [opensearch-node2] PluginService:onIndexModule index:[test-index/ty9eRlntQL2_39LGyksoPg]
opensearch-node2    | [2023-06-06T12:28:59,297][INFO ][o.o.i.s.RemoteSegmentStoreDirectory] [opensearch-node2] No metadata file found, this can happen for new index with no data uploaded to remote segment store
opensearch-node2    | [2023-06-06T12:28:59,306][INFO ][o.o.i.s.IndexShard       ] [opensearch-node2] [test-index][0] Downloading segments from remote segment store
opensearch-node2    | [2023-06-06T12:28:59,307][INFO ][o.o.i.s.RemoteSegmentStoreDirectory] [opensearch-node2] No metadata file found, this can happen for new index with no data uploaded to remote segment store
opensearch-node2    | [2023-06-06T12:28:59,308][INFO ][o.o.i.s.IndexShard       ] [opensearch-node2] [test-index][0] Downloaded segments: []
opensearch-node2    | [2023-06-06T12:28:59,308][INFO ][o.o.i.s.IndexShard       ] [opensearch-node2] [test-index][0] Skipped download for segments: []
opensearch-node2    | [2023-06-06T12:28:59,315][INFO ][o.o.i.t.RemoteFsTranslog ] [opensearch-node2] Downloading translog files from remote for shard [test-index][0] 
opensearch-node2    | [2023-06-06T12:28:59,320][INFO ][o.o.i.t.t.TranslogTransferManager] [opensearch-node2] Downloading translog files with: Primary Term = 1, Generation = 7, Location = /usr/share/opensearch/data/nodes/0/indices/ty9eRlntQL2_39LGyksoPg/0/translog
opensearch-node2    | [2023-06-06T12:28:59,331][INFO ][o.o.i.t.RemoteFsTranslog ] [opensearch-node2] Downloaded translog files from remote for shard [test-index][0] 
opensearch-node2    | [2023-06-06T12:28:59,336][WARN ][o.o.i.c.IndicesClusterStateService] [opensearch-node2] [test-index][0] marking and sending shard failed due to [failed recovery]
opensearch-node2    | org.opensearch.indices.recovery.RecoveryFailedException: [test-index][0]: Recovery failed on {opensearch-node2}{0VTPak0PQxm8RoYxJpVC8w}{C5Kl4AmmQEKitkRHjcQzvw}{172.19.0.2}{172.19.0.2:9300}{dir}{shard_indexing_pressure_enabled=true}
opensearch-node2    | 	at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$31(IndexShard.java:3418) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$7(StoreRecovery.java:434) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:345) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.shard.StoreRecovery.recoverFromRemoteStore(StoreRecovery.java:121) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.shard.IndexShard.restoreFromRemoteStore(IndexShard.java:2521) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.shard.IndexShard.lambda$startRecovery$26(IndexShard.java:3322) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) [opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
opensearch-node2    | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
opensearch-node2    | 	at java.lang.Thread.run(Thread.java:1623) [?:?]
opensearch-node2    | Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed recovery
opensearch-node2    | 	... 12 more
opensearch-node2    | Caused by: org.opensearch.index.translog.TranslogCorruptedException: translog from source [/usr/share/opensearch/data/nodes/0/indices/ty9eRlntQL2_39LGyksoPg/0/translog] is corrupted
opensearch-node2    | 	at org.opensearch.index.translog.Translog.readCheckpoint(Translog.java:1905) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1892) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.shard.IndexShard.loadGlobalCheckpointToReplicationTracker(IndexShard.java:2163) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2189) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.shard.StoreRecovery.recoverFromRemoteStore(StoreRecovery.java:469) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromRemoteStore$1(StoreRecovery.java:123) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	... 9 more
opensearch-node2    | Caused by: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "text" is null
opensearch-node2    | 	at org.apache.lucene.util.BytesRef.<init>(BytesRef.java:80) ~[lucene-core-9.7.0-snapshot-4d1ed9e.jar:9.7.0-snapshot-4d1ed9e 4d1ed9ef9f69ebd032538ff4324fe8f6c8356f9a - 2023-05-19 14:51:47]
opensearch-node2    | 	at org.opensearch.index.translog.TranslogHeader.read(TranslogHeader.java:182) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.translog.Translog.readCheckpoint(Translog.java:1901) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1892) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.shard.IndexShard.loadGlobalCheckpointToReplicationTracker(IndexShard.java:2163) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2189) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.shard.StoreRecovery.recoverFromRemoteStore(StoreRecovery.java:469) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromRemoteStore$1(StoreRecovery.java:123) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-node2    | 	... 9 more
opensearch-master1  | [2023-06-06T12:28:59,338][WARN ][o.o.c.r.a.AllocationService] [opensearch-master1] failing shard [failed shard, shard [test-index][0], node[0VTPak0PQxm8RoYxJpVC8w], [P], recovery_source[remote store recovery [7jSQKnwoSyCk8SKGe1ECDA]], s[INITIALIZING], a[id=ZVyL8o1xRk2UikmEjFVTQg], unassigned_info[[reason=EXISTING_INDEX_RESTORED], at[2023-06-06T12:28:59.192Z], delayed=false, details[restore_source[remote_store]], allocation_status[no_attempt]], message [failed recovery], failure [RecoveryFailedException[[test-index][0]: Recovery failed on {opensearch-node2}{0VTPak0PQxm8RoYxJpVC8w}{C5Kl4AmmQEKitkRHjcQzvw}{172.19.0.2}{172.19.0.2:9300}{dir}{shard_indexing_pressure_enabled=true}]; nested: IndexShardRecoveryException[failed recovery]; nested: TranslogCorruptedException[translog from source [/usr/share/opensearch/data/nodes/0/indices/ty9eRlntQL2_39LGyksoPg/0/translog] is corrupted]; nested: NullPointerException[Cannot invoke "java.lang.CharSequence.length()" because "text" is null]; ], markAsStale [true]]
opensearch-master1  | org.opensearch.indices.recovery.RecoveryFailedException: [test-index][0]: Recovery failed on {opensearch-node2}{0VTPak0PQxm8RoYxJpVC8w}{C5Kl4AmmQEKitkRHjcQzvw}{172.19.0.2}{172.19.0.2:9300}{dir}{shard_indexing_pressure_enabled=true}
opensearch-master1  | 	at org.opensearch.index.shard.IndexShard.lambda$executeRecovery$31(IndexShard.java:3418) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.shard.StoreRecovery.lambda$recoveryListener$7(StoreRecovery.java:434) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.action.ActionListener$1.onFailure(ActionListener.java:88) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:345) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.shard.StoreRecovery.recoverFromRemoteStore(StoreRecovery.java:121) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.shard.IndexShard.restoreFromRemoteStore(IndexShard.java:2521) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.shard.IndexShard.lambda$startRecovery$26(IndexShard.java:3322) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.action.ActionRunnable$2.doRun(ActionRunnable.java:88) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:806) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:52) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1144) [?:?]
opensearch-master1  | 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:642) [?:?]
opensearch-master1  | 	at java.lang.Thread.run(Thread.java:1623) [?:?]
opensearch-master1  | Caused by: org.opensearch.index.shard.IndexShardRecoveryException: failed recovery
opensearch-master1  | 	... 12 more
opensearch-master1  | Caused by: org.opensearch.index.translog.TranslogCorruptedException: translog from source [/usr/share/opensearch/data/nodes/0/indices/ty9eRlntQL2_39LGyksoPg/0/translog] is corrupted
opensearch-master1  | 	at org.opensearch.index.translog.Translog.readCheckpoint(Translog.java:1905) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1892) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.shard.IndexShard.loadGlobalCheckpointToReplicationTracker(IndexShard.java:2163) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2189) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.shard.StoreRecovery.recoverFromRemoteStore(StoreRecovery.java:469) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromRemoteStore$1(StoreRecovery.java:123) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	... 9 more
opensearch-master1  | Caused by: java.lang.NullPointerException: Cannot invoke "java.lang.CharSequence.length()" because "text" is null
opensearch-master1  | 	at org.apache.lucene.util.BytesRef.<init>(BytesRef.java:80) ~[lucene-core-9.7.0-snapshot-4d1ed9e.jar:9.7.0-snapshot-4d1ed9e 4d1ed9ef9f69ebd032538ff4324fe8f6c8356f9a - 2023-05-19 14:51:47]
opensearch-master1  | 	at org.opensearch.index.translog.TranslogHeader.read(TranslogHeader.java:182) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.translog.Translog.readCheckpoint(Translog.java:1901) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.translog.Translog.readGlobalCheckpoint(Translog.java:1892) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.shard.IndexShard.loadGlobalCheckpointToReplicationTracker(IndexShard.java:2163) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.shard.IndexShard.openEngineAndRecoverFromTranslog(IndexShard.java:2189) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.shard.StoreRecovery.recoverFromRemoteStore(StoreRecovery.java:469) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.index.shard.StoreRecovery.lambda$recoverFromRemoteStore$1(StoreRecovery.java:123) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	at org.opensearch.action.ActionListener.completeWith(ActionListener.java:342) ~[opensearch-3.0.0-SNAPSHOT.jar:3.0.0-SNAPSHOT]
opensearch-master1  | 	... 9 more

To Reproduce
Steps to reproduce the behavior:

  1. Create repository and an index
curl -X PUT "localhost:9200/test-index?pretty" -H 'Content-Type: application/json' -d'
{
  "settings": {
    "number_of_shards": 1,
    "number_of_replicas": 0,
    "replication.type" : "SEGMENT",
    "index.remote_store.enabled": true,
    "index.remote_store.repository" : "seg",
    "index.remote_store.translog.enabled" : true,
    "index.remote_store.translog.repository" : "rem",
    "index.remote_store.translog.buffer_interval" : "300ms",
    "refresh_interval": "1000s"
  }
}
'
  1. Use _cat/shards api to find the node having the index and stop the node.
  2. Index docs - this step is optional. Even if we do not index, the issue can be reproduced.
  3. Close the index for the remote store restore to work
curl -X POST "localhost:9203/test-index/_close?pretty"
  1. Restore the index
curl -X POST "localhost:9203/_remotestore/_restore" -H 'Content-Type: application/json' -d'{"indices": ["test-index"]}'

Expected behavior
Index should get restored and become green.

Plugins
NA

Screenshots
NA

Host/Environment (please complete the following information):

  • OS: Not related
  • Version: main branch

Additional context
NA

@ashking94 ashking94 added bug Something isn't working untriaged Storage:Durability Issues and PRs related to the durability framework v2.9.0 'Issues and PRs related to version v2.9.0' and removed untriaged labels Jun 6, 2023
@ashking94 ashking94 changed the title [BUG] Remote store restore using API fails when the index was created and node was stopped before refresh [BUG] Remote store restore using API fails when the index was created, docs indexed (optional) and node was stopped before refresh Jun 6, 2023
@ashking94 ashking94 changed the title [BUG] Remote store restore using API fails when the index was created, docs indexed (optional) and node was stopped before refresh [BUG] Remote store restore using API fails when the index was created, docs indexed (optional) and node was stopped before the first refresh could happen Jun 6, 2023
@ashking94
Copy link
Member Author

There is another case - Create an index with remote segments and remote translog enabled, do not index anything, perform flush, stop the node, close the index and perform remotestore restore. The restore fails with the same above error.

@gbbafna gbbafna self-assigned this Jun 21, 2023
@BhumikaSaini-Amazon
Copy link
Contributor

Relates to #6188
(i.e. the currently-muted testRemoteTranslogRestoreWithNoDataPostCommit IT awaiting fix)

@sachinpkale
Copy link
Member

Taking a look

@dreamer-89 dreamer-89 added v2.10.0 and removed v2.9.0 'Issues and PRs related to version v2.9.0' labels Jul 19, 2023
@Bukhtawar Bukhtawar added the Storage Issues and PRs relating to data and metadata storage label Jul 27, 2023
@sachinpkale sachinpkale self-assigned this Aug 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Storage:Durability Issues and PRs related to the durability framework Storage Issues and PRs relating to data and metadata storage v2.10.0
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants