Less Verbose Serialization of Snapshot Failure in SLM Metadata #80942

original-brownbear · 2021-11-23T13:09:04Z

We should not serialize the full exception including cause(s) and stacktraces
here. This can be a string of multiple MBs for a very large cluster that has
a large subset of indices/shards failing to snapshot.
We can get the full details of what failed for each shard in detail from the
repository as well as from logs anyway. If we fail to finalize the snapshot
we still get the rough reason for that failure with this change and can
look at the logs for more details.
Also, moved generating the JSON to the proper utility for that.

somewhat relates #77466 (since this is motivated by a user issue that saw a multi MB string being serialized here).

We should not serialize the full exception including cause(s) and stacktraces here. This can be a string of multiple MBs for a very large cluster that has a large subset of indices/shards failing to snapshot. We can get the full details of what failed for each shard in detail from the repository as well as from logs anyway. If we fail to finalize the snapshot we still get the rough reason for that failure with this change and can look at the logs for more details.

elasticmachine · 2021-11-23T13:09:07Z

Pinging @elastic/es-data-management (Team:Data Management)

idegtiarenko · 2021-11-23T15:01:36Z

x-pack/plugin/ilm/src/main/java/org/elasticsearch/xpack/slm/SnapshotLifecycleTask.java

+            return exception.map(e -> Strings.toString((builder, params) -> {
+                ElasticsearchException.generateThrowableXContent(builder, EXCEPTION_SERIALIZATION_PARAMS, e);
+                return builder;
+            }, EXCEPTION_SERIALIZATION_PARAMS)).orElse(null);


I am afraid in the linked sdh the exception become huge because of 2089 suppressed exceptions (attached to the root, not to the nested exception):

{"type":"snapshot_exception","reason":"[azure_repository:routine-snapshot-2021.09.29-08uqgt9wskigd-qqjym8ea] failed to create snapshot successfully, 2089 out of 3497 total shards failed","stack_trace":"SnapshotException[[azure_repository: routine-snapshot-2021.09.29-08uqgt9wskigd-qqjym8ea] failed to create snapshot successfully, 2089 out of 3497 total shards failed] at org.elasticsearch.xpack.slm.SnapshotLifecycleTask$1.onResponse(SnapshotLifecycleTask.java:111) at org.elasticsearch.xpack.slm.SnapshotLifecycleTask$1.onResponse(SnapshotLifecycleTask.java:93) at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:83) at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:77) at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31) at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:163) at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:217) at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:184) at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101) at org.elasticsearch.action.ActionListener.onResponse(ActionListener.java:291) at org.elasticsearch.snapshots.SnapshotsService.completeListenersIgnoringException(SnapshotsService.java:2850) at org.elasticsearch.snapshots.SnapshotsService.lambda$finalizeSnapshotEntry$36(SnapshotsService.java:1710) at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134) at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$finalizeSnapshot$38(BlobStoreRepository.java:1140) at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134) at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47) at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732) at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26) at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130) at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630) at java.base/java.lang.Thread.run(Thread.java:831) Suppressed: [openapi-iis-live-7.11.2-2021.08.19/h80zl_siTwC5zSzUIsqqlw][[openapi-iis-live-7.11.2-2021.08.19][1]] IndexShardSnapshotFailedException[UncategorizedExecutionException[Failed execution]; nested: ExecutionException[java.io.IOE xception: Unable to write blob indices/W7tLIyNDQI-N0W6z2QPi8A/1/__8QAl_rcTQn2rV7A30QqeEw]; nested: IOException[Unable to write blob indices/W7tLIyNDQI-N0W6z2QPi8A/1/__8QAl_rcTQn2rV7A30QqeEw]; nested: AbortedSnapshotException[aborted]] at org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:66) at org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:54) at org.elasticsearch.snapshots.SnapshotsService.finalizeSnapshotEntry(SnapshotsService.java:1614) at org.elasticsearch.snapshots.SnapshotsService.access$2100(SnapshotsService.java:119) at org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1555) at org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1552) at org.elasticsearch.action.ActionListener.onResponse(ActionListener.java:291) at org.elasticsearch.action.ResultDeduplicator$CompositeListener.onResponse(ResultDeduplicator.java:92) at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$getRepositoryData$47(BlobStoreRepository.java:1376) at org.elasticsearch.action.ResultDeduplicator.executeOnce(ResultDeduplicator.java:40) at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:1376) at org.elasticsearch.snapshots.SnapshotsService.endSnapshot(SnapshotsService.java:1552) at org.elasticsearch.snapshots.SnapshotsService.access$900(SnapshotsService.java:119) at org.elasticsearch.snapshots.SnapshotsService$16.clusterStateProcessed(SnapshotsService.java:3355) at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.clusterStateProcessed(MasterService.java:523) at org.elasticsearch.cluster.service.MasterService$TaskOutputs.lambda$processedDifferentClusterState$1(MasterService.java:410) at java.base/java.util.ArrayList.forEach(ArrayList.java:1511) at org.elasticsearch.cluster.service.MasterService$TaskOutputs.processedDifferentClusterState(MasterService.java:410) at org.elasticsearch.cluster.service.MasterService.onPublicationSuccess(MasterService.java:270) at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:262) at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:239) at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:62) at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:140) at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139) at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177) at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241) at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204) ... 3 more Suppressed: [openapi-svc-live-7.11.2-2021.09.07/whEHoP6ATW6aOKy8HNOzEA][[openapi-svc-live-7.11.2-2021.09.07][3]] IndexShardSnapshotFailedException[UncategorizedExecutionException[Failed execution]; nested: ExecutionException[java.io.IOE :

They were using 7.13.1 in case that was changed since then

I was looking on the exception json and I think we print each suppressed exception twice:

~/Downloads/sdh-5176/api-diagnostics-20211029-145129 ᐅ grep "at org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:66)" details.json -c 4178 ~/Downloads/sdh-5176/api-diagnostics-20211029-145129 ᐅ grep "Unable to write blob indices/W7tLIyNDQI-N0W6z2QPi8A/1/__8QAl_rcTQn2rV7A30QqeEw" details.json -c 2

First time it is written by ExceptionsHelper.stackTrace(throwable) > e.printStackTrace(printWriter) inside of STACK_TRACE
and second time inside our own SUPPRESSED array

I went a little further here now. As you pointed out to me on another channel, even with this change we are still serialising all the suppressed exceptions.
These exceptions are pointless without the stack-trace it turns out (the message doesn't really contain any actionable information). I don't think we can serialise the stack traces for every shard or even should do so for a sub-set of shards, it's just a massive number of bytes. To me this seems to even be outright wrong for storing history items to the SLM history as there's no telling how large those exceptions could serialize.
-> I just dropped collecting all the shard level exceptions in SLM itself and moved serialisation to what we would serialize in a REST response both for the history store and the cluster state.

This still gives us a reasonably actionable amount of information in the case of an outright failure of the snapshot request via the internal client and for all the shard level details we track how many shards failed and that's it. Further analysing this can be done by inspecting the GET /_snapshot/ API and if even more detail is needed, the logs.
Storing this kind of information outside the logs and including the stack-trace simply doesn't scale to situations like failing all shards in a 100k shard snapshot and the like.

@dakrone could you take a look here and see if you're fine with ^^, we really need to do something here :) Otherwise this is going to bite us (already has IMO, with the fact that we indexed O(10M) docs in a linked issue).

…ion-slm-exception

dakrone

LGTM, thanks for fixing this Armin

original-brownbear · 2021-11-29T18:49:17Z

@elasticmachine update branch

original-brownbear · 2021-11-29T19:13:15Z

Jenkins run elasticsearch-ci/part-2

original-brownbear · 2021-11-29T20:46:46Z

Thanks Ievgen and Lee!

elasticsearchmachine · 2021-11-29T20:48:06Z

💔 Backport failed

Status	Branch	Result
❌	8.0	Commit could not be cherrypicked due to conflicts

You can use sqren/backport to manually backport by running backport --upstream elastic/elasticsearch --pr 80942

* upstream/master: (150 commits) Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864) Optimize DLS bitset building for matchAll query (elastic#81030) URL option for BaseRunAsSuperuserCommand (elastic#81025) Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942) Fix shadowed vars pt7 (elastic#80996) Fail shards early when we can detect a type missmatch (elastic#79869) Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096) Clarify `unassigned.reason` docs (elastic#81017) Strip blocks from settings for reindex targets (elastic#80887) Split off the values supplier for ScriptDocValues (elastic#80635) [ML] Switch message and detail for model snapshot deprecations (elastic#81108) [DOCS] Update xrefs for snapshot restore docs (elastic#81023) [ML] Updates visiblity of validate API (elastic#81061) Track histogram of transport handling times (elastic#80581) [ML] Fix datafeed preview with remote indices (elastic#81099) [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060) [ML] Add logging for failing PyTorch test (elastic#81044) Extending the timeout waiting for snapshot to be ready (elastic#81018) [ML] Fix incorrect logging of unexpected model size error (elastic#81089) [ML] Make inference timeout test more reliable (elastic#81094) ... # Conflicts: # server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java

* upstream/master: (55 commits) Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864) Optimize DLS bitset building for matchAll query (elastic#81030) URL option for BaseRunAsSuperuserCommand (elastic#81025) Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942) Fix shadowed vars pt7 (elastic#80996) Fail shards early when we can detect a type missmatch (elastic#79869) Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096) Clarify `unassigned.reason` docs (elastic#81017) Strip blocks from settings for reindex targets (elastic#80887) Split off the values supplier for ScriptDocValues (elastic#80635) [ML] Switch message and detail for model snapshot deprecations (elastic#81108) [DOCS] Update xrefs for snapshot restore docs (elastic#81023) [ML] Updates visiblity of validate API (elastic#81061) Track histogram of transport handling times (elastic#80581) [ML] Fix datafeed preview with remote indices (elastic#81099) [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060) [ML] Add logging for failing PyTorch test (elastic#81044) Extending the timeout waiting for snapshot to be ready (elastic#81018) [ML] Fix incorrect logging of unexpected model size error (elastic#81089) [ML] Make inference timeout test more reliable (elastic#81094) ...

original-brownbear added >bug :Data Management/ILM+SLM Index and Snapshot lifecycle management v8.0.0 v8.1.0 labels Nov 23, 2021

elasticmachine added the Team:Data Management Meta label for data/management team label Nov 23, 2021

original-brownbear requested a review from dakrone November 23, 2021 14:13

idegtiarenko reviewed Nov 23, 2021

View reviewed changes

idegtiarenko approved these changes Nov 23, 2021

View reviewed changes

original-brownbear added 3 commits November 23, 2021 16:44

Merge remote-tracking branch 'elastic/master' into compact-serializat…

442ba5a

…ion-slm-exception

Merge remote-tracking branch 'elastic/master' into compact-serializat…

c1da7b8

…ion-slm-exception

more extreme solution

894035b

original-brownbear mentioned this pull request Nov 23, 2021

Fix Large Shard Count Scalability Issues #77466

Open

97 tasks

test

3864531

dakrone approved these changes Nov 29, 2021

View reviewed changes

Merge branch 'master' into compact-serialization-slm-exception

edaad92

original-brownbear added the auto-backport-and-merge label Nov 29, 2021

original-brownbear merged commit 1f93339 into elastic:master Nov 29, 2021

original-brownbear deleted the compact-serialization-slm-exception branch November 29, 2021 20:46

andreidan mentioned this pull request Jun 29, 2023

SLM metadata records enormous string describing last failure #71325

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Less Verbose Serialization of Snapshot Failure in SLM Metadata #80942

Less Verbose Serialization of Snapshot Failure in SLM Metadata #80942

original-brownbear commented Nov 23, 2021

elasticmachine commented Nov 23, 2021

idegtiarenko Nov 23, 2021

idegtiarenko Nov 23, 2021

idegtiarenko Nov 23, 2021 •

edited

Loading

original-brownbear Nov 23, 2021

dakrone left a comment

original-brownbear commented Nov 29, 2021

original-brownbear commented Nov 29, 2021

original-brownbear commented Nov 29, 2021

elasticsearchmachine commented Nov 29, 2021

Less Verbose Serialization of Snapshot Failure in SLM Metadata #80942

Less Verbose Serialization of Snapshot Failure in SLM Metadata #80942

Conversation

original-brownbear commented Nov 23, 2021

elasticmachine commented Nov 23, 2021

idegtiarenko Nov 23, 2021

Choose a reason for hiding this comment

idegtiarenko Nov 23, 2021

Choose a reason for hiding this comment

idegtiarenko Nov 23, 2021 • edited Loading

Choose a reason for hiding this comment

original-brownbear Nov 23, 2021

Choose a reason for hiding this comment

dakrone left a comment

Choose a reason for hiding this comment

original-brownbear commented Nov 29, 2021

original-brownbear commented Nov 29, 2021

original-brownbear commented Nov 29, 2021

elasticsearchmachine commented Nov 29, 2021

💔 Backport failed

idegtiarenko Nov 23, 2021 •

edited

Loading