-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Less Verbose Serialization of Snapshot Failure in SLM Metadata #80942
Less Verbose Serialization of Snapshot Failure in SLM Metadata #80942
Conversation
We should not serialize the full exception including cause(s) and stacktraces here. This can be a string of multiple MBs for a very large cluster that has a large subset of indices/shards failing to snapshot. We can get the full details of what failed for each shard in detail from the repository as well as from logs anyway. If we fail to finalize the snapshot we still get the rough reason for that failure with this change and can look at the logs for more details.
Pinging @elastic/es-data-management (Team:Data Management) |
return exception.map(e -> Strings.toString((builder, params) -> { | ||
ElasticsearchException.generateThrowableXContent(builder, EXCEPTION_SERIALIZATION_PARAMS, e); | ||
return builder; | ||
}, EXCEPTION_SERIALIZATION_PARAMS)).orElse(null); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am afraid in the linked sdh the exception become huge because of 2089 suppressed exceptions (attached to the root, not to the nested exception):
{"type":"snapshot_exception","reason":"[azure_repository:routine-snapshot-2021.09.29-08uqgt9wskigd-qqjym8ea] failed to create snapshot successfully, 2089 out of 3497 total shards failed","stack_trace":"SnapshotException[[azure_repository:
routine-snapshot-2021.09.29-08uqgt9wskigd-qqjym8ea] failed to create snapshot successfully, 2089 out of 3497 total shards failed]
at org.elasticsearch.xpack.slm.SnapshotLifecycleTask$1.onResponse(SnapshotLifecycleTask.java:111)
at org.elasticsearch.xpack.slm.SnapshotLifecycleTask$1.onResponse(SnapshotLifecycleTask.java:93)
at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:83)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:77)
at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:31)
at org.elasticsearch.xpack.security.action.filter.SecurityActionFilter.lambda$applyInternal$2(SecurityActionFilter.java:163)
at org.elasticsearch.action.ActionListener$DelegatingFailureActionListener.onResponse(ActionListener.java:217)
at org.elasticsearch.action.ActionListener$DelegatingActionListener.onResponse(ActionListener.java:184)
at org.elasticsearch.action.ActionListener$MappedActionListener.onResponse(ActionListener.java:101)
at org.elasticsearch.action.ActionListener.onResponse(ActionListener.java:291)
at org.elasticsearch.snapshots.SnapshotsService.completeListenersIgnoringException(SnapshotsService.java:2850)
at org.elasticsearch.snapshots.SnapshotsService.lambda$finalizeSnapshotEntry$36(SnapshotsService.java:1710)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$finalizeSnapshot$38(BlobStoreRepository.java:1140)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:134)
at org.elasticsearch.action.ActionRunnable.lambda$supply$0(ActionRunnable.java:47)
at org.elasticsearch.action.ActionRunnable$2.doRun(ActionRunnable.java:62)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:732)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:26)
at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1130)
at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:630)
at java.base/java.lang.Thread.run(Thread.java:831)
Suppressed: [openapi-iis-live-7.11.2-2021.08.19/h80zl_siTwC5zSzUIsqqlw][[openapi-iis-live-7.11.2-2021.08.19][1]] IndexShardSnapshotFailedException[UncategorizedExecutionException[Failed execution]; nested: ExecutionException[java.io.IOE
xception: Unable to write blob indices/W7tLIyNDQI-N0W6z2QPi8A/1/__8QAl_rcTQn2rV7A30QqeEw]; nested: IOException[Unable to write blob indices/W7tLIyNDQI-N0W6z2QPi8A/1/__8QAl_rcTQn2rV7A30QqeEw]; nested: AbortedSnapshotException[aborted]]
at org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:66)
at org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:54)
at org.elasticsearch.snapshots.SnapshotsService.finalizeSnapshotEntry(SnapshotsService.java:1614)
at org.elasticsearch.snapshots.SnapshotsService.access$2100(SnapshotsService.java:119)
at org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1555)
at org.elasticsearch.snapshots.SnapshotsService$7.onResponse(SnapshotsService.java:1552)
at org.elasticsearch.action.ActionListener.onResponse(ActionListener.java:291)
at org.elasticsearch.action.ResultDeduplicator$CompositeListener.onResponse(ResultDeduplicator.java:92)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$getRepositoryData$47(BlobStoreRepository.java:1376)
at org.elasticsearch.action.ResultDeduplicator.executeOnce(ResultDeduplicator.java:40)
at org.elasticsearch.repositories.blobstore.BlobStoreRepository.getRepositoryData(BlobStoreRepository.java:1376)
at org.elasticsearch.snapshots.SnapshotsService.endSnapshot(SnapshotsService.java:1552)
at org.elasticsearch.snapshots.SnapshotsService.access$900(SnapshotsService.java:119)
at org.elasticsearch.snapshots.SnapshotsService$16.clusterStateProcessed(SnapshotsService.java:3355)
at org.elasticsearch.cluster.service.MasterService$SafeClusterStateTaskListener.clusterStateProcessed(MasterService.java:523)
at org.elasticsearch.cluster.service.MasterService$TaskOutputs.lambda$processedDifferentClusterState$1(MasterService.java:410)
at java.base/java.util.ArrayList.forEach(ArrayList.java:1511)
at org.elasticsearch.cluster.service.MasterService$TaskOutputs.processedDifferentClusterState(MasterService.java:410)
at org.elasticsearch.cluster.service.MasterService.onPublicationSuccess(MasterService.java:270)
at org.elasticsearch.cluster.service.MasterService.publish(MasterService.java:262)
at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:239)
at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:62)
at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:140)
at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:139)
at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:177)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:673)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:241)
at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:204)
... 3 more
Suppressed: [openapi-svc-live-7.11.2-2021.09.07/whEHoP6ATW6aOKy8HNOzEA][[openapi-svc-live-7.11.2-2021.09.07][3]] IndexShardSnapshotFailedException[UncategorizedExecutionException[Failed execution]; nested: ExecutionException[java.io.IOE
:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They were using 7.13.1 in case that was changed since then
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was looking on the exception json and I think we print each suppressed exception twice:
~/Downloads/sdh-5176/api-diagnostics-20211029-145129 ᐅ grep "at org.elasticsearch.snapshots.SnapshotShardFailure.<init>(SnapshotShardFailure.java:66)" details.json -c
4178
~/Downloads/sdh-5176/api-diagnostics-20211029-145129 ᐅ grep "Unable to write blob indices/W7tLIyNDQI-N0W6z2QPi8A/1/__8QAl_rcTQn2rV7A30QqeEw" details.json -c
2
First time it is written by ExceptionsHelper.stackTrace(throwable) > e.printStackTrace(printWriter)
inside of STACK_TRACE
and second time inside our own SUPPRESSED
array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went a little further here now. As you pointed out to me on another channel, even with this change we are still serialising all the suppressed exceptions.
These exceptions are pointless without the stack-trace it turns out (the message doesn't really contain any actionable information). I don't think we can serialise the stack traces for every shard or even should do so for a sub-set of shards, it's just a massive number of bytes. To me this seems to even be outright wrong for storing history items to the SLM history as there's no telling how large those exceptions could serialize.
-> I just dropped collecting all the shard level exceptions in SLM itself and moved serialisation to what we would serialize in a REST response both for the history store and the cluster state.
This still gives us a reasonably actionable amount of information in the case of an outright failure of the snapshot request via the internal client and for all the shard level details we track how many shards failed and that's it. Further analysing this can be done by inspecting the GET /_snapshot/
API and if even more detail is needed, the logs.
Storing this kind of information outside the logs and including the stack-trace simply doesn't scale to situations like failing all shards in a 100k shard snapshot and the like.
@dakrone could you take a look here and see if you're fine with ^^, we really need to do something here :) Otherwise this is going to bite us (already has IMO, with the fact that we indexed O(10M) docs in a linked issue).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for fixing this Armin
@elasticmachine update branch |
Jenkins run elasticsearch-ci/part-2 |
Thanks Ievgen and Lee! |
💔 Backport failed
You can use sqren/backport to manually backport by running |
* upstream/master: (150 commits) Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864) Optimize DLS bitset building for matchAll query (elastic#81030) URL option for BaseRunAsSuperuserCommand (elastic#81025) Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942) Fix shadowed vars pt7 (elastic#80996) Fail shards early when we can detect a type missmatch (elastic#79869) Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096) Clarify `unassigned.reason` docs (elastic#81017) Strip blocks from settings for reindex targets (elastic#80887) Split off the values supplier for ScriptDocValues (elastic#80635) [ML] Switch message and detail for model snapshot deprecations (elastic#81108) [DOCS] Update xrefs for snapshot restore docs (elastic#81023) [ML] Updates visiblity of validate API (elastic#81061) Track histogram of transport handling times (elastic#80581) [ML] Fix datafeed preview with remote indices (elastic#81099) [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060) [ML] Add logging for failing PyTorch test (elastic#81044) Extending the timeout waiting for snapshot to be ready (elastic#81018) [ML] Fix incorrect logging of unexpected model size error (elastic#81089) [ML] Make inference timeout test more reliable (elastic#81094) ... # Conflicts: # server/src/main/java/org/elasticsearch/index/mapper/NumberFieldMapper.java
* upstream/master: (55 commits) Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864) Optimize DLS bitset building for matchAll query (elastic#81030) URL option for BaseRunAsSuperuserCommand (elastic#81025) Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942) Fix shadowed vars pt7 (elastic#80996) Fail shards early when we can detect a type missmatch (elastic#79869) Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096) Clarify `unassigned.reason` docs (elastic#81017) Strip blocks from settings for reindex targets (elastic#80887) Split off the values supplier for ScriptDocValues (elastic#80635) [ML] Switch message and detail for model snapshot deprecations (elastic#81108) [DOCS] Update xrefs for snapshot restore docs (elastic#81023) [ML] Updates visiblity of validate API (elastic#81061) Track histogram of transport handling times (elastic#80581) [ML] Fix datafeed preview with remote indices (elastic#81099) [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060) [ML] Add logging for failing PyTorch test (elastic#81044) Extending the timeout waiting for snapshot to be ready (elastic#81018) [ML] Fix incorrect logging of unexpected model size error (elastic#81089) [ML] Make inference timeout test more reliable (elastic#81094) ...
* upstream/master: (55 commits) Fix ComposableIndexTemplate equals when composed_of is null (elastic#80864) Optimize DLS bitset building for matchAll query (elastic#81030) URL option for BaseRunAsSuperuserCommand (elastic#81025) Less Verbose Serialization of Snapshot Failure in SLM Metadata (elastic#80942) Fix shadowed vars pt7 (elastic#80996) Fail shards early when we can detect a type missmatch (elastic#79869) Delegate Ref Counting to ByteBuf in Netty Transport (elastic#81096) Clarify `unassigned.reason` docs (elastic#81017) Strip blocks from settings for reindex targets (elastic#80887) Split off the values supplier for ScriptDocValues (elastic#80635) [ML] Switch message and detail for model snapshot deprecations (elastic#81108) [DOCS] Update xrefs for snapshot restore docs (elastic#81023) [ML] Updates visiblity of validate API (elastic#81061) Track histogram of transport handling times (elastic#80581) [ML] Fix datafeed preview with remote indices (elastic#81099) [ML] Fix acceptable model snapshot versions in ML deprecation checker (elastic#81060) [ML] Add logging for failing PyTorch test (elastic#81044) Extending the timeout waiting for snapshot to be ready (elastic#81018) [ML] Fix incorrect logging of unexpected model size error (elastic#81089) [ML] Make inference timeout test more reliable (elastic#81094) ...
We should not serialize the full exception including cause(s) and stacktraces
here. This can be a string of multiple MBs for a very large cluster that has
a large subset of indices/shards failing to snapshot.
We can get the full details of what failed for each shard in detail from the
repository as well as from logs anyway. If we fail to finalize the snapshot
we still get the rough reason for that failure with this change and can
look at the logs for more details.
Also, moved generating the JSON to the proper utility for that.
somewhat relates #77466 (since this is motivated by a user issue that saw a multi MB string being serialized here).