ES index: don't log individual def failures #297

Dieterbe · 2016-08-25T12:56:45Z

otherwise log can get flooded when ES can't keep up

We have the metrics reporting already, so no need to log at this res

otherwise log can get flooded when ES can't keep up We have the metrics reporting already, so no need to log at this res

woodsaj · 2016-08-25T13:25:57Z

These are not debug messages. They indicate that there is a problem and should be in the log file. When benchmarking the index I was able to index 20k/s metrics into es and I never saw a failure.

Dieterbe · 2016-08-25T13:41:09Z

i was doing 50k/s and getting thousands of these messages.
excerpt below.

Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.40f59ff5bcedb1d6df731ab130042326 failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@110ad163 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"
Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.804b367de8d313fc9802a563b1dfb899 failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@110ad163 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"
Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.6ad68cafda6ac435f29d6fcea3c3b4c7 failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@3656a0a2 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"
Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.7df814e0ed65496c6d23851684da5f6d failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@110ad163 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"
Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.aeb2c6decfc94a9e97b6025d6ed3e715 failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@110ad163 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"
Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.014a10b3da9cc58fdeee09a418357517 failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@110ad163 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"
Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.775c9ca01a9bf5720d5af49a2bacc2f6 failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@110ad163 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"

there's nothing actionable in there that an operator can do, at least not on the individual metricdef level.

it's much better to report on the level of an entire batch request/response, which we already do at https://github.com/raintank/metrictank/blob/master/idx/elasticsearch/elasticsearch.go#L353-L355

maybe the solution is to iterate over the erroreous responses and count for each error type how much errors they were.
we could log warnings like:

5124 metricdefs encountered the error: es_rejected_execution_exception

woodsaj · 2016-08-25T14:25:12Z

Grouping by error type sounds like a good idea.

You also might want to decrease max-conns for the ES index so you dont overwhelm elasticsearch.
This will likely cause ESIndex.Add() to block while it tries to add the metrics to the BulkIndexer because, adding to the BulkIndexer results in the docs being added to an internal buffered channel that has a fixed size of 100.
https://github.com/mattbaird/elastigo/blob/master/lib/corebulk.go#L111
https://github.com/mattbaird/elastigo/blob/master/lib/corebulk.go#L287
https://github.com/mattbaird/elastigo/blob/master/lib/corebulk.go#L248

We could work around this by maintaining our own buffer that is configurable so that we do not blocking ingestion and do not overwhelming ES.

Dieterbe · 2016-08-26T13:26:51Z

pushed some commits (untested because i haven't been able to get the MT datasource working again in raintank-docker)

We could work around this by maintaining our own buffer that is configurable so that we do not blocking ingestion and do not overwhelming ES.

We have to think about this a bit more explicitly. There's a whole spectrum of how much backpressure/blocking we want to do vs how much metricdef indexing failures we want to tolerate.

on one extreme, we could block metric ingestion until all metricdefs have been safely indexed. (not practical with ES)
on another, we could process metricdefs entirely async and not add any backpressure.

currently we're somewhere in the middle: we block on updating the in-memory index and adding to the ES bulkqueue channel. but the actual bulk-index buffering, flushing (and retries) are async (though limited via max-buffer-docs and buffer-delay-max) and the 100 in the channel. metricdefs can get lost if they were in the bulkqueue channel, the bulkqueue buffer or in our retry buffer (only the ones that failed to index)

adding to in-memory should probably always block, so that we have the metricdefs for the data and can serve it. (though splitting up metricdata from metric metadata will complicate this or make this moot)
we can probably remove all ES induced backpressure, but only if you run multiple instances so that you always have at least 1 up and running. if you only have 1 instance and it crashes after a long period of not being able to flush data to ES, you'll have lost metric metadata. we could perhaps change the kafka offset logic to also take into account which metricdefs haven't been successfully committed yet. with the separate streams concept we could track both offsets separately.

with the carbon input i guess all bets are off since there's no acknowledgement mechanism in the protocol anyway.

* it doesn't perform well * we don't use it and discourage it * it is not well maintained, and has some known issues fix #500 fix #387 closes #297

ES index: don't log individual def failures

7da4f63

otherwise log can get flooded when ES can't keep up We have the metrics reporting already, so no need to log at this res

Dieterbe assigned woodsaj Aug 25, 2016

Dieterbe assigned Dieterbe and unassigned woodsaj Aug 26, 2016

Dieterbe added the this week label Aug 26, 2016

Dieterbe added 2 commits August 26, 2016 10:52

summarize errors

857bbc2

more clarity for operator

6098693

Dieterbe removed the this week label Oct 3, 2016

Dieterbe mentioned this pull request Nov 4, 2016

reducing high volume error logging #370

Closed

Dieterbe added a commit that referenced this pull request Jan 27, 2017

deprecate ES index

4100081

* it doesn't perform well * we don't use it and discourage it * it is not well maintained, and has some known issues fix #500 fix #387 closes #297

Dieterbe added a commit that referenced this pull request Jan 31, 2017

deprecate ES index

7690a95

* it doesn't perform well * we don't use it and discourage it * it is not well maintained, and has some known issues fix #500 fix #387 closes #297

Dieterbe closed this in 3a43353 Jan 31, 2017

Dieterbe deleted the no-individual-failures branch January 2, 2018 16:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ES index: don't log individual def failures #297

ES index: don't log individual def failures #297

Dieterbe commented Aug 25, 2016

woodsaj commented Aug 25, 2016

Dieterbe commented Aug 25, 2016

woodsaj commented Aug 25, 2016

Dieterbe commented Aug 26, 2016

ES index: don't log individual def failures #297

ES index: don't log individual def failures #297

Conversation

Dieterbe commented Aug 25, 2016

woodsaj commented Aug 25, 2016

Dieterbe commented Aug 25, 2016

woodsaj commented Aug 25, 2016

Dieterbe commented Aug 26, 2016