Skip to content
This repository has been archived by the owner on Aug 23, 2023. It is now read-only.

ES index: don't log individual def failures #297

Closed
wants to merge 3 commits into from

Conversation

Dieterbe
Copy link
Contributor

otherwise log can get flooded when ES can't keep up

We have the metrics reporting already, so no need to log at this res

otherwise log can get flooded when ES can't keep up

We have the metrics reporting already, so no need to log at this res
@woodsaj
Copy link
Member

woodsaj commented Aug 25, 2016

These are not debug messages. They indicate that there is a problem and should be in the log file. When benchmarking the index I was able to index 20k/s metrics into es and I never saw a failure.

@Dieterbe
Copy link
Contributor Author

i was doing 50k/s and getting thousands of these messages.
excerpt below.

Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.40f59ff5bcedb1d6df731ab130042326 failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@110ad163 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"
Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.804b367de8d313fc9802a563b1dfb899 failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@110ad163 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"
Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.6ad68cafda6ac435f29d6fcea3c3b4c7 failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@3656a0a2 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"
Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.7df814e0ed65496c6d23851684da5f6d failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@110ad163 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"
Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.aeb2c6decfc94a9e97b6025d6ed3e715 failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@110ad163 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"
Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.014a10b3da9cc58fdeee09a418357517 failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@110ad163 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"
Aug 25 12:37:50 dieter-mt2 metrictank[10340]: 2016/08/25 12:37:50 [W] ES: 1.775c9ca01a9bf5720d5af49a2bacc2f6 failed: es_rejected_execution_exception: "rejected execution of org.elasticsearch.transport.TransportService$4@110ad163 on EsThreadPoolExecutor[bulk, queue capacity = 50, org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor@6b75405d[Running, pool size = 4, active threads = 4, queued tasks = 50, completed tasks = 46]]"

there's nothing actionable in there that an operator can do, at least not on the individual metricdef level.

it's much better to report on the level of an entire batch request/response, which we already do at https://github.com/raintank/metrictank/blob/master/idx/elasticsearch/elasticsearch.go#L353-L355

maybe the solution is to iterate over the erroreous responses and count for each error type how much errors they were.
we could log warnings like:

5124 metricdefs encountered the error: es_rejected_execution_exception

@woodsaj
Copy link
Member

woodsaj commented Aug 25, 2016

Grouping by error type sounds like a good idea.

You also might want to decrease max-conns for the ES index so you dont overwhelm elasticsearch.
This will likely cause ESIndex.Add() to block while it tries to add the metrics to the BulkIndexer because, adding to the BulkIndexer results in the docs being added to an internal buffered channel that has a fixed size of 100.
https://github.com/mattbaird/elastigo/blob/master/lib/corebulk.go#L111
https://github.com/mattbaird/elastigo/blob/master/lib/corebulk.go#L287
https://github.com/mattbaird/elastigo/blob/master/lib/corebulk.go#L248

We could work around this by maintaining our own buffer that is configurable so that we do not blocking ingestion and do not overwhelming ES.

@Dieterbe Dieterbe assigned Dieterbe and unassigned woodsaj Aug 26, 2016
@Dieterbe
Copy link
Contributor Author

pushed some commits (untested because i haven't been able to get the MT datasource working again in raintank-docker)

We could work around this by maintaining our own buffer that is configurable so that we do not blocking ingestion and do not overwhelming ES.

We have to think about this a bit more explicitly. There's a whole spectrum of how much backpressure/blocking we want to do vs how much metricdef indexing failures we want to tolerate.

on one extreme, we could block metric ingestion until all metricdefs have been safely indexed. (not practical with ES)
on another, we could process metricdefs entirely async and not add any backpressure.

currently we're somewhere in the middle: we block on updating the in-memory index and adding to the ES bulkqueue channel. but the actual bulk-index buffering, flushing (and retries) are async (though limited via max-buffer-docs and buffer-delay-max) and the 100 in the channel. metricdefs can get lost if they were in the bulkqueue channel, the bulkqueue buffer or in our retry buffer (only the ones that failed to index)

adding to in-memory should probably always block, so that we have the metricdefs for the data and can serve it. (though splitting up metricdata from metric metadata will complicate this or make this moot)
we can probably remove all ES induced backpressure, but only if you run multiple instances so that you always have at least 1 up and running. if you only have 1 instance and it crashes after a long period of not being able to flush data to ES, you'll have lost metric metadata. we could perhaps change the kafka offset logic to also take into account which metricdefs haven't been successfully committed yet. with the separate streams concept we could track both offsets separately.

with the carbon input i guess all bets are off since there's no acknowledgement mechanism in the protocol anyway.

@Dieterbe Dieterbe removed the this week label Oct 3, 2016
Dieterbe added a commit that referenced this pull request Jan 27, 2017
* it doesn't perform well
* we don't use it and discourage it
* it is not well maintained, and has some known issues

fix #500
fix #387
closes #297
Dieterbe added a commit that referenced this pull request Jan 31, 2017
* it doesn't perform well
* we don't use it and discourage it
* it is not well maintained, and has some known issues

fix #500
fix #387
closes #297
@Dieterbe Dieterbe closed this in 3a43353 Jan 31, 2017
@Dieterbe Dieterbe deleted the no-individual-failures branch January 2, 2018 16:01
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants