-
Notifications
You must be signed in to change notification settings - Fork 105
ES index: don't log individual def failures #297
Conversation
otherwise log can get flooded when ES can't keep up We have the metrics reporting already, so no need to log at this res
These are not debug messages. They indicate that there is a problem and should be in the log file. When benchmarking the index I was able to index 20k/s metrics into es and I never saw a failure. |
i was doing 50k/s and getting thousands of these messages.
there's nothing actionable in there that an operator can do, at least not on the individual metricdef level. it's much better to report on the level of an entire batch request/response, which we already do at https://github.com/raintank/metrictank/blob/master/idx/elasticsearch/elasticsearch.go#L353-L355 maybe the solution is to iterate over the erroreous responses and count for each error type how much errors they were.
|
Grouping by error type sounds like a good idea. You also might want to decrease max-conns for the ES index so you dont overwhelm elasticsearch. We could work around this by maintaining our own buffer that is configurable so that we do not blocking ingestion and do not overwhelming ES. |
pushed some commits (untested because i haven't been able to get the MT datasource working again in raintank-docker)
We have to think about this a bit more explicitly. There's a whole spectrum of how much backpressure/blocking we want to do vs how much metricdef indexing failures we want to tolerate. on one extreme, we could block metric ingestion until all metricdefs have been safely indexed. (not practical with ES) currently we're somewhere in the middle: we block on updating the in-memory index and adding to the ES bulkqueue channel. but the actual bulk-index buffering, flushing (and retries) are async (though limited via max-buffer-docs and buffer-delay-max) and the 100 in the channel. metricdefs can get lost if they were in the bulkqueue channel, the bulkqueue buffer or in our retry buffer (only the ones that failed to index) adding to in-memory should probably always block, so that we have the metricdefs for the data and can serve it. (though splitting up metricdata from metric metadata will complicate this or make this moot) with the carbon input i guess all bets are off since there's no acknowledgement mechanism in the protocol anyway. |
otherwise log can get flooded when ES can't keep up
We have the metrics reporting already, so no need to log at this res