Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with increasing connections while getting metrics from Elasticsearch in client mode #452

Closed
eresgie opened this issue Dec 17, 2015 · 11 comments
Labels
bug unexpected problem or unintended behavior

Comments

@eresgie
Copy link

eresgie commented Dec 17, 2015

Hello.

I'm using Telegraf (ver. 0.2.4) to monitor few Elasticsearch (ver. 1.7.3) instances. One of them is configured in client mode (has no data, essentially it's a load balancer). When I'm starting Telegraf, metrics are gathered and send to Influx database. But Telegraf process keeps opening network connections to Elasticsearch, never closing them. It keeps increasing until limit of open files is reached.

22994 is PID of Telegraf process:

# ls /proc/22994/fd | wc -l
128

In 2 hours number of open files is reaching limit, causing:

Dec 17 12:07:30 es-lb-0 telegraf: 2015/12/17 12:07:30 Error in plugin [net]: error getting net io info: open /proc/net/dev: too many open files
Dec 17 12:07:30 es-lb-0 telegraf: panic: runtime error: index out of range

Then Telegraf restarts (or clear active connections) or sometimes even stops getting metrics.

When Elasticsearch plugin is not loaded, everything is ok.
I've tried all possible Elasticsearch plugin configuration options (local = true/false), without any positive results.
Number of open files is increasing in Elasticsearch process, too, so I'm sure it's connected.

Process file descriptors (output shortened for brevity):

# ls -al /proc/22994/fd
[...]
lrwx------. 1 telegraf telegraf 64 Dec 17 13:16 90 -> socket:[8634053]
lrwx------. 1 telegraf telegraf 64 Dec 17 13:14 91 -> socket:[8634093]
lrwx------. 1 telegraf telegraf 64 Dec 17 13:16 92 -> socket:[8634632]
lrwx------. 1 telegraf telegraf 64 Dec 17 13:16 93 -> socket:[8634743]
lrwx------. 1 telegraf telegraf 64 Dec 17 13:16 94 -> socket:[8634671]
lrwx------. 1 telegraf telegraf 64 Dec 17 13:16 95 -> socket:[8634776]
lrwx------. 1 telegraf telegraf 64 Dec 17 13:16 96 -> socket:[8635496]
lrwx------. 1 telegraf telegraf 64 Dec 17 13:16 97 -> socket:[8635534]
lrwx------. 1 telegraf telegraf 64 Dec 17 13:17 98 -> socket:[8635002]
lrwx------. 1 telegraf telegraf 64 Dec 17 13:16 99 -> socket:[8635163]

I can help with further testing, for now my solution is to remove Elasticsearch plugin from Telegraf.

@zstyblik
Copy link
Contributor

My guess is this is caused by not calling r.Body.Close() and not reading response body. I don't see anything like that in code of the plugin.

The client must close the response body when finished with it:

resp, err := http.Get("http://example.com/")
if err != nil {
    // handle error
}
defer resp.Body.Close()
body, err := ioutil.ReadAll(resp.Body)

I had the similar issue in Python with requests.

@sparrc sparrc added the bug unexpected problem or unintended behavior label Dec 17, 2015
@sparrc
Copy link
Contributor

sparrc commented Dec 17, 2015

thanks for the report, if @zstyblik is correct then that is the issue, we should be always be closing the response body.

@zstyblik
Copy link
Contributor

@sparrc - I don't know architecture of Telegraf, so it's a "guess" based on reading through the Elasticsearch plugin. But I doubt such functionality, read body of response and close connection, is implemented somewhere inside. Anyway, that part is copy-paste from documentation of net/http + my own experience from Python land :)

@sparrc
Copy link
Contributor

sparrc commented Dec 17, 2015

I think you're right, we should be closing this: https://github.com/influxdb/telegraf/blob/master/plugins/elasticsearch/elasticsearch.go#L201

@zstyblik
Copy link
Contributor

@sparrc indeed :) It could be something like ...

    r, err := e.client.Get(url)
    if err != nil {
        return err
    }
    defer r.Body.Close()
    if r.StatusCode != http.StatusOK {
        ioutil.ReadAll(r.Body)
        return fmt.Errorf("elasticsearch: API responded with status-code %d, expected %d", r.StatusCode, http.StatusOK)
    }
    if err = json.NewDecoder(r.Body).Decode(v); err != nil {
        return err
    }
    return nil

? And import ioutil or use another Reader, I guess.

@sparrc
Copy link
Contributor

sparrc commented Dec 17, 2015

why does it require the ioutil.ReadAll? I think only the defer is necessary

@zstyblik
Copy link
Contributor

Because you're supposed to read the body of the response even if you don't care about it. Whether it's just a good practice or optional, that I don't know. I'm always doing it regardless :)

@jipperinbham
Copy link
Contributor

@sparrc I was going to work on this and #430 but wasn't sure if I should create a branch from master or 0.2.4 given all the changes going on with master

@sparrc
Copy link
Contributor

sparrc commented Dec 17, 2015

@jipperinbham do it on master, the changes I'm making are on the 0.3.0 branch and will remain there until sometime in January

@jipperinbham
Copy link
Contributor

@zstyblik re: whether or not the body should be read, from what I've read you are right in that the underlying transport will not reuse the connection if Body.Close() is called without first being read.

In this situation, I tend to think it's ok to not reuse the underlying connection because we received something other than HTTP/1.1 200 OK and would want to setup the connection again.

thoughts?

@zstyblik
Copy link
Contributor

@jipperinbham sorry, but I don't know that well. I'd expect this to be related to connection pool rather than connection being reused, resp. another request being sent over the exactly the same and already open connection and freeing up that connection early. shrug

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug unexpected problem or unintended behavior
Projects
None yet
Development

No branches or pull requests

4 participants