Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Temporary loss of socket is not recoverable #86

Open
Enrico2 opened this issue Aug 21, 2019 · 4 comments
Open

Temporary loss of socket is not recoverable #86

Enrico2 opened this issue Aug 21, 2019 · 4 comments

Comments

@Enrico2
Copy link

Enrico2 commented Aug 21, 2019

We're running the Datadog agent as a DaemonSet (although I suspect the same issue will happen when running as a sidecar) in Kubernetes and the client is a singleton in a JVM process running in a pod.

The communication between the client and agent is done via UDS.

When instantiating NonBlockingStatsDClient it open()s the DatagramChannel:
https://github.com/DataDog/java-dogstatsd-client/blob/master/src/main/java/com/timgroup/statsd/NonBlockingStatsDClient.java#L642

Everything works as expected.

Now let's say we updated our Datadog agent's template. We apply the template and that causes the DaemonSet to gradually restart the agent processes running on nodes.

When the agent restarts, the socket path is temporarily unavailable and we see a

 java.io.IOException: No such file or directory
    at jnr.unixsocket.UnixDatagramChannel.send(UnixDatagramChannel.java:164)
    at com.timgroup.statsd.StatsDSender.blockingSend(StatsDSender.java:88)
    at com.timgroup.statsd.StatsDSender.run(StatsDSender.java:71)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)

The exception handler we supply to NonBlockingStatsDClient cannot do anything to tell the client to re-open the DatagramChannel, or retry in any way.

When we've seen this happen before, we had to restart all of our pods.

From a code perspective, our only option in this case is to mark the client as "dead" and have other code observe that state and re-instantiate it. But, if we do that, we lose all of the metrics that are currently in the queue.

Instead, the client should be able to recover from such an error.

@Enrico2
Copy link
Author

Enrico2 commented Aug 21, 2019

Correction. This case can happen when the error handler attempts to throw. This ends up throwing in the daemon thread and killing it.

The loss of metrics still happens though, because of a queue.poll() and then an exception is thrown. Perhaps the code should .peek(), and remove in case it was successful?

@maulin-vasavada
Copy link

@Enrico2 is this issue fixed in latest version of the library?

@Enrico2
Copy link
Author

Enrico2 commented Apr 29, 2021

@maulin-vasavada Sorry, I don't know. I'm no longer working at the company where this code was used.
From reading the original report though, it seems there are clear steps for reproduction, so you can check.

@maulin-vasavada
Copy link

Thanks @Enrico2

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants