Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TCPWriter: Close connections on network timeout errors #267

Merged
merged 9 commits into from
Mar 17, 2021

Conversation

tabboud
Copy link
Contributor

@tabboud tabboud commented Mar 16, 2021

Before this PR

Connections could get stuck in a "connection timed out" state and the underlying net.Conn would never be reset causing logs to be dropped.

According to the docs for tls.Conn.SetDeadline() (https://pkg.go.dev/crypto/tls#Conn.SetDeadline), after a write has timed out once, the TLS state is corrupt and thus the same error is returned. My hypothesis is that once we hit the timeout, we got stuck in this state since we never closed the connection.

After this PR

==COMMIT_MSG==
Close TCP connections if there are network timeouts
==COMMIT_MSG==

Notes

I wasn't able to repro the exact issues we were seeing below, since the connection timed out error comes from the ETIMEDOUT syscall on Linux.

2021/03/16 18:38:25 write failed: write tcp <src-ip>:<src-port> -> <dst-ip>:<dst-port>: write: connection timed out

However when digging into syscall.ETIMEDOUT, I realized it happens to implement the net.Error interface and returns true for Timeout() and ALSO returns true for Temporary() if there was a timeout!

Relevant issue: golang/go#31449

Possible downsides?

  • Not directly caused by this change, but we upon any timeout, we will still drop the log line.

This change is Reviewable

@tabboud tabboud requested a review from bmoylan March 16, 2021 21:48
@changelog-app
Copy link

changelog-app bot commented Mar 16, 2021

Generate changelog in changelog/@unreleased

Type

  • Feature
  • Improvement
  • Fix
  • Break
  • Deprecation
  • Manual task
  • Migration

Description

Close TCP connections if there are network timeouts

Check the box to generate changelog(s)

  • Generate changelog entry

@@ -106,7 +106,7 @@ func (d *TCPWriter) Write(p []byte) (n int, err error) {
n, err := conn.Write(envelope[total:])
total += n
if err != nil {
if nerr, ok := err.(net.Error); !(ok && (nerr.Temporary() || nerr.Timeout())) {
if nerr, ok := err.(net.Error); !ok || !nerr.Temporary() {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am slightly nervous that a timeout error could also be considered temporary, but looking at the tls.Conn code it doesn't seem like that is the case. Mind double checking this logic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just went ahead and added the explicit check for a timeout as well to be safe.

Copy link
Contributor

@bmoylan bmoylan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 2 of 3 files at r2, 1 of 1 files at r3.
Reviewable status: all files reviewed, 1 unresolved discussion (waiting on @tabboud)

@bulldozer-bot bulldozer-bot bot merged commit 96719d3 into develop Mar 17, 2021
@bulldozer-bot bulldozer-bot bot deleted the ta/fix-timeout branch March 17, 2021 15:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants