Use go-connlimit to ratelimit with 429 responses #9608

shoenig · 2020-12-10T15:28:01Z

In addition this PR fixes the original tests, which produced the correct result for the wrong reason. In TLS mode, the clients were creating HTTP connections which would be shunted by the server in the same way being ratelimited would (i.e. io.EOF). Instead, use a tls net.Conn for tls tests.

Also rebased the original commit to remove extraneous file.

Closes #9359

This is essentially a port of Consul's similar fix Changes are: go get -u github.com/hashicorp/go-connlimit go mod vendor Use new HTTP429 handler hashicorp/consul@20d1ea7

tgross

LGTM!

tgross · 2020-12-10T15:56:26Z

command/agent/http_test.go

+				Certificates:       []tls.Certificate{cert},
+				InsecureSkipVerify: true, // good enough
+			})
+			require.NoError(t, err)


If we don't pass the testing.T from the test case into this function and it's closing over the top-level testing.T, I think we lose the association with the test case and it makes errors harder to track down. Might want to either pass in the test case's testing.T or have the function return the error and do the assertion in the caller.

Good catch, fixed!

schmichael

Yikes, great catch in the tests.

I'm still concerned about this approach for 2 reasons:

Writing a response before reading the request will not work with every HTTP client.
Allowing an attacker to block new connections from other addresses seems like a DoS. Even if it's only 10ms I don't think there's anything preventing an attacker from filling the kernel's listen backlog with pending connections each of which will cause a 10ms timeout.

I think 1. will make 2 worse. Imagine a Python script which uses a synchronous http library that only reads after successfully writing. Perhaps this script is run as a system job across all nodes to check a bit of Nomad state and react appropriately. This works fine until the user deploys this system in a remote region behind a NAT. The NAT makes every incoming connection from the same IP, so the limit is reached. However the synchronous Python http library hits the 10ms timeout every time because it will never read until its write succeeds (although the write "succeeding" probably depends on its size and various buffers between the systems... so the simple case of a GET may work as expected?).

If there are 500 remote nodes, they could easily DoS the cluster by filling the accept backlog with 500 10ms timeouts. Operators may be unable to use Log Monitoring (either via the UI or API) becaue of the unintnetional DoS. (#7893 would partially solve this.)

There's a lot of socket/TCP subtletly involved here that maybe I'm wrong about? Maybe this is safe? Confirming via testing sounds incredibly difficult considering the wide variety of http clients, requests, socket implementations, and middleboxes involved.

Regardless I think a better solution from a safety and usability standpoint would be to implicitly create soft and hard limits:

The soft limit would read the request and respond with a 429 + normal timeout.

The hard limit would behave as before: immediately reject the connection.

The hard limit could implicitly be set to the soft limit + 1. I can't imagine 1 extra connection over the limit will break anybody's security model.

This would allow us to keep all of the existing hard limit code intact, and add the soft limit as a plain old HTTP handler.

tgross · 2020-12-10T19:26:44Z

That reasoning all makes sense to me; we can't assume clients are well-behaved and non-blocking.

I like the idea of keeping the hard limit but adding a soft limit handler. But I might adjust the relationship of the hard limit and soft limit. In the accidental case you described with 500 nodes, only a single one of those in flight requests would get a HTTP429 before the remaining connections get dropped. And if the client is blocking as you've described, no one will see the HTTP429 before we start dropping connections. So maybe we want to give ourselves a little more breathing room than soft limit + 1? (Like 10% or something?)

schmichael · 2020-12-10T19:59:56Z

Ah good point. hard_limit = soft_limit + 1 only really helps if it's a single client hitting the limit. 10% seems good to me. Linux+Go are quite good at handling a large number of connections, and our default of 100 is pretty low, so I don't foresee an extra 10 being a big deal.

#9608 introduced the use of the built-in HTTP 429 response handler provided by go-connlimit. There is concern though around plausible DOS attacks that need to be addressed, so this PR reverts that functionality. It keeps a fix in the tests around the use of an HTTPS enabled client for when the server is listening on HTTPS. Previously, the tests would fail deterministically with io.EOF because that's how the TLS server terminates invalid connections. Now, the result is much less deterministic. The state of the client connection and the server socket depends on when the connection is closed and how far along the handshake was.

github-actions · 2022-12-06T02:17:27Z

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

Ben Buzbee and others added 2 commits December 9, 2020 17:57

Use new go-connlimit with HTTP 429 response

1919c18

This is essentially a port of Consul's similar fix Changes are: go get -u github.com/hashicorp/go-connlimit go mod vendor Use new HTTP429 handler hashicorp/consul@20d1ea7

client: fix https test cases in client rate limits

5508e5d

shoenig requested review from schmichael and tgross December 10, 2020 15:36

tgross approved these changes Dec 10, 2020

View reviewed changes

client: fix plumbing of testing object into helper

26acbf7

vercel bot temporarily deployed to Preview – nomad-storybook-and-ui December 10, 2020 17:05 Inactive

vercel bot temporarily deployed to Preview – nomad December 10, 2020 17:05 Inactive

shoenig merged commit 958942e into master Dec 10, 2020

shoenig deleted the f-go-connlimit branch December 10, 2020 17:05

shoenig added this to the 1.0.1 milestone Dec 10, 2020

schmichael reviewed Dec 10, 2020

View reviewed changes

shoenig mentioned this pull request Dec 14, 2020

agent: revert use of http connlimit #9633

Merged

wjordan mentioned this pull request Jul 8, 2022

Return 429 response on HTTP max connection limit #13621

Merged

github-actions bot locked as resolved and limited conversation to collaborators Dec 6, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use go-connlimit to ratelimit with 429 responses #9608

Use go-connlimit to ratelimit with 429 responses #9608

shoenig commented Dec 10, 2020 •

edited

Loading

tgross left a comment

tgross Dec 10, 2020

shoenig Dec 10, 2020

schmichael left a comment •

edited

Loading

tgross commented Dec 10, 2020

schmichael commented Dec 10, 2020

github-actions bot commented Dec 6, 2022

Use go-connlimit to ratelimit with 429 responses #9608

Use go-connlimit to ratelimit with 429 responses #9608

Conversation

shoenig commented Dec 10, 2020 • edited Loading

tgross left a comment

Choose a reason for hiding this comment

tgross Dec 10, 2020

Choose a reason for hiding this comment

shoenig Dec 10, 2020

Choose a reason for hiding this comment

schmichael left a comment • edited Loading

Choose a reason for hiding this comment

tgross commented Dec 10, 2020

schmichael commented Dec 10, 2020

github-actions bot commented Dec 6, 2022

shoenig commented Dec 10, 2020 •

edited

Loading

schmichael left a comment •

edited

Loading