Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tests: fix rpc limit tests #12364

Merged
merged 3 commits into from
Mar 24, 2022
Merged

tests: fix rpc limit tests #12364

merged 3 commits into from
Mar 24, 2022

Conversation

lgfa29
Copy link
Contributor

@lgfa29 lgfa29 commented Mar 24, 2022

Autopilot makes a lot of requests to Status.RaftStats which end up affecting the connection limit enforcer.

I couldn't find a way to stop it, or ignore these requests in the counter (the connection limiter rejects the request before we can read and figure out what method is being called), so I'm downgrading the Raft version for this test.

It may be worth investigating the impact of this in real clusters as the configured request limit may be off. I tried setting c.RPCMaxConnsPerClient = tc.limit + 1 to account for the potential concurrent Status.RaftStats but that wasn't enough.

@lgfa29 lgfa29 requested review from tgross and jrasell March 24, 2022 01:54
Copy link
Member

@jrasell jrasell left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it worth renaming the test detailing it is testing Raft v2 only?

Do we plan to follow up with a test that covers v2 behaviour? The test seems to cover some useful protections that we would want to ensure are working correctly.

Comment on lines 872 to 876
// With Raft v3 Autopilot makes a lot of RPC requests to
// Status.RaftStats using the StatsFetcher. This makes it very
// likely that there will be a concurrent request that will
// mess with the request limit enforcer.
c.RaftConfig.ProtocolVersion = 2
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rather than switching to v2, do we have any configuration that could throttle-down (or turn off) the autopilot library for this test? Given that we're going to be running with v3 by default I'd rather not switch this here if we can avoid it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not without modifying the current code I think. I tried to call s.autopilot.Stop() but the requests kept happening.

What did work was to have a config flag to prevent it from starting in the first place:

diff --git a/nomad/config.go b/nomad/config.go
index 6ea8cfd09..4ff57f443 100644
--- a/nomad/config.go
+++ b/nomad/config.go
@@ -349,6 +349,8 @@ type Config struct {
 	// DeploymentQueryRateLimit is in queries per second and is used by the
 	// DeploymentWatcher to throttle the amount of simultaneously deployments
 	DeploymentQueryRateLimit float64
+
+	disableAutopilot bool
 }
 
 // DefaultConfig returns the default configuration. Only used as the basis for
diff --git a/nomad/leader.go b/nomad/leader.go
index 7db6cc57e..4a2affa7c 100644
--- a/nomad/leader.go
+++ b/nomad/leader.go
@@ -288,8 +288,10 @@ func (s *Server) establishLeadership(stopCh chan struct{}) error {
 	s.handlePausableWorkers(true)
 
 	// Initialize and start the autopilot routine
-	s.getOrCreateAutopilotConfig()
-	s.autopilot.Start()
+	if !s.config.disableAutopilot {
+		s.getOrCreateAutopilotConfig()
+		s.autopilot.Start()
+	}
 
 	// Initialize scheduler configuration
 	s.getOrCreateSchedulerConfig()
diff --git a/nomad/rpc_test.go b/nomad/rpc_test.go
index 070829c00..539d01678 100644
--- a/nomad/rpc_test.go
+++ b/nomad/rpc_test.go
@@ -873,7 +873,7 @@ func TestRPC_Limits_OK(t *testing.T) {
 				// Status.RaftStats using the StatsFetcher. This makes it very
 				// likely that there will be a concurrent request that will
 				// mess with the request limit enforcer.
-				c.RaftConfig.ProtocolVersion = 2
+				c.disableAutopilot = true
 			})
 			defer func() {
 				cleanup()

Would this be a better approach?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That looks good to me! (Probably worth adding a comment on the bool to let folks know why it's there.)

@lgfa29
Copy link
Contributor Author

lgfa29 commented Mar 24, 2022

Is it worth renaming the test detailing it is testing Raft v2 only?

Do we plan to follow up with a test that covers v2 behaviour? The test seems to cover some useful protections that we would want to ensure are working correctly.

Yeah, I wasn't happy either, but I couldn't find another way around it.

The root of the problem is that connection rate limit is grouped by IP and in the test setup we have a single node, so both server-to-server RPCs and test-to-server RPCs come from the same address.

I thought that maybe having this as an E2E test would avoid this problem, but then we would have to make sure the test runs in isolation and before any other tests, so the rate limiter is empty.

I will try to see if I can "mask" the TestServer to something else.

@lgfa29
Copy link
Contributor Author

lgfa29 commented Mar 24, 2022

@tgross @jrasell option number 2: bind the server to another IP so Autopilot requests don't come from 127.0.0.1.

diff --git a/nomad/rpc_test.go b/nomad/rpc_test.go
index 070829c00..0ae94f0c7 100644
--- a/nomad/rpc_test.go
+++ b/nomad/rpc_test.go
@@ -17,6 +17,7 @@ import (
 	"time"
 
 	"github.com/hashicorp/go-msgpack/codec"
+	"github.com/hashicorp/go-sockaddr"
 	msgpackrpc "github.com/hashicorp/net-rpc-msgpackrpc"
 	"github.com/hashicorp/nomad/ci"
 	cstructs "github.com/hashicorp/nomad/client/structs"
@@ -869,11 +870,14 @@ func TestRPC_Limits_OK(t *testing.T) {
 				c.RPCHandshakeTimeout = tc.timeout
 				c.RPCMaxConnsPerClient = tc.limit
 
-				// With Raft v3 Autopilot makes a lot of RPC requests to
-				// Status.RaftStats using the StatsFetcher. This makes it very
-				// likely that there will be a concurrent request that will
-				// mess with the request limit enforcer.
-				c.RaftConfig.ProtocolVersion = 2
+				// Bind the server to a private IP so that Autopilot's
+				// StatsFetcher requests come from a different IP than the test
+				// requests, otherwise they would interfere with the connection
+				// rate limiter since limits are imposed by IP address.
+				ip, err := sockaddr.GetPrivateIP()
+				require.NoError(t, err)
+				c.RPCAddr.IP = []byte(ip)
+				c.SerfConfig.MemberlistConfig.BindAddr = ip
 			})
 			defer func() {
 				cleanup()

I have to test if this works in CI, as it depends on the availability of a private IP, but I think that's a fair assumption. If that doesn't work I will go with the disableAutopilot flag approach.

@tgross
Copy link
Member

tgross commented Mar 24, 2022

option number 2: bind the server to another IP

Oh, that's nice too! 👍

@lgfa29
Copy link
Contributor Author

lgfa29 commented Mar 24, 2022

I don't know why GitHub Actions is not being triggered (status is green now). I'm going to go with the different IP approach for now since it doesn't require test-specific code changes.

@lgfa29 lgfa29 requested review from jrasell and tgross March 24, 2022 18:38
Copy link
Member

@tgross tgross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks @lgfa29!

@lgfa29
Copy link
Contributor Author

lgfa29 commented Mar 24, 2022

It was definitely a tricky one to solve 😄

Thanks for all the feedback!

@lgfa29 lgfa29 merged commit 47b07f5 into main Mar 24, 2022
@lgfa29 lgfa29 deleted the fix-rpc-limits-test branch March 24, 2022 19:31
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Oct 24, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants