Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

grpc: ensure grpc resolver correctly uses lan/wan addresses on servers #17270

Merged
merged 10 commits into from
May 11, 2023

Conversation

rboyer
Copy link
Member

@rboyer rboyer commented May 9, 2023

Description

The grpc resolver implementation is fed from changes to the router.Router. Within the router there is a map of various areas storing the addressing information for servers in those areas. All map entries are of the WAN variety except a single special entry for the LAN.

Addressing information in the LAN "area" are local addresses intended for use when making a client-to-server or server-to-server request.

The client agent correctly updates this LAN area when receiving lan serf events, so by extension the grpc resolver works fine in that scenario.

The server agent only initially populates a single entry in the LAN area (for itself) on startup, and then never mutates that area map again. For normal RPCs a different structure is used for LAN routing.

Additionally when selecting a server to contact in the local datacenter it will randomly select addresses from either the LAN or WAN addressed entries in the map.

Unfortunately this means that the grpc resolver stack as it exists on server agents is either broken or only accidentally functions by having servers dial each other over the WAN-accessible address. If the operator disables the serf wan port completely likely this incidental functioning would break.

This PR enforces that local requests for servers (both for stale reads or leader forwarded requests) exclusively use the LAN "area" information and also fixes it so that servers keep that area up to date in the router.

A test for the grpc resolver logic was added, as well as a higher level full-stack test to ensure the externally perceived bug does not return.

@rboyer rboyer self-assigned this May 9, 2023
})

// Check them all for the bad error
const grpcError = `failed to find Consul server for global address`
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error would periodically show up when executing various API calls that ultimately used gRPC to fulfill them under the covers.

@@ -280,6 +281,18 @@ func (r *Router) maybeInitializeManager(area *areaInfo, dc string) *Manager {

// addServer does the work of AddServer once the write lock is held.
func (r *Router) addServer(areaID types.AreaID, area *areaInfo, s *metadata.Server) error {
if areaID == types.AreaLAN {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As much as I want to add this prevention, there's nothing stopping someone from having a node named blah.dc1 in dc1 which would have a wan full name of blah.dc1.dc1. At least 2 tests broke on this safety check.

@rboyer rboyer marked this pull request as ready for review May 10, 2023 16:21
The grpc resolver implementation is fed from changes to the
router.Router. Within the router there is a map of various areas storing
the addressing information for servers in those areas. All map entries
are of the WAN variety except a single special entry for the LAN.

Addressing information in the LAN "area" are local addresses intended
for use when making a client-to-server or server-to-server request.

The client agent correctly updates this LAN area when receiving lan serf
events, so by extension the grpc resolver works fine in that scenario.

The server agent only initially populates a single entry in the LAN area
(for itself) on startup, and then never mutates that area map again.
For normal RPCs a different structure is used for LAN routing.

Additionally when selecting a server to contact in the local datacenter
it will randomly select addresses from either the LAN or WAN addressed
entries in the map.

Unfortunately this means that the grpc resolver stack as it exists on
server agents is either broken or only accidentally functions by having
servers dial each other over the WAN-accessible address. If the operator
disables the serf wan port completely likely this incidental functioning
would break.

This PR enforces that local requests for servers (both for stale reads
or leader forwarded requests) exclusively use the LAN "area" information
and also fixes it so that servers keep that area up to date in the
router.

A test for the grpc resolver logic was added, as well as a higher level
full-stack test to ensure the externally perceived bug does not return.
This reverts commit 0dc5bbd.

testutil.RunStep(t, "no server experienced the server resolution error", func(t *testing.T) {
// Check them all for the bad error
const grpcError = `failed to find Consul server for global address`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we const-ify this in the code? it might get updated by some UX initiative and make this never fail as a side effect.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants