-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Problem with 0.6 automated protocol upshift #1346
Comments
Interesting - it was reproducible on the next set of servers. First off, there's a pretty jank version of Consul on these:
I ran (the potential) 0.6rc1 with this patch applied: diff --git a/command/agent/agent.go b/command/agent/agent.go
index 99b4d25..0f523e1 100644
--- a/command/agent/agent.go
+++ b/command/agent/agent.go
@@ -555,6 +555,9 @@ func (a *Agent) CanServersUnderstandProtocol(version uint8) bool {
numServers++
if member.ProtocolMax >= version {
numWhoGrok++
+ fmt.Printf("[DEBUG] agent: XXX %s groks\n", member.Name)
+ } else {
+ fmt.Printf("[DEBUG] agent: XXX %s does not grok\n", member.Name)
}
}
}
@@ -596,12 +599,15 @@ func (a *Agent) sendCoordinate() {
min := a.config.SyncCoordinateIntervalMin
intv := rateScaledInterval(rate, min, len(a.LANMembers()))
intv = intv + randomStagger(intv)
+ a.logger.Printf("[DEBUG] agent: XXX sending coordinate in %9.6f seconds", intv.Seconds())
select {
case <-time.After(intv):
+ a.logger.Printf("[DEBUG] agent: XXX checking for coordinate send")
if !a.CanServersUnderstandProtocol(3) {
continue
}
+ a.logger.Printf("[DEBUG] agent: XXX sending coordinate")
var c *coordinate.Coordinate
var err error
diff --git a/consul/coordinate_endpoint.go b/consul/coordinate_endpoint.go
index 4f429be..3af0000 100644
--- a/consul/coordinate_endpoint.go
+++ b/consul/coordinate_endpoint.go
@@ -76,6 +76,7 @@ func (c *Coordinate) batchApplyUpdates() error {
break
}
+ c.srv.logger.Printf("[DEBUG] consul.coordinate: XXX Applying coordinate for node %s", node)
updates[i] = &structs.Coordinate{node, coord}
i++
}
@@ -101,6 +102,8 @@ func (c *Coordinate) Update(args *structs.CoordinateUpdateRequest, reply *struct
return err
}
+ c.srv.logger.Printf("[DEBUG] consul.coordinate: XXX Updating coordinate for node %s", args.Node)
+
// Since this is a coordinate coming from some place else we harden this
// and look for dimensionality problems proactively.
coord, err := c.srv.serfLAN.GetCoordinate()
diff --git a/consul/state/state_store.go b/consul/state/state_store.go
index 36a61ec..16c13f3 100644
--- a/consul/state/state_store.go
+++ b/consul/state/state_store.go
@@ -2339,6 +2339,7 @@ func (s *StateStore) CoordinateBatchUpdate(idx uint64, updates structs.Coordinat
if err != nil {
return fmt.Errorf("failed node lookup: %s", err)
}
+ fmt.Printf("[DEBUG] state_store: XXX Coordinate update for %s %v\n", update.Node, node)
if node == nil {
continue
} The first and second servers I rolled looked like this:
Even though all three were updated. The last one I updated was able to see that all three servers would grok, so it started sending coordinates at them. Need to see why the version info is different on the first two servers. It looks like there's a min/max calculation that's not taking place. Will add more debug logging and see what happens. |
wrt to the second question (the ignore flag) we probably want that as a way to make a downgrade possible if something happens and they want to back out. Right now people will get a panic if they try running an older version of Consul and it tries to apply the Raft log entries for the new coordinate updates. This is a bad experience, especially given how unimportant these entries are. |
Added more debugging output: diff --git a/command/agent/agent.go b/command/agent/agent.go
index 99b4d25..f7acffe 100644
--- a/command/agent/agent.go
+++ b/command/agent/agent.go
@@ -553,8 +553,12 @@ func (a *Agent) CanServersUnderstandProtocol(version uint8) bool {
for _, member := range members {
if member.Tags["role"] == "consul" {
numServers++
+ fmt.Printf("[DEBUG] agent: XXX %v\n", member)
if member.ProtocolMax >= version {
numWhoGrok++
+ fmt.Printf("[DEBUG] agent: XXX %s groks\n", member.Name)
+ } else {
+ fmt.Printf("[DEBUG] agent: XXX %s does not grok\n", member.Name)
}
}
} When rolling two of the nodes, now I get: (on sfo1-server1)
(on sfo1-server-3)
Compare these lines:
What's interesting is that sfo1-server-1 sees sfo1-server-3's versions as |
Need to prove it but I'm thinking it might be related to these lines in memberlist: https://github.com/hashicorp/memberlist/blob/master/state.go#L707-L711 I did a shutdown without a leave (that's how this consul server was set up when I did a |
Yep, doing a |
The process watcher restarted it pretty quickly. Here's how things looked from sfo1-server-1's perspective as sfo1-server-3 was rolled to the new version:
|
The It didn't see an alive -> dead -> alive transition, just an update to the metadata. Check out this code from Serf: https://github.com/hashicorp/memberlist/blob/master/state.go#L794-L797 // Notify the delegate of any relevant updates
if m.config.Events != nil {
if oldState == stateDead {
// if Dead -> Alive, notify of join
m.config.Events.NotifyJoin(&state.Node)
} else if !bytes.Equal(oldMeta, state.Meta) {
// if Meta changed, trigger an update notification
m.config.Events.NotifyUpdate(&state.Node)
}
} So clearly the metadata changed because of the version (address and port information was the same). but, the handler for this doesn't capture the new version information: https://github.com/hashicorp/serf/blob/master/serf/serf.go#L975-L978 // Update the member attributes
member.Addr = net.IP(n.Addr)
member.Port = n.Port
member.Tags = s.decodeTags(n.Meta) This doesn't properly update the member's version information. If we set that here things should work ok. |
Ugh - I thoroughly confused myself here:
The thing I found above should still be fixed because it's confusing, but Serf probably doesn't really care about the memberlist version, and it certainly isn't the proper fix for Consul's upshift. |
Testing back on the DO cluster I rolled the first two nodes with the Consul fix from #1353 but not the Serf fix from hashicorp/serf#335 which would have masked the problem.
The first server started posting coordinates after seeing this update, which is the correct behavior. |
I also tested downgrading a server:
This looks like the right behavior. The other servers started holding off on coordinate updates again:
|
Summary for the future: // Member is a single member of the Serf cluster.
type Member struct {
Name string
Addr net.IP
Port uint16
Tags map[string]string // <- Consul version in vsn, vsn_min, vsn_max
Status MemberStatus
// The minimum, maximum, and current values of the protocol versions
// and delegate (Serf) protocol versions that each member can understand
// or is speaking.
ProtocolMin uint8 // <- memberlist version
ProtocolMax uint8
ProtocolCur uint8
DelegateMin uint8 // <- Serf version
DelegateMax uint8
DelegateCur uint8
} I was looking at |
Leaving open since there's one more PR outstanding. |
…nor). Reduce future confusion by introducing a minor version that is gossiped out via the `mvn` Serf tag (Minor Version Number, `vsn` is already being used for to communicate `Major Version Number`). Background: hashicorp/consul/issues/1346#issuecomment-151663152
…nor). Reduce future confusion by introducing a minor version that is gossiped out via the `mvn` Serf tag (Minor Version Number, `vsn` is already being used for to communicate `Major Version Number`). Background: hashicorp/consul/issues/1346#issuecomment-151663152
…nor). Reduce future confusion by introducing a minor version that is gossiped out via the `mvn` Serf tag (Minor Version Number, `vsn` is already being used for to communicate `Major Version Number`). Background: hashicorp/consul/issues/1346#issuecomment-151663152
While upgrading the nyc demo cluster I ran into an issue where only one of the three servers was sending its coordinates. After adding some debug logging and restarting it started working, so it's not clear what's going on. There's TLS enabled so I couldn't see in tcpdump if the servers weren't sending or if the servers were rejecting the coordinate for some reason.
I'm going to add some more instrumentation and see if I can reproduce while upgrading one of the other DCs. All the servers shows up in the WAN coordinates even when the /v1/coordinate/nodes endpoint didn't show anything for a server.
Additionally, I realized that if we have a mixed configuration of servers and/or a client who doesn't yet see all the servers, then we could end up in a situation where a client fires off a coordinate which may create an unsupported item in the raft log for the other servers. I need to test this case and possibly add the ignore flag to this type of message so that we don't create problems during an upgrade.
The text was updated successfully, but these errors were encountered: