Problem with 0.6 automated protocol upshift #1346

slackpad · 2015-10-27T02:08:46Z

While upgrading the nyc demo cluster I ran into an issue where only one of the three servers was sending its coordinates. After adding some debug logging and restarting it started working, so it's not clear what's going on. There's TLS enabled so I couldn't see in tcpdump if the servers weren't sending or if the servers were rejecting the coordinate for some reason.

I'm going to add some more instrumentation and see if I can reproduce while upgrading one of the other DCs. All the servers shows up in the WAN coordinates even when the /v1/coordinate/nodes endpoint didn't show anything for a server.

Additionally, I realized that if we have a mixed configuration of servers and/or a client who doesn't yet see all the servers, then we could end up in a situation where a client fires off a coordinate which may create an unsupported item in the raft log for the other servers. I need to test this case and possibly add the ignore flag to this type of message so that we don't create problems during an upgrade.

slackpad · 2015-10-27T03:03:16Z

Interesting - it was reproducible on the next set of servers. First off, there's a pretty jank version of Consul on these:

root@ams2-server-3:~# consul version
Consul v0.5.1-1-gdc6795a
Consul Protocol: 2 (Understands back to: 1)

I ran (the potential) 0.6rc1 with this patch applied:

diff --git a/command/agent/agent.go b/command/agent/agent.go
index 99b4d25..0f523e1 100644
--- a/command/agent/agent.go
+++ b/command/agent/agent.go
@@ -555,6 +555,9 @@ func (a *Agent) CanServersUnderstandProtocol(version uint8) bool {
                        numServers++
                        if member.ProtocolMax >= version {
                                numWhoGrok++
+                               fmt.Printf("[DEBUG] agent: XXX %s groks\n", member.Name)
+                       } else {
+                               fmt.Printf("[DEBUG] agent: XXX %s does not grok\n", member.Name)
                        }
                }
        }
@@ -596,12 +599,15 @@ func (a *Agent) sendCoordinate() {
                min := a.config.SyncCoordinateIntervalMin
                intv := rateScaledInterval(rate, min, len(a.LANMembers()))
                intv = intv + randomStagger(intv)
+               a.logger.Printf("[DEBUG] agent: XXX sending coordinate in %9.6f seconds", intv.Seconds())

                select {
                case <-time.After(intv):
+                       a.logger.Printf("[DEBUG] agent: XXX checking for coordinate send")
                        if !a.CanServersUnderstandProtocol(3) {
                                continue
                        }
+                       a.logger.Printf("[DEBUG] agent: XXX sending coordinate")

                        var c *coordinate.Coordinate
                        var err error
diff --git a/consul/coordinate_endpoint.go b/consul/coordinate_endpoint.go
index 4f429be..3af0000 100644
--- a/consul/coordinate_endpoint.go
+++ b/consul/coordinate_endpoint.go
@@ -76,6 +76,7 @@ func (c *Coordinate) batchApplyUpdates() error {
                        break
                }

+               c.srv.logger.Printf("[DEBUG] consul.coordinate: XXX Applying coordinate for node %s", node)
                updates[i] = &structs.Coordinate{node, coord}
                i++
        }
@@ -101,6 +102,8 @@ func (c *Coordinate) Update(args *structs.CoordinateUpdateRequest, reply *struct
                return err
        }

+       c.srv.logger.Printf("[DEBUG] consul.coordinate: XXX Updating coordinate for node %s", args.Node)
+
        // Since this is a coordinate coming from some place else we harden this
        // and look for dimensionality problems proactively.
        coord, err := c.srv.serfLAN.GetCoordinate()
diff --git a/consul/state/state_store.go b/consul/state/state_store.go
index 36a61ec..16c13f3 100644
--- a/consul/state/state_store.go
+++ b/consul/state/state_store.go
@@ -2339,6 +2339,7 @@ func (s *StateStore) CoordinateBatchUpdate(idx uint64, updates structs.Coordinat
                if err != nil {
                        return fmt.Errorf("failed node lookup: %s", err)
                }
+               fmt.Printf("[DEBUG] state_store: XXX Coordinate update for %s %v\n", update.Node, node)
                if node == nil {
                        continue
                }

The first and second servers I rolled looked like this:

[DEBUG] agent: XXX ams2-server-2 does not grok
[DEBUG] agent: XXX ams2-server-3 does not grok
[DEBUG] agent: XXX ams2-server-1 groks

[DEBUG] agent: XXX ams2-server-1 groks
[DEBUG] agent: XXX ams2-server-2 groks
[DEBUG] agent: XXX ams2-server-3 does not grok

Even though all three were updated. The last one I updated was able to see that all three servers would grok, so it started sending coordinates at them.

Need to see why the version info is different on the first two servers. It looks like there's a min/max calculation that's not taking place. Will add more debug logging and see what happens.

slackpad · 2015-10-27T03:24:58Z

wrt to the second question (the ignore flag) we probably want that as a way to make a downgrade possible if something happens and they want to back out. Right now people will get a panic if they try running an older version of Consul and it tries to apply the Raft log entries for the new coordinate updates. This is a bad experience, especially given how unimportant these entries are.

slackpad · 2015-10-27T03:37:03Z

Added more debugging output:

diff --git a/command/agent/agent.go b/command/agent/agent.go
index 99b4d25..f7acffe 100644
--- a/command/agent/agent.go
+++ b/command/agent/agent.go
@@ -553,8 +553,12 @@ func (a *Agent) CanServersUnderstandProtocol(version uint8) bool {
        for _, member := range members {
                if member.Tags["role"] == "consul" {
                        numServers++
+                       fmt.Printf("[DEBUG] agent: XXX %v\n", member)
                        if member.ProtocolMax >= version {
                                numWhoGrok++
+                               fmt.Printf("[DEBUG] agent: XXX %s groks\n", member.Name)
+                       } else {
+                               fmt.Printf("[DEBUG] agent: XXX %s does not grok\n", member.Name)
                        }
                }
        }

When rolling two of the nodes, now I get:

(on sfo1-server1)

[DEBUG] agent: XXX {sfo1-server-2-new 104.236.149.175 8301 map[expect:3 role:consul dc:sfo1 vsn:2 vsn_min:1 vsn_max:2 build:0.5.1:dc6795a5 port:8300] alive 1 2 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-2-new does not grok
[DEBUG] agent: XXX {sfo1-server-1 104.236.129.97 8301 map[expect:3 role:consul dc:sfo1 vsn:2 vsn_min:1 vsn_max:3 build:0.6.0rc1:c29507f4 port:8300] alive 1 3 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-1 groks
[DEBUG] agent: XXX {sfo1-server-3 104.131.157.52 8301 map[vsn:2 vsn_min:1 vsn_max:3 build:0.6.0rc1:c29507f4 port:8300 expect:3 role:consul dc:sfo1] alive 1 2 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-3 does not grok

(on sfo1-server-3)

[DEBUG] agent: XXX {sfo1-server-2-new 104.236.149.175 8301 map[dc:sfo1 vsn:2 vsn_min:1 vsn_max:2 build:0.5.1:dc6795a5 port:8300 expect:3 role:consul] alive 1 2 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-2-new does not grok
[DEBUG] agent: XXX {sfo1-server-3 104.131.157.52 8301 map[vsn_min:1 vsn_max:3 build:0.6.0rc1:c29507f4 port:8300 expect:3 role:consul dc:sfo1 vsn:2] alive 1 3 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-3 groks
[DEBUG] agent: XXX {sfo1-server-1 104.236.129.97 8301 map[vsn:2 vsn_min:1 vsn_max:3 build:0.6.0rc1:c29507f4 port:8300 expect:3 role:consul dc:sfo1] alive 1 3 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-1 groks

Compare these lines:

[DEBUG] agent: XXX {sfo1-server-3 104.131.157.52 8301 map[vsn:2 vsn_min:1 vsn_max:3 build:0.6.0rc1:c29507f4 port:8300 expect:3 role:consul dc:sfo1] alive 1 2 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-3 does not grok

[DEBUG] agent: XXX {sfo1-server-3 104.131.157.52 8301 map[vsn_min:1 vsn_max:3 build:0.6.0rc1:c29507f4 port:8300 expect:3 role:consul dc:sfo1 vsn:2] alive 1 3 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-3 groks

What's interesting is that sfo1-server-1 sees sfo1-server-3's versions as 1 2 2 2 4 4 sfo1-server-3 seeing its own versions as 1 3 2 2 4 4. This explains why it does not grok. Need to see how this gets set and how it could be different.

slackpad · 2015-10-27T04:20:45Z

Need to prove it but I'm thinking it might be related to these lines in memberlist:

https://github.com/hashicorp/memberlist/blob/master/state.go#L707-L711

I did a shutdown without a leave (that's how this consul server was set up when I did a kill <pid> of the consul process to get the watcher to restart it), so I'm thinking that memberlist might have skipped updating this node's version as it came back up and announced it (thus the fishy previous value of 1 2 2 2 4 4 for all the versions). All my previous testing of the upshift was using shutdown with leave.

slackpad · 2015-10-27T04:24:36Z

Yep, doing a consul leave and then letting things re-join fixed it.

slackpad · 2015-10-27T04:35:46Z

The process watcher restarted it pretty quickly. Here's how things looked from sfo1-server-1's perspective as sfo1-server-3 was rolled to the new version:

    2015/10/26 23:28:22 [INFO] serf: EventMemberUpdate: sfo1-server-3
    2015/10/26 23:28:22 [INFO] serf: EventMemberUpdate: sfo1-server-3.sfo1
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:24 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:24 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:24 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:24 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:26 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1

slackpad · 2015-10-27T04:43:10Z

The 2015/10/26 23:28:22 [INFO] serf: EventMemberUpdate: sfo1-server-3 log explains it.

It didn't see an alive -> dead -> alive transition, just an update to the metadata. Check out this code from Serf:

https://github.com/hashicorp/memberlist/blob/master/state.go#L794-L797

    // Notify the delegate of any relevant updates
    if m.config.Events != nil {
        if oldState == stateDead {
            // if Dead -> Alive, notify of join
            m.config.Events.NotifyJoin(&state.Node)

        } else if !bytes.Equal(oldMeta, state.Meta) {
            // if Meta changed, trigger an update notification
            m.config.Events.NotifyUpdate(&state.Node)
        }
    }

So clearly the metadata changed because of the version (address and port information was the same). but, the handler for this doesn't capture the new version information:

https://github.com/hashicorp/serf/blob/master/serf/serf.go#L975-L978

    // Update the member attributes
    member.Addr = net.IP(n.Addr)
    member.Port = n.Port
    member.Tags = s.decodeTags(n.Meta)

This doesn't properly update the member's version information. If we set that here things should work ok.

slackpad · 2015-10-27T18:45:53Z

Ugh - I thoroughly confused myself here:

The array I was printing is for the memberlist versions and the Serf versions, they have nothing to do with the Consul versions. I thought the versioning was delegated, but memberlist only supports one level, so Serf uses the delegate versioning. Consul puts all of its versioning into the vsn_ tags.
This means that I was keying off the wrong thing in the upshift! Need to fix that to look at vsn_max, not the memberlist protocol version, which also happens to be bumping up to 3, which is why things seemed to be working ok.

The thing I found above should still be fixed because it's confusing, but Serf probably doesn't really care about the memberlist version, and it certainly isn't the proper fix for Consul's upshift.

slackpad · 2015-10-27T22:02:15Z

Testing back on the DO cluster I rolled the first two nodes with the Consul fix from #1353 but not the Serf fix from hashicorp/serf#335 which would have masked the problem.

2015/10/27 18:00:28 [INFO] serf: EventMemberUpdate: sfo1-server-2-new

The first server started posting coordinates after seeing this update, which is the correct behavior.

slackpad · 2015-10-27T22:06:22Z

I also tested downgrading a server:

    2015/10/27 18:03:59 [WARN] consul.fsm: ignoring unknown message type (6), upgrade to newer version
    2015/10/27 18:03:59 [WARN] consul.fsm: ignoring unknown message type (6), upgrade to newer version
    2015/10/27 18:03:59 [WARN] consul.fsm: ignoring unknown message type (6), upgrade to newer version
    2015/10/27 18:03:59 [WARN] consul.fsm: ignoring unknown message type (6), upgrade to newer version

This looks like the right behavior. The other servers started holding off on coordinate updates again:

    2015/10/27 18:05:59 [DEBUG] agent: skipping coordinate updates until servers are upgraded

slackpad · 2015-10-27T22:24:04Z

Summary for the future:

// Member is a single member of the Serf cluster.
type Member struct {
        Name   string
        Addr   net.IP
        Port   uint16
        Tags   map[string]string  // <- Consul version in vsn, vsn_min, vsn_max
        Status MemberStatus

        // The minimum, maximum, and current values of the protocol versions
        // and delegate (Serf) protocol versions that each member can understand
        // or is speaking.
        ProtocolMin uint8 // <- memberlist version
        ProtocolMax uint8
        ProtocolCur uint8
        DelegateMin uint8 // <- Serf version
        DelegateMax uint8
        DelegateCur uint8
}

I was looking at ProtocolMax uint8 which as totally wrong, but I was looking for a 3 and we changed memberlist to 3 for the TCP fallback pings so…it worked. I thought that Serf would bubble up the “delegated” version for Consul to the top-level one for some reason. There’s only one level of chaining: memberlist helps you support versioning for your delegates and carries the Serf version around for you, but Serf doesn’t really have the same thing so Consul stuffs this info into tags.

slackpad · 2015-10-27T23:18:36Z

Leaving open since there's one more PR outstanding.

…nor). Reduce future confusion by introducing a minor version that is gossiped out via the `mvn` Serf tag (Minor Version Number, `vsn` is already being used for to communicate `Major Version Number`). Background: hashicorp/consul/issues/1346#issuecomment-151663152

This was referenced Oct 27, 2015

Captures the protocol versions for a node when an update event fires. hashicorp/serf#335

Merged

Uses correct version for automatic upshift, sets coordinate Raft log entries to ignore. #1353

Merged

slackpad closed this as completed in #1353 Oct 27, 2015

slackpad reopened this Oct 27, 2015

slackpad closed this as completed Oct 27, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with 0.6 automated protocol upshift #1346

Problem with 0.6 automated protocol upshift #1346

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

Problem with 0.6 automated protocol upshift #1346

Problem with 0.6 automated protocol upshift #1346

Comments

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015

slackpad commented Oct 27, 2015