Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problem with 0.6 automated protocol upshift #1346

Closed
slackpad opened this issue Oct 27, 2015 · 12 comments · Fixed by #1353
Closed

Problem with 0.6 automated protocol upshift #1346

slackpad opened this issue Oct 27, 2015 · 12 comments · Fixed by #1353

Comments

@slackpad
Copy link
Contributor

While upgrading the nyc demo cluster I ran into an issue where only one of the three servers was sending its coordinates. After adding some debug logging and restarting it started working, so it's not clear what's going on. There's TLS enabled so I couldn't see in tcpdump if the servers weren't sending or if the servers were rejecting the coordinate for some reason.

I'm going to add some more instrumentation and see if I can reproduce while upgrading one of the other DCs. All the servers shows up in the WAN coordinates even when the /v1/coordinate/nodes endpoint didn't show anything for a server.

Additionally, I realized that if we have a mixed configuration of servers and/or a client who doesn't yet see all the servers, then we could end up in a situation where a client fires off a coordinate which may create an unsupported item in the raft log for the other servers. I need to test this case and possibly add the ignore flag to this type of message so that we don't create problems during an upgrade.

@slackpad
Copy link
Contributor Author

Interesting - it was reproducible on the next set of servers. First off, there's a pretty jank version of Consul on these:

root@ams2-server-3:~# consul version
Consul v0.5.1-1-gdc6795a
Consul Protocol: 2 (Understands back to: 1)

I ran (the potential) 0.6rc1 with this patch applied:

diff --git a/command/agent/agent.go b/command/agent/agent.go
index 99b4d25..0f523e1 100644
--- a/command/agent/agent.go
+++ b/command/agent/agent.go
@@ -555,6 +555,9 @@ func (a *Agent) CanServersUnderstandProtocol(version uint8) bool {
                        numServers++
                        if member.ProtocolMax >= version {
                                numWhoGrok++
+                               fmt.Printf("[DEBUG] agent: XXX %s groks\n", member.Name)
+                       } else {
+                               fmt.Printf("[DEBUG] agent: XXX %s does not grok\n", member.Name)
                        }
                }
        }
@@ -596,12 +599,15 @@ func (a *Agent) sendCoordinate() {
                min := a.config.SyncCoordinateIntervalMin
                intv := rateScaledInterval(rate, min, len(a.LANMembers()))
                intv = intv + randomStagger(intv)
+               a.logger.Printf("[DEBUG] agent: XXX sending coordinate in %9.6f seconds", intv.Seconds())

                select {
                case <-time.After(intv):
+                       a.logger.Printf("[DEBUG] agent: XXX checking for coordinate send")
                        if !a.CanServersUnderstandProtocol(3) {
                                continue
                        }
+                       a.logger.Printf("[DEBUG] agent: XXX sending coordinate")

                        var c *coordinate.Coordinate
                        var err error
diff --git a/consul/coordinate_endpoint.go b/consul/coordinate_endpoint.go
index 4f429be..3af0000 100644
--- a/consul/coordinate_endpoint.go
+++ b/consul/coordinate_endpoint.go
@@ -76,6 +76,7 @@ func (c *Coordinate) batchApplyUpdates() error {
                        break
                }

+               c.srv.logger.Printf("[DEBUG] consul.coordinate: XXX Applying coordinate for node %s", node)
                updates[i] = &structs.Coordinate{node, coord}
                i++
        }
@@ -101,6 +102,8 @@ func (c *Coordinate) Update(args *structs.CoordinateUpdateRequest, reply *struct
                return err
        }

+       c.srv.logger.Printf("[DEBUG] consul.coordinate: XXX Updating coordinate for node %s", args.Node)
+
        // Since this is a coordinate coming from some place else we harden this
        // and look for dimensionality problems proactively.
        coord, err := c.srv.serfLAN.GetCoordinate()
diff --git a/consul/state/state_store.go b/consul/state/state_store.go
index 36a61ec..16c13f3 100644
--- a/consul/state/state_store.go
+++ b/consul/state/state_store.go
@@ -2339,6 +2339,7 @@ func (s *StateStore) CoordinateBatchUpdate(idx uint64, updates structs.Coordinat
                if err != nil {
                        return fmt.Errorf("failed node lookup: %s", err)
                }
+               fmt.Printf("[DEBUG] state_store: XXX Coordinate update for %s %v\n", update.Node, node)
                if node == nil {
                        continue
                }

The first and second servers I rolled looked like this:

[DEBUG] agent: XXX ams2-server-2 does not grok
[DEBUG] agent: XXX ams2-server-3 does not grok
[DEBUG] agent: XXX ams2-server-1 groks
[DEBUG] agent: XXX ams2-server-1 groks
[DEBUG] agent: XXX ams2-server-2 groks
[DEBUG] agent: XXX ams2-server-3 does not grok

Even though all three were updated. The last one I updated was able to see that all three servers would grok, so it started sending coordinates at them.

Need to see why the version info is different on the first two servers. It looks like there's a min/max calculation that's not taking place. Will add more debug logging and see what happens.

@slackpad
Copy link
Contributor Author

wrt to the second question (the ignore flag) we probably want that as a way to make a downgrade possible if something happens and they want to back out. Right now people will get a panic if they try running an older version of Consul and it tries to apply the Raft log entries for the new coordinate updates. This is a bad experience, especially given how unimportant these entries are.

@slackpad
Copy link
Contributor Author

Added more debugging output:

diff --git a/command/agent/agent.go b/command/agent/agent.go
index 99b4d25..f7acffe 100644
--- a/command/agent/agent.go
+++ b/command/agent/agent.go
@@ -553,8 +553,12 @@ func (a *Agent) CanServersUnderstandProtocol(version uint8) bool {
        for _, member := range members {
                if member.Tags["role"] == "consul" {
                        numServers++
+                       fmt.Printf("[DEBUG] agent: XXX %v\n", member)
                        if member.ProtocolMax >= version {
                                numWhoGrok++
+                               fmt.Printf("[DEBUG] agent: XXX %s groks\n", member.Name)
+                       } else {
+                               fmt.Printf("[DEBUG] agent: XXX %s does not grok\n", member.Name)
                        }
                }
        }

When rolling two of the nodes, now I get:

(on sfo1-server1)

[DEBUG] agent: XXX {sfo1-server-2-new 104.236.149.175 8301 map[expect:3 role:consul dc:sfo1 vsn:2 vsn_min:1 vsn_max:2 build:0.5.1:dc6795a5 port:8300] alive 1 2 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-2-new does not grok
[DEBUG] agent: XXX {sfo1-server-1 104.236.129.97 8301 map[expect:3 role:consul dc:sfo1 vsn:2 vsn_min:1 vsn_max:3 build:0.6.0rc1:c29507f4 port:8300] alive 1 3 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-1 groks
[DEBUG] agent: XXX {sfo1-server-3 104.131.157.52 8301 map[vsn:2 vsn_min:1 vsn_max:3 build:0.6.0rc1:c29507f4 port:8300 expect:3 role:consul dc:sfo1] alive 1 2 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-3 does not grok

(on sfo1-server-3)

[DEBUG] agent: XXX {sfo1-server-2-new 104.236.149.175 8301 map[dc:sfo1 vsn:2 vsn_min:1 vsn_max:2 build:0.5.1:dc6795a5 port:8300 expect:3 role:consul] alive 1 2 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-2-new does not grok
[DEBUG] agent: XXX {sfo1-server-3 104.131.157.52 8301 map[vsn_min:1 vsn_max:3 build:0.6.0rc1:c29507f4 port:8300 expect:3 role:consul dc:sfo1 vsn:2] alive 1 3 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-3 groks
[DEBUG] agent: XXX {sfo1-server-1 104.236.129.97 8301 map[vsn:2 vsn_min:1 vsn_max:3 build:0.6.0rc1:c29507f4 port:8300 expect:3 role:consul dc:sfo1] alive 1 3 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-1 groks

Compare these lines:

[DEBUG] agent: XXX {sfo1-server-3 104.131.157.52 8301 map[vsn:2 vsn_min:1 vsn_max:3 build:0.6.0rc1:c29507f4 port:8300 expect:3 role:consul dc:sfo1] alive 1 2 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-3 does not grok

[DEBUG] agent: XXX {sfo1-server-3 104.131.157.52 8301 map[vsn_min:1 vsn_max:3 build:0.6.0rc1:c29507f4 port:8300 expect:3 role:consul dc:sfo1 vsn:2] alive 1 3 2 2 4 4}
[DEBUG] agent: XXX sfo1-server-3 groks

What's interesting is that sfo1-server-1 sees sfo1-server-3's versions as 1 2 2 2 4 4 sfo1-server-3 seeing its own versions as 1 3 2 2 4 4. This explains why it does not grok. Need to see how this gets set and how it could be different.

@slackpad
Copy link
Contributor Author

Need to prove it but I'm thinking it might be related to these lines in memberlist:

https://github.com/hashicorp/memberlist/blob/master/state.go#L707-L711

I did a shutdown without a leave (that's how this consul server was set up when I did a kill <pid> of the consul process to get the watcher to restart it), so I'm thinking that memberlist might have skipped updating this node's version as it came back up and announced it (thus the fishy previous value of 1 2 2 2 4 4 for all the versions). All my previous testing of the upshift was using shutdown with leave.

@slackpad
Copy link
Contributor Author

Yep, doing a consul leave and then letting things re-join fixed it.

@slackpad
Copy link
Contributor Author

The process watcher restarted it pretty quickly. Here's how things looked from sfo1-server-1's perspective as sfo1-server-3 was rolled to the new version:

    2015/10/26 23:28:22 [INFO] serf: EventMemberUpdate: sfo1-server-3
    2015/10/26 23:28:22 [INFO] serf: EventMemberUpdate: sfo1-server-3.sfo1
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:23 [DEBUG] serf: messageJoinType: sfo1-server-3
    2015/10/26 23:28:24 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:24 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:24 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:24 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:25 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1
    2015/10/26 23:28:26 [DEBUG] serf: messageJoinType: sfo1-server-3.sfo1

@slackpad
Copy link
Contributor Author

The 2015/10/26 23:28:22 [INFO] serf: EventMemberUpdate: sfo1-server-3 log explains it.

It didn't see an alive -> dead -> alive transition, just an update to the metadata. Check out this code from Serf:

https://github.com/hashicorp/memberlist/blob/master/state.go#L794-L797

    // Notify the delegate of any relevant updates
    if m.config.Events != nil {
        if oldState == stateDead {
            // if Dead -> Alive, notify of join
            m.config.Events.NotifyJoin(&state.Node)

        } else if !bytes.Equal(oldMeta, state.Meta) {
            // if Meta changed, trigger an update notification
            m.config.Events.NotifyUpdate(&state.Node)
        }
    }

So clearly the metadata changed because of the version (address and port information was the same). but, the handler for this doesn't capture the new version information:

https://github.com/hashicorp/serf/blob/master/serf/serf.go#L975-L978

    // Update the member attributes
    member.Addr = net.IP(n.Addr)
    member.Port = n.Port
    member.Tags = s.decodeTags(n.Meta)

This doesn't properly update the member's version information. If we set that here things should work ok.

@slackpad
Copy link
Contributor Author

Ugh - I thoroughly confused myself here:

  1. The array I was printing is for the memberlist versions and the Serf versions, they have nothing to do with the Consul versions. I thought the versioning was delegated, but memberlist only supports one level, so Serf uses the delegate versioning. Consul puts all of its versioning into the vsn_ tags.
  2. This means that I was keying off the wrong thing in the upshift! Need to fix that to look at vsn_max, not the memberlist protocol version, which also happens to be bumping up to 3, which is why things seemed to be working ok.

The thing I found above should still be fixed because it's confusing, but Serf probably doesn't really care about the memberlist version, and it certainly isn't the proper fix for Consul's upshift.

@slackpad
Copy link
Contributor Author

Testing back on the DO cluster I rolled the first two nodes with the Consul fix from #1353 but not the Serf fix from hashicorp/serf#335 which would have masked the problem.

2015/10/27 18:00:28 [INFO] serf: EventMemberUpdate: sfo1-server-2-new

The first server started posting coordinates after seeing this update, which is the correct behavior.

@slackpad
Copy link
Contributor Author

I also tested downgrading a server:

    2015/10/27 18:03:59 [WARN] consul.fsm: ignoring unknown message type (6), upgrade to newer version
    2015/10/27 18:03:59 [WARN] consul.fsm: ignoring unknown message type (6), upgrade to newer version
    2015/10/27 18:03:59 [WARN] consul.fsm: ignoring unknown message type (6), upgrade to newer version
    2015/10/27 18:03:59 [WARN] consul.fsm: ignoring unknown message type (6), upgrade to newer version

This looks like the right behavior. The other servers started holding off on coordinate updates again:

    2015/10/27 18:05:59 [DEBUG] agent: skipping coordinate updates until servers are upgraded

@slackpad
Copy link
Contributor Author

Summary for the future:

// Member is a single member of the Serf cluster.
type Member struct {
        Name   string
        Addr   net.IP
        Port   uint16
        Tags   map[string]string  // <- Consul version in vsn, vsn_min, vsn_max
        Status MemberStatus

        // The minimum, maximum, and current values of the protocol versions
        // and delegate (Serf) protocol versions that each member can understand
        // or is speaking.
        ProtocolMin uint8 // <- memberlist version
        ProtocolMax uint8
        ProtocolCur uint8
        DelegateMin uint8 // <- Serf version
        DelegateMax uint8
        DelegateCur uint8
}

I was looking at ProtocolMax uint8 which as totally wrong, but I was looking for a 3 and we changed memberlist to 3 for the TCP fallback pings so…it worked. I thought that Serf would bubble up the “delegated” version for Consul to the top-level one for some reason. There’s only one level of chaining: memberlist helps you support versioning for your delegates and carries the Serf version around for you, but Serf doesn’t really have the same thing so Consul stuffs this info into tags.

@slackpad
Copy link
Contributor Author

Leaving open since there's one more PR outstanding.

sean- added a commit to hashicorp/nomad that referenced this issue May 28, 2016
…nor).

Reduce future confusion by introducing a minor version that is gossiped out
via the `mvn` Serf tag (Minor Version Number, `vsn` is already being used for
to communicate `Major Version Number`).

Background: hashicorp/consul/issues/1346#issuecomment-151663152
sean- added a commit to hashicorp/nomad that referenced this issue Jun 1, 2016
…nor).

Reduce future confusion by introducing a minor version that is gossiped out
via the `mvn` Serf tag (Minor Version Number, `vsn` is already being used for
to communicate `Major Version Number`).

Background: hashicorp/consul/issues/1346#issuecomment-151663152
sean- added a commit to hashicorp/nomad that referenced this issue Jun 10, 2016
…nor).

Reduce future confusion by introducing a minor version that is gossiped out
via the `mvn` Serf tag (Minor Version Number, `vsn` is already being used for
to communicate `Major Version Number`).

Background: hashicorp/consul/issues/1346#issuecomment-151663152
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant