fleetd: detect the existing machine ID #1561

dongsupark · 2016-04-19T14:01:23Z

Detect the existing machine ID when fleetd starts up, to avoid uncomfortable errors when creating multiple machines in a cluster. In addition to the fix in fleetd, this PR also contains a functional test to cover such a case.

Functional test was suggested by @antrik .
/cc @wuqixuan
Based on #1288
Fixes: #615
See also #1241

antrik · 2016-04-20T12:42:40Z

functional/node_test.go

+	// If the two machine IDs are different with each other,
+	// set the m1's ID to the same one as m0, to intentionally
+	// trigger an error case of duplication of machine ID.
+	if strings.Compare(m0_machine_id, m1_machine_id) != 0 {


Just set the ID of m1 unconditionally -- no need to get and compare it first...

antrik · 2016-04-20T12:52:50Z

functional/node_test.go

+	stdout, _ = cluster.MemberCommand(m1, "systemctl", "show", "--property=Result", "fleet")
+	if strings.TrimSpace(stdout) != "Result=success" {
+		t.Fatalf("Result for fleet unit not reported as success: %s", stdout)
+	}


So fleet should actually be running, but failing to perform any operations? I guess that makes sense in a way (the conflicting machine might go away later) -- but it's a bit surprising at first. Maybe add a comment explaining this?

(I wonder, is this behaviour actually mentioned in the fleetd documentation?...)

@antrik: yes as discussed I'm surprised too.. could you please check the documentation and open a PR for that one ?

antrik · 2016-04-20T12:53:57Z

@dongsupark "suggested by antrik" is a bit confusing: I only suggested how to perform the functional test -- not the original fix :-)

tixxdz · 2016-04-20T13:22:41Z

functional/node_test.go

+	} else {
+		t.Fatalf("Error is expected, but got success.\nstderr: %s", stderr)
+	}
+}


Looks promising! so on top of all the comments, it would be nice to add after this the following:

stdout, err = cluster.MemberCommand(m0, "echo", new_uniq_machine_id, "|", "sudo", "tee", machineIdFile)
on machine 0, then restart fleet on machine 0 so we break the loop here https://github.com/coreos/fleet/pull/1561/files#diff-91bbeda7eb98a7adc57b9e47e2cf5c2bR179 on machine 1 and fleet continue to works properly after of course all your previous tests.

Now if restarting fleet on m0 doesn't work, then just kill machine 0 and let machine 1 up to do its stuff register its self cleanly and follow up with a last stdout, stderr, err := cluster.Fleetctl(m1, "list-machines", "--no-legend") ?

Thank you!

dongsupark · 2016-04-20T15:12:24Z

Updated

In TestDetectMachineId, add more test cases of m0's ID getting different from m1's. In that case, m0 and m1 are expected to work gracefully.
From Cluster interface, export NewMachineID() to be used for functional tests.
Split "restart fleet + systemctl show" calls into a separate function restartFleetService()
Remove an unnecessary comparison between m0 and m1 before setting the same machine IDs.
When intentionally triggering an error, handle error of "fleetctl list-machines" more precisely.
Fixed comments

Thanks!

antrik · 2016-04-20T16:03:22Z

functional/node_test.go

+			// This is an expected error. PASS.
+		} else {
+			t.Fatalf("m1: Failed to get list of machines. err: %v\nstderr: %s", err, stderr)
+		}


Empty then blocks are usually considered bad style... Just reverse the condition :-)

(Keep the comment though to make things clear.)

Alternatively, you could merge the inner and outer if conditions into an if + elseif construct... Not sure which approach is more readable in this case.

dongsupark · 2016-04-20T16:06:16Z

It looks like this PR somehow causes TestSingleNodeConnectivityLoss to fail. Of course it's hard to say, why the unrelated test fails after having fixed a bug w.r.t. the machine ID.
I suspect a missing commit which will be implemented in a PR #1563 has something to do with the error. Will investigate.

antrik · 2016-04-20T16:06:55Z

functional/node_test.go

+
+	// Trigger another test case of m0's ID getting different from m1's.
+	// Then it's expected that m0 and m1 would be working properly with distinct
+	// machine IDs, after having restarted fleet.service both on m0 and m1.


Why do you want to restart on both? AIUI, changing the ID of m0 and restarting this one should be sufficient to resolve the conflict, so m1 should automatically become functional as well?...

antrik · 2016-04-20T16:12:01Z

@dongsupark is the failure reproducible or intermittent?

dongsupark · 2016-04-21T08:07:09Z

is the failure reproducible or intermittent?

It's almost always reproducible. A persistent regression.
This regression has nothing to do with the functional test TestDetectMachineId. Culprit is only the fleetd fix.

dongsupark · 2016-04-21T11:22:13Z

I think what's happening is like that:

TestSingleNodeConnectivityLoss tries to block connections to etcd for a while, and to recover the connections again, to see that fleetd gets reattached to etcd. This mechanism is based on fleetd monitor, which invokes Server.run() in the restart context, immediately after everything gets ready. Then Server.run() first of all enters a heartbeat loop.
Before this PR, in the restart context, the heartbeat loop only called Heart.Beat(). That had succeeded in a short period, as Beat() already handled both cases even if the machine was already registered.
With the fleetd fix in this PR, however, the heartbeat loop calls Heart.Register() that returns always an error like "key already exists", because the machine was already registered. Then it retries in the heartbeat loop, retrying the registration. That's normally harmless, as the monitor is supposed to kick it out of the loop after timeout expires. However, it's unfortunate that the functional test cannot wait for so long time until it recovers.

Solution could be:

First try to Register() the machine, and Beat() starting from the second attempt. This is done by [WIP] fleetd: register heartbeat only for the first attempt #1563. But now I think it's sub-optimal, as it fails at the first attempt anyway.
Distinguish the restart context from the normal start context. Then call Register() only for the start context, while calling Beat() for the restart context.

I'll update this PR according to the solution 2.

Now support detecting the existing machine-id on startup. Fixes coreos#1241 coreos#615

When beating the Heart in the clean start context, call Heart.Register() to avoid such a case of registering machine with the same ID. In the restart context, however, call Heart.Beat() to allow registration with the same ID. That way fleetd can handle the machine presence in a graceful way. Without this patch, functional tests like TestSingleNodeConnectivityLoss fail, because Heart.Register() would always return an error, especially when it's in the restart context. That could result in a case of fleetd being unable to recover in an expected time frame.

dongsupark · 2016-04-21T11:45:14Z

Updated.

Improved the heartbeat logic to distinguish the restart context from other ones.
Updated the functional test as suggested.

antrik · 2016-04-21T12:39:44Z

@dongsupark well, if it's really just a timeout issue, we could just increase the timeout in the test case by another TTL length or so...

Regardless, we need to think about whether it's more correct for restart to try a new registration or update the existing one. I tend to agree that updating (as you implemented it now) is probably better -- but I haven't thought much about it...

antrik · 2016-04-21T12:44:25Z

functional/platform/nspawn.go

@@ -384,14 +384,14 @@ func (nc *nspawnCluster) CreateMember() (m Member, err error) {
 	return nc.createMember(id)
 }

-func newMachineID() string {
+func (nc *nspawnCluster) NewMachineID() string {


Why did you turn it into a method on nc, if it doesn't need nc for anything?...

Why did you turn it into a method on nc, if it doesn't need nc for anything?

No reason. I just thought that was an ideal way for a method to be exported from the Cluster interface. Of course we could make it a general helper in functional/util or whatever.

Oh, right, this might be specific to a certain cluster type... I was just thinking of exporting is as a function from where it is, but that obviously would conflict with other implementations. Sorry for the noise.

dongsupark · 2016-04-21T13:36:59Z

well, if it's really just a timeout issue, we could just increase the timeout in the test case by another TTL length or so...

Maybe yes, we could solve it by tuning timeout. But as far as I have observed, the original change in fleetd affected not only the connectivity test, but other tests. Not always, but sometimes. So I cannot prove it right now. Though I'm not surprised, as it's not the first time for me to see occasional random failures in functional tests.
Thus I tend to rely on a fix in fleetd, instead of some tuning in functional tests. Of course that's only if my fix has no logical errors or any other downsides.

Regardless, we need to think about whether it's more correct for restart to try a new registration or update the existing one. I tend to agree that updating (as you implemented it now) is probably better -- but I haven't thought much about it...

Updating through Beat() is better, as that already performs conditional actions, "create or update".

antrik · 2016-04-21T14:01:30Z

@dongsupark well, I didn't mean to say that working around by increasing the timeout is necessarily a good idea -- just pointing out that it would be an option if it was the only problem... But if it uncovers a real problem, it's better to solve the real problem of course :-)

As for the create vs update question, I was just wondering whether a reload should be treated more like a fresh start, i.e. trying to acquire the machine ID again. But that of course can't work if the old entry isn't purged from the registry yet -- so as I already said, your current approach is probably correct :-)

Move newMachineID() from the platform/nspawn.go to util.NewMachineID(), to make it available for functional tests. This will be necessary for following test cases where machine IDs need to be regenerated.

@wuqixuan

A new test TestDetectMachineId checks if a etcd registration fails when a duplicated entry for /etc/machine-id gets registered to different machines. Note that it's expected to fail in this case. Goal of the test is to cover the improvement patch by @wuqixuan ("fleetd: Detecting the existing machine-id"). See also coreos#1288, coreos#1241, coreos#615. Suggested-by: Olaf Buddenhagen <olaf@endocode.com> Cc: wuqixuan <wuqixuan@huawei.com> Cc: Djalal Harouni <djalal@endocode.com>

tixxdz · 2016-04-21T14:52:37Z

lgtm, thank you!

dongsupark mentioned this pull request Apr 20, 2016

Monitor heartbeat attempt can time out, but succeed #750

Closed

antrik reviewed Apr 20, 2016
View reviewed changes

tixxdz added this to the v0.13.0 milestone Apr 20, 2016

antrik reviewed Apr 20, 2016
View reviewed changes

tixxdz reviewed Apr 20, 2016
View reviewed changes

dongsupark force-pushed the dongsu/fleetd-detect-machine-id branch from 851244a to 0661d34 Compare April 20, 2016 14:58

dongsupark mentioned this pull request Apr 20, 2016

[WIP] fleetd: register heartbeat only for the first attempt #1563

Closed

antrik reviewed Apr 20, 2016
View reviewed changes

dongsupark force-pushed the dongsu/fleetd-detect-machine-id branch from 0661d34 to adec6fd Compare April 21, 2016 07:58

wuqixuan and others added 2 commits April 21, 2016 13:43

fleetd: Detecting the existing machine-id

32c1760

Now support detecting the existing machine-id on startup. Fixes coreos#1241 coreos#615

dongsupark force-pushed the dongsu/fleetd-detect-machine-id branch from adec6fd to 9fd7615 Compare April 21, 2016 11:44

antrik reviewed Apr 21, 2016
View reviewed changes

Dongsu Park added 2 commits April 21, 2016 16:41

functional: move newMachineID() from nspawn to util.NewMachineID()

5157930

Move newMachineID() from the platform/nspawn.go to util.NewMachineID(), to make it available for functional tests. This will be necessary for following test cases where machine IDs need to be regenerated.

dongsupark force-pushed the dongsu/fleetd-detect-machine-id branch from 9fd7615 to 176825f Compare April 21, 2016 14:43

tixxdz merged commit d357cf2 into coreos:master Apr 21, 2016

tixxdz mentioned this pull request Apr 21, 2016

fleetd: Detecting the existing machine-id #1288

Closed

dongsupark deleted the dongsu/fleetd-detect-machine-id branch April 22, 2016 10:13

dongsupark mentioned this pull request May 6, 2016

fleet engine leadership losting trigger all units restarting on centos7 #1181

Closed

dongsupark mentioned this pull request Jul 1, 2016

ETCD key already exists #1622

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fleetd: detect the existing machine ID #1561

fleetd: detect the existing machine ID #1561

dongsupark commented Apr 19, 2016 •

edited

Loading

antrik Apr 20, 2016

antrik Apr 20, 2016

tixxdz Apr 20, 2016

antrik commented Apr 20, 2016

tixxdz Apr 20, 2016

dongsupark commented Apr 20, 2016

antrik Apr 20, 2016

dongsupark commented Apr 20, 2016

antrik Apr 20, 2016

antrik commented Apr 20, 2016

dongsupark commented Apr 21, 2016 •

edited

Loading

dongsupark commented Apr 21, 2016

dongsupark commented Apr 21, 2016

antrik commented Apr 21, 2016

antrik Apr 21, 2016

dongsupark Apr 21, 2016

antrik Apr 21, 2016

dongsupark commented Apr 21, 2016

antrik commented Apr 21, 2016

tixxdz commented Apr 21, 2016

fleetd: detect the existing machine ID #1561

fleetd: detect the existing machine ID #1561

Conversation

dongsupark commented Apr 19, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

antrik commented Apr 20, 2016

Choose a reason for hiding this comment

dongsupark commented Apr 20, 2016

Choose a reason for hiding this comment

dongsupark commented Apr 20, 2016

Choose a reason for hiding this comment

antrik commented Apr 20, 2016

dongsupark commented Apr 21, 2016 • edited Loading

dongsupark commented Apr 21, 2016

dongsupark commented Apr 21, 2016

antrik commented Apr 21, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dongsupark commented Apr 21, 2016

antrik commented Apr 21, 2016

tixxdz commented Apr 21, 2016

dongsupark commented Apr 19, 2016 •

edited

Loading

dongsupark commented Apr 21, 2016 •

edited

Loading