Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

fleetd: detect the existing machine ID #1561

Merged
merged 4 commits into from
Apr 21, 2016

Conversation

dongsupark
Copy link
Contributor

@dongsupark dongsupark commented Apr 19, 2016

Detect the existing machine ID when fleetd starts up, to avoid uncomfortable errors when creating multiple machines in a cluster. In addition to the fix in fleetd, this PR also contains a functional test to cover such a case.

Functional test was suggested by @antrik .
/cc @wuqixuan
Based on #1288
Fixes: #615
See also #1241

// If the two machine IDs are different with each other,
// set the m1's ID to the same one as m0, to intentionally
// trigger an error case of duplication of machine ID.
if strings.Compare(m0_machine_id, m1_machine_id) != 0 {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just set the ID of m1 unconditionally -- no need to get and compare it first...

@tixxdz tixxdz added this to the v0.13.0 milestone Apr 20, 2016
stdout, _ = cluster.MemberCommand(m1, "systemctl", "show", "--property=Result", "fleet")
if strings.TrimSpace(stdout) != "Result=success" {
t.Fatalf("Result for fleet unit not reported as success: %s", stdout)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So fleet should actually be running, but failing to perform any operations? I guess that makes sense in a way (the conflicting machine might go away later) -- but it's a bit surprising at first. Maybe add a comment explaining this?

(I wonder, is this behaviour actually mentioned in the fleetd documentation?...)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@antrik: yes as discussed I'm surprised too.. could you please check the documentation and open a PR for that one ?

@antrik
Copy link
Contributor

antrik commented Apr 20, 2016

@dongsupark "suggested by antrik" is a bit confusing: I only suggested how to perform the functional test -- not the original fix :-)

} else {
t.Fatalf("Error is expected, but got success.\nstderr: %s", stderr)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks promising! so on top of all the comments, it would be nice to add after this the following:

stdout, err = cluster.MemberCommand(m0, "echo", new_uniq_machine_id, "|", "sudo", "tee", machineIdFile)
on machine 0, then restart fleet on machine 0 so we break the loop here https://github.com/coreos/fleet/pull/1561/files#diff-91bbeda7eb98a7adc57b9e47e2cf5c2bR179 on machine 1 and fleet continue to works properly after of course all your previous tests.

Now if restarting fleet on m0 doesn't work, then just kill machine 0 and let machine 1 up to do its stuff register its self cleanly and follow up with a last stdout, stderr, err := cluster.Fleetctl(m1, "list-machines", "--no-legend") ?

Thank you!

@dongsupark dongsupark force-pushed the dongsu/fleetd-detect-machine-id branch from 851244a to 0661d34 Compare April 20, 2016 14:58
@dongsupark
Copy link
Contributor Author

Updated

  • In TestDetectMachineId, add more test cases of m0's ID getting different from m1's. In that case, m0 and m1 are expected to work gracefully.
  • From Cluster interface, export NewMachineID() to be used for functional tests.
  • Split "restart fleet + systemctl show" calls into a separate function restartFleetService()
  • Remove an unnecessary comparison between m0 and m1 before setting the same machine IDs.
  • When intentionally triggering an error, handle error of "fleetctl list-machines" more precisely.
  • Fixed comments

Thanks!

// This is an expected error. PASS.
} else {
t.Fatalf("m1: Failed to get list of machines. err: %v\nstderr: %s", err, stderr)
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Empty then blocks are usually considered bad style... Just reverse the condition :-)

(Keep the comment though to make things clear.)

Alternatively, you could merge the inner and outer if conditions into an if + elseif construct... Not sure which approach is more readable in this case.

@dongsupark
Copy link
Contributor Author

It looks like this PR somehow causes TestSingleNodeConnectivityLoss to fail. Of course it's hard to say, why the unrelated test fails after having fixed a bug w.r.t. the machine ID.
I suspect a missing commit which will be implemented in a PR #1563 has something to do with the error. Will investigate.


// Trigger another test case of m0's ID getting different from m1's.
// Then it's expected that m0 and m1 would be working properly with distinct
// machine IDs, after having restarted fleet.service both on m0 and m1.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you want to restart on both? AIUI, changing the ID of m0 and restarting this one should be sufficient to resolve the conflict, so m1 should automatically become functional as well?...

@antrik
Copy link
Contributor

antrik commented Apr 20, 2016

@dongsupark is the failure reproducible or intermittent?

@dongsupark dongsupark force-pushed the dongsu/fleetd-detect-machine-id branch from 0661d34 to adec6fd Compare April 21, 2016 07:58
@dongsupark
Copy link
Contributor Author

dongsupark commented Apr 21, 2016

is the failure reproducible or intermittent?

It's almost always reproducible. A persistent regression.
This regression has nothing to do with the functional test TestDetectMachineId. Culprit is only the fleetd fix.

@dongsupark
Copy link
Contributor Author

I think what's happening is like that:

  • TestSingleNodeConnectivityLoss tries to block connections to etcd for a while, and to recover the connections again, to see that fleetd gets reattached to etcd. This mechanism is based on fleetd monitor, which invokes Server.run() in the restart context, immediately after everything gets ready. Then Server.run() first of all enters a heartbeat loop.
  • Before this PR, in the restart context, the heartbeat loop only called Heart.Beat(). That had succeeded in a short period, as Beat() already handled both cases even if the machine was already registered.
  • With the fleetd fix in this PR, however, the heartbeat loop calls Heart.Register() that returns always an error like "key already exists", because the machine was already registered. Then it retries in the heartbeat loop, retrying the registration. That's normally harmless, as the monitor is supposed to kick it out of the loop after timeout expires. However, it's unfortunate that the functional test cannot wait for so long time until it recovers.

Solution could be:

  1. First try to Register() the machine, and Beat() starting from the second attempt. This is done by [WIP] fleetd: register heartbeat only for the first attempt #1563. But now I think it's sub-optimal, as it fails at the first attempt anyway.
  2. Distinguish the restart context from the normal start context. Then call Register() only for the start context, while calling Beat() for the restart context.

I'll update this PR according to the solution 2.

wuqixuan and others added 2 commits April 21, 2016 13:43
Now support detecting the existing machine-id on startup.

Fixes coreos#1241 coreos#615
When beating the Heart in the clean start context, call Heart.Register()
to avoid such a case of registering machine with the same ID. In the
restart context, however, call Heart.Beat() to allow registration with
the same ID. That way fleetd can handle the machine presence in a
graceful way.

Without this patch, functional tests like TestSingleNodeConnectivityLoss
fail, because Heart.Register() would always return an error, especially
when it's in the restart context. That could result in a case of fleetd
being unable to recover in an expected time frame.
@dongsupark dongsupark force-pushed the dongsu/fleetd-detect-machine-id branch from adec6fd to 9fd7615 Compare April 21, 2016 11:44
@dongsupark
Copy link
Contributor Author

Updated.

  • Improved the heartbeat logic to distinguish the restart context from other ones.
  • Updated the functional test as suggested.

@antrik
Copy link
Contributor

antrik commented Apr 21, 2016

@dongsupark well, if it's really just a timeout issue, we could just increase the timeout in the test case by another TTL length or so...

Regardless, we need to think about whether it's more correct for restart to try a new registration or update the existing one. I tend to agree that updating (as you implemented it now) is probably better -- but I haven't thought much about it...

@@ -384,14 +384,14 @@ func (nc *nspawnCluster) CreateMember() (m Member, err error) {
return nc.createMember(id)
}

func newMachineID() string {
func (nc *nspawnCluster) NewMachineID() string {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you turn it into a method on nc, if it doesn't need nc for anything?...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why did you turn it into a method on nc, if it doesn't need nc for anything?

No reason. I just thought that was an ideal way for a method to be exported from the Cluster interface. Of course we could make it a general helper in functional/util or whatever.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, right, this might be specific to a certain cluster type... I was just thinking of exporting is as a function from where it is, but that obviously would conflict with other implementations. Sorry for the noise.

@dongsupark
Copy link
Contributor Author

well, if it's really just a timeout issue, we could just increase the timeout in the test case by another TTL length or so...

Maybe yes, we could solve it by tuning timeout. But as far as I have observed, the original change in fleetd affected not only the connectivity test, but other tests. Not always, but sometimes. So I cannot prove it right now. Though I'm not surprised, as it's not the first time for me to see occasional random failures in functional tests.
Thus I tend to rely on a fix in fleetd, instead of some tuning in functional tests. Of course that's only if my fix has no logical errors or any other downsides.

Regardless, we need to think about whether it's more correct for restart to try a new registration or update the existing one. I tend to agree that updating (as you implemented it now) is probably better -- but I haven't thought much about it...

Updating through Beat() is better, as that already performs conditional actions, "create or update".

@antrik
Copy link
Contributor

antrik commented Apr 21, 2016

@dongsupark well, I didn't mean to say that working around by increasing the timeout is necessarily a good idea -- just pointing out that it would be an option if it was the only problem... But if it uncovers a real problem, it's better to solve the real problem of course :-)

As for the create vs update question, I was just wondering whether a reload should be treated more like a fresh start, i.e. trying to acquire the machine ID again. But that of course can't work if the old entry isn't purged from the registry yet -- so as I already said, your current approach is probably correct :-)

Dongsu Park added 2 commits April 21, 2016 16:41
Move newMachineID() from the platform/nspawn.go to util.NewMachineID(),
to make it available for functional tests. This will be necessary for
following test cases where machine IDs need to be regenerated.
A new test TestDetectMachineId checks if a etcd registration fails
when a duplicated entry for /etc/machine-id gets registered to
different machines. Note that it's expected to fail in this case.

Goal of the test is to cover the improvement patch by @wuqixuan
("fleetd: Detecting the existing machine-id").

See also coreos#1288,
coreos#1241,
coreos#615.

Suggested-by: Olaf Buddenhagen <olaf@endocode.com>
Cc: wuqixuan <wuqixuan@huawei.com>
Cc: Djalal Harouni <djalal@endocode.com>
@dongsupark dongsupark force-pushed the dongsu/fleetd-detect-machine-id branch from 9fd7615 to 176825f Compare April 21, 2016 14:43
@tixxdz
Copy link
Contributor

tixxdz commented Apr 21, 2016

lgtm, thank you!

@tixxdz tixxdz merged commit d357cf2 into coreos:master Apr 21, 2016
@dongsupark dongsupark deleted the dongsu/fleetd-detect-machine-id branch April 22, 2016 10:13
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

agent: check for existing machine-ids on startup
3 participants