-
Notifications
You must be signed in to change notification settings - Fork 302
fleetd: detect the existing machine ID #1561
fleetd: detect the existing machine ID #1561
Conversation
// If the two machine IDs are different with each other, | ||
// set the m1's ID to the same one as m0, to intentionally | ||
// trigger an error case of duplication of machine ID. | ||
if strings.Compare(m0_machine_id, m1_machine_id) != 0 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just set the ID of m1 unconditionally -- no need to get and compare it first...
stdout, _ = cluster.MemberCommand(m1, "systemctl", "show", "--property=Result", "fleet") | ||
if strings.TrimSpace(stdout) != "Result=success" { | ||
t.Fatalf("Result for fleet unit not reported as success: %s", stdout) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So fleet should actually be running, but failing to perform any operations? I guess that makes sense in a way (the conflicting machine might go away later) -- but it's a bit surprising at first. Maybe add a comment explaining this?
(I wonder, is this behaviour actually mentioned in the fleetd documentation?...)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@antrik: yes as discussed I'm surprised too.. could you please check the documentation and open a PR for that one ?
@dongsupark "suggested by antrik" is a bit confusing: I only suggested how to perform the functional test -- not the original fix :-) |
} else { | ||
t.Fatalf("Error is expected, but got success.\nstderr: %s", stderr) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks promising! so on top of all the comments, it would be nice to add after this the following:
stdout, err = cluster.MemberCommand(m0, "echo", new_uniq_machine_id, "|", "sudo", "tee", machineIdFile)
on machine 0, then restart fleet on machine 0 so we break the loop here https://github.com/coreos/fleet/pull/1561/files#diff-91bbeda7eb98a7adc57b9e47e2cf5c2bR179 on machine 1 and fleet continue to works properly after of course all your previous tests.
Now if restarting fleet on m0 doesn't work, then just kill machine 0 and let machine 1 up to do its stuff register its self cleanly and follow up with a last stdout, stderr, err := cluster.Fleetctl(m1, "list-machines", "--no-legend")
?
Thank you!
851244a
to
0661d34
Compare
Updated
Thanks! |
// This is an expected error. PASS. | ||
} else { | ||
t.Fatalf("m1: Failed to get list of machines. err: %v\nstderr: %s", err, stderr) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Empty then
blocks are usually considered bad style... Just reverse the condition :-)
(Keep the comment though to make things clear.)
Alternatively, you could merge the inner and outer if conditions into an if + elseif construct... Not sure which approach is more readable in this case.
It looks like this PR somehow causes |
|
||
// Trigger another test case of m0's ID getting different from m1's. | ||
// Then it's expected that m0 and m1 would be working properly with distinct | ||
// machine IDs, after having restarted fleet.service both on m0 and m1. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you want to restart on both? AIUI, changing the ID of m0 and restarting this one should be sufficient to resolve the conflict, so m1 should automatically become functional as well?...
@dongsupark is the failure reproducible or intermittent? |
0661d34
to
adec6fd
Compare
It's almost always reproducible. A persistent regression. |
I think what's happening is like that:
Solution could be:
I'll update this PR according to the solution 2. |
Now support detecting the existing machine-id on startup. Fixes coreos#1241 coreos#615
When beating the Heart in the clean start context, call Heart.Register() to avoid such a case of registering machine with the same ID. In the restart context, however, call Heart.Beat() to allow registration with the same ID. That way fleetd can handle the machine presence in a graceful way. Without this patch, functional tests like TestSingleNodeConnectivityLoss fail, because Heart.Register() would always return an error, especially when it's in the restart context. That could result in a case of fleetd being unable to recover in an expected time frame.
adec6fd
to
9fd7615
Compare
Updated.
|
@dongsupark well, if it's really just a timeout issue, we could just increase the timeout in the test case by another TTL length or so... Regardless, we need to think about whether it's more correct for restart to try a new registration or update the existing one. I tend to agree that updating (as you implemented it now) is probably better -- but I haven't thought much about it... |
@@ -384,14 +384,14 @@ func (nc *nspawnCluster) CreateMember() (m Member, err error) { | |||
return nc.createMember(id) | |||
} | |||
|
|||
func newMachineID() string { | |||
func (nc *nspawnCluster) NewMachineID() string { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you turn it into a method on nc
, if it doesn't need nc
for anything?...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why did you turn it into a method on nc, if it doesn't need nc for anything?
No reason. I just thought that was an ideal way for a method to be exported from the Cluster interface. Of course we could make it a general helper in functional/util or whatever.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, right, this might be specific to a certain cluster type... I was just thinking of exporting is as a function from where it is, but that obviously would conflict with other implementations. Sorry for the noise.
Maybe yes, we could solve it by tuning timeout. But as far as I have observed, the original change in fleetd affected not only the connectivity test, but other tests. Not always, but sometimes. So I cannot prove it right now. Though I'm not surprised, as it's not the first time for me to see occasional random failures in functional tests.
Updating through |
@dongsupark well, I didn't mean to say that working around by increasing the timeout is necessarily a good idea -- just pointing out that it would be an option if it was the only problem... But if it uncovers a real problem, it's better to solve the real problem of course :-) As for the create vs update question, I was just wondering whether a reload should be treated more like a fresh start, i.e. trying to acquire the machine ID again. But that of course can't work if the old entry isn't purged from the registry yet -- so as I already said, your current approach is probably correct :-) |
Move newMachineID() from the platform/nspawn.go to util.NewMachineID(), to make it available for functional tests. This will be necessary for following test cases where machine IDs need to be regenerated.
A new test TestDetectMachineId checks if a etcd registration fails when a duplicated entry for /etc/machine-id gets registered to different machines. Note that it's expected to fail in this case. Goal of the test is to cover the improvement patch by @wuqixuan ("fleetd: Detecting the existing machine-id"). See also coreos#1288, coreos#1241, coreos#615. Suggested-by: Olaf Buddenhagen <olaf@endocode.com> Cc: wuqixuan <wuqixuan@huawei.com> Cc: Djalal Harouni <djalal@endocode.com>
9fd7615
to
176825f
Compare
lgtm, thank you! |
Detect the existing machine ID when fleetd starts up, to avoid uncomfortable errors when creating multiple machines in a cluster. In addition to the fix in fleetd, this PR also contains a functional test to cover such a case.
Functional test was suggested by @antrik .
/cc @wuqixuan
Based on #1288
Fixes: #615
See also #1241