-
Notifications
You must be signed in to change notification settings - Fork 302
agent: check for existing machine-ids on startup #615
Comments
What situations can cause this to happen? |
Cloned VMs (come up a few times on the mailing list). |
+1 |
If we do this, I believe we would lose the ability for an agent to go down with SIGKILL (i.e. no state purge) and come back up without any service interruption. If we make an agent wait on its state key to time out before it can rejoin, that will cause unnecessary churn in the schedule. Once we switch to a more direct presence mechanism (direct API requests) this change won't be as disruptive. |
Ah, I thought #697 fixed this, but it looks like it didn't. I must've misread the release notes. |
(The whole 'accidentally running fleet instead of fleetctl' issue) |
@carmstrong we should really just move |
👍 |
+1 renaming to fleetd |
I strongly vote for including a sanity check to ensure unique machine-ids. I came across this due to cloned VMs and spent hours of debugging until @jonboulle pointed me in the right direction. |
Now support detecting the existing machine-id on startup. Fixes coreos#1241 coreos#615
Now support detecting the existing machine-id on startup. Fixes coreos#1241 coreos#615
Now support detecting the existing machine-id on startup. Fixes coreos#1241 coreos#615
+1 this was a head scratcher |
Now support detecting the existing machine-id on startup. Fixes coreos#1241 coreos#615
A new test TestDetectMachineId checks if a etcd registration fails when a duplicated entry for /etc/machine-id gets registered to different machines. Note that it's expected to fail in this case. Goal of the test is to cover the improvement patch by @wuqixuan ("fleetd: Detecting the existing machine-id"). See also coreos#1288, coreos#1241, coreos#615. Suggested-by: Olaf Buddenhagen <olaf@endocode.com> Cc: wuqixuan <wuqixuan@huawei.com>
Now support detecting the existing machine-id on startup. Fixes coreos#1241 coreos#615
A new test TestDetectMachineId checks if a etcd registration fails when a duplicated entry for /etc/machine-id gets registered to different machines. Note that it's expected to fail in this case. Goal of the test is to cover the improvement patch by @wuqixuan ("fleetd: Detecting the existing machine-id"). See also coreos#1288, coreos#1241, coreos#615. Suggested-by: Olaf Buddenhagen <olaf@endocode.com> Cc: wuqixuan <wuqixuan@huawei.com>
A new test TestDetectMachineId checks if a etcd registration fails when a duplicated entry for /etc/machine-id gets registered to different machines. Note that it's expected to fail in this case. Goal of the test is to cover the improvement patch by @wuqixuan ("fleetd: Detecting the existing machine-id"). See also coreos#1288, coreos#1241, coreos#615. Suggested-by: Olaf Buddenhagen <olaf@endocode.com> Cc: wuqixuan <wuqixuan@huawei.com>
Now support detecting the existing machine-id on startup. Fixes coreos#1241 coreos#615
A new test TestDetectMachineId checks if a etcd registration fails when a duplicated entry for /etc/machine-id gets registered to different machines. Note that it's expected to fail in this case. Goal of the test is to cover the improvement patch by @wuqixuan ("fleetd: Detecting the existing machine-id"). See also coreos#1288, coreos#1241, coreos#615. Suggested-by: Olaf Buddenhagen <olaf@endocode.com> Cc: wuqixuan <wuqixuan@huawei.com> Cc: Djalal Harouni <djalal@endocode.com>
A new test TestDetectMachineId checks if a etcd registration fails when a duplicated entry for /etc/machine-id gets registered to different machines. Note that it's expected to fail in this case. Goal of the test is to cover the improvement patch by @wuqixuan ("fleetd: Detecting the existing machine-id"). See also coreos#1288, coreos#1241, coreos#615. Suggested-by: Olaf Buddenhagen <olaf@endocode.com> Cc: wuqixuan <wuqixuan@huawei.com> Cc: Djalal Harouni <djalal@endocode.com>
If multiple systems in a cluster happen to have the same machine-id, all hell can break loose. It would be nice if the agent could be able to do a simple sanity check when first registering itself in the cluster; if there's an existing machine with its machine-id, it can log something useful and/or back off and try again (e.g. in case a previously-pushed machineState just hasn't TTLed yet)
The text was updated successfully, but these errors were encountered: