Monitor heartbeat attempt can time out, but succeed #750

bcwaldon · 2014-08-06T16:40:17Z

An attempt to heartbeat machine presence can succeed, but timeout client-side. This causes all subsequent attempts to fail as the presence is created using prevExist=false.

Aug 06 06:43:21 node-02 fleet[668]: I0806 06:43:21.360308 00668 monitor.go:54] Monitor heartbeat function returned err, retrying in 2.5s: timeout reached
…
Aug 06 06:43:25 node-02 fleet[668]: I0806 06:43:25.354149 00668 monitor.go:54] Monitor heartbeat function returned err, retrying in 2.5s: 105: Key already exists (/_coreos.com/fleet/machines/00b7e1bbb2464ed9a4ef4282b11f2161/object)

The text was updated successfully, but these errors were encountered:

jonboulle · 2014-08-06T17:38:26Z

Somewhat related: #615

wuqixuan · 2015-09-15T07:52:21Z

@jonboulle @bcwaldon It's not totally same as #615.
There are 3 cases should be taken care.

prevent other machine with the same id (agent: check for existing machine-ids on startup #615)
allow the existing same id when you second time heartbeat ( Monitor heartbeat attempt can time out, but succeed #750)
prevent another daemon in the same host.

kayrus · 2016-04-11T13:58:17Z

Does anyone know how to reproduce this issue?

dongsupark · 2016-04-20T10:08:01Z

prevent other machine with the same id (agent: check for existing machine-ids on startup #615)

allow the existing same id when you second time heartbeat ( Monitor heartbeat attempt can time out, but succeed #750)

prevent another daemon in the same host.

The 1st one, preventing other machines from being registered with the same ID, was addressed by #1561.
I'm not sure about what we can do about the 3rd one, preventing another daemon in the same host.

Let me get the 2nd one straight, "allowing the existing same id when doing heartbeat for the second time". AFAIK no PR addressed this issue.
Could it be simply done, if we would do s.hrt.Register() only for the 1st attempt, and do s.hrt.Beat() starting from the 2nd attempt? Then this issue could be relatively simply addressed.

Though I'm not sure if this is supposed to be done inside the milestone v0.13.

antrik · 2016-04-20T13:03:02Z

At first I thought 3) would be addressed by the fix for 1) too -- but I realised now that it is indeed a separate issue: if you start two fleetds on the same machine, they will be "fighting" for the machine ID... Not sure this is really a problem that needs to be fixed though -- at least not an urgent one.

Anyway, this issue here is clearly about case 2.

When beating the Heart for the 1st time, call Heart.Register() to avoid such a case of registering machine with the same ID. Starting from the next heartbeat, however, call Heart.Beat() to allow registration with the same ID. That way fleetd can handle the machine presence in a graceful way. Suggested-by: wuqixuan <wuqixuan@huawei.com> Fixes: coreos#750

dongsupark · 2016-04-27T10:22:06Z

prevent other machine with the same id (agent: check for existing machine-ids on startup #615)

allow the existing same id when you second time heartbeat ( Monitor heartbeat attempt can time out, but succeed #750)

prevent another daemon in the same host.

The 1st one is already fixed by #1561.
As for the 2nd issue, originally I thought I could fix it via #1563, but it was not the case. It actually caused other side effects. So I had to come up with another solution, which was already merged to #1561.
I don't know about the 3rd issue. At least I don't think it should be fixed with this issue.

So I think this issue can be closed.

tixxdz · 2016-04-27T11:15:21Z

Closing this one.

Thanks all!

bcwaldon added the bug label Aug 6, 2014

bcwaldon added this to the v0.8.1 milestone Sep 3, 2014

bcwaldon modified the milestones: v0.8.2, v0.8.1 Sep 12, 2014

bcwaldon removed this from the v0.8.4 milestone Oct 20, 2014

bcwaldon added the help wanted label Jul 9, 2015

jonboulle added kind/bug and removed bug labels Sep 24, 2015

jonboulle added this to the v0.13.0 milestone Jan 25, 2016

jonboulle added the priority/P2 label Jan 25, 2016

dongsupark mentioned this issue Apr 20, 2016

[WIP] fleetd: register heartbeat only for the first attempt #1563

Closed

tixxdz closed this as completed Apr 27, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitor heartbeat attempt can time out, but succeed #750

Monitor heartbeat attempt can time out, but succeed #750

bcwaldon commented Aug 6, 2014

jonboulle commented Aug 6, 2014

wuqixuan commented Sep 15, 2015

kayrus commented Apr 11, 2016

dongsupark commented Apr 20, 2016

antrik commented Apr 20, 2016

dongsupark commented Apr 27, 2016

tixxdz commented Apr 27, 2016

Monitor heartbeat attempt can time out, but succeed #750

Monitor heartbeat attempt can time out, but succeed #750

Comments

bcwaldon commented Aug 6, 2014

jonboulle commented Aug 6, 2014

wuqixuan commented Sep 15, 2015

kayrus commented Apr 11, 2016

dongsupark commented Apr 20, 2016

antrik commented Apr 20, 2016

dongsupark commented Apr 27, 2016

tixxdz commented Apr 27, 2016