Skip to content
This repository has been archived by the owner on Jan 30, 2020. It is now read-only.

Monitor heartbeat attempt can time out, but succeed #750

Closed
bcwaldon opened this issue Aug 6, 2014 · 7 comments
Closed

Monitor heartbeat attempt can time out, but succeed #750

bcwaldon opened this issue Aug 6, 2014 · 7 comments

Comments

@bcwaldon
Copy link
Contributor

bcwaldon commented Aug 6, 2014

An attempt to heartbeat machine presence can succeed, but timeout client-side. This causes all subsequent attempts to fail as the presence is created using prevExist=false.

Aug 06 06:43:21 node-02 fleet[668]: I0806 06:43:21.360308 00668 monitor.go:54] Monitor heartbeat function returned err, retrying in 2.5s: timeout reached
…
Aug 06 06:43:25 node-02 fleet[668]: I0806 06:43:25.354149 00668 monitor.go:54] Monitor heartbeat function returned err, retrying in 2.5s: 105: Key already exists (/_coreos.com/fleet/machines/00b7e1bbb2464ed9a4ef4282b11f2161/object)
@bcwaldon bcwaldon added the bug label Aug 6, 2014
@jonboulle
Copy link
Contributor

Somewhat related: #615

@bcwaldon bcwaldon added this to the v0.8.1 milestone Sep 3, 2014
@bcwaldon bcwaldon modified the milestones: v0.8.2, v0.8.1 Sep 12, 2014
@bcwaldon bcwaldon removed this from the v0.8.4 milestone Oct 20, 2014
@wuqixuan
Copy link
Contributor

@jonboulle @bcwaldon It's not totally same as #615.
There are 3 cases should be taken care.

  1. prevent other machine with the same id (agent: check for existing machine-ids on startup #615)
  2. allow the existing same id when you second time heartbeat ( Monitor heartbeat attempt can time out, but succeed #750)
  3. prevent another daemon in the same host.

@jonboulle jonboulle added kind/bug and removed bug labels Sep 24, 2015
@jonboulle jonboulle added this to the v0.13.0 milestone Jan 25, 2016
@kayrus
Copy link
Contributor

kayrus commented Apr 11, 2016

Does anyone know how to reproduce this issue?

@dongsupark
Copy link
Contributor

  1. prevent other machine with the same id (agent: check for existing machine-ids on startup #615)
  2. allow the existing same id when you second time heartbeat ( Monitor heartbeat attempt can time out, but succeed #750)
  3. prevent another daemon in the same host.

The 1st one, preventing other machines from being registered with the same ID, was addressed by #1561.
I'm not sure about what we can do about the 3rd one, preventing another daemon in the same host.

Let me get the 2nd one straight, "allowing the existing same id when doing heartbeat for the second time". AFAIK no PR addressed this issue.
Could it be simply done, if we would do s.hrt.Register() only for the 1st attempt, and do s.hrt.Beat() starting from the 2nd attempt? Then this issue could be relatively simply addressed.

Though I'm not sure if this is supposed to be done inside the milestone v0.13.

@antrik
Copy link
Contributor

antrik commented Apr 20, 2016

At first I thought 3) would be addressed by the fix for 1) too -- but I realised now that it is indeed a separate issue: if you start two fleetds on the same machine, they will be "fighting" for the machine ID... Not sure this is really a problem that needs to be fixed though -- at least not an urgent one.

Anyway, this issue here is clearly about case 2.

dongsupark pushed a commit to endocode/fleet that referenced this issue Apr 20, 2016
When beating the Heart for the 1st time, call Heart.Register() to avoid
such a case of registering machine with the same ID.
Starting from the next heartbeat, however, call Heart.Beat() to allow
registration with the same ID. That way fleetd can handle the machine
presence in a graceful way.

Suggested-by: wuqixuan <wuqixuan@huawei.com>
Fixes: coreos#750
@dongsupark
Copy link
Contributor

  1. prevent other machine with the same id (agent: check for existing machine-ids on startup #615)
  2. allow the existing same id when you second time heartbeat ( Monitor heartbeat attempt can time out, but succeed #750)
  3. prevent another daemon in the same host.

The 1st one is already fixed by #1561.
As for the 2nd issue, originally I thought I could fix it via #1563, but it was not the case. It actually caused other side effects. So I had to come up with another solution, which was already merged to #1561.
I don't know about the 3rd issue. At least I don't think it should be fixed with this issue.

So I think this issue can be closed.

@tixxdz
Copy link
Contributor

tixxdz commented Apr 27, 2016

Closing this one.

Thanks all!

@tixxdz tixxdz closed this as completed Apr 27, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants