RFE: periodic heartbeat in bluechi-controller #857

engelmi · 2024-03-27T12:51:03Z

Please describe what you would like to see

In #652 the problem was raised that bluechi-controller detects a disconnect of the bluechi-agent quite late if the cable is unplugged, for example, and a command is issued to that agent before the connection timeout is hit. This leads to a "zombie" agent in the bluechi-controller - it still lists the agent as online and refuses reconnects of that agent (due to the name still being used).
This can be mitigated via the introduced TCP KeepAlive options (#674), but it still takes quite a while to detect it (depending on various complex tcp options, e.g. tcp retransmissions).

The bluechi-agent on the other side can detect a disconnect rather soon due to the Heartbeat feature. Such a periodic check of the connection status on an application layer could be used in the bluechi-controller as well. Based on the last seen timestamp, it could actively disconnect nodes.

Note: This only makes sense for rather reliable networks, I think, and should be deactivated by default (so no overhead).

Please describe the solution you'd like

The bluechi-controller uses the same event-based mechanism that is used in the bluechi-agent for the heartbeat to check in a configurable interval for each node that is online the last seen timestamp is not older than a configurable threshold. If it is older, it actively disconnects the node.

New configuration options for bluechi-controller
- HeartbeatInterval: The interval for checking the last seen timestamps of nodes in milliseconds, a value of 0 disables it.
- NodeHeartbeatThreshold: If now - last_seen_timestamp > NodeHeartbeatThreshold then actively disconnect the node
Implement verify and disconnect logic
Implement integration tests
Extend documentation (man pages, examples, etc.)

The text was updated successfully, but these errors were encountered:

alexlarsson · 2024-03-28T09:05:38Z

This makes a lot of sense to me.

engelmi added enhancement New feature or request backlog This is next up in priority labels Mar 27, 2024

This was referenced Mar 28, 2024

Optimize LastSeenTimestamp property retrieval in controller #853

Closed

Optimize LastTimeSeen property #552

Closed

ueno mentioned this issue Apr 8, 2024

controller: Proactively disconnect node based on heartbeat #870

Closed

engelmi added this to the v0.9 milestone Apr 8, 2024

dofmind mentioned this issue Jun 21, 2024

controller: Proactively disconnect node based on heartbeat #911

Merged

engelmi mentioned this issue Jun 26, 2024

Extend documentation for detecting disconnected nodes #913

Open

engelmi closed this as completed in #911 Jun 26, 2024

dofmind mentioned this issue Aug 7, 2024

Add to check the liveness for connection with controller #921

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RFE: periodic heartbeat in bluechi-controller #857

RFE: periodic heartbeat in bluechi-controller #857

engelmi commented Mar 27, 2024 •

edited

Loading

alexlarsson commented Mar 28, 2024

RFE: periodic heartbeat in bluechi-controller #857

RFE: periodic heartbeat in bluechi-controller #857

Comments

engelmi commented Mar 27, 2024 • edited Loading

Please describe what you would like to see

Please describe the solution you'd like

alexlarsson commented Mar 28, 2024

engelmi commented Mar 27, 2024 •

edited

Loading