Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: periodic heartbeat in bluechi-controller #857

Closed
5 tasks done
engelmi opened this issue Mar 27, 2024 · 1 comment · Fixed by #911
Closed
5 tasks done

RFE: periodic heartbeat in bluechi-controller #857

engelmi opened this issue Mar 27, 2024 · 1 comment · Fixed by #911
Labels
backlog This is next up in priority enhancement New feature or request
Milestone

Comments

@engelmi
Copy link
Member

engelmi commented Mar 27, 2024

Please describe what you would like to see

In #652 the problem was raised that bluechi-controller detects a disconnect of the bluechi-agent quite late if the cable is unplugged, for example, and a command is issued to that agent before the connection timeout is hit. This leads to a "zombie" agent in the bluechi-controller - it still lists the agent as online and refuses reconnects of that agent (due to the name still being used).
This can be mitigated via the introduced TCP KeepAlive options (#674), but it still takes quite a while to detect it (depending on various complex tcp options, e.g. tcp retransmissions).

The bluechi-agent on the other side can detect a disconnect rather soon due to the Heartbeat feature. Such a periodic check of the connection status on an application layer could be used in the bluechi-controller as well. Based on the last seen timestamp, it could actively disconnect nodes.

Note: This only makes sense for rather reliable networks, I think, and should be deactivated by default (so no overhead).

Please describe the solution you'd like

The bluechi-controller uses the same event-based mechanism that is used in the bluechi-agent for the heartbeat to check in a configurable interval for each node that is online the last seen timestamp is not older than a configurable threshold. If it is older, it actively disconnects the node.

  • New configuration options for bluechi-controller
    • HeartbeatInterval: The interval for checking the last seen timestamps of nodes in milliseconds, a value of 0 disables it.
    • NodeHeartbeatThreshold: If now - last_seen_timestamp > NodeHeartbeatThreshold then actively disconnect the node
  • Implement verify and disconnect logic
  • Implement integration tests
  • Extend documentation (man pages, examples, etc.)
@engelmi engelmi added enhancement New feature or request backlog This is next up in priority labels Mar 27, 2024
@alexlarsson
Copy link
Contributor

This makes a lot of sense to me.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backlog This is next up in priority enhancement New feature or request
Projects
None yet
2 participants