Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFE: hirtectl connection-status [NODE-NAME] #281

Closed
dougsland opened this issue May 7, 2023 · 23 comments · Fixed by #700
Closed

RFE: hirtectl connection-status [NODE-NAME] #281

dougsland opened this issue May 7, 2023 · 23 comments · Fixed by #700
Assignees
Labels
breaking Requested changes will break existing usages enhancement New feature or request jira Issues that are synced to Jira
Milestone

Comments

@dougsland
Copy link
Contributor

Please describe what you would like to see

Show current node information and connection status, example if it's able to send systemd commands to another node (or not).

Example, feel free to change/adapt output:

# hirtectl connection-status (no params)
Node Name (server running the command above): foobar
Listening and ready to accept connections via  port: 842  NOTE: (if control/main node - see /etc/hirte/agent.conf ManagerPort)
Connected and able to send commands to: NODE1 (see /etc/hirte/agent.conf - AllowedNodes)

# hirtectl connection-status
NodeName: foobar
Not accepting connection
Not connected to other nodes

If param is provided, assume it's a node name to check if it's able to communicate and send systemd commands remotely.

# hirtectl connection-status node1
Node Name: foobar
Connected and able to send commands to: node1

# echo $?
0

# hirtectl connection-status node1
Node Name: foobar
Unable to connect to: node1

# echo $?
echo 1

Please describe your use case

Start in parallel a hirte server and hirte agent with 2 containers (in background) and try to send a command it might fail as the hirte server is not ready communicating with agent:

Example, execute in PARALLEL:

  • Host Machine: create a control-container via podman -> set hirte server and agent
  • Host Machine: create a node1 container via podman -> set a hirte agent
  • Host Machine: list all services in the control/node1 machines: podman exec control-container hirtectl list-units

It might fail as the servers are not communicating yet. We need a way to pool and see if hirte is ready to receive commands like list-units.

@engelmi engelmi added the enhancement New feature or request label May 8, 2023
@mkemel
Copy link
Member

mkemel commented May 9, 2023

Hi Douglas,
Hirte is always ready to receive list-units command, if there are no nodes registered to hirte yet - then it will return an empty list.
I think the better feature would be
hirtectl list-nodes

@engelmi
Copy link
Member

engelmi commented May 9, 2023

@mkemel The use case also includes to query the status for a specific node. How would you design the list-nodes command in this case? For example, hirtectl list-nodes $node1 seems a bit off.

When looking at a similar issue #255 where we want to query the status of a service on a specific node, I am wondering if we can (kind of) unify the commands for hirtectl here, something like:

# prints basically the status of the whole hirte system 
# all nodes (as expected in hirte) and their status (online/offline)
hirtectl status

# prints the node status and info as proposed earlier
hirtectl status $node

# prints the unit status on the node 
hirtectl status $node $unit

What do you think? @dougsland @mkemel

Edit: I'll add more comprehensive examples later.

@mkemel
Copy link
Member

mkemel commented May 9, 2023

Correct me if I'm wrong, but I think that the moment agent disconnects - we don't have any info on it anymore. If we list nodes - all of them are online. Unless we list all the nodes in AllowedNodeNames, and then say that ones that are not in the nodes list are offline

@mkemel
Copy link
Member

mkemel commented May 9, 2023

Other than that, I like the unification under status command idea

@dougsland
Copy link
Contributor Author

dougsland commented May 9, 2023

@mkemel The use case also includes to query the status for a specific node. How would you design the list-nodes command in this case? For example, hirtectl list-nodes $node1 seems a bit off.

When looking at a similar issue #255 where we want to query the status of a service on a specific node, I am wondering if we can (kind of) unify the commands for hirtectl here, something like:

# prints basically the status of the whole hirte system 
# all nodes (as expected in hirte) and their status (online/offline)
hirtectl status

# prints the node status and info as proposed earlier
hirtectl status $node

# prints the unit status on the node 
hirtectl status $node $unit

What do you think? @dougsland @mkemel

Edit: I'll add more comprehensive examples later.

For list-nodes I opened this one: #291
The status name is a good one too. :). I imagined connection-status as I was thinking to somehow "mimic" or do similar command line interface to nmcli (i.e: nmcli conn status). Having all command lines tools with similar interface is easy to remember. :)

Feel free to adapt the name. Thanks for taking care of.

@dougsland
Copy link
Contributor Author

dougsland commented May 9, 2023

Correct me if I'm wrong, but I think that the moment agent disconnects - we don't have any info on it anymore. If we list
nodes - all of them are online. Unless we list all the nodes in AllowedNodeNames, and then say that ones that are not in
the nodes list are offline

Agreed. Offline is just fine.

@mkemel
Copy link
Member

mkemel commented May 9, 2023

Agreed. Offline is just fine.

I'm not sure we need it. IMO just listing the connected nodes (i.e. online nodes) should be enough.

@dougsland
Copy link
Contributor Author

Agreed. Offline is just fine.

I'm not sure we need it. IMO just listing the connected nodes (i.e. online nodes) should be enough.

hum, I would disagree. I am sure, soon or later, we will have opened cases or users complaining 'where my nodes' ? 'what the status of node xyz123?' To simplify and even reduce the "amount of data generated and time" --online, --active, --all should be our friends IMHO.

@dougsland
Copy link
Contributor Author

dougsland commented May 9, 2023

Another example for the offline nodes be in the list, include the last stderr error "know" from the pooling status (if possible) so easy to adms to identify what's wrong and how to fix it.

# hirtectl connection-status
NodeName: foobar
Not accepting connection
Not connected to other nodes
Last stderr from hirte agent/daemon:
      "Error, port already in use"

@dougsland
Copy link
Contributor Author

@mkemel make sense? Not sure if that was a good example because we might not able to capture the stderr in such scenario but for sure "unable to connect", "unreachable", "unknown host", "last seen 1 hour ago" might fit.

@engelmi
Copy link
Member

engelmi commented May 10, 2023

I'm not sure we need it. IMO just listing the connected nodes (i.e. online nodes) should be enough.

I agree with @dougsland. As you said @mkemel, Hirte has a list of expected nodes and when listing all nodes, I'd also expect all of them to be there - otherwise I'd first check the Hirte configuration "maybe I didn't add them?".

hum, I would disagree. I am sure, soon or later, we will have opened cases or users complaining 'where my nodes' ? 'what the status of node xyz123?' To simplify and even reduce the "amount of data generated and time" --online, --active, --all should be our friends IMHO.

Yes, adding filter is a pretty nice (additional) feature. I'd focus first on getting the basic one done, then adding those is simple :)

Another example for the offline nodes be in the list, include the last stderr error "know" from the pooling status (if possible) so easy to adms to identify what's wrong and how to fix it.

# hirtectl connection-status
NodeName: foobar
Not accepting connection
Not connected to other nodes
Last stderr from hirte agent/daemon:
      "Error, port already in use"

In hirte we have the cental unit (controller or simple "hirte") and the controlled nodes ("agents"). We decided that it is more robust that the agents connect to the controller - and not the other way around. And since hirtectl uses the API of the controller, we wouldn't be able to retrieve any information on initial connection failure. We could, however, try and get health info on connected nodes. I think @sdunnagan already implemented a POC for this, but we need to refine.

@engelmi
Copy link
Member

engelmi commented May 10, 2023

So, I was thinking that this could be the output of those 3 commands:

# hirtectl status
Hirte Controller: 10.0.2.1:8420

NODE       STATUS      LAST SEEN        IP
laptop     online      now              10.0.2.2
pi         offline     1 day ago       

# hirtectl status laptop
NODE       STATUS      LAST SEEN        IP
laptop     online      now              10.0.2.2

# hirtectl status laptop simple.service
UNIT              LOADED      ACTIVE     SUBSTATE    ENABLED
simple.service    loaded      inactive   dead        yes

What do you think? @dougsland @mkemel
We could also extend the list-units by the loaded and enabled field.

Edit: Just saw #291 and its pretty much the same as the hirtectl status proposed here :) Thinking about it, maybe just hirtectl status might be too fluffy of a cmd.

@mkemel
Copy link
Member

mkemel commented May 10, 2023

@engelmi Good point
To track 'LAST SEEN' we would need not to remove a node from nodes_list on disconnect. And, in any case, since hirte does not persist state - it would be correct only until hirte restart. This can be done, but I would wait for @alexlarsson 's and @pypingou 's comments first on how they see it.

@mkemel
Copy link
Member

mkemel commented May 10, 2023

Edit: Just saw #291 and its pretty much the same as the hirtectl status proposed here :) Thinking about it, maybe just hirtectl status might be too fluffy of a cmd.

Also good point.
The thing with status is that it associates with systemctl status. So maybe indeed keep it for that, and hirte manager and agent statuses would be called by another command.

@rhatdan
Copy link
Contributor

rhatdan commented May 10, 2023

I like @engelmi output. although I would also add either a --format or a --json output to make it machine readable.

# hirtectl conn
Hirte Controller: 10.0.2.1:8420

NODE       STATUS      LAST SEEN        IP
laptop     online      now              10.0.2.2
pi         offline     1 day ago       

@dougsland
Copy link
Contributor Author

dougsland commented May 10, 2023

I like @engelmi output. although I would also add either a --format or a --json output to make it machine readable.

# hirtectl conn
Hirte Controller: 10.0.2.1:8420

NODE       STATUS      LAST SEEN        IP
laptop     online      now              10.0.2.2
pi         offline     1 day ago       

+1 For @engelmi output, looks like we have a winner 💯
+1 For @rhatdan suggestion to have --json too, very useful.

@dougsland
Copy link
Contributor Author

So, I was thinking that this could be the output of those 3 commands:

# hirtectl status
Hirte Controller: 10.0.2.1:8420

NODE       STATUS      LAST SEEN        IP
laptop     online      now              10.0.2.2
pi         offline     1 day ago       

# hirtectl status laptop
NODE       STATUS      LAST SEEN        IP
laptop     online      now              10.0.2.2

# hirtectl status laptop simple.service
UNIT              LOADED      ACTIVE     SUBSTATE    ENABLED
simple.service    loaded      inactive   dead        yes

What do you think? @dougsland @mkemel We could also extend the list-units by the loaded and enabled field.

Edit: Just saw #291 and its pretty much the same as the hirtectl status proposed here :) Thinking about it, maybe just hirtectl status might be too fluffy of a cmd.

@engelmi sure thing. Let's make your output status fly and on the road we can re-define. I will close as dup for now.

@dougsland
Copy link
Contributor Author

dougsland commented May 10, 2023

@engelmi can we make sure we display:

  • the ipv4 and ipv6 (if available) of the nodes in the output too?
  • communication port

Thanks!

@rhatdan
Copy link
Contributor

rhatdan commented May 10, 2023

Ok lets go with hirte nodes ...

@sandrobonazzola
Copy link
Contributor

Any progress on this? I opened containers/qm#104 by mistake on the wrong repo but was looking for what's being discussed here, a list of known nodes.

@engelmi
Copy link
Member

engelmi commented Jul 15, 2023

@sandrobonazzola Issue #324 is kind of a blocking issue for further extending hirtectl, including this one. Due to our limited capacity at the moment there hasn't been much progress.
hirtectl monitor node-connection has been implemented recently and can be used to continuously view all nodes and the status. If it is fine to use python, you could have a look at pyhirte. Getting all nodes is a 3 line script, e.g. in this example

@engelmi engelmi added this to the v0.7 milestone Nov 14, 2023
@mkemel mkemel added the jira Issues that are synced to Jira label Nov 21, 2023
@mkemel mkemel self-assigned this Nov 21, 2023
@mkemel mkemel self-assigned this Jan 8, 2024
@mkemel
Copy link
Member

mkemel commented Jan 9, 2024

How would we change the requirement in this issue given #695 ?

@engelmi
Copy link
Member

engelmi commented Jan 10, 2024

How would we change the requirement in this issue given #695 ?

#695 Shouldn't impact this issue as bluechictl status is about the individual nodes (and units), not the systems overall status. For this issue, the properties in org.eclipse.bluechi.Node.xml are needed.
The IP of the agent we currently don't track. If needed, it probably makes sense to extract that part to another issue and implement it in a different PR.

The bluechictl status <node> <unit> we already have. So all that is missing is adding bluechictl status <node> and bluechictl status, I think. Maybe we can also include a -w/--watch option to listen for changes? What do you think? @mkemel

@engelmi engelmi added the breaking Requested changes will break existing usages label Jan 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Requested changes will break existing usages enhancement New feature or request jira Issues that are synced to Jira
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants