Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add bluechi is online tool #964

Merged
merged 3 commits into from
Nov 4, 2024

Conversation

engelmi
Copy link
Member

@engelmi engelmi commented Oct 18, 2024

Relates to: #962

So far, this PR adds the bluechi-is-online CLI tool, a man page and the RPM package to the spec file.

@engelmi
Copy link
Member Author

engelmi commented Oct 18, 2024

@dofmind
This PR is still in-progress, but you can still build and test the bluechi-is-online binary (using the usual meson install). Here are some examples on how to use it (will document them later):

######################
# Example 1: Stop service(s) when bluechi-agent loses connection 
$ cat /etc/systemd/system/monitor-bluechi-agent.service
[Unit]
Description=Monitor bluechi-agents connection to controller

[Service]
Type=simple
ExecStart=/usr/local/bin/bluechi-is-online agent --initial-wait=5000 --monitor

$ cat /etc/systemd/system/workload.service
[Unit]
Description=Some workload that should stop running when bluechi-agent disconnects
BindsTo=monitor-bluechi-agent.service
After=monitor-bluechi-agent.service

[Service]
...

######################
# Example 2: Start a service when bluechi-agent loses connection 
$ cat /etc/systemd/system/handle-bluechi-agent-offline.service
[Unit]
Description=Handle BlueChi Agent going offline and start do-stuff.service
OnFailure=do-stuff.service

[Service]
Type=simple
ExecStart=/usr/local/bin/bluechi-is-online agent --initial-wait=5000 --monitor

Not sure yet if BlueChi will provide some general purpose systemd units for it - I don't have an idea how those could look like at the moment. If you have, please let me know. And if have time to test the bluechi-is-online, please let me know what you think so we can implement your feedback right away.

@coveralls
Copy link

coveralls commented Oct 18, 2024

Coverage Status

coverage: 83.236% (+0.1%) from 83.106%
when pulling 1c839d7 on engelmi:add-bluechi-is-online-tool
into ada5cf5 on eclipse-bluechi:main.

@dofmind
Copy link
Contributor

dofmind commented Oct 21, 2024

Thanks for this PR. I tested bluechi-is-online on my system with multiple nodes. The basic behavior of bluechi-is-online worked as we expected. However, there are three issues.

  1. When bluechi-agent loses connection, monitor-bluechi-agent.service stops but does not restart. I added Restart=on-failure. I also need a condition that bluechi-agent must be online when restarting, so I added ExecStartPre=/usr/bin/wait-for-agent-online.sh using the following script.
$ cat scripts/wait-for-agent-online.sh 
#!/bin/sh

main() {
    while [ true ]; do
        /usr/bin/bluechi-is-online agent && break
        sleep 1
    done
}

main "$@"
  1. My system uses the SwitchController DBus method of Agent when the leader node changes. When the leader node changes and bluechi-agent executes the SwitchController DBus method, the bluechi-agent status changes to offline and then reconnects to the bluechi-controller of the new leader node. The monitor-bluechi-agent.service may stop even though bluechi-agent does not lose the connection physically.

  2. Before applying bluechi-is-online, I made bluechi-agent exit with 1 when it doesn't receive a heartbeat from the controller. But now, if bluechi-agent disconnects, bluechi-agent will try to reconnect to bluechi-controller.

Oct 21 20:08:35 42dot-ak7 bluechi-agent[522173]: Did not receive heartbeat from controller since '2500'ms. Disconnecting it...
Oct 21 20:08:35 42dot-ak7 bluechi-agent[522173]: Disconnected from controller
Oct 21 20:08:35 42dot-ak7 bluechi-agent[522173]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 21 20:09:08 42dot-ak7 bluechi-agent[522173]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Oct 21 20:09:08 42dot-ak7 bluechi-agent[522173]: Trying to connect to controller (try 1)
Oct 21 20:09:08 42dot-ak7 bluechi-agent[522173]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 21 20:09:11 42dot-ak7 bluechi-agent[522173]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected

If the leader node is changed before the error Registering as 'ak7_master_main' failed: Transport endpoint is not connected is reported, the SwitchController DBus method will fail as the follows and not work on bluechi-agent.

root@42dot-ak7:~# dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.102,port=842'
Error org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

@engelmi
Copy link
Member Author

engelmi commented Oct 22, 2024

Thanks for your feedback! @dofmind

1. When bluechi-agent loses connection, monitor-bluechi-agent.service stops but does not restart. I added `Restart=on-failure`. I also need a condition that bluechi-agent must be online when restarting, so I added `ExecStartPre=/usr/bin/wait-for-agent-online.sh` using the following script.

I'd suggest using UpheldBy= (inverse from Upholds=) on the `monitor-bluechi-agent.service:

$ cat /etc/systemd/system/monitor-bluechi-agent.service
[Unit]
Description=Monitor bluechi-agents connection to controller

[Service]
Type=simple
ExecStart=/usr/local/bin/bluechi-is-online agent --initial-wait=5000 --monitor

[Install]
UpheldBy=bluechi-agent.service

This way the monitoring service gets restarted as long as the bluechi-agent.service is active.

2. My system uses the SwitchController DBus method of Agent when the leader node changes. When the leader node changes and bluechi-agent executes the SwitchController DBus method, the bluechi-agent status changes to offline and then reconnects to the bluechi-controller of the new leader node. The monitor-bluechi-agent.service may stop even though bluechi-agent does not lose the connection physically.

Although I think the behavior of bluechi-is-online is correct here (since the agent really disconnected), I understand that this connection "wiggling" isn't desired. The ControllerAddress property emits a changed signal, which is also triggered for SwitchController right before disconnecting. In bluechi-is-online, we can use that signal and don't exit as a disconnect is expected to happen... I'll add a new CLI option to set this (maybe with a timeout?).

3. Before applying bluechi-is-online, I made bluechi-agent exit with 1 when it doesn't receive a heartbeat from the controller. But now, if bluechi-agent disconnects, bluechi-agent will try to reconnect to bluechi-controller.
Oct 21 20:08:35 42dot-ak7 bluechi-agent[522173]: Did not receive heartbeat from controller since '2500'ms. Disconnecting it...
Oct 21 20:08:35 42dot-ak7 bluechi-agent[522173]: Disconnected from controller
Oct 21 20:08:35 42dot-ak7 bluechi-agent[522173]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 21 20:09:08 42dot-ak7 bluechi-agent[522173]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected
Oct 21 20:09:08 42dot-ak7 bluechi-agent[522173]: Trying to connect to controller (try 1)
Oct 21 20:09:08 42dot-ak7 bluechi-agent[522173]: Connecting to controller on tcp:host=192.168.16.101,port=842
Oct 21 20:09:11 42dot-ak7 bluechi-agent[522173]: Registering as 'ak7_master_main' failed: Transport endpoint is not connected

If the leader node is changed before the error Registering as 'ak7_master_main' failed: Transport endpoint is not connected is reported, the SwitchController DBus method will fail as the follows and not work on bluechi-agent.

root@42dot-ak7:~# dbus-send --system --dest=org.eclipse.bluechi.Agent --print-reply --type=method_call /org/eclipse/bluechi org.eclipse.bluechi.Agent.SwitchController string:'tcp:host=192.168.16.102,port=842'
Error org.freedesktop.DBus.Error.NoReply: Did not receive a reply. Possible causes include: the remote application did not send a reply, the message bus security policy blocked the reply, the reply timeout expired, or the network connection was broken.

This sounds like a bug in BlueChi?! This needs more investigation, I think.
If I understood it correctly, this issue can be reproduced by:

  1. setting up a 3 nodes - two running a controller, and the third an agent
  2. stopping controller (yanking cable or so)
  3. triggering the SwitchController before the Registering as... appears in bluechi-agent

@dofmind
Copy link
Contributor

dofmind commented Oct 24, 2024

I'd suggest using UpheldBy= (inverse from Upholds=) on the `monitor-bluechi-agent.service:

I couldn't test using UpheldBy= because my systemd versions (246.9 and 250.5) don't support it. I'll try it after I backport the patch to systemd to support UpheldBy=.

The ControllerAddress property emits a changed signal, which is also triggered for SwitchController right before disconnecting. In bluechi-is-online, we can use that signal and don't exit as a disconnect is expected to happen... I'll add a new CLI option to set this (maybe with a timeout?).

Looks good for a new CLI option with a timeout.

This sounds like a bug in BlueChi?! This needs more investigation, I think.
If I understood it correctly, this issue can be reproduced by:

That's right, i will create an issue for this.

@engelmi
Copy link
Member Author

engelmi commented Oct 24, 2024

I couldn't test using UpheldBy= because my systemd versions (246.9 and 250.5) don't support it. I'll try it after I backport the patch to systemd to support UpheldBy=.

Ah ok, then the UpheldBy= can't be used, of course. And I think I misunderstood the condition you wanted to apply - that bluechi-agent must be online when restarting. You could achieve that, I think, by adding a ExecStartPre= to your unit with an --initial-wait of the new tool. This should keep the unit in an activating state so depending services are not started. For example:

[Service]
Type=simple
ExecStartPre=/usr/local/bin/bluechi-is-online agent --initial-wait=5000
ExecStart=/usr/local/bin/bluechi-is-online agent --monitor

Looks good for a new CLI option with a timeout.

Just added the new option --switch-timeout=<ms>. If bluechi-is-online is called with this option, it will wait the specified amount of time till it exits with code 1. If the agent connects during that time frame again, bluechi-is-online will continue to monitor the state.
@dofmind Please give it a try if you have time.

I noticed, however, a problem we have with the order of the changed signals for the connection state and the address. Currently, we first emit the change in the connection state, then the change of the address - which should be reversed, in my point of view. I prepared a small PR to fix this: #968

This sounds like a bug in BlueChi?! This needs more investigation, I think.
If I understood it correctly, this issue can be reproduced by:

That's right, i will create an issue for this.

Thank you!

@dofmind
Copy link
Contributor

dofmind commented Oct 25, 2024

After applying the updated is-online application and the small PR #968, I tested using the --wait (instead of --initial-wait) and --switch-timeout options, and both worked perfectly.

[Service]
Type=simple
ExecStartPre=/usr/bin/bluechi-is-online agent --wait=5000
ExecStart=/usr/bin/bluechi-is-online agent --monitor --switch-timeout=1000
Restart=on-failure

I created an issue about triggering SwitchController DBus method: #966, if this gets resolved I'll finally be able to apply the is-online solution on my system. Thanks.

doc/man/bluechi-is-online.1.md Show resolved Hide resolved
doc/man/bluechi-is-online.1.md Outdated Show resolved Hide resolved
doc/man/bluechi-is-online.1.md Outdated Show resolved Hide resolved
src/is-online/help.c Show resolved Hide resolved
@engelmi engelmi force-pushed the add-bluechi-is-online-tool branch 3 times, most recently from 5c37433 to 2429839 Compare October 31, 2024 12:20
@engelmi engelmi marked this pull request as ready for review October 31, 2024 12:20
@engelmi
Copy link
Member Author

engelmi commented Oct 31, 2024

Integration tests will be added in a later PR. I think it makes sense to decouple the tests for bluechi-is-online from the usual integration tests (similar to what is described here: #840), but that will take more time/effort.

bluechi.spec.in Show resolved Hide resolved
src/is-online/is-online.c Outdated Show resolved Hide resolved
Copy link
Contributor

@ArtiomDivak ArtiomDivak left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Signed-off-by: Michael Engel <mengel@redhat.com>
Relates to: eclipse-bluechi#962

Signed-off-by: Michael Engel <mengel@redhat.com>
Relates to: eclipse-bluechi#962

Added new section for tooling, moved the ansible page to it and
created bluechi-is-online page there, too.

Signed-off-by: Michael Engel <mengel@redhat.com>
Copy link
Member

@mkemel mkemel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@engelmi
Copy link
Member Author

engelmi commented Nov 4, 2024

Hmm... the OpenScanHub job is stuck. Since everything else passed, lets go ahead and merge this.

@engelmi engelmi merged commit bad95d5 into eclipse-bluechi:main Nov 4, 2024
21 of 22 checks passed
@engelmi engelmi mentioned this pull request Nov 4, 2024
6 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants